In [1]:
import pandas as pd
import numpy as np

## Different ways of Creating DataFrame (cont.)
- 1. Python Dictionary (The one we've been working till now)
- 2. Using CSV (Mostly used)
- 3. Using Excel
- 4. From list of tuples
- 5. From list of dictionaries

<b> Using CSV</b>

In [2]:
df = pd.read_csv("report_card.csv")
df

Unnamed: 0,Roll no,Name,Maths,Science
0,1,Abhay,70,80
1,2,Abhijeet,90,76
2,3,Abhinav,60,89
3,4,Abhishek,55,75
4,5,Aditya,65,72
5,6,Ajaz,58,68
6,7,Akash,63,66
7,8,Amit,76,82
8,9,Amresh,72,73
9,10,Anand,58,71


<b>Using Excel</b>

In [3]:
# df = pd.read_excel("report_card.xlsx","Sheet1")
df = pd.read_excel("report_card.xlsx")
df

Unnamed: 0,Roll no,Name,Maths,Science
0,1,Abhay,70,80
1,2,Abhijeet,90,76
2,3,Abhinav,60,89
3,4,Abhishek,55,75
4,5,Aditya,65,72
5,6,Ajaz,58,68
6,7,Akash,63,66
7,8,Amit,76,82
8,9,Amresh,72,73
9,10,Anand,58,71


In Excel we sometimes have multiple sheets, in that case we can specify a paramente ```sheet_name``` to it like "Sheet1" we did.

<b> From list of tuples</b>

In [4]:
score_card = [
    (1,'Abhay',70,80),
    (2,'Abhijeet',90,76),
    (3,'Abhinav',60,89),
    (4,'Abhishek',55,75)
]

df = pd.DataFrame(score_card,
                 columns=["Roll no","Name","Maths","Science"])
df

Unnamed: 0,Roll no,Name,Maths,Science
0,1,Abhay,70,80
1,2,Abhijeet,90,76
2,3,Abhinav,60,89
3,4,Abhishek,55,75


##### Don't forget to specify the columns name, else the panda will name the column with incremental integers

In [5]:
pd.DataFrame(score_card)

Unnamed: 0,0,1,2,3
0,1,Abhay,70,80
1,2,Abhijeet,90,76
2,3,Abhinav,60,89
3,4,Abhishek,55,75


<b> From list of dictionaries</b>

In [6]:
score_card = [
    {'Roll no': 1, 'Name': 'Abhay', 'Maths': 70, 'Science': 80},
    {'Roll no': 2, 'Name': 'Abhijeet', 'Maths': 90, 'Science': 76},
    {'Roll no': 3, 'Name': 'Abhinav', 'Maths': 60, 'Science': 89},
    {'Roll no': 4, 'Name': 'Abhishek', 'Maths': 55, 'Science': 75}
]

df = pd.DataFrame(score_card)
df

Unnamed: 0,Roll no,Name,Maths,Science
0,1,Abhay,70,80
1,2,Abhijeet,90,76
2,3,Abhinav,60,89
3,4,Abhishek,55,75


There's are also other methods for IO operations in pandas. Out of which these were the most basics and widely used IO methods by users.

### Miscellaneous 2: Convert a txt file to CSV

In [7]:
import csv

with open('test.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('test.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('day', 'temperature','windspeed','event'))
        writer.writerows(lines)

In [8]:
df_test = pd.read_csv('test.csv')
df_test

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2021,32.0,6.0,Rain
1,1/4/2021,,9.0,Sunny
2,1/5/2021,28.0,,Snow
3,1/6/2021,,7.0,
4,1/7/2021,32.0,,Rain
5,1/8/2021,,,Sunny
6,1/9/2021,,,
7,1/10/2021,34.0,8.0,Cloudy
8,1/11/2021,40.0,12.0,Sunny


## Handling Missing Data
- fillna
- dropna
- interpolate

In [2]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_csv('data/handling_missing_weather_data.csv')
data

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2021,32.0,6.0,Rain
1,1/4/2021,,9.0,Sunny
2,1/5/2021,28.0,,Snow
3,1/6/2021,,7.0,
4,1/7/2021,32.0,,Rain
5,1/8/2021,,,Sunny
6,1/9/2021,,,
7,1/10/2021,34.0,8.0,Cloudy
8,1/11/2021,40.0,12.0,Sunny


Data Analysis for the generated random weather dataset

In [7]:
type(data.day[0])

str

As we see that the day column is of string type not datetime type. So first of all we'll make the day column as the datetime type

In [8]:
data = pd.read_csv('data/handling_missing_weather_data.csv',parse_dates=["day"])
data

Unnamed: 0,day,temperature,windspeed,event
0,2021-01-01,32.0,6.0,Rain
1,2021-01-04,,9.0,Sunny
2,2021-01-05,28.0,,Snow
3,2021-01-06,,7.0,
4,2021-01-07,32.0,,Rain
5,2021-01-08,,,Sunny
6,2021-01-09,,,
7,2021-01-10,34.0,8.0,Cloudy
8,2021-01-11,40.0,12.0,Sunny


In [9]:
type(data.day[0])

pandas._libs.tslibs.timestamps.Timestamp

As we can now that the day column is changed. Not only by look but also by data type

Also as we are working woth weather data. Day column as index look more promising than some integers. As we'll be looking the weather by day.

In [10]:
data.set_index("day",inplace=True)
data

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,,9.0,Sunny
2021-01-05,28.0,,Snow
2021-01-06,,7.0,
2021-01-07,32.0,,Rain
2021-01-08,,,Sunny
2021-01-09,,,
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


#### Now we'll start looking into how to handle missing data

```fillna```

In [11]:
new_df = data.fillna(value=0)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,0.0,9.0,Sunny
2021-01-05,28.0,0.0,Snow
2021-01-06,0.0,7.0,0
2021-01-07,32.0,0.0,Rain
2021-01-08,0.0,0.0,Sunny
2021-01-09,0.0,0.0,0
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


What we're telling pandas it that fill all the na values, aka NaN values (all the not a number values) with some other values,
for us it is ```0``` in this place.

And we can now see that all the values that were NaN are now replaced/filled by 0's

But sometimes having 0's in place of missing data is not a great option. For eg., here in the event column the missing data are filled by 0's. But that doesn't look good.

In [13]:
# specifying a dictionary as the value to fill the missing data according to our choice (column wise)

new_df = data.fillna({
    'temperature': 0,
    'windspeed':0,
    'event': 'No event'
})
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,0.0,9.0,Sunny
2021-01-05,28.0,0.0,Snow
2021-01-06,0.0,7.0,No event
2021-01-07,32.0,0.0,Rain
2021-01-08,0.0,0.0,Sunny
2021-01-09,0.0,0.0,No event
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


But still we're not satisfied by this. As we look into the temperature column. We notice that on 2021-01-01 the temperature was 32.0 but on 2021-02-04 it's 0. Well this never happens everyone knows.

So, what will be the better options to fill the missing values?

- ```fillna(method='ffill')```

- ```fillna(method='bfill')```

In [14]:
new_df = data.fillna(method='ffill')
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,32.0,9.0,Sunny
2021-01-05,28.0,9.0,Snow
2021-01-06,28.0,7.0,Snow
2021-01-07,32.0,7.0,Rain
2021-01-08,32.0,7.0,Sunny
2021-01-09,32.0,7.0,Sunny
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


<b>ffill method: </b> Acronymn for forward fill. It takes the previous value and fill the next NaN values. In other word, whenever pandas will see a NaN values it'll fill that by the values just above it. As we can see the values of temperature column on day 2021-01-04 is filled with 32. As for all other columns also.

But there's still one problem with this ```ffill``` method. If there's no other values above a NaN value the NaN values will not be changed and remains the same.

In [15]:
# We can use axis paramether within the fillna method to change the value the data is forward filled.
# now as we can see the NaN places are filled/replaced with values by the left values. As we are going column wise.
# axis = 0 (rows)
# axis = 1(columns)

# As we saw earlier some places are left NaN because there's not values at it's left that can be filled

new_df = data.fillna(method='ffill',axis=1)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,,9.0,Sunny
2021-01-05,28.0,28.0,Snow
2021-01-06,,7.0,7
2021-01-07,32.0,32.0,Rain
2021-01-08,,,Sunny
2021-01-09,,,
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


<b>bfill method: </b>Acronym for backward fill. It takes the later value and fill the previous NaN values. In other word, whenever pandas will see a NaN values it'll fill that by the values just below it. As we can see the values of temperature column on day 2021-01-04 is filled with 28, the values below the NaN value. As for all other columns also.

But there's still one problem with this ```bfill``` method similar to "ffill". If there's no other values below a NaN value the NaN values will not be changed and remains the same.

In [16]:
new_df = data.fillna(method="bfill")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,28.0,9.0,Sunny
2021-01-05,28.0,7.0,Snow
2021-01-06,32.0,7.0,Rain
2021-01-07,32.0,8.0,Rain
2021-01-08,34.0,8.0,Sunny
2021-01-09,34.0,8.0,Cloudy
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


In [17]:
# specifying axis column to fill the NaN values by columns
# axis = 0 (rows)
# axis = 1 (columns)

new_df = data.fillna(method="bfill", axis=1)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32,6,Rain
2021-01-04,9,9,Sunny
2021-01-05,28,Snow,Snow
2021-01-06,7,7,
2021-01-07,32,Rain,Rain
2021-01-08,Sunny,Sunny,Sunny
2021-01-09,,,
2021-01-10,34,8,Cloudy
2021-01-11,40,12,Sunny


Limiting the two methods will some degree. Like what if I want to fill the NaN values below or above just once or two times.

In [18]:
# original data
data

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,,9.0,Sunny
2021-01-05,28.0,,Snow
2021-01-06,,7.0,
2021-01-07,32.0,,Rain
2021-01-08,,,Sunny
2021-01-09,,,
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


In [19]:
new_df = data.fillna(method='bfill',limit=1)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,28.0,9.0,Sunny
2021-01-05,28.0,7.0,Snow
2021-01-06,32.0,7.0,Rain
2021-01-07,32.0,,Rain
2021-01-08,,,Sunny
2021-01-09,34.0,8.0,Cloudy
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


The value at ```2021-01-09``` is fill but not at ```2021-01-08``` as we limited it to just 1 time

### Interpolation

In [20]:
new_df = data.interpolate()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,30.0,9.0,Sunny
2021-01-05,28.0,8.0,Snow
2021-01-06,30.0,7.0,
2021-01-07,32.0,7.25,Rain
2021-01-08,32.666667,7.5,Sunny
2021-01-09,33.333333,7.75,
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


<b>Interpolation </b>is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points.

And Interpolation works with only numbers.

<b>OK but why interpolation?</b>

The question here is why we are filling the missing data points using interpolations when we had other options like
- central tendency (mean/meadian/mode)
- fillna

The answers is if we fill the values at ```2021-01-04``` by any of central tendency value. It will not be less effective that using interpolation. Above ```2021-01-04``` we have the data for ```2021-01-01``` but below ```2021-01-04``` we have the data for ```2021-01-05``` which is more near. This will be more understood with some graphs.

#### Graphs will be added soon

If we look more closly and follow the linear interpolation with dates then temperature on '4th Jan 2021' should be 29 and not 30.

#### Methods under interpolation
- linear (default)
- quadratic
- cubic
- polynomial
- time
- index
- nearest
- zero
- etc

In [45]:
# method = time
new_df = data.interpolate(method='time')
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,29.0,9.0,Sunny
2021-01-05,28.0,8.0,Snow
2021-01-06,30.0,7.0,
2021-01-07,32.0,7.25,Rain
2021-01-08,32.666667,7.5,Sunny
2021-01-09,33.333333,7.75,
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


Now it's more better and looks more promising than before

### Other missing data handling technique
```dropna```: It drops the rows that have missing values.

In [46]:
new_df = data.dropna()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


It dropped all the rows that have NaN in them.

But since dropping removed majority of our data and we are left with only 3. Dropping is not a better choice in case we have a limited and small dataset.

<b>When to drop data:</b>
- When there very very less number of rows with missing (NaN) values
- When we have lots of data and dropping some rows does not affect our work

<b>Drop only if all the values are missing in a row</b>

```dropna(how='all')```

In [47]:
new_df = data.dropna(how='all')
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,,9.0,Sunny
2021-01-05,28.0,,Snow
2021-01-06,,7.0,
2021-01-07,32.0,,Rain
2021-01-08,,,Sunny
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


<b>Drop with some thresold</b>

```dropna(thresh)```

Sometimes if we wish to drop some data. We can limit it to drop the rows only if all the values of a row are missing or a particular number of values are missing. Lets say if 2/3 values are missing then drop that row.

In [49]:
new_df = data.dropna(thresh=2)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,32.0,6.0,Rain
2021-01-04,,9.0,Sunny
2021-01-05,28.0,,Snow
2021-01-07,32.0,,Rain
2021-01-10,34.0,8.0,Cloudy
2021-01-11,40.0,12.0,Sunny


<b>Insert the missing dates</b>

As we are dealing with weather data that is highly dependent on dates. So, it is a good idea to add the missing dates also.

In [55]:
dt = pd.date_range("01-01-2021","01-11-2021")
idx = pd.DatetimeIndex(dt)
data = data.reindex(idx)
data

Unnamed: 0,temperature,windspeed,event
2021-01-01,32.0,6.0,Rain
2021-01-02,,,
2021-01-03,,,
2021-01-04,,9.0,Sunny
2021-01-05,28.0,,Snow
2021-01-06,,7.0,
2021-01-07,32.0,,Rain
2021-01-08,,,Sunny
2021-01-09,,,
2021-01-10,34.0,8.0,Cloudy


Now the missing dates are also added

In [56]:
new_df = data.interpolate(method='time')
new_df

Unnamed: 0,temperature,windspeed,event
2021-01-01,32.0,6.0,Rain
2021-01-02,31.0,7.0,
2021-01-03,30.0,8.0,
2021-01-04,29.0,9.0,Sunny
2021-01-05,28.0,8.0,Snow
2021-01-06,30.0,7.0,
2021-01-07,32.0,7.25,Rain
2021-01-08,32.666667,7.5,Sunny
2021-01-09,33.333333,7.75,
2021-01-10,34.0,8.0,Cloudy


In [57]:
new_df = new_df.fillna(method='bfill',limit=1)
new_df

Unnamed: 0,temperature,windspeed,event
2021-01-01,32.0,6.0,Rain
2021-01-02,31.0,7.0,
2021-01-03,30.0,8.0,Sunny
2021-01-04,29.0,9.0,Sunny
2021-01-05,28.0,8.0,Snow
2021-01-06,30.0,7.0,Rain
2021-01-07,32.0,7.25,Rain
2021-01-08,32.666667,7.5,Sunny
2021-01-09,33.333333,7.75,Cloudy
2021-01-10,34.0,8.0,Cloudy


In [58]:
new_df = new_df.fillna(method='ffill',limit=1)
new_df

Unnamed: 0,temperature,windspeed,event
2021-01-01,32.0,6.0,Rain
2021-01-02,31.0,7.0,Rain
2021-01-03,30.0,8.0,Sunny
2021-01-04,29.0,9.0,Sunny
2021-01-05,28.0,8.0,Snow
2021-01-06,30.0,7.0,Rain
2021-01-07,32.0,7.25,Rain
2021-01-08,32.666667,7.5,Sunny
2021-01-09,33.333333,7.75,Cloudy
2021-01-10,34.0,8.0,Cloudy


In [65]:
new_df.isna().value

AttributeError: 'DataFrame' object has no attribute 'value'