# Dealing with NaN

Before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. 

While any given dataset can have **many types of bad data, such as outliers or incorrect values**, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns NaN values to missing data. We will learn how to detect and deal with NaN values.

We will begin by creating a DataFrame with some NaN values in it.

In [13]:
# We create a list of Python dictionaries
import pandas as pd
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


We can clearly see that the DataFrame we created has 3 NaN values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of NaN values is not easily visualized. 

For these cases, we can use a combination of methods to count the number of NaN values in our data. The following example combines the **.isnull()** and the **.sum()** methods to count the number of NaN values in our DataFrame

In [14]:
# We count the number of NaN values in store_items
x =  store_items.isnull().sum().sum()

# We print x
print('Number of NaN values in our DataFrame:', x)

Number of NaN values in our DataFrame: 3


In the above example, the .isnull() method returns a Boolean DataFrame of the same size as store_items and indicates with True the elements that have NaN values and with False the elements that are not. Let's see an example:

In [55]:
store_items.isnull()

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,False,False,False,False,False,False,True
store 2,False,False,False,False,False,False,False
store 3,False,False,False,True,False,True,False


In Pandas, **logical True values have numerical value 1** and **logical False values have numerical value 0**. Therefore, we can count the number of NaN values by counting the number of logical True values.

In order to count the total number of **logical True values**, we use the **.sum()** method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical True values along columns, as we see below:

In [16]:
store_items.isnull().sum()

bikes      0
pants      0
watches    0
shirts     1
shoes      0
suits      1
glasses    1
dtype: int64

The second **.sum()** will then add up the 1s in the above Pandas Series (which is 3).

Instead of counting the number of NaN values we can also do the opposite, we can count the number of non-NaN values. We can do this by using the **.count()** method as shown below:

In [18]:
# We print the number of non-NaN values in our DataFrame
print()
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())


Number of non-NaN values in the columns of our DataFrame:
 bikes      3
pants      3
watches    3
shirts     2
shoes      3
suits      2
glasses    2
dtype: int64


Now that we learned how to know if our dataset has any NaN values in it, the next step is to decide what to do with them. In general we have two options, we can either delete or replace the NaN values.

We will start by learning how to eliminate rows or columns from our DataFrame that contain any NaN values. The ***.dropna(axis) method* eliminates any rows with NaN values when *axis = 0* is used and will eliminate any columns with NaN values when *axis = 1* is used.**

In [19]:
# We drop any columns with NaN values
store_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


Notice that the **.dropna() method eliminates (drops) the rows or columns with NaN values out of place**. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by **setting the keyword inplace = True inside the dropna() function.**

In [20]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


In [24]:
print(store_items.dropna(axis=1, inplace=True))

None


In [25]:
store_items

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


Now, instead of eliminating NaN values, we can replace them with suitable values. We could choose for example to replace all NaN values with the value 0. We can do this by using the **.fillna() method** as shown below.

In [33]:
# We create a list of Python dictionaries again since we have already dropped the NAN columns in place using .inplace()
import pandas as pd
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# We replace all NaN values with 0
store_items.fillna(0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,0.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,0.0,10,0.0,4.0


We can also use the **.fillna() method** to replace NaN values with previous values in the DataFrame, this is known as **forward filling**. When replacing NaN values with **forward filling,** we can use previous values taken from columns or rows. 

The **.fillna(method = 'ffill', axis)** will use the forward filling (ffill) method to replace NaN values using the previous known value along the given axis. Let's see some examples:

In [35]:
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0) 
#this replaces the NAN values with the values directly above them (axis=0 represents the column)
#NAN left on the "glasses" is so because there is no value directly above it.

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


Notice that the two NaN values in store 3 have been replaced with previous values in their columns. However, notice that the NaN value in store 1 didn't get replaced. That's because there are no previous values in this column, since the NaN value is the first value in that column. However, if we do forward fill using the previous row values, this won't happen. Let's take a look:

In [36]:
store_items.fillna(method = 'ffill', axis = 1)
#this replaces the NAN values with the values from the column directly on it's left (axis=1 represents the rows)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,35.0,10.0,10.0,4.0


We see that in this case all the NaN values have been replaced with the previous row values.

Similarly, you can choose to replace the NaN values with the values that go after them in the DataFrame, this is known as **backward filling.** The **.fillna(method = 'backfill', axis)** will use the backward filling (backfill) method to replace NaN values using the next known value along the given axis. Just like with forward filling we can choose to use row or column values. Let's see some examples:

In [37]:
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,50.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


Notice that the NaN value in store 1 has been replaced with the next value in its column. However, notice that the two NaN values in store 3 didn't get replaced. That's because there are no next values in these columns, since these NaN values are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. Let's take a look:

In [38]:
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,10.0,10.0,4.0,4.0


Notice that the **.fillna() method** replaces (fills) the NaN values out of place. This means that the original DataFrame is not modified. You can always replace the NaN values in place by setting the keyword **inplace = True inside the fillna() function.**

We can also choose to replace NaN values by using different interpolation methods. For example, the **.interpolate(method = 'linear', axis) method** will use linear interpolation to replace NaN values using the values along the given axis.

In [39]:
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


Notice that the two NaN values in store 3 have been replaced with linear interpolated values. However, notice that the NaN value in store 1 didn't get replaced. That's because the NaN value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value. Now, let's interpolate using row values instead:

In [42]:
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,22.5,10.0,7.0,4.0


Just as with the other methods we saw, the .interpolate() method replaces NaN values out of place.

# Loading Data into a Pandas DataFrame

In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. 

CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the **pd.read_csv() function.** Let's load kpit_weather data into a Pandas DataFrame.

In [47]:
# We load Google stock data in a DataFrame
kpit_weather = pd.read_csv('kpit_weather.csv')

# We print some information about Google_stock
print('kpit_weather is of type:', type(kpit_weather))
print('kpit_weather has shape:', kpit_weather.shape)

kpit_weather is of type: <class 'pandas.core.frame.DataFrame'>
kpit_weather has shape: (192, 12)


We see that we have loaded the kpit_weather.csv file into a Pandas DataFrame and it consists of 192 rows and 12 columns. Now let's look at the data

In [48]:
kpit_weather

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
0,20170820040000,20170820000000,178,172,10171,0,0,0,0,-9999,,
1,20170820050000,20170820010000,178,172,10177,0,0,0,0,-9999,,
2,20170820060000,20170820020000,167,161,10181,0,0,0,0,-9999,,
3,20170820070000,20170820030000,161,161,10182,0,0,4,0,-9999,,
4,20170820080000,20170820040000,156,156,10186,180,15,-9999,0,-9999,,
...,...,...,...,...,...,...,...,...,...,...,...,...
187,20170827230000,20170827190000,239,156,10204,130,46,-9999,0,-9999,,
188,20170828000000,20170827200000,228,144,10204,130,36,6,0,-9999,,
189,20170828010000,20170827210000,211,150,10207,110,31,-9999,0,-9999,,
190,20170828020000,20170827220000,200,150,10208,120,41,-9999,0,-9999,,


We see that it is quite a large dataset and that Pandas has automatically assigned numerical row indices to the DataFrame. Pandas also used the labels that appear in the data in the CSV file to assign the column labels.

When dealing with large datasets like this one, it is often useful just to take a look at the first few rows of data instead of the whole dataset. We can take a look at the first 5 rows of data using the **.head() method,** as shown below

In [51]:
kpit_weather.head()

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
0,20170820040000,20170820000000,178,172,10171,0,0,0,0,-9999,,
1,20170820050000,20170820010000,178,172,10177,0,0,0,0,-9999,,
2,20170820060000,20170820020000,167,161,10181,0,0,0,0,-9999,,
3,20170820070000,20170820030000,161,161,10182,0,0,4,0,-9999,,
4,20170820080000,20170820040000,156,156,10186,180,15,-9999,0,-9999,,


In [74]:
# We can also take a look at the last 5 rows of data by using the .tail() method:

kpit_weather.tail()

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
187,20170827230000,20170827190000,239,156,10204,130,46,-9999,0,-9999,0.0,0.0
188,20170828000000,20170827200000,228,144,10204,130,36,6,0,-9999,0.0,0.0
189,20170828010000,20170827210000,211,150,10207,110,31,-9999,0,-9999,0.0,0.0
190,20170828020000,20170827220000,200,150,10208,120,41,-9999,0,-9999,0.0,0.0
191,20170828030000,20170827230000,194,150,10210,120,41,4,0,-9999,0.0,0.0


We can also optionally use **.head(N) or .tail(N) to display the first and last N rows of data,** respectively.

Let's do a quick check to see whether we have any NaN values in our dataset. To do this, we will use the **.isnull() method** followed by the **.any() method** to check whether any of the columns contain NaN values.

In [75]:
kpit_weather.isnull().any()

ZTime          False
Time           False
OAT            False
DT             False
SLP            False
WD             False
WS             False
SKY            False
PPT            False
PPT6           False
Plsr.Event     False
Plsr.Source    False
dtype: bool

In [76]:
#kpit_weather.fillna(0)

# we need to put the the inplace method so that the main dataframe is modified

kpit_weather.fillna(0, inplace=True)

kpit_weather

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
0,20170820040000,20170820000000,178,172,10171,0,0,0,0,-9999,0.0,0.0
1,20170820050000,20170820010000,178,172,10177,0,0,0,0,-9999,0.0,0.0
2,20170820060000,20170820020000,167,161,10181,0,0,0,0,-9999,0.0,0.0
3,20170820070000,20170820030000,161,161,10182,0,0,4,0,-9999,0.0,0.0
4,20170820080000,20170820040000,156,156,10186,180,15,-9999,0,-9999,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
187,20170827230000,20170827190000,239,156,10204,130,46,-9999,0,-9999,0.0,0.0
188,20170828000000,20170827200000,228,144,10204,130,36,6,0,-9999,0.0,0.0
189,20170828010000,20170827210000,211,150,10207,110,31,-9999,0,-9999,0.0,0.0
190,20170828020000,20170827220000,200,150,10208,120,41,-9999,0,-9999,0.0,0.0


When dealing with large datasets, it is often useful to get statistical information from them. Pandas provides the **.describe() method** to get descriptive statistics on each column of the DataFrame. Let's see how this works:

In [77]:
# We get descriptive statistics on our stock data
kpit_weather.describe()

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
count,192.0,192.0,192.0,192.0,192.0,192.0,192.0,192.0,192.0,192.0,192.0,192.0
mean,20170820000000.0,20170820000000.0,203.255208,150.473958,10089.489583,-166.6875,23.286458,-3643.036458,0.921875,-9789.760417,0.0,0.0
std,2321295.0,2298326.0,44.335967,39.747756,1457.958894,1774.630762,17.696577,4827.079652,9.969542,1438.251904,0.0,0.0
min,20170820000000.0,20170820000000.0,111.0,83.0,-9999.0,-9999.0,0.0,-9999.0,-1.0,-9999.0,0.0,0.0
25%,20170820000000.0,20170820000000.0,167.0,117.0,10159.75,0.0,15.0,-9999.0,0.0,-9999.0,0.0,0.0
50%,20170820000000.0,20170820000000.0,206.0,144.0,10203.0,160.0,21.0,2.0,0.0,-9999.0,0.0,0.0
75%,20170830000000.0,20170830000000.0,233.0,178.0,10230.0,240.0,31.0,4.0,0.0,-9999.0,0.0,0.0
max,20170830000000.0,20170830000000.0,294.0,228.0,10266.0,360.0,88.0,9.0,137.0,150.0,0.0,0.0


In [78]:
# If desired, we can apply the .describe() method on a single column as shown below:

# We get descriptive statistics on a single column of our DataFrame
kpit_weather['SLP'].describe()

count      192.000000
mean     10089.489583
std       1457.958894
min      -9999.000000
25%      10159.750000
50%      10203.000000
75%      10230.000000
max      10266.000000
Name: SLP, dtype: float64

Similarly, you can also look at one statistic by using one of the many statistical functions Pandas provides. Let's look at some examples:

In [79]:
# We print information about our DataFrame  
print()
print('Maximum values of each column:\n', kpit_weather.max())
print()
print('Minimum Close value:', kpit_weather['DT'].min())
print()
print('Average value of each column:\n', kpit_weather.mean())


Maximum values of each column:
 ZTime          2.017083e+13
Time           2.017083e+13
OAT            2.940000e+02
DT             2.280000e+02
SLP            1.026600e+04
WD             3.600000e+02
WS             8.800000e+01
SKY            9.000000e+00
PPT            1.370000e+02
PPT6           1.500000e+02
Plsr.Event     0.000000e+00
Plsr.Source    0.000000e+00
dtype: float64

Minimum Close value: 83

Average value of each column:
 ZTime          2.017082e+13
Time           2.017082e+13
OAT            2.032552e+02
DT             1.504740e+02
SLP            1.008949e+04
WD            -1.666875e+02
WS             2.328646e+01
SKY           -3.643036e+03
PPT            9.218750e-01
PPT6          -9.789760e+03
Plsr.Event     0.000000e+00
Plsr.Source    0.000000e+00
dtype: float64


Another important statistical measure is data correlation. Data correlation can tell us, for example, if the data in different columns are correlated. We can use the **.corr() method** to get the correlation between different columns, as shown below:

In [82]:
# We display the correlation between columns
kpit_weather.corr()

Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
ZTime,1.0,0.992492,-0.410268,-0.633909,0.126709,-0.01385,0.099163,-0.063976,-0.048678,0.015438,,
Time,0.992492,1.0,-0.421351,-0.645899,0.12401,-0.013779,0.099876,-0.0706,-0.048036,-0.000607,,
OAT,-0.410268,-0.421351,1.0,0.523077,0.079418,-0.099641,0.512753,-0.111562,0.054663,-0.084365,,
DT,-0.633909,-0.645899,0.523077,1.0,-0.013142,0.059976,0.184839,-0.11562,0.143104,0.049398,,
SLP,0.126709,0.12401,0.079418,-0.013142,1.0,-0.010804,0.088826,0.101434,0.002025,0.006406,,
WD,-0.01385,-0.013779,-0.099641,0.059976,-0.010804,1.0,0.036256,-0.073338,0.019432,0.036181,,
WS,0.099163,0.099876,0.512753,0.184839,0.088826,0.036256,1.0,-0.223681,0.07037,0.010082,,
SKY,-0.063976,-0.0706,-0.111562,-0.11562,0.101434,-0.073338,-0.223681,1.0,-0.098445,0.110598,,
PPT,-0.048678,-0.048036,0.054663,0.143104,0.002025,0.019432,0.07037,-0.098445,1.0,0.023005,,
PPT6,0.015438,-0.000607,-0.084365,0.049398,0.006406,0.036181,0.010082,0.110598,0.023005,1.0,,


A correlation value of 1 tells us there is a high correlation and a correlation of 0 tells us that the data is not correlated at all.

Let's take a look at the **.groupby() method**. The **.groupby()** method allows us to group data in different ways. Let's see how we can group data to get different types of information.

In [106]:
# We load fake Company data in a DataFrame

#data = pd.read_csv('./kpit_weather.csv') this worked. ./ is pointing to the directory where it is located
data = pd.read_csv('fake_company.csv')

data

FileNotFoundError: [Errno 2] File fake_company.csv does not exist: 'fake_company.csv'

<b class = "hidden">We see that the data contains information for the year 1990 through 1992. For each year we see name of the employees, the department they work for, their age, and their annual salary. **I am hiding this because this info I need the explaination but it doesn't exist** 
Let's calculate how much money the company spent in salaries each year. To do this, we will group the data by Year using the .groupby() method and then we will add up the salaries of all the employees by using the .sum() method.</b>
    
    

let's use the **.groupby() method** to get information. We will calculate the info from the SLP column.

In [103]:
# We display the total amount of money spent in salaries each year
# data.groupby(['Year'])['Salary'].sum() this is for calculating the total amount of salaries in a different years
# in a dataset

data.groupby(['SKY'])['SLP'].sum() # trying the .groupby method using the data I have available

SKY
-9999    692633
 0       174068
 2       255320
 4       438339
 6       274867
 8        71361
 9        30594
Name: SLP, dtype: int64

We see the data for all the column of sky with call the columns that have -9999 is 692633, 0 is 174068, and so on.

Now, let's suppose we want to know what was the average SLP detail for each year. In this case, we will group the data by SKY using the **.groupby() method**, just as we did before, and then we use the **.mean() method** to get the average SLP. Let's see how this works

<b class=hidden>We see that the company spent a total of 153,000 dollars in 1990, 162,000 in 1991, and 174,000 in 1992.

Now, let's suppose I want to know what was the average salary for each year. In this case, we will group the data by Year using the .groupby() method, just as we did before, and then we use the .mean() method to get the average salary. Let's see how this works

# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()</b>

In [104]:
# We display the average salary per year
#data.groupby(['Year'])['Salary'].mean() I don't have the dataset for this

# We display the average salary per year
data.groupby(['SKY'])['SLP'].mean()

SKY
-9999     9894.757143
 0       10239.294118
 2       10212.800000
 4       10193.930233
 6       10180.259259
 8       10194.428571
 9       10198.000000
Name: SLP, dtype: float64

<b class = hidden>We see that the average salary in 1990 was 51,000 dollars, 54,000 in 1991, and 58,000 in 1992.

Let's see how much did each employee get paid in those three years. In this case, we will group the data by Name using the .groupby() method and then we will add up the salaries for each year. Let's see the result. I don't have this data, that is why it is set to hidden</b>


We see that the average SLP for the value of SKY in -9999 is 9894.757143, 0 is 10239.294118, and so on. We can go ahead to check the total SLP in our 


In [101]:
# We display the total salary each employee received in all the years they worked for the company
# data.groupby(['Name'])['Salary'].sum() I don't have this dataset

# We display the total SLP for the WS columns in our dataset
data.groupby(['WS'])['SLP'].sum()

WS
0     418300
15    316380
21    275604
26    234218
31    224509
36    132532
41     91555
46    101727
51     71364
62     20306
67     10147
72     20282
82     10122
88     10136
Name: SLP, dtype: int64

<b class= "hidden">Now let's see what was the salary distribution per department per year. In this case we will group the data by Year and by Department using the .groupby() method and then we will add up the salaries for each department. Let's see the result</b>

Let us go on ahead to see the 



In [105]:
# We display the salary distribution per department per year.
data.groupby(['Year', 'Department'])['Salary'].sum()




KeyError: 'Year'

In [None]:
We see that in 1990 the Admin department paid 55,000 dollars in salaries,the HR department paid 50,000, and the RD department 48,0000. While in 1992 the Admin department paid 122,000 dollars in salaries and the RD department paid 52,000.