<a href="https://colab.research.google.com/github/axel-sirota/manage-data-pandas/blob/main/module3/ManageDataPandas_Mod3Demo3_FillNaNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filling NaNs



## Prep

In this demo we will find ways of fill the `NaNs` we found before instead of dropping them. To remind, we had the following drinks dataframe with some missing values on the continent. 

In [None]:
%%writefile get_data.sh
if [ ! -f drinks.csv ]; then
  wget -O drinks.csv https://raw.githubusercontent.com/axel-sirota/manage-data-pandas/main/data/drinks.csv
fi

Writing get_data.sh


In [None]:
!bash get_data.sh

--2023-04-24 18:23:32--  https://raw.githubusercontent.com/axel-sirota/normalise-data-pandas/main/data/drinks.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4973 (4.9K) [text/plain]
Saving to: ‘drinks.csv’


2023-04-24 18:23:32 (48.1 MB/s) - ‘drinks.csv’ saved [4973/4973]



In [None]:
import numpy as np
import pandas as pd

drinks =  pd.read_csv('drinks.csv')
drinks.continent.value_counts()

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64

In [None]:
drinks.continent.value_counts().sum()

170

## How to fill NaNs?

### Filling with a simple value

The simplest way to fill a missing value is to fill it with a constant value. Also this is the most dangerous way since you have to be very sure of what you input.

In [None]:
# Have you cracked the problems with the drinks dataset?

In [None]:
drinks.fillna('NA', inplace=True)

In [None]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


In this case we analised the issue and checked the problem was it was taking `North America` as `NaN`. But suppose we don't know this, or we want to handle another numeric column, let's use another dataset to see this:

In [None]:
%%writefile get_data_2.sh
if [ ! -f drinks_mixed.csv ]; then
  wget -O drinks_mixed.csv https://raw.githubusercontent.com/axel-sirota/manage-data-pandas/main/data/drinks_mixed.csv
fi


Writing get_data_2.sh


In [None]:
!bash get_data_2.sh

--2023-04-24 18:47:12--  https://raw.githubusercontent.com/axel-sirota/normalise-data-pandas/main/data/drinks_mixed.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6023 (5.9K) [text/plain]
Saving to: ‘drinks_mixed.csv’


2023-04-24 18:47:12 (41.9 MB/s) - ‘drinks_mixed.csv’ saved [6023/6023]



In [None]:
drinks_mixed =  pd.read_csv('drinks_mixed.csv')

Let's analise this dataframe!

In [None]:
drinks_mixed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Unnamed: 0                    193 non-null    int64  
 1   country                       193 non-null    object 
 2   beer_servings                 145 non-null    float64
 3   spirit_servings               142 non-null    float64
 4   wine_servings                 148 non-null    float64
 5   total_litres_of_pure_alcohol  144 non-null    float64
 6   continent                     170 non-null    object 
dtypes: float64(4), int64(1), object(2)
memory usage: 10.7+ KB


Quite an interesting dataset! What happens if we dropna?

In [None]:
drinks_mixed.dropna()

Unnamed: 0.1,Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent


So every row has at least one Nan. One way to handle the numeric values when we are dealing with a time series or we don't know is *forward fill*.

In [None]:
drinks_mixed.fillna(method='ffill')

Unnamed: 0.1,Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,0,Afghanistan,0.0,0.0,0.0,,AS
1,1,Albania,89.0,132.0,0.0,4.9,EU
2,2,Algeria,25.0,132.0,14.0,0.7,AF
3,3,Andorra,245.0,138.0,312.0,0.7,EU
4,4,Angola,217.0,138.0,45.0,5.9,AF
...,...,...,...,...,...,...,...
188,188,Venezuela,21.0,100.0,3.0,7.7,SA
189,189,Vietnam,111.0,100.0,1.0,2.0,AS
190,190,Yemen,6.0,0.0,1.0,0.1,AS
191,191,Zambia,6.0,19.0,4.0,2.5,AF


This way you fill with the previous value. The inverse is backfill:

In [None]:
drinks_mixed.fillna(method='backfill')

Unnamed: 0.1,Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,0,Afghanistan,0.0,0.0,0.0,4.9,AS
1,1,Albania,89.0,132.0,14.0,4.9,EU
2,2,Algeria,25.0,138.0,14.0,0.7,AF
3,3,Andorra,245.0,138.0,312.0,5.9,EU
4,4,Angola,217.0,128.0,45.0,5.9,AF
...,...,...,...,...,...,...,...
188,188,Venezuela,111.0,100.0,3.0,7.7,SA
189,189,Vietnam,111.0,0.0,1.0,2.0,AS
190,190,Yemen,6.0,0.0,4.0,0.1,AS
191,191,Zambia,64.0,19.0,4.0,2.5,AF


Notice that naturally in both cases the extremes are still NaNs. As a final note, one pretty standard thing to do is fill with the mean valule:

In [None]:
drinks_mixed.fillna(drinks_mixed.mean())

  drinks_mixed.fillna(drinks_mixed.mean())


Unnamed: 0.1,Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,0,Afghanistan,0.000000,0.000000,0.000000,4.673611,AS
1,1,Albania,89.000000,132.000000,49.587838,4.900000,EU
2,2,Algeria,25.000000,84.380282,14.000000,0.700000,AF
3,3,Andorra,245.000000,138.000000,312.000000,4.673611,EU
4,4,Angola,217.000000,84.380282,45.000000,5.900000,AF
...,...,...,...,...,...,...,...
188,188,Venezuela,100.227586,100.000000,3.000000,7.700000,SA
189,189,Vietnam,111.000000,84.380282,1.000000,2.000000,AS
190,190,Yemen,6.000000,0.000000,49.587838,0.100000,AS
191,191,Zambia,100.227586,19.000000,4.000000,2.500000,AF
