<a href="https://colab.research.google.com/github/axel-sirota/normalise-data-pandas/blob/main/module5/NormaliseDataPandas_Mod5Demo2_ReplaceInvalid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using replace on invalid data


## Prep

In the last series of demos we worked with duplicated rows, now we are going to work with another troublesome issue of datasets: invalid data. For this we will work again on the drinks dataset

In [1]:
%%writefile get_data.sh
if [ ! -f drinks.csv ]; then
  wget -O drinks.csv https://raw.githubusercontent.com/axel-sirota/normalise-data-pandas/main/data/drinks.csv
fi

Writing get_data.sh


In [2]:
!bash get_data.sh

--2023-04-25 14:30:17--  https://raw.githubusercontent.com/axel-sirota/normalise-data-pandas/main/data/drinks.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4973 (4.9K) [text/plain]
Saving to: ‘drinks.csv’


2023-04-25 14:30:17 (46.2 MB/s) - ‘drinks.csv’ saved [4973/4973]



In [3]:
import numpy as np
import pandas as pd

drinks =  pd.read_csv('drinks.csv')
drinks

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF


## Detecting invalid data

If you remember when we were learning about NaNs we learned the continent dataset had NaNs

In [4]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


In [8]:
drinks[drinks.continent.isna()]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
5,Antigua & Barbuda,102,128,45,4.9,
11,Bahamas,122,176,51,6.3,
14,Barbados,143,173,36,6.3,
17,Belize,263,114,8,6.8,
32,Canada,240,122,100,8.2,
41,Costa Rica,149,87,11,4.4,
43,Cuba,93,137,5,4.2,
50,Dominica,52,286,26,6.6,
51,Dominican Republic,193,147,9,6.2,
54,El Salvador,52,69,2,2.2,


Have you figured out the issue? Let's see the original file!

In [9]:
!tail -n 20 drinks.csv

Tonga,36,21,5,1.1,OC
Trinidad & Tobago,197,156,7,6.4,NA
Tunisia,51,3,20,1.3,AF
Turkey,51,22,7,1.4,AS
Turkmenistan,19,71,32,2.2,AS
Tuvalu,6,41,9,1.0,OC
Uganda,45,9,0,8.3,AF
Ukraine,206,237,45,8.9,EU
United Arab Emirates,16,135,5,2.8,AS
United Kingdom,219,126,195,10.4,EU
Tanzania,36,6,1,5.7,AF
USA,249,158,84,8.7,NA
Uruguay,115,35,220,6.6,SA
Uzbekistan,25,101,8,2.4,AS
Vanuatu,21,18,11,0.9,OC
Venezuela,333,100,3,7.7,SA
Vietnam,111,2,1,2.0,AS
Yemen,6,0,0,0.1,AS
Zambia,32,19,4,2.5,AF
Zimbabwe,64,18,4,4.7,AF


The string NA is being set as NaN. Two ways forward: 

- Easy way: replace values. You need to be very sure!
- Complex way: Changing the na values on the `read_csv` method

## Replacing

In [10]:
drinks.continent.replace(np.nan, 'NA', inplace=True)

In [11]:
drinks

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF


In [12]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


In [13]:
drinks.continent.value_counts()

AF    53
EU    45
AS    44
NA    23
OC    16
SA    12
Name: continent, dtype: int64

There it is! Although super risky!

## Modifying read_csv 

In [14]:
drinks =  pd.read_csv('drinks.csv', keep_default_na=False)
drinks

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF


In [15]:
drinks.continent.value_counts()

AF    53
EU    45
AS    44
NA    23
OC    16
SA    12
Name: continent, dtype: int64

Notice it works, however by setting keep_default_na as False we are saying to not take any Nan. The correct way, which  I leave to you as homework, is also to set the na_values to use all the normal ones except the one that affects us.