<a href="https://colab.research.google.com/github/axel-sirota/manage-data-pandas/blob/main/module3/ManageDataPandas_Mod3Demo1_IdentifyNaNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying NaNs



## Prep

In this demo we will find ways of identifying `NaNs`, not a number, cells inside our Pandas DataFrames. NaNs are very important because they have a `carry over` effect, what I mean is:

In [None]:
import numpy as np 
import pandas as pd

a = 2
b = np.nan

print(f'The result of {a} and {b} added is {a+b}')

The result of 2 and nan added is nan


Likewise it happens with multiplications:

In [None]:
a = 2
b = np.nan

print(f'The result of {a} and {b} multiplied is {a*b}')

The result of 2 and nan multiplied is nan


Therefore it affects all of our calculations in Data Science like averages, standard deviations or more complex constructs like Neural Networks. To work on this, we will first download a public dataset of amount drinked per type and per country.

In [None]:
%%writefile get_data.sh
if [ ! -f drinks.csv ]; then
  wget -O drinks.csv https://raw.githubusercontent.com/axel-sirota/manage-data-pandas/main/data/drinks.csv
fi

Overwriting get_data.sh


In [None]:
!bash get_data.sh

## Loading the data and analizing it

In [None]:
drinks =  pd.read_csv('drinks.csv')
drinks

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF


We can easily see for each country how much beer, spirit, wine and total amounts have been drank; as well as the continent. However, what happens if we describe the dataframe?


In [None]:
drinks.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


It takes away the non numeric columns, so let's include them all:

In [None]:
drinks.describe(include='all')

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
count,193,193.0,193.0,193.0,193.0,170
unique,193,,,,,5
top,Afghanistan,,,,,AF
freq,1,,,,,53
mean,,106.160622,80.994819,49.450777,4.717098,
std,,101.143103,88.284312,79.697598,3.773298,
min,,0.0,0.0,0.0,0.0,
25%,,20.0,4.0,1.0,1.3,
50%,,76.0,56.0,8.0,4.2,
75%,,188.0,128.0,59.0,7.2,


It appears that the continent column has some problem since it only registers 170 rows. `.describe(include='all)` is a great method for diagnosing problems; another tool is `.info()`

In [None]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


Same idea, in a much more compact form we can identify NaNs this way. But one question remains: which rows have that NaN? 

## Finding which rows have a NaN

There is a great method called `.isna()` that will return True if that value is `NaN`, let's see it.

In [None]:
drinks.isna()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
188,False,False,False,False,False,False
189,False,False,False,False,False,False
190,False,False,False,False,False,False
191,False,False,False,False,False,False


Now we only need an indexer! let's do it on the column `continent` which is the one with NaNs

In [None]:
drinks[drinks.isna()['continent']]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
5,Antigua & Barbuda,102,128,45,4.9,
11,Bahamas,122,176,51,6.3,
14,Barbados,143,173,36,6.3,
17,Belize,263,114,8,6.8,
32,Canada,240,122,100,8.2,
41,Costa Rica,149,87,11,4.4,
43,Cuba,93,137,5,4.2,
50,Dominica,52,286,26,6.6,
51,Dominican Republic,193,147,9,6.2,
54,El Salvador,52,69,2,2.2,


There we have them! Can you find out what they have in common?

**Hint:** Try to check the file and this table and verify!