<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;"
  >
Pandas: Finding and Dropping Missing Data
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 1: Topic 5</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

#### Missing data


Observations for a set of variables (columns):
- E.g, observations of octopus: length, width, mass, beak length, beak width, number of suckers, bioluminiscence, species.   
- For a given observation:  have data on some variables but not others.
- This leads to missing/empty values in tabular form.

| Obs_ID  | L (m) | W (m) | L<sub>beak</sub> (cm) | W<sub>beak</sub> (cm) | m (kg) | n<sub>suckers</sub> | Bioluminescent? | Species|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 1.1 | .6 | 10 | 5 | 10 | | N | Ghost squid |
| 1 | 30 | 9 | 8 | 5 | 180 | 1200 | N | Giant Squid |
| 2 | 1.5 | .8 |  |  | 12 |  | Y | S. Syrtensis |

<br>

<div align = "right">
<center><img src="Images/syrtensis.jpeg" width="400"/></center>
</div>
<center>Stauroteuthis Syrtensis: The glowing octopus </center>


#### What pandas does with missing values
- On an import: missing values represented as NaNs
- NaN = not a number

Let's take a look at our titanic dataset.

In [1]:
import numpy as np
import pandas as pd
titanic_df = pd.read_csv('Data/titanic.csv')
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The .info() method shows us that there are some null values. Look at Cabin column:

In [2]:
titanic_df['Cabin'].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

There are clearly some NaNs: missing values.

Pandas type for NaN:

In [3]:
type(titanic_df.loc[0, 'Cabin'])

float

#### Placeholder Values
- Sometimes nulls are already encoded in data
    - Most common: 0 for missing values
    - A very large number: e.g. 9999 w/ data range [0,10]
    - np.inf: infinity coding
    
Typically value well outside range/type of most data values.

In [2]:
datafun = np.array( [[23, 45, 10, 22, 0, 31, 8, 6, 9999, 11, 9999],['NYC', 'NYC', 'PHIL', 'NYC', 'DC', 'BOS', 0, 'NYC', 'BOS', 0, 'DC']])
pd.DataFrame(datafun.T, columns = ['miles_driven_hour', 'car_origin_city'])

Unnamed: 0,miles_driven_hour,car_origin_city
0,23,NYC
1,45,NYC
2,10,PHIL
3,22,NYC
4,0,DC
5,31,BOS
6,8,0
7,6,NYC
8,9999,BOS
9,11,0


Find the missing values.

#### Why are missing values a problem?
- NaNs:
    - Many statistical calculations and machine learning algorithms ill posed with NaNs.
- Placeholders:
    - Distort/skew data distribution artificially (e.g., many meaningless 0 or 9999 in data)
   


#### NaNs: Finding them using pandas
- DataFrame.isna() method: returns True/False if NaN entry-wise by default.

In [5]:
titanic_df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


But often we want to know which rows or which columns have NaNs. First we need to look at:

- Series.any() method: Returns True or False on a Series if any of the elements are True.

In [6]:
pd.Series([False, False, False]).any()

False

In [7]:
pd.Series([False, True, False]).any()

True

- DataFrame.any(axis = ___) method.
- If axis = 0, check if there are any True in each column.
- If axis = 1, check if there are any True in each row.

Put it altogether by chaining:

In [3]:
titanic_df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


Does row have NaN in it?

In [5]:
titanic_df.isna().any(axis = 1)

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Length: 891, dtype: bool

In [10]:
titanic_df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


Does column have NaN in it?

In [11]:
titanic_df.isna().any(axis = 0)

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

The .notna() method:
- Unsurprisingly, finds all elements in dataframe that are not NaNs.

In [12]:
titanic_df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [13]:
titanic_df.notna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,True,True,True,True,True,True,True,True,True,True,False,True
1,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,False,True
3,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
886,True,True,True,True,True,True,True,True,True,True,False,True
887,True,True,True,True,True,True,True,True,True,True,True,True
888,True,True,True,True,True,False,True,True,True,True,False,True
889,True,True,True,True,True,True,True,True,True,True,True,True


The DataFrame.all(axis = __) method:

- If axis = 0, checks if each column is all True.
- If axis = 1, checks if each row is all True.

Chain with .notna() and we can find all rows/columns that have no NaNs
    

In [6]:
# all columns that have no Nans
titanic_df.notna().all(axis = 0)

PassengerId     True
Survived        True
Pclass          True
Name            True
Sex             True
Age            False
SibSp           True
Parch           True
Ticket          True
Fare            True
Cabin          False
Embarked       False
dtype: bool

In [15]:
# all row that have no Nans
titanic_df.notna().all(axis = 1)

0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Length: 891, dtype: bool

#### Selections on columns/rows with/without NaNs:

- Use the .loc[] accessor with .isna(), .any(), etc...
- The Series that we have been generating with .notna().all(), etc are Boolean masks!

Example: select all rows in titanic data without NaNs. Extract Sex, Passenger class, Age, and Cabin columns only.


In [8]:
col_list = ['Sex', 'Pclass', 'Age', 'Cabin']
selection = titanic_df.loc[titanic_df.notna().all(axis = 1), col_list]

selection.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Sex     183 non-null    object 
 1   Pclass  183 non-null    int64  
 2   Age     183 non-null    float64
 3   Cabin   183 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 7.1+ KB


#### Dropping NaNs easily:
- The .dropna(axis = __, how =, subset = __) 
- Above chaining with .loc flexible
- But: dropping NaNs regular enough operation that there is easy command to do this.

Drop all rows (index) that have any NaNs

In [17]:
titanic_df.dropna(axis = 'index', how = 'any')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


Drop all columns that have any NaNs:

In [20]:
titanic_df.dropna(axis = 'columns', how = 'any').head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05


#### Dealing with PlaceHolder Values
- One way is to convert suspected placeholder value(s) to NaN and then apply previous methods.

In [10]:
datafun_df = pd.DataFrame(datafun.T, columns = ['miles_driven_hour', 'car_origin_city'])
datafun_df

Unnamed: 0,miles_driven_hour,car_origin_city
0,23,NYC
1,45,NYC
2,10,PHIL
3,22,NYC
4,0,DC
5,31,BOS
6,8,0
7,6,NYC
8,9999,BOS
9,11,0


DataFrame.replace() method: 
- dictionary-style value replacement in a DataFrame

In [11]:
datafun_df.replace({'0':np.nan, '9999': np.nan}, inplace = True)
datafun_df

Unnamed: 0,miles_driven_hour,car_origin_city
0,23.0,NYC
1,45.0,NYC
2,10.0,PHIL
3,22.0,NYC
4,,DC
5,31.0,BOS
6,8.0,
7,6.0,NYC
8,,BOS
9,11.0,


Dropping all rows with any NaNs:

In [12]:
datafun_df.dropna(axis = 'index', how = 'any')

Unnamed: 0,miles_driven_hour,car_origin_city
0,23,NYC
1,45,NYC
2,10,PHIL
3,22,NYC
5,31,BOS
7,6,NYC


- Pandas: effective at finding and dropping missing values.

- Often dropping values is not the best way. 

We will see other possibilities next lecture.