<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;"
  >
Pandas: Finding and Dropping Missing Data
              
</p>
</div>

Data Science Cohort Live NYC May 2022
<p>Phase 1: Topic 5</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

#### Missing data


Observations for a set of variables (columns):
- E.g, observations of octopus: length, width, mass, beak length, beak width, number of suckers, bioluminiscence, species.   
- For a given observation:  have data on some variables but not others.
- This leads to missing/empty values in tabular form.

| Obs_ID  | L (m) | W (m) | L<sub>beak</sub> (cm) | W<sub>beak</sub> (cm) | m (kg) | n<sub>suckers</sub> | Bioluminescent? | Species|
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 1.1 | .6 | 10 | 5 | 10 | | N | Ghost squid |
| 1 | 30 | 9 | 8 | 5 | 180 | 1200 | N | Giant Squid |
| 2 | 1.5 | .8 |  |  | 12 |  | Y | S. Syrtensis |

<br>

<div align = "right">
<center><img src="Images/syrtensis.jpeg" width="400"/></center>
</div>
<center>Stauroteuthis Syrtensis: The glowing octopus </center>


#### What pandas does with missing values
- On an import: missing values represented as NaNs
- NaN = not a number

Let's take a look at our titanic dataset.

In [None]:
import numpy as np
import pandas as pd
titanic_df = pd.read_csv('Data/titanic.csv')
titanic_df.info()

The .info() method shows us that there are some null values. Look at Cabin column:

In [None]:
titanic_df['Cabin'].head()

There are clearly some NaNs: missing values.

Pandas type for NaN:

In [None]:
type(titanic_df.loc[0, 'Cabin'])

#### Placeholder Values
- Sometimes nulls are already encoded in data
    - Most common: 0 for missing values
    - A very large number: e.g. 9999 w/ data range [0,10]
    - np.inf: infinity coding
    
Typically value well outside range/type of most data values.

In [None]:
datafun = np.array( [[23, 45, 10, 22, 0, 31, 8, 6, 9999, 11, 9999],['NYC', 'NYC', 'PHIL', 'NYC', 'DC', 'BOS', 0, 'NYC', 'BOS', 0, 'DC']])
pd.DataFrame(datafun.T, columns = ['miles_driven_hour', 'car_origin_city'])

Find the missing values.

#### Why are missing values a problem?
- NaNs:
    - Many statistical calculations and machine learning algorithms ill posed with NaNs.
- Placeholders:
    - Distort/skew data distribution artificially (e.g., many meaningless 0 or 9999 in data)
   


#### NaNs: Finding them using pandas
- DataFrame.isna() method: returns True/False if NaN entry-wise by default.

In [None]:
titanic_df.isna()

But often we want to know which rows or which columns have NaNs. First we need to look at:

- Series.any() method: Returns True or False on a Series if any of the elements are True.

In [None]:
pd.Series([False, False, False]).any()

In [None]:
pd.Series([False, True, False]).any()

- DataFrame.any(axis = ___) method.
- If axis = 0, check if there are any True in each column.
- If axis = 1, check if there are any True in each row.

Put it altogether by chaining:

In [None]:
titanic_df.isna()


Do the columns have NaN in it?

In [None]:
titanic_df.isna().any(axis = 0) #.sum()

Do the rows have NaN in it?

In [None]:
titanic_df.isna().any(axis = 1)

The .notna() method:
- Unsurprisingly, finds all elements in dataframe that are not NaNs.

In [None]:
titanic_df.isna()

In [None]:
titanic_df.notna()

The DataFrame.all(axis = __) method:

- If axis = 0, checks if each column is all True.
- If axis = 1, checks if each row is all True.

Chain with .notna() and we can find all rows/columns that have no NaNs
    

In [None]:
# all columns that have no Nans
titanic_df.notna().all(axis = 0)

In [None]:
# all row that have no Nans
titanic_df.notna().all(axis = 1)

#### Selections on columns/rows with/without NaNs:

- Use the .loc[] accessor with .isna(), .any(), etc...
- The Series that we have been generating with .notna().all(), etc are Boolean masks!

Example: select all rows in titanic data without NaNs. Extract Sex, Passenger class, Age, and Cabin columns only.


In [None]:
col_list = ['Sex', 'Pclass', 'Age', 'Cabin']
selection = titanic_df.loc[titanic_df.notna().all(axis = 1), col_list]

print(selection.head())
print(selection.info())


#### Dropping NaNs easily:
- The .dropna(axis = __, how =, subset = __) 
- Above chaining with .loc flexible
- But: dropping NaNs regular enough operation that there is easy command to do this.

Drop all rows (index) that have any NaNs

In [None]:
titanic_df.dropna(axis = 'index', how = 'any')

Drop all columns that have any NaNs:

In [None]:
titanic_df.dropna(axis = 'columns', how = 'any').head()

#### Dealing with PlaceHolder Values
- One way is to convert suspected placeholder value(s) to NaN and then apply previous methods.

In [None]:
datafun_df = pd.DataFrame(datafun.T, columns = ['miles_driven_hour', 'car_origin_city'])
datafun_df

DataFrame.replace() method: 
- dictionary-style value replacement in a DataFrame

In [None]:
datafun_df.replace({'0':np.nan, '9999': np.nan}, inplace = True)
datafun_df

Dropping all rows with any NaNs:

In [None]:
datafun_df.dropna(axis = 'index', how = 'any')

- Pandas: effective at finding and dropping missing values.

- Often dropping values is not the best way. 

We will see other possibilities next lecture.