# Missing Data Handeling
* Missing data in Pandas is ofthen represented by NumPy "NaN" values
* Pandas treats NaN vaules as a float, which allows them to be used in a vectorized operation

In [3]:
import numpy as np
import pandas as pd

In [4]:
sales = [0, 5, 155, np.nan, 518]
items = ["coffee", "bananas", "tea", "coconut", "sugar"]

sales_series = pd.Series(sales, index=items, name="Sales")
sales_series

coffee       0.0
bananas      5.0
tea        155.0
coconut      NaN
sugar      518.0
Name: Sales, dtype: float64

### Pandas released its own missing data type, "NA".
* This allows missing values to be stored as integers, instead of needing to convert to float
* This is still a new feature, but most bugs end up converting the data to NumPy's NaN

In [5]:
sales = [0, 5, 155, pd.NA, 518]
items = ["coffee", "bananas", "tea", "coconut", "sugar"]

sales_series = pd.Series(sales, index=items, name="Sales")
sales_series

coffee        0
bananas       5
tea         155
coconut    <NA>
sugar       518
Name: Sales, dtype: object

## Identifying Missing Data

* The .isna() and .value_counts() methods let you identify missing data in a Series
* The .isna() method returns True if a value is missing, and False otherwise


In [6]:
checklist = pd.Series(['Complete', np.NaN, np.NaN, np.NaN, 'Complete'])
checklist

0    Complete
1         NaN
2         NaN
3         NaN
4    Complete
dtype: object

In [7]:
checklist.isna() 

0    False
1     True
2     True
3     True
4    False
dtype: bool

* The .value_count() mehtod returns unique values and their frequency

In [8]:
checklist.value_counts() # .value_counts() supresses NaN and NONE values

Complete    2
Name: count, dtype: int64

In [9]:
checklist.value_counts(dropna=False) # The dropna=False argument resolves this problem

NaN         3
Complete    2
Name: count, dtype: int64

## Handeling Missing Data

* The .dropna() and .fillna() methods let you handle missing data in a Series
* The .dropna() method removes NaN values from your Series or DataFrame

In [13]:
checklist.dropna() # if you drop NA values you are affecting the index of the Series.


0    Complete
4    Complete
dtype: object

In [14]:
checklist.dropna().reset_index() # you can add reset index if needed.

Unnamed: 0,index,0
0,0,Complete
1,4,Complete


### The .fillna(value) method replaces NaN values with a specified value

In [18]:
checklist.fillna('INCOMPLETE') 

0      Complete
1    INCOMPLETE
2    INCOMPLETE
3    INCOMPLETE
4      Complete
dtype: object

### It's important to be thoughtful and deliberate in how you handle missing data

* Talk to a data matter expert when considering replacing missing data with zeros or possibly the mean

In [20]:
my_series = pd.Series([np.NaN] * 5) # a numpy NaN Series
my_series

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64

In [25]:
my_series.isna().sum() # if we drop the .sum() mehtod this will become a bool and read all True

5

In [27]:
my_series = pd.Series([pd.NA] * 5) # a pandas NA Series
my_series

0    <NA>
1    <NA>
2    <NA>
3    <NA>
4    <NA>
dtype: object

In [29]:
my_series.astype('Int64') # the pandas NA Series can be datatype cast as an int

0    <NA>
1    <NA>
2    <NA>
3    <NA>
4    <NA>
dtype: Int64

In [33]:
my_series.isna().sum() # dropna works fine

5

In [37]:
my_series = pd.Series(range(5))
my_series

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [38]:
my_series.loc[1:2] = pd.NA
my_series

0    0.0
1    NaN
2    NaN
3    3.0
4    4.0
dtype: float64

In [39]:
my_series.isna()

0    False
1     True
2     True
3    False
4    False
dtype: bool

In [40]:
my_series.value_counts(dropna=False)

NaN    2
0.0    1
3.0    1
4.0    1
Name: count, dtype: int64

In [41]:
my_series.fillna(0)

0    0.0
1    0.0
2    0.0
3    3.0
4    4.0
dtype: float64

In [42]:
my_series.fillna(my_series.mean())

0    0.000000
1    2.333333
2    2.333333
3    3.000000
4    4.000000
dtype: float64

In [43]:
my_series.dropna()

0    0.0
3    3.0
4    4.0
dtype: float64

In [44]:
my_series.dropna().reset_index(drop=True)

0    0.0
1    3.0
2    4.0
dtype: float64