<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

In [None]:
import numpy as np
import pandas as pd

## IV) handling **missing data** in *numpy* and  *pandas*

   - in **real data** you can have **missing values**
   - **missing values** are represented in *pandas* arrays by *numpy.NaN*

### 1) the type of **missing values**

   - the type of *numpy.NaN* is **float**
   
       
   - i.e. *numpy.NaN*, can only be used for **float** or **object** types
   
   
   - in other cases a conversion is done
      - **integers** are converted to **float64**
      - **Booleans** are converted to **object**

  
   - when a *numpy.NaN* is **present** in a numeric *numpy.Series*
   - the **dtype** of the *numpy.Series* is **numpy float64**
   

In [None]:
df = pd.Series([1, 2, 3, np.NaN])
df.dtype

   - if you try to **force** an integer dtype, an **exception** is **raised**

In [None]:
try:
    df = pd.Series([1, 2, 3, np.NaN], dtype=np.int64)
except ValueError as e:
    print(e)

   - the **version 0.24** of the *pandas* **library**
   - can hold **integer dtypes** with **missing values**
   
   
   
   - it is not done through the **regular integer type**
   - but it uses **extension types**
   
   
   - the **extended integer-type** that can hold NaN values is *'Int64'* (not *'int64'*

   - in *pandas.Series*, *None* is replaced *numpy.NaN*
   
   
   - except of *pandas.Series* of type **object**


In [None]:
df = pd.Series([1, 2, 3, None], dtype='object')
df

In [None]:
#pd.isna?

### 2) *pandas* functions to **dealing** with **missing values**

*pandas.isna()*, *DataFrame.isna*  and *Index.isna*
   - returns the **Boolean mask** of **missing** values 

In [None]:
df = pd.Series([1, 2, np.NaN, None], dtype='object')
pd.isna(df)  # same as df.isna()

In [None]:
df[df.isna()] # select the missing values in the Series

In [None]:
df = pd.DataFrame([[1, 2, 3, np.NaN], [4, 5, None]])
   # 4 columns of two values each
   # the two firsts are int64
   # the third and the furth are float64 (presence of NaN)
df.head()

   - on **index**

In [None]:
df = pd.DataFrame([[1, 2], [4, 5]], index=['a', np.NaN])

In [None]:
df.index.isna()

#### *pandas.notna()*, *DataFrame.notna*  and *Index.notna*
   - returns the **Boolean mask** of **non-missing** values 

#### *pandas.dropna*  **remove missing values**

on *pandas.Series* it remove the value

on *pandas.DataFrame* it remove the **whole row** or **column**
   - *axis = 0* or *axis = 'index'*  for **rows**
   - *axis = 1* or *axis = 'columns'*  for **columns**

In [None]:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, np.NaN, 7], [np.NaN, 8, 9, 10]])
df

In [None]:
df.dropna() # by default axis=0

In [None]:
df.dropna(axis='index')

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(axis='columns')

the parameter *how*



   - when *how='any'* **row** or **column** is removed when it contains at least one NA or all NA
   
   
   - when *how='any'* **row** or **column** is removed when all values are missing


In [None]:
df = pd.DataFrame([[1, 2, 3, np.NaN], []])
df

In [None]:
df.dropna(how='all')

In [None]:
df.dropna(how='any') # there is nothing left !

the parameter *thresh*
   - you keep **rows** (or **columns**)
   - where **thresh** values or **more** are **not missing**

In [None]:
df = pd.DataFrame([[1, 2, 3, np.NaN], [4, 5, np.NaN, np.NaN], [6, 7, np.NaN, np.NaN]])
df

In [None]:
df.dropna(thresh=3, axis=0)

In [None]:
df.dropna(thresh=1, axis=1)

#### *pandas.fillna()*  **missing values** are replaced
   - you can specify the **strategy** (*method*) of replacement

methods
   - **propagation** of the **last valid** observation to **next valid**
   - **forward** (*ffill*)
   -  **backward**(*bfill*)

In [None]:
df = pd.Series([1, np.NaN, np.NaN, 5, np.NaN,  6, np.NaN, 9])
df

In [None]:
df.fillna(method='ffill') # propagation forward 

In [None]:
df.fillna(method='bfill')  # propagation backward

   - the same for *pandas.DataFrame*

In [None]:
df = pd.DataFrame([[1, np.NaN, np.NaN], [np.NaN, 6, np.NaN], [2, np.NaN, 9]])
df.head()

In [None]:
df.fillna(axis=0, method='ffill')

In [None]:
df.fillna(axis=1, method='bfill')

   - computing **equality** in presence of **NaN** values
   - **equals** is not the same as **==**

In [None]:
df1 = pd.DataFrame([[2, 3, 4], [5, np.NaN, 7]])
df2 = pd.DataFrame([[2, 3, 4], [5, np.NaN, 7]])

In [None]:
df1.equals(df2) # NaN == NaN

In [None]:
df1 == df2 # NaN != NaN