# `info_na()` Vignette 

In this section, we will explore the functionality of `info_na()`, a function with in `eda_mds` that expands the behaviour of `pd.DataFrame.info()`. 
We will do so by beginning the Exploratory Data Analysis process using both functions, and compare the output and necessary steps to acquire the same information, motivating its use.

The dataset we will use is `titanic.csv` from the `seaborn` vignette dataset, available here: 
`https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv`

In [3]:
import numpy as np
import pandas as pd
from eda_mds import info_na


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')


Imagine we are beginning a new data science project. 
As with any project, exploratory data analysis (EDA) is a crucial first step to understand the nature of the data you are working with. 


A crucial part of EDA is determining the quality of your data. 
Missing datapoints can significantly affect model performance, largely causing them to break, and characterizing these values is essential to quantifying data quality. 
This will inform strategies to either remove, imput, or otherwise replace data with missing values. 
In some cases, specific rows or columns will be fragmented. Lets see how we can achieve this functionality using base `pandas`: 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


`pandas.DataFrame.info` is a great start! It shows us how many values in a dataset are non-null by column, alongside the data types. Here, we can see that some columns, particularly `deck`, are missing significant amounts of data. 

While this may seem like enough information at first glance, what about rows of data? How much data will be lost if we remove, say, all rows with null values? Is missing data randomly dispersed or is it focused in some rows? 

Lets see if we can answer these questions: 

In [13]:
n_rows_any_null = df.isna().any(axis=1).sum()
n_rows = df.shape[0]
print(f"{n_rows_any_null} rows with any null value. ({n_rows_any_null / n_rows * 100:.2f}%)")

709 rows with any null value. (79.57%)


80% is a huge amount of data! Thankfully, we can see that this is mostly in the column `deck`. 

Are there any rows that have more than one null value? Or rows that have all null values? 

In [14]:
n_rows_all_null = df.isna().all(axis=1).sum()
mean_null_rows = df.isna().sum(axis=1).mean().round(2)
max_null_rows = df.isna().sum(axis=1).max()

print(f"{n_rows_all_null} rows have all-null values")
print(f"{mean_null_rows:0.2f}: average null values per row")
print(f"{max_null_rows}: max number of null values in a row")

0 rows have all-null values
0.98: average null values per row
2: max number of null values in a row


We've confirmed our suspicions, it appears that `deck` is the primary contributor for null values. 

In this case, we can see that the most amount of null values in any of the rows is two, and on average, we're missing one value in each row. 


This exercise shows the extra steps needed to more fully characterize a dataset. 
While this is not a huge amount of work, repeating this process for each project you begin will be tedious, and that's where `info_na` comes in: 

In [15]:
info_na(df)


type: <class 'pandas.core.frame.DataFrame'>
shape: (891, 15)
memory usage: 398.1 KB
--------
columns:
 #      column  null count  null %   dtype
 0    survived           0    0.00   int64
 1      pclass           0    0.00   int64
 2         sex           0    0.00  object
 3         age         177   19.87 float64
 4       sibsp           0    0.00   int64
 5       parch           0    0.00   int64
 6        fare           0    0.00 float64
 7    embarked           2    0.22  object
 8       class           0    0.00  object
 9         who           0    0.00  object
10  adult_male           0    0.00    bool
11        deck         688   77.22  object
12 embark_town           2    0.22  object
13       alive           0    0.00  object
14       alone           0    0.00    bool
-----
rows:
total rows            891.00
any null count        709.00
any null %             79.57
all null count          0.00
all null %              0.00
mean null count         0.98
std.dev null count     

We can see that many of the values we computed before (and some extra ones!) are provided, alongside the information given by `pandas.DataFrame.info`. 

This summarizes the primary use case of `info_na()`: characterizing missing values in a dataset in more detail - an essential task in most data science projects.