# Identifying Missing Data
### You can identify missing data by column using the .isna() and .sum() method
* The .info() method can also help identify null values

In [10]:
import numpy as np
import pandas as pd

In [15]:
product_name = [np.nan, 'Dairy', 'Dairy', np.nan, 'Fruits']
product_price = [2.56, np.nan, 4.55, 2.74, np.nan]
product_id = [1, 2, 3, 4, 5]

product_df = pd.DataFrame({'product':product_name, 'price':product_price, 'product_id':product_id})
product_df

Unnamed: 0,product,price,product_id
0,,2.56,1
1,Dairy,,2
2,Dairy,4.55,3
3,,2.74,4
4,Fruits,,5


In [18]:
product_df.isna() # .isna() returns a DataFrame with Boolean values. True for NaN and False for others.

Unnamed: 0,product,price,product_id
0,True,False,False
1,False,True,False
2,False,False,False
3,True,False,False
4,False,True,False


In [20]:
product_df.isna().sum() # adding .sum() can summarize the results for each column. True=1 False=0

product       2
price         2
product_id    0
dtype: int64

In [17]:
product_df.info() # .info() can also return information about missing values. But we would need to do some math to calculate the exact values. For example:
                  # The RangeIndex: 5 shows our total entires. We would need to subtract our Non-null Count from the RangeIndex and the difference would
                  # be the total NaN for each column.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   product     3 non-null      object 
 1   price       3 non-null      float64
 2   product_id  5 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes


# Handeling Missing Data
### Like with Series, the .dropna() and .fillna() methods let you handle missing data in a DataFrame by either removing them or replacing them with other values

In [21]:
product_df

Unnamed: 0,product,price,product_id
0,,2.56,1
1,Dairy,,2
2,Dairy,4.55,3
3,,2.74,4
4,Fruits,,5


In [22]:
product_df.fillna(0) # This will replace all missing values with zero. Might not always be the best solution

Unnamed: 0,product,price,product_id
0,0,2.56,1
1,Dairy,0.0,2
2,Dairy,4.55,3
3,0,2.74,4
4,Fruits,0.0,5


In [25]:
product_df.fillna({'price':0, 'product':0}) # Here we can use a dictionary to specify a value for a specific column or more

Unnamed: 0,product,price,product_id
0,0,2.56,1
1,Dairy,0.0,2
2,Dairy,4.55,3
3,0,2.74,4
4,Fruits,0.0,5


In [26]:
product_df.dropna() # .dropna() will drop any row with missing data. We lost a lot of data with this method

Unnamed: 0,product,price,product_id
2,Dairy,4.55,3


In [28]:
product_df.dropna(subset="price") # here we only dropped rows with missing data from the 'price' column with "subset=" argument

Unnamed: 0,product,price,product_id
0,,2.56,1
2,Dairy,4.55,3
3,,2.74,4
