# How to Handle Missing Data (continued)
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo36_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import a dataset about diabetes
df = pd.read_csv('diabetes.csv')
df

In [None]:
df.info()

In [None]:
sns.pairplot(df);

Based on this check, it appears this dataset has no null values. However, if we take a close look at the data we notice that some columns use `0` where there should be nulls.

**Step 1:** Convert missing values to nulls.

In [None]:
# Replace zeros in certain columns with nans
df.iloc[:,1:6] = df.iloc[:,1:6].replace(0,np.nan)
df

**Step 2** Analyze type and amount of missing data.

In [None]:
df.isna().sum()

In [None]:
df.info()

Several columns have missing data. Let's explore more to determine the type. 

In [None]:
# Make a heatmap to visualize null values in the dataset
sns.heatmap(data = df.isna());

In [None]:
# Install a package that allows for nice graphical analysis of missingvalues
!pip install missingno

In [None]:
# Import missingno module
import missingno as msno

In [None]:
# Can visualize the percent missing in each column
msno.bar(df);

In [None]:
# Another way to make the null map with the msno module 
msno.matrix(df);

In [None]:
# Plot the matrix again sorted by insulin
sorted_df = df.sort_values('Insulin')
msno.matrix(sorted_df);

It looks like some variables only have NaNs when insulin is NaN. This is an important pattern to notice. While we don't completely understand what about insuline influences the missingness, we can see that insuline influences missingness. Let's quantify this further with a heatmap.

In [None]:
# Correlation in missingness 
msno.heatmap(df);

**Step 3:** Decide on an approach. 
Some of these columns (e.g. Glucose) don't have lots of missing values and they are not highly correlated with other variables. We can assume those are MCAR and drop those rows. 

In [None]:
# Drop rows that have NaN in the Glucose column
df_dropped = df.dropna(subset = 'Glucose')
df_dropped

In [None]:
df_dropped.isna().sum()

BMI was similar to Glucose; very few missing values and not highly correlated with anything else. If we wanted, we could also drop rows with missing BMI.

In [None]:
df_dropped.dropna(subset='BMI',inplace=True)
df_dropped.isna().sum()

Since BMI is likely MCAR, we could have also decided to impute the data.

In [None]:
df['BMI'].hist()

In [None]:
# Impute BMI
filled_df = df.copy()
filled_df['BMI'].fillna(filled_df['BMI'].mean(),inplace = True)
filled_df

In [None]:
# Visualize the imputation
nullity = df['BMI'].isna()
filled_df.plot(x='Insulin', y='BMI', kind='scatter', alpha=0.8,                   
                   c=nullity, cmap='rainbow',title='Mean Imputation');

It looks like there is only one value, this isn't actually the case, most of the missing BMI's showed up when Insulin was missing too. What if we imputed both variables?

In [None]:
# check distribution of insuline
df['Insulin'].hist()

In [None]:
# Impute BMI and Insulin
filled_df = df.copy()
filled_df['BMI'].fillna(filled_df['BMI'].mean(),inplace = True)
filled_df['Insulin'].fillna(filled_df['Insulin'].mean(),inplace = True)
nullity = df['BMI'].isna() + df['Insulin'].isna()
filled_df.plot(x='Insulin', y='BMI', kind='scatter', alpha=0.8,                   
                   c=nullity, cmap='rainbow',title='Mean Imputation');

### Imputing time series data
We saw last time that imputing with mean, median, or mode is not always the best option. Specifically on time series data. Lets explore other options

In [None]:
airquality = pd.read_csv("airquality.csv")
airquality

In [None]:
airquality.info()

In [None]:
airquality['Date'] = pd.to_datetime(airquality['Date'])
airquality

In [None]:
# Look at the missing data in Ozone
plt.figure(figsize=(10,4))
airquality['Ozone'].plot(marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

The time series varies a lot. If we fill everything with the mean or median, we might get strange results. 

In [None]:
# Fill nans with the mean datapoint
mean_fill = airquality.fillna(airquality.Ozone.mean())

plt.figure(figsize=(10,4))
mean_fill['Ozone'].plot(color = 'r', marker='o')
airquality['Ozone'].plot( marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

In [None]:
# Fill nans with the previous datapoint
forward_fill = airquality.fillna(method='ffill')

In [None]:
# Look at the missing data in Ozone
plt.figure(figsize=(10,4))
forward_fill['Ozone'].plot(color = 'r', marker='o')
airquality['Ozone'].plot( marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

In [None]:
# Fill nans with the next datapoint
back_fill = airquality.fillna(method='bfill')

In [None]:
# Look at the missing data in Ozone
plt.figure(figsize=(10,4))
back_fill['Ozone'].plot(color = 'r', marker='o')
airquality['Ozone'].plot( marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

In [None]:
# Linear interpolation
interpolated_oz = airquality.copy()
interpolated_oz['Ozone'] = airquality.Ozone.interpolate(method='linear')

In [None]:
# Look at the missing data in Ozone
plt.figure(figsize=(10,4))
interpolated_oz['Ozone'].plot(color = 'r', marker='o')
airquality['Ozone'].plot( marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

This looks like the most reasonable guess for how to fill in the datapoints.

*NOTE:* There are other methods that can be used to interpolate.

In [None]:
# Quadratic interpolation
interpolated_oz['Ozone'] = airquality.Ozone.interpolate(method='quadratic')

In [None]:
# Look at the missing data in Ozone
plt.figure(figsize=(10,4))
interpolated_oz['Ozone'].plot(color = 'r', marker='o')
airquality['Ozone'].plot( marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

This method highly overshot the data ranges. This is not a good option. 

One other option is to fill null values with a random sample. This could work when the data is MCAR or MAR and could also work for categorical data.

In [None]:
# Get non-NaN values from column 'A'
non_nan_values = airquality['Ozone'].dropna()

# Count the number of NaN values in column 'A'
nan_count = airquality['Ozone'].isna().sum()

# Generate random samples from non-NaN values with replacement to fill NaNs
random_samples = np.random.choice(non_nan_values, nan_count, replace=True)

# Fill NaN values in column 'A' with random samples
random_fill = airquality.copy()
random_fill.loc[random_fill['Ozone'].isna(), 'Ozone'] = random_samples

plt.figure(figsize=(10,4))
random_fill['Ozone'].plot(color = 'r', marker='o')
airquality['Ozone'].plot( marker='o')
plt.xlabel('Day')
plt.ylabel('Ozone')
plt.show()

## Activity

**Activity 1:** Run the following cells to import and clean the Lake Mendocino Data from last time. Use an appropriate imputation method to fill in the null values in this dataset. Plot your results to evaluate if your imputation was reasonable.

In [None]:
# Import a dataset about Lake Mendocino
lake = pd.read_csv('coy_wy2024_csvdata.csv')
lake

In [None]:
lake[lake=='-'] = np.nan
notes_columns = [col for col in lake.columns if 'notes' in col]
lake[notes_columns] = lake[notes_columns].replace(0, np.nan)
lake.dropna(axis = 1, thresh = 230,inplace=True)
lake = lake.assign(cons_high = lake['Top of Conservation High (ac-ft)'].astype(float),
              cons = lake['Top of Conservation (ac-ft)'].astype(float),
              gross_pool = lake['Gross Pool'].astype(float),
              gross_pool_elev = lake['Gross Pool(elev)'].astype(float))
lake['date'] = pd.to_datetime(lake['ISO 8601 Date Time'].str[:10])
lake.drop(columns = ['Top of Conservation High (ac-ft)', 'Top of Conservation (ac-ft)','Gross Pool','Gross Pool(elev)'],inplace=True)
lake

In [None]:
# Fill nans


In [None]:
# Check if it worked 


In [None]:
# Plot 
