# Capstone 3 - Data pre-processing

### Table of contents

<div class="alert alert-block alert-info">
<b>Put table of contents here</b>
</div>

## Introduction

### Import relevant libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

### Retrieve variables

In [2]:
# Retrieve original noise_data dataframe
%store -r noise_data

# Retrieve GeoPandas dataframes
%store -r districts_gdf 
%store -r schoolpoints_gdf

# Retrieve schools covered by sensor range
%store -r coverage_matrix

# Retrieve school demographic information
%store -r combined_summary_df

# Retrieve lower grades achievement information
%store -r combined_lg_achievement_df

# Retrieve high school achievement information
%store -r combined_hs_achievement_df

# Retrieve combined achievement data
%store -r coverage_combined_achievement_df
%store -r non_coverage_combined_achievement_df

# Retrieve merged dataset with all metrics
%store -r merged_coverage_df

# Retrieve school lists
%store -r elem_middle_schools
%store -r high_schools

## Preparing to pre-process the data

To simplify our pre-processing stage, we will write a few functions.

### Define a function `identify_missing_values`
To start, we will define a function, `identify_missing_values`, that will help us isolate columns with missing values.

In [3]:
def identify_missing_values(df):
    # Distinguish categorical variables from numerical variables
    categorical_variables = df.select_dtypes(include=['object', 'category', 'bool']).columns
    numerical_variables = df.select_dtypes(exclude=['object', 'category', 'bool']).columns
    
    # Loop through numerical variables of summary dfs to find number of null values in each column
    # Also, create an empty list for adding any columns with missing values
    missing_numerical_variables = []
    for item in numerical_variables:
        missing_values = df[item].isnull().sum()
        print(f"{item}: {df[item].isnull().sum()}")
        if missing_values > 0:
            missing_numerical_variables.append(item)
    
    return missing_numerical_variables

### Define a function `knn_impute`

In [4]:
def knn_impute(df, n_neighbors=5):
    missing_numerical_variables = identify_missing_values(df)
    missing_numerical_data = df[missing_numerical_variables]
    
    # Perform KNN imputation
    imputer = KNNImputer(n_neighbors=n_neighbors)
    numerical_data_imputed = imputer.fit_transform(missing_numerical_data)
    
    # Convert imputed data back to a dataframe
    numerical_data_imputed = pd.DataFrame(numerical_data_imputed, columns=missing_numerical_variables, 
                                          index=df.index)
    
    # Combine with original categorical columns
    df_imputed = df.copy()
    df_imputed[missing_numerical_variables] = numerical_data_imputed
    
    return df_imputed

### Define a function `knn_cross_validate`

<div class="alert alert-block alert-danger">
<b>FINISH THIS!</b> Write a function that uses `knn_impute` to cross-validate imputed numerical values.
</div>

In [5]:
# def knn_cross_validate(df, n_neighbors=5):
#     df_imputed = knn_impute(df, n_neighbors=n_neighbors)
    
#     # Finish the rest after learning more about how to cross-validate

## Imputing missing values for summary data

In [6]:
# knn_cross_validate(combined_summary_df)

## Imputing missing values for lower grades achievement data

In [7]:
# knn_cross_validate(combined_lg_achievement_df)

## Imputing missing values for lower grades achievement data

In [8]:
# knn_cross_validate(combined_hs_achievement_df)