# Data Preprocessing

After exploring the data, I made notes of the steps we need to take to prepare the data:

* Drop `quantity_group`, `wpt_name`, `recorded_by`, `source_type`, `waterpoint_type_group`, `payment_type` columns
* Replace missing values with `other` category in `funder`, `installer`, `subvillage`, `scheme_name` columns.
* Replace missing values in `public_meeting`, `permit` columns with `True`
* Replace missing values in `scheme_management` to `VWC`
* Replace zeros in `construction_year` with average 
* Use frequency distribution for values in these columns: `funder`, `installer`, `subvillage`, `scheme_name`, `ward`, `lga`.
* One-hot Encode `scheme_management`, `basin`, `region`,`region_code`, `district_code`, `extraction_type`, `extraction_type_group`, `extraction_type_class`, `management`, `management_group`, `payment`, `water_quality`, `quality_group`, `source`, `source_class`, `waterpoint_type`.
* Normalize `amount_tsh`, `gps_height`, `population` columns
* Create amount_tsh:gps_height ratio
* Handle imbalanced class labels

In [4]:
import pandas as pd
import numpy as np

# load transformed data
data = pd.read_csv('../data/interim/transformed_data.csv')

## Drop Columns

In [5]:
# a list of columns to drop
cols_to_drop = ['quantity_group', 'wpt_name', 'recorded_by', 'source_type', 'waterpoint_type_group', 'payment_type']

In [103]:
def drop_columns(df, cols):
    """Drops columns
    
    input: 1). A Pandas Dataframe and 2). A list of strings
    output: A Pandas Dataframe without the dropped columns
    """
    df = df.copy()
    df = df.drop(cols, axis = 1)
    return df

transformed_data = drop_columns(data, cols_to_drop)
print("Columns dropped! Dataframe now contains {} columns".format(transformed_data.shape[1]))
transformed_data.head(1)

Columns dropped! Dataframe now contains 35 columns


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,num_private,basin,...,management,management_group,payment,water_quality,quality_group,quantity,source,source_class,waterpoint_type,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,0,Lake Nyasa,...,vwc,user-group,pay annually,soft,good,enough,spring,groundwater,communal standpipe,functional


## Replace Missing Values

The missing values need to be replaced with some specified value.  The value will vary per column.  

Missing values in these columns will be replaced by a random integer:

* **`funder`**
* **`installer`**
* **`subvillage`**
* **`scheme_name`**

Missing values in these columns will be replaced by a string:

* **`public_meeting`** - replace with 'True'
* **`permit`** - replace with 'True'
* **`scheme_management`** - replace with 'VWC'

In [104]:
def fill_missing_vals(df, col, value):
    """Replace missing value with a scalar
    
    input: 1). A Pandas Dataframe, 2). Column, 3). Value to fill holes: scalar
    output: A Pandas Dataframe with no missing values
    """
    df = df.copy()
    return df[col].fillna(value = value)

In [105]:
# a list of a few columns containing missing values.  
missing_val_cols = ['funder', 'installer', 'subvillage', 'scheme_name']

# Replace missing values with this number.  I used an arbitrary value that isn't currently found in the columns. 
filler = 4000

for col in missing_val_cols:
    transformed_data[col] = fill_missing_vals(transformed_data, col, value = filler)
    print('{} now contains {} missing values'.format(col, transformed_data[col].isnull().sum()))

funder now contains 0 missing values
installer now contains 0 missing values
subvillage now contains 0 missing values
scheme_name now contains 0 missing values


In [106]:
missing_val_cols = ['public_meeting', 'permit']

# for this set of columns, replace missing values with 'True'
filler = 'True'

for col in missing_val_cols:
    transformed_data[col] = fill_missing_vals(transformed_data, col, value = filler)
    print('{} now contains {} missing values'.format(col, transformed_data[col].isnull().sum()))

public_meeting now contains 0 missing values
permit now contains 0 missing values


In [107]:
col = 'scheme_management'
filler = 'VWC'

transformed_data[col] = fill_missing_vals(transformed_data, col, value = filler)
print('{} now contains {} missing values'.format(col, transformed_data[col].isnull().sum()))

scheme_management now contains 0 missing values


In [108]:
# Let's confirm that there aren't any missing values left
print('There are {} missing values'.format(transformed_data.isnull().sum().sum()))

There are 0 missing values


## Replace Other Values

I need to replace zeros in the **`construction_year`** column with 1997.  This is the average of the column without zeros.  We calculated this value in the data exploration notebook.  

In [109]:
avg_year = 1997
col = 'construction_year'
transformed_data[col] = transformed_data[col].replace(to_replace = 0, value = 1997)
print("The average of the construction year column is: ", round(transformed_data[col].mean()))

The average of the construction year column is:  1997


## Handle Categorical Variables

### Use Frequency Distribution

Perform action on the following columns:
* funder
* installer
* subvillage
* scheme_name
* ward
* lga

These columns contained too many unique values to be One-Hot encoded. Instead, I'll replace the text values with their respective frequency. 

In [110]:
columns = ['funder', 'installer', 'subvillage', 'scheme_name', 'ward', 'lga']

def replace(x, counts):
    """Replaces the current value x with the frequency of that unique value"""
    return counts[x]

for col in columns:
    # store frequency of each unique value in a dictionary
    counts = dict(transformed_data[col].value_counts(normalize = True))
    
    # Apply a replace function to every element in the column
    transformed_data[col] = transformed_data[col].apply(lambda x: replace(x, counts))
    print('{} transformed'.format(col))

# Sample the relevant columns
sample_data = transformed_data[columns]
print("\n")
print(sample_data.head())

funder transformed
installer transformed
subvillage transformed
scheme_name transformed
ward transformed
lga transformed


     funder  installer  subvillage  scheme_name      ward       lga
0  0.004525   0.001603    0.000081     0.002303  0.000552  0.009253
1  0.000862   0.000673    0.000081     0.474855  0.001212  0.012189
2  0.000121   0.006963    0.008498     0.000135  0.000175  0.005010
3  0.017832   0.003838    0.000323     0.474855  0.000552  0.002869
4  0.000013   0.002276    0.000027     0.474855  0.000189  0.013024


From the sample, you can see that the frequencies of each unique value replaced the initial string values.  

### One-Hot Encoding 

Most of our categorical variables can afford to be encoded using One-Hot encoding.  I created a list below:

In [111]:
# Perform one-hot encoding on a list of columns
columns = ['scheme_management', 'basin', 'region', 'region_code', 'district_code', 
           'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 
           'management_group', 'payment', 'water_quality', 'quality_group', 
           'source', 'source_class', 'waterpoint_type']

transformed_data = pd.get_dummies(transformed_data, columns = columns)
print('dummies created!')
print('Dataframe shape:', transformed_data.shape)

dummies created!
Dataframe shape: (74250, 204)


In [112]:
transformed_data.head(2)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,num_private,subvillage,...,source_class_groundwater,source_class_surface,source_class_unknown,waterpoint_type_cattle trough,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other
0,69572,6000.0,2011-03-14,0.004525,1390,0.001603,34.938093,-9.856322,0,8.1e-05,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,8776,0.0,2013-03-06,0.000862,1399,0.000673,34.698766,-2.147466,0,8.1e-05,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
