# Data Preprocessing

After exploring the data, I made notes of the steps we need to take to prepare the data:

* Drop `quantity_group`, `wpt_name`, `recorded_by`, `source_type`, `waterpoint_type_group`, `payment_type` columns
* Replace missing values with `other` category in `funder`, `installer`, `subvillage`, `scheme_name` columns.
* Replace missing values in `public_meeting`, `permit` columns with `True`
* Replace missing values in `scheme_management` to `VWC`
* Replace zeros in `construction_year` with average 
* Label encode `funder`, `installer`, `subvillage`, `scheme_name`, `ward`, `lga`, `district_code`, `region_code`, `construction_year`,  columns, and then group values using a frequency distribution for each value
* One-hot Encode `scheme_management`, `basin`, `region`, `extraction_type`, `extraction_type_group`, `extraction_type_class`, `management`, `management_group`, `payment`, `payment_type`, `water_quality`, `quality_group`, `source`, `source_type`, `source_class`, `waterpoint_type`, `waterpoint_type_group`.
* Normalize `amount_tsh`, `gps_height`, `population` columns
* Create amount_tsh:gps_height ratio
* Handle imbalanced class labels

In [5]:
import pandas as pd

# load transformed data
data = pd.read_csv('../data/interim/transformed_data.csv')

## Drop Columns

In [7]:
# a list of columns to drop
cols_to_drop = ['quantity_group', 'wpt_name', 'recorded_by', 'source_type', 'waterpoint_type_group', 'payment_type']

In [111]:
def drop_columns(df, cols):
    """Drops columns
    
    input: 1). A Pandas Dataframe and 2). A list of strings
    output: A Pandas Dataframe without the dropped columns
    """
    df = df.copy()
    df = df.drop(cols, axis = 1)
    return df

transformed_data = drop_columns(data, cols_to_drop)
print("Columns dropped! Dataframe now contains {} columns".format(transformed_data.shape[1]))
transformed_data.head(1)

Columns dropped! Dataframe now contains 35 columns


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,num_private,basin,...,management,management_group,payment,water_quality,quality_group,quantity,source,source_class,waterpoint_type,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,0,Lake Nyasa,...,vwc,user-group,pay annually,soft,good,enough,spring,groundwater,communal standpipe,functional


## Replace Missing Values

The missing values need to be replaced with some specified value.  The value will vary per column.  

Missing values in these columns will be replaced by a random integer:

* **`funder`**
* **`installer`**
* **`subvillage`**
* **`scheme_name`**

Missing values in these columns will be replaced by a string:

* **`public_meeting`** - replace with 'True'
* **`permit`** - replace with 'True'
* **`scheme_management`** - replace with 'VWC'

In [99]:
def fill_missing_vals(df, col, value):
    """Replace missing value with a scalar
    
    input: 1). A Pandas Dataframe, 2). Column, 3). Value to fill holes: scalar
    output: A Pandas Dataframe with no missing values
    """
    df = df.copy()
    return df[col].fillna(value = value)

In [112]:
# a list of a few columns containing missing values.  
missing_val_cols = ['funder', 'installer', 'subvillage', 'scheme_name']

# Replace missing values with this number.  I used an arbitrary value that isn't currently found in the columns. 
filler = 4000

for col in missing_val_cols:
    transformed_data[col] = fill_missing_vals(transformed_data, col, value = filler)
    print('{} now contains {} missing values'.format(col, transformed_data[col].isnull().sum()))

funder now contains 0 missing values
installer now contains 0 missing values
subvillage now contains 0 missing values
scheme_name now contains 0 missing values


In [113]:
missing_val_cols = ['public_meeting', 'permit']

# for this set of columns, replace missing values with 'True'
filler = 'True'

for col in missing_val_cols:
    transformed_data[col] = fill_missing_vals(transformed_data, col, value = filler)
    print('{} now contains {} missing values'.format(col, transformed_data[col].isnull().sum()))

public_meeting now contains 0 missing values
permit now contains 0 missing values


In [114]:
col = 'scheme_management'
filler = 'VWC'

transformed_data[col] = fill_missing_vals(transformed_data, col, value = filler)
print('{} now contains {} missing values'.format(col, transformed_data[col].isnull().sum()))

scheme_management now contains 0 missing values


In [115]:
# Let's confirm that there aren't any missing values left
print('There are {} missing values'.format(transformed_data.isnull().sum().sum()))

There are 0 missing values


## Replace Other Values

I need to replace zeros in the **`construction_year`** column with 1997.  This is the average of the column without zeros.  We calculated this value in the data exploration notebook.  

In [116]:
avg_year = 1997
col = 'construction_year'
transformed_data[col] = transformed_data[col].replace(to_replace = 0, value = 1997)
print("The average of the construction year column is: ", round(transformed_data[col].mean()))

The average of the construction year column is:  1997


## Handle Categorical Variables

In [None]:
# encode values as numerical types
transformed_data['public_meeting'] = pd.factorize(transformed_data['public_meeting'], sort=True)[0]