# Water Pumps: Preprocessing
## Business Problem:
Tanzania is a developing country and access to water is very important for the health of the population. For this reason, it is vital that all water pumps are properly working. Currently, the only way to monitor pump working status is by physically visiting the site. This is time consuming and costly. Therefore, a more intelligent solution to monitor water pump status is desirable.

This project will address the following question: How can the government of Tanzania improve water pump maintenance by knowing the pump functional status in advance?

* Plan:
    1. Standard continuous variables.
    2. Perform model selection
    
* Model Selection: Models to Try:
    * Tree-based.
        * Random Forrest.
        * AdaBoost.
        * Gradient Boosting/XGBoost.
    * Logistic regression.
    
## Import libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

## Load Dataset
Load the dataset after it has been modified during EDA.

In [2]:
df = pd.read_csv('../data/clean/eda_data.csv', parse_dates=['date_recorded'], infer_datetime_format=True)

## Preprocessing
### Data types
Review the datatypes in the dataset.

In [3]:
all_data_types = df.dtypes
unique_data_types = all_data_types.unique()
print(unique_data_types)

[dtype('O') dtype('<M8[ns]') dtype('int64') dtype('float64')]


In [4]:
print(df.select_dtypes(include=['O']).columns.tolist())

['status_group']


The data types consist of *object*, which is the *status_group* column, datetime, int and float.

### Normalization

First, select only those columns which have not been one-hot encoded.

In [5]:
all_numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [6]:
df_summary = df[all_numerical_columns].describe().T

In [7]:
numerical_columns = df_summary[~df_summary['max'].isin([0.0, 1.0])].index.to_list()

In [8]:
df[numerical_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
gps_height,30423.0,983.518456,611.146363,-63.0,353.0,1148.0,1462.0,2628.0
longitude,30423.0,35.832252,2.650462,29.607122,34.607104,36.404454,37.751836,40.345193
latitude,30423.0,-6.084223,2.738387,-11.568577,-8.506306,-5.749784,-3.599285,-1.042375
log_population,30423.0,5.236613,1.095295,0.693147,4.60517,5.298317,5.940171,10.325482


Normalize all numerical values to the range 0.0 - 1.0.

In [10]:
df[numerical_columns] = MinMaxScaler().fit_transform(df[numerical_columns])

In [11]:
df[numerical_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
gps_height,30423.0,0.388896,0.227108,0.0,0.154589,0.450019,0.566704,1.0
longitude,30423.0,0.579725,0.246828,0.0,0.465631,0.633012,0.758489,1.0
latitude,30423.0,0.521019,0.26015,0.0,0.290919,0.552791,0.757091,1.0
log_population,30423.0,0.471689,0.11371,0.0,0.406134,0.478095,0.54473,1.0
