# Pump It Up Challenge - Cleaning

The approach for this was adapted from Desislava Petkova, who performed the analysis in R. That repository can be viewed [here](https://github.com/dipetkov/DrivenData-PumpItUp/blob/master/transform-data.md).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
import datetime as dt
from scripts import pumpitup

from scipy import stats

from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold

sns.set_style("white")
sns.set_context("talk")

%matplotlib inline

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
pd.options.display.max_columns = 50

# Training Data

## Data

In [None]:
train_data = pd.read_csv('data/training_set.csv')

In [None]:
backup = pd.read_csv('data/training_set.csv')

In [None]:
train_data.shape
train_data.head(5)

First, it's interesting to see how many unique values there are for every feature, which gives us an insight into the granularity and usefulness of each one.

We can see that there are some similarly named features which might also share similar levels of detail, while some of them contain thousands of possible values.

In [None]:
df_uniques = pumpitup.unique_count(train_data)
df_uniques

Next we can look at the percentage of missing values for each feautre. My function treats zeros as missing values for numeric data, but we should bear the context in mind when interpreting this.

One interesting example is `num_private` where we previously saw that there were 65 values that it could take, but we now see that almost 99% of them are 0, probably rendering the feature useless to us. We also have a few other features with large proportions of missing data, which we address later.

In [None]:
train_data['operation_years'][train_data['operation_years'] < 0] = train_data['operation_years'].median()

### Categorical Features

#### Installer and Funder

First up, let's look at the `installer` and `funder` features again. If we plot them on a map, we can see that there is some degree of geographical clustering.

In [None]:
installer = train_data['installer']
installer[pd.isnull(installer)] = 'none'
funder = train_data['funder']
funder[pd.isnull(funder)] = 'none'

isntaller_encoded = pumpitup.label_encode(installer, le)
installer_encoded_norm = isntaller_encoded / isntaller_encoded.max()
isntaller_cmap = [cmap(x) for x in installer_encoded_norm]

funder_encoded = pumpitup.label_encode(funder, le)
funder_encoded_norm = funder_encoded / funder_encoded.max()
funder_cmap = [cmap(x) for x in funder_encoded_norm]

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].scatter(train_data['longitude'], train_data['latitude'], c=isntaller_cmap, linewidth=0)
ax[0].set_ylim(-13, 0)
ax[0].set_xlim(28, 42)
ax[0].set_ylabel('Latitude')
ax[0].set_xlabel('Longitude')

ax[1].scatter(train_data['longitude'], train_data['latitude'], c=funder_cmap, linewidth=0)
ax[1].set_ylim(-13, 0)
ax[1].set_xlim(28, 42)
ax[1].set_ylabel('Latitude')
ax[1].set_xlabel('Longitude')

plt.tight_layout()

To impute these, we will again use imputation by grouping with region and district code. Any missing values in region and district code combinations that do not have any other values will simply be relabeled as *other*.

In [None]:
mask = train_data['installer'].isnull()
train_data['installer'][mask] = 'other'
train_data.loc[mask, 'installer'] = train_data.groupby(['region', 'district_code'])['installer'].transform(lambda x: x.value_counts().index[0])


And now `installer`.

In [None]:
mask = train_data['funder'].isnull()
train_data['funder'][mask] = 'other'
train_data.loc[mask, 'funder'] = train_data.groupby(['region', 'district_code'])['funder'].transform(lambda x: x.value_counts().index[0])

#### Permit and Public Meeting

Both of these features are boolean. A quick look at grouping by `region` and `district_code` shows that different places have different distributions of *True* vs *False*. Without much else to go on at this point, we can use these to fill in the missing values, defaulting to *True* (the more common overall in both features) if there is only missing values in the district.

In [None]:
mask = train_data['permit'].isnull()
train_data['permit'][mask] = True
train_data.loc[mask, 'permit'] = train_data.groupby(['region', 'district_code'])['permit'].transform(lambda x: x.value_counts().index[0])

In [None]:
mask = train_data['public_meeting'].isnull()
train_data['public_meeting'][mask] = True
train_data.loc[mask, 'public_meeting'] = train_data.groupby(['region', 'district_code'])['public_meeting'].transform(lambda x: x.value_counts().index[0])

Data cleaning done!

We can now finally drop `district_code` from our dataframe.

The regions are shown clearly divided in the left figure. There are 21 regions, with each of them containing a significant number of points from the dataset. Using a colourmap to plot the wards does not make a lot of visual sense beyond simply highlighting the huge increase in precision. In some cases, there is only one point in a ward, making it too granular of a categorical feature for our purposes, so we will drop it.

We will also drop `district_code`, but for now I'm going to keep it as it will help to resolve some missing values issues later.

In [None]:
train_data.drop(['region_code', 'subvillage', 'ward'], axis=1, inplace=True)

The `lga` feature also contains geographic information more precise than region, but most likely not representing any other political boundaries. However, it does contain information about whether a point is rural or urban.

We can create a recoded feature from this that simply contains *rural*, *urban* and *other* as categories, adding this information on top of the regional information we already have.

In [None]:
series = train_data['lga'].copy()
series[series.str.contains('Rural')] = 'rural'
series[series.str.contains('Urban')] = 'urban'
other_flag = series.str.contains('rural') | series.str.contains('urban')
other_flag = other_flag == False
series[other_flag] = 'other'

train_data['lga'] = series

In [None]:
lga_cmap = train_data['lga'].copy()
lga_cmap[series.str.contains('rural')] = 'green'
lga_cmap[series.str.contains('urban')] = 'red'
other_flag = series.str.contains('rural') | series.str.contains('urban')
other_flag = other_flag == False
lga_cmap[other_flag] = 'blue'

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['longitude'], train_data['latitude'], c=lga_cmap, linewidth=0)
ax.set_ylim(-13, 0)
ax.set_xlim(28, 42)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude')

Plotting our recoded `lga` feature on a map shows collections of points that look like urban centres. The distinction between *rural* and *other* is less clear.

There is also altitude and population information in the `gps_height` and `population` features. We can plot them in the same way to get an idea of how they're distributed. From the missing percentages, we can see that about 35% of both these features are zeros, so we can visualise where that is occuring too.

In [None]:
viridis = plt.cm.get_cmap('viridis')

In [None]:
gps_height_encoded = pumpitup.label_encode(train_data['gps_height'][train_data['gps_height'] > 0], le)
gps_height_encoded_norm = gps_height_encoded / gps_height_encoded.max()
gps_height_cmap = [viridis(x) for x in gps_height_encoded_norm]

population_encoded = pumpitup.label_encode(train_data['population'][train_data['population'] > 0], le)
population_encoded_norm = population_encoded / population_encoded.max()
population_cmap = [viridis(x) for x in population_encoded_norm]

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].scatter(train_data['longitude'][train_data['gps_height'] == 0],\
           train_data['latitude'][train_data['gps_height'] == 0], color='gray')
ax[0].scatter(train_data['longitude'][train_data['gps_height'] > 0],\
           train_data['latitude'][train_data['gps_height'] > 0], c=gps_height_cmap, linewidth=0)
ax[0].set_ylim(-13, 0)
ax[0].set_xlim(28, 42)
ax[0].set_ylabel('Latitude')
ax[0].set_xlabel('Longitude')

ax[1].scatter(train_data['longitude'][train_data['population'] == 0],\
           train_data['latitude'][train_data['population'] == 0], color='gray')
ax[1].scatter(train_data['longitude'][train_data['population'] > 0],\
           train_data['latitude'][train_data['population'] > 0], c=population_cmap, linewidth=0)
ax[1].set_ylim(-13, 0)
ax[1].set_xlim(28, 42)
ax[1].set_ylabel('Latitude')
ax[1].set_xlabel('Longitude')

plt.tight_layout()

In both the cases, the missing data is concentrated in particular regions, within clean boundaries. This might cause us some problems later for filling in missing values.

### Time

There are two features related to time; the `construction_year` and the `date_recorded`.

#### Construction Year

From our missing data percentages, we can see that `construction_year` is missing around 35% of the data which we should check out in more detail. I also wanted to see whether there was any trend of water pumps being installed in certain parts of the country at different times. Plotting the construction year against longitude and latitude we can see that to a first approximation, installations are distributed randomly through space and time. However we can see that the missing data is not random at all, and is concetrated in specific regions of the country. Again, this makes imputing the missing data intelligently quite difficult, as we have little to compare it to.

## Cleaning

We can immediately remove the `id` feature as it is unique for each row and therefore will not help our preduction. Conversely the `recorded_by` takes a single value for the whole dataset, so this can also be removed.

In [None]:
train_data.drop(['id', 'recorded_by', 'num_private'], axis=1, inplace=True)

### Extraction Types

Looking at `extraction_type_class`, `extraction_type_group` and `extraction_type`, we can see a hierarchy of classification, with `extract_type` as the most granular and `extraction_type_class` the most coarse.

We remove the mid level information and refactor the finest feature.

There are no missing values to take care of.

In [None]:
extraction_types = train_data.groupby(['extraction_type_class', 'extraction_type_group', 'extraction_type'])
extraction_types.count()

In [None]:
train_data['extraction_type'].replace('other - swn 81', 'swn 81', inplace=True)
train_data['extraction_type'].replace('other - mkulima/shinyanga', 'other handpump', inplace=True)
train_data['extraction_type'].replace('other - play pump', 'other handpump', inplace=True)
train_data['extraction_type'].replace('cemo', 'other motorpump', inplace=True)
train_data['extraction_type'].replace('climax', 'other motorpump', inplace=True)

In [None]:
train_data.drop('extraction_type_group', axis=1, inplace=True)

### Management

The management information features `management_group` and `management` capture two different levels of granularity in categories large enough that they need no further modification. There are some unknown values, but it seems that this would refer to the management group being unknown by the recording party, rather than not recorded at all.

In [None]:
train_data.groupby(['management_group', 'management']).count()

### Scheme

The features in `scheme_management` and `scheme_name` are very different in their level of detail, with 13 and 2697 unique values respectively.

The `scheme_name` column also has 47% missing `NaN` values, whilst `scheme_management` is over 95% complete. For this reason, we will leave `scheme_management` without modification for now and drop `scheme_name`.

In [None]:
train_data.drop('scheme_name', axis=1, inplace=True)

However, on closer inspection, `scheme_management` essentially encodes the same information as `management`, so we will drop it too.

In [None]:
df_encoded = train_data[['latitude', 'longitude']]

In [None]:
series = backup.gps_height

In [None]:
height_filled = pumpitup.fill_missing_knn(series, df_encoded, k=5)

In [None]:
train_data['gps_height'] = height_filled

In [None]:
gpsf_height_encoded_norm = height_filled / height_filled.max()
gpsf_height_cmap = [viridis(x) for x in gpsf_height_encoded_norm]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['longitude'], train_data['latitude'], c=gpsf_height_cmap, linewidth=0)
ax.set_ylim(-13, 0)
ax.set_xlim(28, 42)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude')

plt.tight_layout()

That looks pretty reasonable! It's certainly going to be better than simply picking the mean or median. In fact, it basically is picking the mean, but locally to where each point is located.

**Population**

About 35% of the population data is missing. Population will vary with geographical features such as land type and governmental boundaries. To impute the missing values, we're going to use the `lga`, `region` and `district_code` features. We can use a similar method to that used to impute the missing longitude and latitude values.

In [None]:
mask1 = pumpitup.flag_missing_s(train_data['population'])
train_data['population'][mask1] = np.nan
train_data.loc[mask1, 'population'] = train_data.groupby(['lga', 'region','district_code']).transform('mean')
mask2 = train_data['population'].isnull()
train_data.loc[mask2, 'population'] = train_data.groupby(['lga', 'region']).transform('mean')
mask3 = train_data['population'].isnull()
train_data.loc[mask3, 'population'] = train_data.groupby('lga').transform('mean')

In [None]:
train_data['population'] = train_data['population'].astype(int)

In [None]:
population_encoded = pumpitup.label_encode(train_data['population'][train_data['population'] > 0], le)
population_encoded_norm = population_encoded / population_encoded.max()
population_cmap = [viridis(x) for x in population_encoded_norm]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['longitude'][train_data['population'] == 0],\
           train_data['latitude'][train_data['population'] == 0], color='gray')
ax.scatter(train_data['longitude'][train_data['population'] > 0],\
           train_data['latitude'][train_data['population'] > 0], c=population_cmap, linewidth=0)
ax.set_ylim(-13, 0)
ax.set_xlim(28, 42)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude')

plt.tight_layout()

We can see that urban areas have higher populations, while rural and others seem to have lower populations, with the data varying across regions and districts. The imputed values don't seem to fall quite as low as those found in some parts of the map with data available.

### Time

We have some missing values for the `construction_year` feature, or at least, it is unlikely that water pumps installed in the year 0 are still in use today.

We looked before at whether there was a strong relationship between the construction year of a pump and the location, but that didn't seem to be too strong. Maybe we can infer some information from the type of pump, who installed it.

Just before we do that, we also saw that some of the record dates fell during years that were before the construction years. This can't be right. I'm simply going to replace any where `year_recorded` < `construction_year` with the median year recorded.

In [None]:
train_data['construction_year'] = train_data['construction_year'].astype(int)

We can now calculate the number of years in operation and replace any negative years with the median date.

In [None]:
train_data['operation_years'] = (train_data['year_recorded'] - train_data['construction_year']).astype(int)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['longitude'][train_data['construction_year'] == 0],\
           train_data['latitude'][train_data['construction_year'] == 0], color='gray')
ax.scatter(train_data['longitude'][train_data['construction_year'] > 0],\
           train_data['latitude'][train_data['construction_year'] > 0], c=con_year_cmap, linewidth=0)
ax.set_ylim(-13, 0)
ax.set_xlim(28, 42)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude')

Another way to look at this is to plot the longitude and latitude of each pump against the construction year in separate scatter plots. This gives us a sense of whether the installations have been concentrated in particular times and places. Besides years at the very beginning of the records and a few slightly more dense clusters, the distribution still seems to be fairly uniformly distributed.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].scatter(train_data['construction_year'][train_data['construction_year'] > 0],\
              train_data['longitude'][train_data['construction_year'] > 0], alpha=0.05, linewidth=0)
ax[0].set_xlim(1959, 2013)
ax[0].set_ylim(28, 42)
ax[0].set_ylabel('Longitude')
ax[0].set_xlabel('Construction Year')

ax[1].scatter(train_data['construction_year'][train_data['construction_year'] > 0],\
              train_data['latitude'][train_data['construction_year'] > 0], alpha=0.05, color='red', linewidth=0)
ax[1].set_xlim(1959, 2013)
ax[1].set_ylim(-12, 0)
ax[1].set_ylabel('Latitude')
ax[1].set_xlabel('Construction Year')

plt.tight_layout()

One thing we could do is create a new feature by subtracting the `construction_year` from `date_recorded` to see whether the time in operation has any effect on failure rate. Before we do this however, I'd like to sort out the missing values, so we will leave `construction_year` alone for the time being.

#### Date Recorded

The `date_recorded` feature is a human readable date in a string format, which is not going to be very useful for our classifier. We can implement a helper function that turns the date into just the year and also derives a new feature `operation_years` which is the year of the record minus the year the pump was constructed.

There may also be a seasonal effect to pump failure (or at least the observance of pump failure). To take this into account, we create another feature that simply holds the month value from the feature.

In [None]:
train_data = pumpitup.convert_dates(train_data)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['construction_year'], train_data['year_recorded'], alpha=0.2, linewidth=0)
ax.set_xlim(1959, 2013)
ax.set_ylim(2003, 2014)
ax.set_ylabel('Year Recorded')
ax.set_xlabel('Construction Year')

Looking at this scatter plot tells us that most of the tests were carried out during the period 2011 through 2013. There are a few points where the `year_recorded` is earlier than the `construction_year`. One other issue is that the missing `construction_year` data has led to records indicating that over a third of pumps have been in operation for around 2000 years. Good going. 

In [None]:
len(train_data[train_data['operation_years'] > 2000])

### Other Variables

There are a few other variables that need to be picked and recategorised to clean up the data.

In particular `wpt_name`, `installer` and `funder` are features with a large number of unique variables.

According to the documentation, `wpt_name` refers to the name of the waterpoint. A few minutes spent looking at the entries and translating some Swahali into English gives us a bit more insight. In many cases the waterpoint name seems to be simply a name, but in others the name seems to refer to something of possible local significance that may affect how much a pump is used and how well it is maintained:

- Shuleni/Shule/School/Sekondari/Secondary/Msingi/Primary - School
- Zahanati/Clinic/Health - Health Clinic
- Hospitalini/Hospitali/Hospital - Hospital
- Office/Kijiji/Ofisini/Ofisi/Idara - Village Office
- Farm/Maziwa - Farm/Dairy
- Pump House/Bombani/Well/Kisima - Water related
- Kanisani/Kanisa/Church/Anglican/Pentecost/Luther/Msikitini/Msikitini/Mosque - Christian and Muslim places of worship
- Center/Market/Sokoni/Madukani - Town/Village Centre + Market/Shopping
- Ccm - Government?
- Kwa - Seems to accompany an individual's name

These groups seem to encompass a large enough number of points or have enough significance to form their own categories. There are lots of other smaller groups which we will group together.

In [None]:
train_data['wpt_name'] = train_data['wpt_name'].str.lower()
train_data['wpt_name'][train_data['wpt_name'].str.contains('school')] = 'school'
train_data['wpt_name'][train_data['wpt_name'].str.contains('shule')] = 'school'
train_data['wpt_name'][train_data['wpt_name'].str.contains('sekondari')] = 'school'
train_data['wpt_name'][train_data['wpt_name'].str.contains('secondary')] = 'school'
train_data['wpt_name'][train_data['wpt_name'].str.contains('sekondari')] = 'school'
train_data['wpt_name'][train_data['wpt_name'].str.contains('msingi')] = 'school'
train_data['wpt_name'][train_data['wpt_name'].str.contains('primary')] = 'school'

train_data['wpt_name'][train_data['wpt_name'].str.contains('clinic')] = 'health'
train_data['wpt_name'][train_data['wpt_name'].str.contains('zahanati')] = 'health'
train_data['wpt_name'][train_data['wpt_name'].str.contains('health')] = 'health'
train_data['wpt_name'][train_data['wpt_name'].str.contains('hospital')] = 'health'

train_data['wpt_name'][train_data['wpt_name'].str.contains('ccm')] = 'official'
train_data['wpt_name'][train_data['wpt_name'].str.contains('office')] = 'official'
train_data['wpt_name'][train_data['wpt_name'].str.contains('kijiji')] = 'official'
train_data['wpt_name'][train_data['wpt_name'].str.contains('ofis')] = 'official'
train_data['wpt_name'][train_data['wpt_name'].str.contains('idara')] = 'official'

train_data['wpt_name'][train_data['wpt_name'].str.contains('farm')] = 'farm'
train_data['wpt_name'][train_data['wpt_name'].str.contains('maziwa')] = 'farm'

train_data['wpt_name'][train_data['wpt_name'].str.contains('pump house')] = 'water'
train_data['wpt_name'][train_data['wpt_name'].str.contains('pump')] = 'water'
train_data['wpt_name'][train_data['wpt_name'].str.contains('bombani')] = 'water'
train_data['wpt_name'][train_data['wpt_name'].str.contains('maji')] = 'water'
train_data['wpt_name'][train_data['wpt_name'].str.contains('water')] = 'water'

train_data['wpt_name'][train_data['wpt_name'].str.contains('kanisani')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('kanisa')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('church')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('luther')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('anglican')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('pentecost')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('msikitini')] = 'religious'
train_data['wpt_name'][train_data['wpt_name'].str.contains('msikiti')] = 'religious'

train_data['wpt_name'][train_data['wpt_name'].str.contains('center')] = 'center'
train_data['wpt_name'][train_data['wpt_name'].str.contains('market')] = 'center'
train_data['wpt_name'][train_data['wpt_name'].str.contains('sokoni')] = 'center'
train_data['wpt_name'][train_data['wpt_name'].str.contains('madukani')] = 'center'

train_data['wpt_name'][train_data['wpt_name'].str.contains('kwa')] = 'name'

#finally change any values with less than 500 records to 'other' as well as the 'none' values
value_counts = train_data['wpt_name'].value_counts()
to_remove = value_counts[value_counts <= 500].index
train_data['wpt_name'].replace(to_remove, 'other', inplace=True)

train_data['wpt_name'][train_data['wpt_name'].str.contains('none')] = 'other'

In [None]:
train_data['wpt_name'].value_counts()

We're left with some categories of unknown meaning but hopefully they are of some significance.

Next we can look at `funder` and `installer`. Both of these have some large categories and lots of small ones.

For `funder` the large categories seem to be either governments or large NGOs while the small ones are unknown entities. We can keep the big ones, and group the rest under _other_.

In [None]:
value_counts = train_data['funder'].value_counts()
to_remove = value_counts[value_counts <= 500].index
train_data['funder'].replace(to_remove, 'other', inplace=True)

In [None]:
train_data['funder'].value_counts()

In [None]:
0.0084*len(train_data)

For `installer`, we can do the same.

In [None]:
value_counts = train_data['installer'].value_counts()
to_remove = value_counts[value_counts <= 500].index
train_data['installer'].replace(to_remove, 'other', inplace=True)

In [None]:
train_data['installer'].value_counts()

Good, it now looks like we're ready to deal with the missing data, before finally making some predictions.

## Missing Data

Let's check our missing percentages again after all that work. 

In [None]:
pumpitup.percent_missing(train_data)

We can see that there is still quite a bit of work to do. Firstly we can get rid of `amount_tsh` which represents the amount of water available to a pump. This could be useful, but there's just too much missing.

In [None]:
train_data.drop('amount_tsh', axis=1, inplace=True)

In [None]:
train_data.head(1)

### Geographic

#### Longitude and Latitude

Next let's go back to our longitude and latitude. We have a small percentage of these missing, but we have some pretty granular geographic data in the form of the `district_code` within `regions`. We can find the mean longitude and latitude within each district code and then use these to fill in any missing longitude and latitude data.

## Labels

In [None]:
train_labels = pd.read_csv('data/training_labels.csv')

In [None]:
train_labels.head()

In [None]:
train_labels['status_group'].value_counts()

In [None]:
cmap = plt.cm.get_cmap('Accent')
le = preprocessing.LabelEncoder()

In [None]:
region_encoded = pumpitup.label_encode(train_data['region'], le)
region_encoded_norm = region_encoded / region_encoded.max()
region_cmap = [cmap(x) for x in region_encoded_norm]

ward_encoded = pumpitup.label_encode(train_data['ward'], le)
ward_encoded_norm = ward_encoded / ward_encoded.max()
ward_cmap = [cmap(x) for x in ward_encoded_norm]

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].scatter(train_data['longitude'], train_data['latitude'], c=region_cmap, linewidth=0)
ax[0].set_ylim(-13, 0)
ax[0].set_xlim(28, 42)
ax[0].set_ylabel('Latitude')
ax[0].set_xlabel('Longitude')

ax[1].scatter(train_data['longitude'], train_data['latitude'], c=ward_cmap, linewidth=0)
ax[1].set_ylim(-13, 0)
ax[1].set_xlim(28, 42)
ax[1].set_ylabel('Latitude')
ax[1].set_xlabel('Longitude')

plt.tight_layout()

In [None]:
dist_encoded = pumpitup.label_encode(train_data['district_code'], le)
dist_encoded_norm = dist_encoded / dist_encoded.max()
dist_cmap = [cmap(x) for x in dist_encoded_norm]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['longitude'], train_data['latitude'], c=dist_cmap, linewidth=0)
ax.set_ylim(-13, 0)
ax.set_xlim(28, 42)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude')

In [None]:
train_data.drop('scheme_management', axis=1, inplace=True)

### Payment

Besides slightly different labelling, `payment` and `payment_type` are essentially the same. Some unknown values are present. `payment_type` is shorter so we'll keep that one.

In [None]:
train_data.groupby(['payment_type', 'payment']).count()

In [None]:
train_data.drop('payment', axis=1, inplace=True)

### Water Quality

Here we have two almost identical features, `water_quality` and `quality_group`, with `water_quality` containing slightly more granular information. We therefore keep that one.

In [None]:
train_data.groupby(['water_quality', 'quality_group']).count()

In [None]:
train_data.drop('quality_group', axis=1, inplace=True)

### Quantity

Two identical features again, `quantity` and `quantity_group`. We'll keep `quantity`.

In [None]:
train_data.groupby(['quantity', 'quantity_group']).count()

In [None]:
train_data.drop('quantity_group', axis=1, inplace=True)

### Source

From `source_class`, `source_type` and `source`, we keep the highest and lowest levels of information. The *other* `source` category seems to fall under *unknown* in `source_class` so we rename it.

In [None]:
train_data.groupby(['source_class', 'source_type', 'source']).count()

In [None]:
train_data.drop('source_type', axis=1, inplace=True)

### Waterpoint

Here we again have two almost identical features, `waterpoint_type` and `waterpoint_type_group`. `waterpoint_type` is slightly more precise, so we keep it and drop the other.

In [None]:
train_data['year_recorded'][train_data['year_recorded'] < train_data['operation_years']] = train_data['year_recorded'].median()

In [None]:
mask1 = pumpitup.flag_missing_s(train_data['construction_year'])
train_data['construction_year'][mask1] = np.nan
train_data.loc[mask1, 'construction_year'] = train_data.groupby(['extraction_type', 'installer']).transform('median')
mask2 = train_data['construction_year'].isnull()
train_data.loc[mask2, 'construction_year'] = train_data.groupby(['extraction_type']).transform('median')

In [None]:
train_data.groupby(['waterpoint_type_group', 'waterpoint_type']).count()

In [None]:
train_data.drop('waterpoint_type_group', axis=1, inplace=True)

### Geographical Information

The geographical position of a region is recorded at multiple levels of precision, in seemingly increasing precision, as given by:
- `region` or `region_code`
- `district_code` (subcategory within each region)
- `ward`
- `subvillage`
- `longitude` and `latitude`

Each `region` contains one or more `region_code`, with some overlap (eg. `region_code` 11 appearing in *Iringa* and *Shinyanga*, while the same `district_code` can appear in many regions, indicating that this is a generic sub-division of a `region`. Some `region_code` and `district_code` numbers do not seem to fit into the regular pattern exactly, but without a better understanding of this categorisation system, it's probably best to leave them as they are. 

The `ward` and `subvillage` features are much more precise, with 2092 and 19288 unique values respectively. Although individual wards or villages may differ in governance surrounding a water pump, regions and districts likely indicate different political boundaries that may have a larger effect on water pump failure. It probably doesn't make sense to keep this increasingly granular geographical information while we have latitude and longitude as numeric features.

To get an idea of the precision, we can make a quick scatter plot of the geographic positions of every point colour mapped to their `region` and `ward`.

In [None]:
train_data.groupby(['region', 'region_code', 'district_code']).count()

In [None]:
con_year_encoded = pumpitup.label_encode(train_data['construction_year'][train_data['construction_year'] > 0], le)
con_year_encoded_norm = con_year_encoded / con_year_encoded.max()
con_year_cmap = [viridis(x) for x in con_year_encoded_norm]

In [None]:
mask1 = pumpitup.flag_missing_s(train_data['longitude'])
train_data['longitude'][mask1] = np.nan
train_data.loc[mask1, 'longitude'] = train_data.groupby(['region', 'district_code']).transform('mean')
mask2 = train_data['longitude'].isnull()
train_data.loc[mask2, 'longitude'] = train_data.groupby(['region']).transform('mean')

In [None]:
mask3 = pumpitup.flag_missing_s(train_data['latitude'])
train_data['latitude'][mask3] = np.nan
train_data.loc[mask3, 'latitude'] = train_data.groupby(['region', 'district_code']).transform('mean')
mask4 = train_data['latitude'].isnull()
train_data.loc[mask4, 'latitude'] = train_data.groupby(['region']).transform('mean')

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.scatter(train_data['longitude'], train_data['latitude'], color='gray', linewidth=0)
ax.scatter(train_data['longitude'][mask1], train_data['latitude'][mask1], color='red', linewidth=0)
ax.scatter(train_data['longitude'][mask2], train_data['latitude'][mask2], color='blue', linewidth=0)
ax.set_ylim(-13, 0)
ax.set_xlim(28, 42)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude')

plt.tight_layout()

The red dots show the locations of points where latitude and longitude were imputed using `region` and `district_code`, while the blue does show the locations of points where latitude and longitude were imputed using `region` only

**GPS Height**

From the previous map of the altitude of each pump, we can see a trend of lower altitudes closer to the coast, with elevation increasing further inland. It's difficult to imput the missing `gps_height` values precisely, but we can try to resample the map and use some mean values.