In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import combinations

First we'll open the various docs and see how big they are.

In [None]:
df = pd.read_csv('../data/training_set_values.csv')

In [None]:
df.shape

In [None]:
df.head()

"training_set_values" has 59,400 records.

In [None]:
df = pd.read_csv('../data/training_set_labels.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.status_group.value_counts()

"training_set_labels" just tells you what the status of those 59,400 records is.

In [None]:
df = pd.read_csv('../data/test_set_values.csv')

In [None]:
df.shape

In [None]:
df.head()

"test_set_values" is just like training_set_values, with fewer records (14,850).

In [None]:
df = pd.read_csv('../data/SubmissionFormat.csv')

In [None]:
df.shape

In [None]:
df.head()

"SubmissionFormat" is like training_set_labels except the contestant/data scientist has to provide the labels.

# Summary of available files
There is a training set consisting of 59,400 records, a set of labels for those 59,400 records,  a test set consisting of 14,850 records, and a template for submitting labels for those 14,850 test records to an online contest.

Because we do not have labels for the 14,850 records in the test set, those are useless to us. We will have to carve out training and test sets from the 59,400 records with known labels.

# Examining the training set

Let's look closer at the training set. First we'll drop duplicates.

In [69]:
# reload the training set
df = pd.read_csv('../data/training_set_values.csv')
# check for duplicates, excluding the ids
df[df.duplicated(subset=df.columns.difference(['id']))].shape

(37, 40)

In [70]:
# drop the duplicate records
df.drop(df[df.duplicated(subset=df.columns.difference(['id']))].index, inplace=True)

Now we'll look at missing values.

In [71]:
# show columns with missing values and the number of values missing
df[df.columns[df.isna().any()]].isna().sum()

funder                3635
installer             3655
subvillage             371
public_meeting        3314
scheme_management     3877
scheme_name          28138
permit                3056
dtype: int64

About half the records are missing the 'scheme_name', so we'll drop that feature.

In [89]:
# drop scheme_name
df.drop(columns='scheme_name', inplace=True)
# show rows and columns
df.shape

(59363, 39)

In [72]:
# show columns with missing values and the number of values missing
df[df.columns[df.isna().any()]].isna().sum()

funder                3635
installer             3655
subvillage             371
public_meeting        3314
scheme_management     3877
scheme_name          28138
permit                3056
dtype: int64

We'll return to the question of what to do with these missing values after we explore more of the data, i.e. the "fictional zeros" from the numerical features.

Is there a way to optimize which records to drop and which features to drop? Cycle through all combinations of features and see which way we lose the least information?

In [90]:
na_features = []

for col in df.columns:
    if df[col].dropna().shape[0] < df.shape[0]:
        na_features.append(col)

na_features

['funder',
 'installer',
 'subvillage',
 'public_meeting',
 'scheme_management',
 'permit']

In [91]:
for n in range(len(na_features)):

    sub_list = list(combinations(na_features, n+1))

    high_score = 0
    high_score_features = []

    for item in sub_list:
        features = list(item)
        score = df[features].dropna().shape[0]
        if score > high_score:
            high_score = score
            high_score_features = item
    high_score = round(1 - high_score / df.shape[0], 2)
    result = []
    for element in na_features:
        if element not in list(high_score_features):
            result.append(element)
    print(high_score, result)

0.01 ['funder', 'installer', 'public_meeting', 'scheme_management', 'permit']
0.06 ['funder', 'installer', 'public_meeting', 'scheme_management']
0.07 ['public_meeting', 'scheme_management', 'permit']
0.08 ['public_meeting', 'scheme_management']
0.13 ['scheme_management']
0.19 []


In [31]:
df.permit.dropna().sum()

38838

Let's look at how many unique values each of these columns has.

In [None]:
df[df.columns[df.isna().any()]].nunique()

# *****For now
We will drop these columns and move forward with the ones that are intact.

In [None]:
df.dropna(axis='columns', inplace=True)

Let's look at just numerical features.

In [None]:
df.select_dtypes(include=['number']).info()

None appear to be missing values, but this can be deceiving. Let's look closer for zeros.

In [None]:
df.select_dtypes(include=['number']).describe()

In [None]:
# show value counts for amount_tsh feature
df.amount_tsh.value_counts()

This is the "amount of water available to a water point". *Most* of these values are zero. This seems like a very relevant feature, and it would be a shame if the zeros were some kind of error. Let's optimistically assume the zeros are meaningful.

In [None]:
df.gps_height.value_counts()

This is altitude (or elevation), in meters. There are probably too many zeros here for altitude to be real. We'll investigate this further after we look at longitude and latitude.

In [None]:
# show value counts for longitude feature
df.longitude.value_counts()

In [None]:
# show value counts for latitude feature
df.latitude.value_counts()

There are quite a few zeros or near-zeros for longitude and latitude. Let's first look at the 1,776 records that lack positional coordinates and see whether they're worth repairing.

In [None]:
# describe the numerical features of records with zero longitude
df.select_dtypes(include=['number'])[df['longitude'] == 0].describe()

These records (now 1,776 of them after dropping duplicates) seem worthless, as they all have probably-erroneous zero values for most of the other numerical features. The only numerical data they offer are their region and district codes. Let's find out where those regions and districts are before we drop the records.

In [None]:
# show value counts for the region_code of the zero-longitude records
df[df['longitude'] == 0]['region_code'].value_counts()

In [None]:
# show value counts for the district_code of the zero-longitude records
df[df['longitude'] == 0]['district_code'].value_counts()

Maybe if we plot the positions of records *with* GPS coordinates from these districts, we'll get some idea of where these problematic records are coming from.

In [None]:
# generate a geographical map of all listings in the districts where the 1,776 records lacking long/lat are
fig, ax = plt.subplots(figsize=(11,8))
df[(df['longitude'] != 0) & (df['district_code'].isin([1,2,4,6]))].plot.scatter(
    x='longitude', y='latitude', c='district_code', cmap='Blues', ax=ax)
fig.suptitle('Distribution of Districts with Missing GPS Coordinates', size=18);

While the 1,776 problematic records all come from the same handful of districts, the map offers no other clues about what they may have in common, because it appears that "districts" and "regions" are not essentially contiguous but rather each consist of several discrete clusters. In any case, it looks like it won't be possible to recover or even approximate the GPS data for these records, so we might as well drop them.

In [None]:
# drop all zero-longitude records
df.drop(df[df['longitude'] == 0].index, inplace=True)

Next, let's make a color-coded plot to see whether the gps height data makes sense by comparing it to an available topographical map.

In [None]:
# set figure with two axes over two columns
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize=(20,8))
# plot long/lat for nonzero longs with color gradient for elevation
df.plot.scatter(x='longitude', y='latitude', c='gps_height', cmap='plasma', ax=ax1)
# upload an image
im = plt.imread("../images/topo_map.jpeg")
# display the image
im = ax2.imshow(im)
# hide X and Y axes label marks
ax2.xaxis.set_tick_params(labelbottom=False)
ax2.yaxis.set_tick_params(labelleft=False)
# hide X and Y axes tick marks
ax2.set_xticks([])
ax2.set_yticks([])
# title
fig.suptitle('Topographical Map Comparison', size=18)
fig.tight_layout();

It makes sense for the locations along the ocean to have zero elevation, but there are at least three inland clusters that seem more like they are just lacking elevation data. One solution would be to set elevation values equal to the median for all records that have matching geographical location features such as "subvillage", but it appears that this wouldn't help because the missing values are geographically set apart from non-missing values.

# *what to do about this?*

In [None]:
# show value counts for num_private feature
df.num_private.value_counts()

In [None]:
df[df.num_private > 0].num_private.describe()

In [None]:
df.region_code.value_counts()

In [None]:
df.district_code.value_counts()

In [None]:
df.population.value_counts()

In [None]:
df[df.population < 10].population.value_counts()

In [None]:
df.construction_year.value_counts()

# A first-glance summary of the numerical features

None of the numerical features are missing any values, but some have a suspicious amount of zeros.

* amount_tsh: "Total Static Head". Some kind of measure of available water. This is mostly zeroes.
* gps_height: From a spot check, this appears to be given in meters. It has a lot of zeros.
* longitude, latitude: 1,812 records are essentially (0,0).
* num_private: It is not at all clear what this means.
* region_code, district_code: There are fewer unique district codes, suggesting they are possibly broader? There are some district codes equal to zero, which may or may not be an error.
* population: There are many zeroes, but perhaps this means they're just rural?
* construction_year: There are many zeroes, which will need to be dealt with.

First we'll check for duplicate records, excluding the ids, and drop any duplicates we find.

In [None]:
df.select_dtypes(include=['object']).nunique()

In [None]:
df[df.columns[df.isna().any()]].isna().sum()

The upshot of this initial exploration is that we should drop the zero-longitude records and at least the scheme_name column and possibly the other columns with missing values. Then we still have to deal with the erroneous zero values for at least elevation, population, and construction year.

Then we may need to detect other problem values in the data in other ways that we can't see yet.