# Business Problem

The Tanzania Development Trust is a UK charitable organization operating within the country of Tanzania since 1975.

They focus on development in rural Tanzania, aiming to support small projects in the poorest parts of the country where one of their priority areas of funding is clean water. Their stated water project involves boreholing and rope pump installation in areas with limited access to clean water, currently located in the regions of Kagera and Kigoma in the northwest of the country.

A new benefactor wants to expand the project not only geographically to more of the country, but in the scope of repairing existing pumps before they fail. I have been tasked with developing a model to predict the operating condition of a current waterpoint: functional, needs repair, or non-functional.

The main objective is to identify waterpoints that are in need of repair. [Research shows](https://sswm.info/entrepreneurship-resource/developing-impactful-businesses/maintenance-services-for-rural-water-pumps) that it is much less expensive to repair and rehabilitate a waterpoint, as well as being more protective of the water resources in the country. 
The secondary objective is to identify concentrations of non-functioning water points that may be an eligible location for a new installation.
______________________
The data provided for modeling was collected between March 2011 and March 2013, and contains the information for 59,400 water points 

# Imports

In [55]:
import pandas as pd
import numpy as np

In [1]:
# import training data and target
raw_data = pd.read_csv('data/training_data.csv')
raw_target = pd.read_csv('data/training_target.csv')

display(raw_target.info())
print(raw_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            59400 non-null  int64 
 1   status_group  59400 non-null  object
dtypes: int64(1), object(1)
memory usage: 928.2+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Create a raw dataframe with merged data and target. We will use this df during initial EDA so we can compare feature relationships with target, and so we can understand and deal with null values

In [2]:
raw_df = pd.merge(raw_data, raw_target, on='id')

The dataset for training includes 59,400 entries with 40 total features and a target label.

In [3]:
status_values = pd.DataFrame(raw_df.status_group.value_counts())
status_values['percent'] = raw_df.status_group.value_counts(normalize=True) * 100
status_values

Unnamed: 0,status_group,percent
functional,32259,54.308081
non functional,22824,38.424242
functional needs repair,4317,7.267677


This is a ternary classification problem. The three possible values are:
- functional (F)
- non functional (NF)
- functional needs repair (FR)

Value counts show that our dataset is not balanced with respect to the label values. Only 7.3% of pumps are classified as functional needs repair, while 54.3% are functional and 38.4% are non functional. We will need to keep this imbalance in mind when modeling.
_________
Before any modeling can occur we must check and deal with null values


# Null Checks

In [4]:
null_checks = pd.DataFrame(data=raw_df.isna().sum(),
                          columns=['null_count'])
null_checks['percent_of_data'] = (null_checks.null_count / len(raw_data)) * 100
null_checks = null_checks[null_checks.percent_of_data > 0.0]
null_checks.sort_values('percent_of_data', ascending=False, inplace=True)
null_checks

Unnamed: 0,null_count,percent_of_data
scheme_name,28166,47.417508
scheme_management,3877,6.526936
installer,3655,6.153199
funder,3635,6.119529
public_meeting,3334,5.612795
permit,3056,5.144781
subvillage,371,0.624579


There are 7 features with null values in our dataset, and we can what that number of nulls is by percent of total available data. 
______
We will work our way up from the smallest null_count to the largest and decide how to deal with the missing data

## subvillage

In [6]:
subvillage_nans = raw_df[raw_df.subvillage.isnull()]
subvillage_nans.status_group.value_counts(normalize=True)

functional                 0.552561
non functional             0.444744
functional needs repair    0.002695
Name: status_group, dtype: float64

The null values in subvillage represent 0.6% of our total data. The distribution of the target label is very close to the whole dataset

In [7]:
subvillage_nans.region.value_counts()

Dodoma    361
Mwanza     10
Name: region, dtype: int64

All but 10 of our subvillage nan's come from the region of Dodoma, the rest from Mwanza. Lets look at the subvillage distribution of those regions from the whole dataset

In [8]:
raw_df[raw_df['region'] == 'Dodoma'].subvillage.value_counts()

Kawawa         54
Shuleni        43
Nyerere        35
Azimio         34
Majengo        32
               ..
Foye            1
Mtatangwe       1
Makao Mapya     1
Soya Mjini      1
Mgomwa          1
Name: subvillage, Length: 705, dtype: int64

In [9]:
raw_df[raw_df['region'] == 'Mwanza'].subvillage.value_counts()

1                     132
Madukani               52
Bujingwa               25
Shuleni                19
Matale                 18
                     ... 
Bukalo                  1
Nyambona                1
Kabaganda B             1
Bulyahilu Center B      1
Mwambogwa               1
Name: subvillage, Length: 1507, dtype: int64

There are no average or overwhelmingly dominant subvillages that we could assign the null values to. It's not clear if we will use subvillage in modeling, but the best course of action here is to drop the nulls

In [20]:
raw_df.dropna(subset=['subvillage'], inplace=True)

## permit

In [54]:
permit_nans = raw_df[raw_df.permit.isnull()]
permit_nans.status_group.value_counts(normalize=True) * 100

functional                 54.744764
non functional             35.438482
functional needs repair     9.816754
Name: status_group, dtype: float64

Distribution of target is approximately the same for entries with no permit value compared to the whole dataset.

In [51]:
permit_distribution = raw_df.permit.value_counts(normalize=True)
permit_distribution

True     0.693066
False    0.306934
Name: permit, dtype: float64

Per the data documentation, the permit feature is if the water point is permitted or not. Data we do have for this feature show it's about 70/30 in favor of true permitting.
We should also change this to a boolean, 1 for true, 0 for false.

We should interpolate to fill these 3056 missing datapoints.

In [59]:
raw_df[raw_df.permit.notnull()]

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [None]:
raw_df['permit'].apply(lambda x: 1 if x == 'True', 0 if )

In [None]:
raw_df['permit'] 

## public_meeting

## funder

## installer

## scheme_management

## scheme_name

# Exploring extraction types

In [57]:
extraction_types = pd.DataFrame(raw_df.extraction_type.value_counts())
extraction_types.columns = ['total_count']
extraction_types['percent_of_total'] = round((extraction_types.total_count / len(raw_df)) * 100, 2)
extraction_types['non-functional'] = raw_df[raw_df.status_group == 'non functional'].extraction_type.value_counts()
extraction_types['non-functional_percent'] = round((extraction_types['non-functional'] / extraction_types.total_count) * 100, 2)
extraction_types['needs_repair'] = raw_df[raw_df.status_group == 'functional needs repair'].extraction_type.value_counts()
extraction_types.fillna(0.0, inplace=True)
extraction_types['needs_repair'] = extraction_types['needs_repair'].astype('int')
extraction_types['needs_repair_percent'] = round((extraction_types['needs_repair'] / extraction_types.total_count) * 100, 2)
extraction_types

Unnamed: 0,total_count,percent_of_total,non-functional,non-functional_percent,needs_repair,needs_repair_percent
gravity,26646,45.14,7988,29.98,2701,10.14
nira/tanira,8151,13.81,2089,25.63,641,7.86
other,6421,10.88,5189,80.81,205,3.19
submersible,4656,7.89,1858,39.91,227,4.88
swn 80,3670,6.22,1368,37.28,212,5.78
mono,2748,4.66,1594,58.01,129,4.69
india mark ii,2400,4.07,873,36.38,79,3.29
afridev,1770,3.0,528,29.83,42,2.37
ksb,1415,2.4,686,48.48,26,1.84
other - rope pump,451,0.76,141,31.26,17,3.77
