# Business Understanding


They say, "Water is Life". Tanzania is a developing country which has a clean water problem. According to [worldometers.info](https://www.worldometers.info/world-population/tanzania-population/), Tanzania population is 59,353,795 and it has to provide more than 59 million people with clean water. There are many water features which are already well established, but also there are many useless wells or some are in need of repair. 

Drivendata has begun a competition 'Pump It Up' to point this problem. With the predictive model, people can understand which water points are functional, nonfunctional and functional but it needs to repair. This model can help the Tanzanian government to find likely maintenance needy wells or give an useful information for future wells. 

The project's aim is to build a model which tells the status of the water points (as functional, nonfunctional, functional but needs repair). With this model, we will help to the Tanzanian authorities how to use water sources as a productive way. It also helps the JICA(Japan International Cooperation Authority) and The Dar es Salaam Water and Sewerage Authority (DAWASA) and the Dar es Salaam Water and Sewerage Corporation (DAWASCO) invest on Tanzania water resources wisely. 

![pump.png](./images/pump.png)



# Pump it Up:  Tanzanian Water Wells(Data Mining the Water Table) 

### HOSTED BY:  DRIVENDATA
### AUTHOR:  FRED MUTUMA


# Data Understanding


The original data can be obtained by the [DrivenData 'Pump it Up: Data Mining the Water Table' competition](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/). Basically, there are 4 different data sets: 
- submission format 
- training set
- test set  
- train labels set which contains status of wells.  

With given training set and labels set, competitors are expected to build a classifier to predict the condition of a water we and apply it to the test set to determine status of the wells and submit. 

In this project, I will use train set and train label set. Train set has 59400 water points data with 40 features. Train labels data has 59400 same water points with train set but just has information about id of these points and status of them.

The data for this comeptition comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water.

In their own words:

Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.

You can learn more here:
Taarifa logo

[Taarifa Homepage]('taarifa.org')
Taarifa Blog
Taarifa Github


### Features

- amount_tsh - Total static head (amount water available to waterpoint)
- date_recorded - The date the row was entered
- funder - Who funded the well
- gps_height - Altitude of the well
- installer - Organization that installed the well
- longitude - GPS coordinate
- latitude - GPS coordinate
- wpt_name - Name of the waterpoint if there is one
- num_private -- 
- basin - Geographic water basin
- subvillage - Geographic location
- region - Geographic location
- region_code - Geographic location (coded)
- district_code - Geographic location (coded)
- lga - Geographic location-
- ward - Geographic location
- population - Population around the well
- public_meeting - True/False
- recorded_by - Group entering this row of data
- scheme_management - Who operates the waterpoint
- scheme_name - Who operates the waterpoint
- permit - If the waterpoint is permitted
- construction_year - Year the waterpoint was constructed
- extraction_type - The kind of extraction the waterpoint uses
- extraction_type_group - The kind of extraction the waterpoint uses
- extraction_type_class - The kind of extraction the waterpoint uses
- management - How the waterpoint is managed
- management_group - How the waterpoint is managed
- payment - What the water costs
- payment_type - What the water costs
- water_quality - The quality of the water
- quality_group - The quality of the water
- quantity - The quantity of water
- quantity_group - The quantity of water
- source - The source of the water
- source_type - The source of the water
- source_class - The source of the water
- waterpoint_type - The kind of waterpoint
- waterpoint_type_group - The kind of waterpoint

# Evaluation


- Generally higher population areas has higher number of functional wells.
- Some areas have higher probability to find clean water especially, if they are near good basins.
- Dar es Salaam is one of the highest populated cities but 35% of good water quality points are non-functional.
- Iringa is one of the important areas but it contains lots of non-functional wateer points which has soft water.
- Mostly the wells which are funded by government are non-functional.
- Most of water points which central government and district council installed are non-functional.
- The most common extraction type is gravity but second is hand pumps. The efficiency of handpumps is less than commercial pumps. This shows that authorities need to focus on pumping type. It is seen that, there are many non-functional water points which belongs to gravity as extraction type.
- Some water points that have enough and soft water are non-function.
- The wells that were constructed in recent years are more functional than older ones. It is evident that recent wells have better functionality but needs repair sooner or later. 
- There are lots of water wells which has enough water are non-functional. 




## Findings

- 4272 wells were dried but they have good water quality. With a sound solution, these wells can be functional. Finding clean water sources is not the only problem, to continue to feed these sources are equaly important.
- 2226 (7%) wells have enough and soft water but need repair. Authorities must invest on repairing. Otherwise these will be non-functional.
- 8035 (27%) wells has enough, good quality water but they are non-functional. This shows that authorities must work and invest on technology to pump these good sources.
- Authorties should check again the wells which they funded.
- New tecqniques must be found to feed dry wells and repair wells.


For the predictions, I made models for both binary and multiclass targets. According to the results dataframe, the best results for this binary class target was obtained with Random Forest Classifier with grid search. The table belo shows the results for he binary class tartget models:

![binary.jpg](./images/binary.jpg)


Prioritized random forest model based on roc-auc scores. I looked at the balanced accuracy for test set since the metric for competition is balanced accuracy. It is overfit, but still giving good test balanced accurancy results. So, I choose this model. XGBoost is not over fit and giving near results to this model. These are two good models in the project.

According to confusion matrix on test data, there is no exact split but it is shown that we will care about 962 points which is already predicted as functional but normally they are non-functional. Also, 698 points are predicted as non-functional but they are functional. With the learnings from binary model, I proceeded to multi-class target model.

All of the 3 models I implemented, XGBoost performed best when we used it with SMOTE. So, our new and decided model is XGBoost with SMOTE. With 86% accuracy, we can predict the functionality of the wells. Below is a summary of the results of multiclass target models:

![multi.jpg](./images/multi.jpg)

# Conclusion

PRIORITIZE EFFECIENTLY: The following criteria should be observred to achieve this.

**Repairs:**
- Prioritize functioning wells which need repair and yield clean water.

**Funding**
- Allocate funds and resources to effective organizations with track record.

**Payment**
- Payments of some kind will provide incentive to keep wells functional. Pick a metric and charge per unit.

**Location**
- Target repairs to clusters of wells especially those with high populations.


Our Model is 86% Accurate therefore I am confident if more is done with less then funds are used efficiently and effectively. There is need to streamline maintenance and repairs

# Recommendations

**MONITOR WELLS**
- Update model regularly to issue preventative maintenance.

**GEOGRAPHIC REGION**
- Model has to consider regional actors: rainfall, climate, geology,etc.

**IMPROVE DATA**
- Quantify qualitative data to improve model.


In [2]:
# import librarties
import numpy as np
import pandas as pd 
import matplotlib.cm


import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import folium
from folium.plugins import HeatMap
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')