<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Predicting-Water-Well-Status-in-Tanzania-using-Machine-Learning" data-toc-modified-id="Predicting-Water-Well-Status-in-Tanzania-using-Machine-Learning-1">Predicting Water Well Status in Tanzania using Machine Learning</a></span><ul class="toc-item"><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-1.1">Executive Summary</a></span></li><li><span><a href="#The-Project" data-toc-modified-id="The-Project-1.2">The Project</a></span><ul class="toc-item"><li><span><a href="#Problem-statement" data-toc-modified-id="Problem-statement-1.2.1">Problem statement</a></span></li><li><span><a href="#Crowdsourcing-algorithms-to-predict-non-functional-wells" data-toc-modified-id="Crowdsourcing-algorithms-to-predict-non-functional-wells-1.2.2">Crowdsourcing algorithms to predict non-functional wells</a></span></li><li><span><a href="#My-approach-to-the-problem" data-toc-modified-id="My-approach-to-the-problem-1.2.3">My approach to the problem</a></span><ul class="toc-item"><li><span><a href="#Exploratory-data-analysis-(EDA)-and-custom-functions-for-visualization-and-modeling" data-toc-modified-id="Exploratory-data-analysis-(EDA)-and-custom-functions-for-visualization-and-modeling-1.2.3.1">Exploratory data analysis (EDA) and custom functions for visualization and modeling</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-1.2.3.2">Modeling</a></span></li></ul></li><li><span><a href="#My-Findings" data-toc-modified-id="My-Findings-1.2.4">My Findings</a></span><ul class="toc-item"><li><span><a href="#Outcomes-of-models-produced-by-various-algorithms-and-parameters" data-toc-modified-id="Outcomes-of-models-produced-by-various-algorithms-and-parameters-1.2.4.1">Outcomes of models produced by various algorithms and parameters</a></span></li><li><span><a href="#The-necessity-of-addressing-class-imbalances" data-toc-modified-id="The-necessity-of-addressing-class-imbalances-1.2.4.2">The necessity of addressing class imbalances</a></span></li><li><span><a href="#Real-world-impact-from-addressing-class-imbalances" data-toc-modified-id="Real-world-impact-from-addressing-class-imbalances-1.2.4.3">Real-world impact from addressing class imbalances</a></span></li></ul></li><li><span><a href="#Blog-post:" data-toc-modified-id="Blog-post:-1.2.5">Blog post:</a></span></li></ul></li></ul></li><li><span><a href="#Findings" data-toc-modified-id="Findings-2">Findings</a></span><ul class="toc-item"><li><span><a href="#Model-Performance:-Accuracy-and-Reduction-of-False-Negatives" data-toc-modified-id="Model-Performance:-Accuracy-and-Reduction-of-False-Negatives-2.1">Model Performance: Accuracy and Reduction of False Negatives</a></span></li><li><span><a href="#Feature-Importances" data-toc-modified-id="Feature-Importances-2.2">Feature Importances</a></span></li></ul></li><li><span><a href="#Possible-Future-Work" data-toc-modified-id="Possible-Future-Work-3">Possible Future Work</a></span></li></ul></div>

# Predicting Water Well Status in Tanzania using Machine Learning

## Executive Summary

___The problem:___  The government of Tanzania seeks to provide clean water to all of its citizens, but over 40% of the population still lacks access to clean water.  The government collaborated with Taarifa, a non-profit organization, to create a database of all of the water supply projects in the country. Data for each water supply project includes information about the project’s geographic location, local water abundance and quality, technical information (e.g., type of well), funder, installer, project administration, and more.

___Crowdsourcing machine learning to predict well function:___ DrivenData, a social enterprise that works with mission-driven organizations, is hosting a competition to predict water well status using machine learning algorithms. The dataset provided by DrivenData contains data for 59,400 water supply projects. There are 39 features in this dataset and three target classes (well status = ‘functional’, ‘non-functional’, or ‘functional; needs repair’).  Thus, this is a **multi-class classification problem.**

___My approach:___ For this project, I developed functions to customize data visualizations during the exploration phase and evaluated several different supervised machine learning models for predicting water well status. I also employed a few different approaches to address class imbalances, significantly improving performance around costly ‘false-negative’ errors (predicting wells to be ‘functional’, when in fact they were ‘non-functional’ or ‘functional needing repair’).  

For an overview of the functions I developed and deployed, check out my blog post at https://github.com/gdurante2019/gdurante2019.github.io/blob/master/_posts/2020-07-22-data_science_toolbox_function_to_create_top_n_values.markdown.

___Findings:___ In this project, I found that there is a significant trade-off between 1) better overall test accuracy scores when using the original imbalanced class data, and 2) better performance in minimizing false negative errors by addressing class imbalances.  In fact, the problem with false negatives resulting from the use of imbalanced class data is so great that, for a model to be useful, the class imbalance issue _must_ be addressed.  To do this, I employed either SMOTE resampling or the class_weight='balanced' parameter (where available).

## The Project

### Problem statement

The government of Tanzania is working to provide clean water to all of its citizens, but over 40% of the population still lacks access to clean water.  Furthermore, a significant percentage of the wells that have been installed over the years are either in need of repair or replacement.  Because of variety of well types and the highly distributed nature of well projects, it is difficult for the government to be able to assess a particular well's status at any given time.

### Crowdsourcing algorithms to predict non-functional wells 

In an effort to better understand the functional status of wells around the country, Tanzania has collaborated with Taarifa, a non-profit organization, to create a database of all of the water supply projects in the country. Data for each water supply project includes information about the project’s geographic location, local water abundance and quality, technical information (e.g., type of well), funder, installer, project administration, and more.

DrivenData, a social enterprise that works with mission-driven organizations, is hosting a competition to predict water well status using machine learning algorithms. 

https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/

<img src='images/DrivenData_Tanz_comp_homepage.png' height=70% width=70%>


The dataset provided by DrivenData contains data for 59,400 water supply projects. There are 39 features in this dataset and three target classes (well status = ‘functional’, ‘functional; needs repair’, or ‘non-functional’).  Thus, this is a _**multi-class classification problem.**_

### My approach to the problem
_**Note:**_  I used this dataset to satisfy the project requirement for supervised learning/ensemble methods that had been covered to that point in my data science immersive program, which included decision trees, random forests, bagged trees, boosted trees, and support vector machines.  I chose to evaluate as many algorithms with hyperparameter tuning as possible to get a better sense of how the various models performed given different inputs and hyperparameter settings.
#### Exploratory data analysis (EDA) and custom functions for visualization and modeling
Because of the large number of individual feature values (39 features and anywhere from 2 to 2000 values per feature), I did a lot of EDA and wrote functions to customize data visualizations during the exploration phase.  This helped me to see variation across features and geographies, and helped me to understand the literal and figurative landscape of water projects throughout the country.  

Following are some examples of data visualizations that I developed to help me work with the data during the data exploration phase and the modeling phase.
##### Regional information
Below is a graph of well functional status by region.  Note that the y-axis represents the number of wells for each region.  This gives a nice summary view of how many wells there are by region and how these wells are performing.

<img src='images/status_by_region.png' height=70% width=70%>

It was also interesting to take a look at a few other factors that might be important for well performance.  A different way of considering geographic influences was to visualize water abundance by region:

<img src='images/water_qty_by_region.png' height=70% width=70%>

A quick visual comparison of the water abundance with well status in each region suggestions a possible correlation between better well performance and water abundance in the region as "enough".  Of course, there are many other factors to consider, but this comparison does suggest that some additional insights may be obtained by looking at well performance status by the water abundance characterization at the site where the well was installed.

##### Water abundance (quantity)

Speaking of water abundance ("quantity"), we can also view well status by the water abundance ("quantity") at the well site.  The largest number of wells are in areas with the quantity designation "enough", and their performance appears to be the best of the 5 water abundance categories.  Wells in areas categorized as "dry" appear to have the worst performance.  We might consider additional data visualization cross-referencing well status by extraction type and by water abundance as next steps, but it is informative to have a baseline sense of well status across different features.   

<img src='images/status_by_water_qty.png' height=50% width=50%>

##### Extraction type
An obvious parameter of well performance to consider is the type of water extraction mechanism used by the well.  A few examples of visualizations for this parameter are provided below. The first shows the mix of different well extraction types by functional status:  

<img src='images/status_vs_extr_type.png' height=50% width=50%>

Another view of the same information gives us a way to see extraction type on the x-axis, number of wells on the y-axis, and the functional status in the stacked column.  This instantly gives us a sense of both the number of wells _and_ functional status _by extraction type_: 

<img src='images/status_extr_type_class.png' height=60% width=60%>

While these two charts are great for viewing the extraction status and performance in terms of all installed wells, we can't really see the well status for small categories (e.g., rope pump, wind-powered).  This is where a percentage stacked column chart can be really useful:  

<img src='images/status_by_extr_type_pct.png' height=60% width=60%>

##### Dealing with very large feature sets

Some features, such as ```funder``` and ```installer```, contain over 2000 unique values each, but the vast majority of water projects were funded or installed by a small percentage of funders or installers.  I wanted to be able to separate out the biggest players across water projects and aggregate the hundreds of very small or one-off players into a single column.  Therefore, I developed functions that allowed me to aggregate smaller players into an "other" category based on a "smaller-than-n" value--e.g., if n=100, then the function would show the top 100 installers by number of wells, with the remaining installers (over 1900) aggregated into "other installers".  (This would also allow me to view how the wells of all of the remaining funders or installers performed, if I wished.) 

I used this function not only for visualization but also for modeling--vastly reducing the number of parameters in the model while still maintaining information about those wells funded or installed by the small players. 

One visualization example:  I selected the top 40 funders of water projects and plotted the number of wells they've funded in descending order, and used stacked columns to show what proportion of wells funded by each funder are functional, functional but needing repair, or non-functional (needing replacement).  

<img src='images/status_by_top40_funder.png' height=80% width=80%>

Because there is such a steep drop off in the number of wells represented by each funder beyond the first 10 or so, it is hard to see the functional status of all of the water wells by each funder.  Here again, it's useful to plot functional status as a percentage stacked column:  

<img src='images/status_by_funder_pct.png' height=80% width=80%>

#### Modeling

##### Algorithms used 
This classification project was amenable to several supervised machine learning algorithms.  I created decision-tree, random-forest, bagged-trees, AdaBoost, XGBoost, and SVM models using standard python libraries, and used GridSearchCV for hyperparameter tuning.  

##### Addressing class imbalances 
I explored a few different approaches to address class imbalances--e.g., ```class_weight='balanced'``` and SMOTE.  Doing this significantly improved model prediction of non-functional wells, reducing costly ‘false-negative’ errors.

##### Writing functions to help automate modeling and customize output
_(placeholder)_

##### Customized functions for displaying model results
I wanted to be able to summarized certain model results in a roll-up report.  To accomplish this, it was necessary to develop several functions to enable me to view these rolled-up results in an easy-to-read table.  This allowed me to identify the most important features resulting from each model run.  An example of such an output follows:


### My Findings

#### Outcomes of models produced by various algorithms and parameters


_(placeholder)_




_(placeholder)_




#### The necessity of addressing class imbalances
Perhaps the most important finding was that there is a significant trade-off between 1) _**better overall test accuracy**_ by using the original imbalanced class data, and 2) _**minimizing costly false negative errors**_ by addressing class imbalances.  

Correctly predicting a non-functional well as non-functional is the whole point of developing these algorithms.  However, many people who joined the DrivenData challenge focused solely on maximizing overall model accuracy.  Focusing on maximizing accuracy without addressing class imbalances results in terrible prediction rates for wells in need of attention.  Such models identify only around 25% of all wells requiring repair or replacing--only ~2000 out of almost 8000--or a 75% false negative error!  Thus, addressing class imbalances in this project is absolutely essential to ensure that the models deployed are useful.

![Imbalanced_classes_confusion_matrix](images/Conf_matrix_good_bad_small.png)




There are many ways to address class imbalances.  Given where I was in the data science program curriculum and the supervised learning approaches being examined for this project, I used either SMOTE resampling or the `class_weight='balanced'` parameter.  Doing so allowed me to greatly improve prediction of non-functional wells.  **I was able to reduce false negative errors from ~75% to around 25-30%, depending on the modeling approach.**  





##### Without SMOTE:

_(placeholder)_


##### With SMOTE:

_(placeholder)_


#### Real-world impact from addressing class imbalances

Focusing on accuracy at the expense of addressing class imbalances results in far worse results in terms of identifying wells in need of repair or replacement.  For example, if no class balancing is done, even a high accuracy model will only correctly identify about 25% of non-functional wells (~2000 out of 8000):

_(placeholder)_

However, addressing class imbalances produces much better results, correctly identifying 70-75% of malfunctioning wells.  This translates into correctly identifying about 5900 of 7900 wells needing attention--or around ~4000 _**more**_ non-functional wells!  


_(placeholder)_


This is a huge improvement that could help government officials improve the speed at which they are able to visit, assess, and repair or replace wells.  

More broadly, this project made me very aware of how class imbalances lead to prediction errors.  It provides a real-life example of how important it is to tune data science models to address the most important problems in a particular situation (accurate prediction of non-functioning wells in this case).

### Blog post:
https://github.com/gdurante2019/dsc-mod-3-project-v2-1-online-ds-sp-000

# Findings

## Model Performance: Accuracy and Reduction of False Negatives 
* Imbalanced dataset (original)
  * Training accuracies ranged 77% to 85%
  * Test accuracies ranged from 74% to 80%
  * False negatives were unacceptably high  (up to 77% and 40% for mis-classification of 'functional needs repair' and 'non functional' wells as 'functional', respectively)
* SMOTE resampled data
  * Training accuracies ranged 73% to 80%
  * Test accuracies ranged from 69% to 78%
  * False negatives were noticeably better than for imbalanced dataset, but still higher than for models run with class_weight='balanced' (~30-35% and ~23-30% for mis-classification of 'functional needs repair' and 'non functional' wells as 'functional', respectively)
* class_weight='balanced' (in models where this parameter is available)
  * Training accuracies ranged 65% to 78%
  * Test accuracies ranged from 62% to 74%
  * Best performance of the three options for false negatives (as low as 17% and 17% for mis-classification of 'functional needs repair' and 'non functional' wells as 'functional', respectively)
*   **Conclusion--Class imbalances:**  There is a significant trade-off between overall accuracy and minimizing costly false negatives by addressing class imbalances that contribute to the errors in the smaller classes ('non-functional' and 'functional needs repair'). 
  
## Feature Importances 
* At a macro level, I found that the features that showed up most frequently in the aggregate feature importance list were the following (roughly in order of importance):
     * Quantity ('dry', 'enough', 'insufficient', 'seasonal', or 'unknown')
     * Region or lga (depending on which was used in the model)
     * Waterpoint type (standpipe, borehole, improved spring, cattle trough, other)
     * Installer or funder (depending on which was used in the model)
     * Extraction type (e.g., gravity, hand pump, motor, submersible)
     * Source (e.g., river, spring, shallow well, rainwater harvesting)
     * Payment type (e.g., pay by bucket, pay by month, never pay)
* At a more granular level, the top 30 or so dummy variables tended to include a mix of:
     * Water abundance
     * Waterpoint type
     * Geographic location (e.g., LGA)
     * Extraction type
     * Installer/funder
     * Payment type
* I recommend looking more closely at these values, since they are likely to contain those specific values that have more wells needing replacement or repair

# Possible Future Work

With more time, I would like to explore additional modeling with a subset of parameters:
* As indicated above, most of the models show very similar lists of dummy variables as having the greatest influence on the model algorithms
* While my experience so far is that, regardless which model or subset of variables I use, the accuracy scores remain stubbornly in the 75-80% range, I'd like to find out whether running a smaller subset of dummy variables might improve results
* There are significant holes in this data set (e.g., missing values or values such as ‘other’ for type of well); with more time, I would like to research a few of these values, such as:
  * ‘Other’ in the category ‘extraction_type’, as the majority of wells in this category are non-functional
  * Year constructed—no date recorded for significant percentage of wells; do these wells have something in common, such as:
    * Clustered in certain locations where data was/is not collected?
    * Constructed before a certain date?
    * Constructed by certain installers or funded by certain funders?
    * Of a certain type, e.g., handpump?  

Some features that I was not able to use due to incomplete information are likely to have an impact on the accuracy scores, but would require more research to flesh out
* Significant funding for projects comes from outside Tanzania
  * My initial quick review of some of the top funders and installers suggests that funding / installation management by certain international entities have a substantially higher rate of functional projects, with few projects in need of maintenance or equipment replacement--most notably Germany (perhaps not surprising, given the cultural emphasis on efficiency and high standards for technical expertise and performance
  * There is some geographic / country information on a small subset of funders and/or the installers, but this information is not captured in the dataset provided, except incidentallly (e.g., the rare country reference in the name of the funder or installer)
  * Thus, it would be necessary to research these entities to identify their locations
  * As there are at least 100 that have funded dozens or hundreds of projects, this would be a time-consuming effort.
* Latitude / longitude—Looking at functional status of wells at various geographic locations to identify possible patterns due to geological differences
