![Lake Tanganyika, Tanzania](Images/tanganyika.jpg)

# Tanzania Water
Flatiron Mod 3 Project

## Overview



### Repository Navigation
- [Data Folder Includes](Data)
    - [Data from Taarifa on datadriven.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/)
        - [Training Set Labels](Data/training_set_labels.csv)
        - [Training Set Values](Data/training_set_values.csv)
        - [Test Set Values](Data/test_set_values.csv)
    - Modified Datasets
        - [Combined Training Set](training_set.csv)
        - [Altitude Data](Data/looked_up_alts.csv)
        - [Cleaned Data](Data/cleaned_data_072920_shaw.csv)
        - [Preliminary Data](Data/water_well.csv)


- Notebooks: [Summary Notebook](Notebooks/Master_Notebook_2.ipynb)
    - [Exploratory Data Analysis](Notebooks/EDA_Alex.ipynb)
    - [Data Production](Notebooks/data_production_notebook.ipynb)
    - [Determine Confidence Intervals](Notebooks/Confidence_Interval_with_inputs.ipynb)
    - [Train Regression](Notebooks/Training_Regression.ipynb)
    - [Make Prediction](Notebooks/Make_Prediction.ipynb)
    - [Visualizations](Notebooks/Visualizations)

- [Presentation](Presentation/Mod_2_Project.pdf)



### ReadMe Navigation

1. [Business Context](#Business-Context)
2. [Current Landscape](#Data-Understanding)
    1. The problem
    2. Data Understanding
    3. Data Limitations
    
4. Predictive Analysis
    1. Our Model
    2. Feature Importance
    3. Model Performance
    
5. Conclusion & Takaways:
    1. Improvement Areas
    2. Growth Opportunities
    3. Takeaways

3. [Feature Engineering](#Feature-Engineering)
4. [Model & Prediction](#Model-&-Prediction)
5. [Conclusions](#Conclusions)
6. [Further Steps](#Further-Steps)
7. [Project Info](#Project-Info)

***

## Business Context

Water in Africa: Techmnical and Equipment Researchers, LTD. (WATER) is a (hypothetical) consultancy which seeks to amplify the positive effects of progress in water point parts, supply, and maintenance. Progress may be made by the public and/or private sectors by analyzing, and anticipating the status of water point repair more quickly and efficiently.

### Predicting Water Point Maintenance Status from Data

The instrumental goal in this project is to construct a model capable of predicting the status of a waterpoint (specifically, in Tanzania), between the three states of "functional", "functional needs repair", and "non functioning".The underlying motivation is to create a cheaply implemented, readily available tool for businesses or other organizations looking to provide maintenance and replacement services for said waterpoints. This should in principle lower the overhead costs for such providers (less chance of over-ordering parts, over-hiring staff, etc.) and so should, due to lower volatility and increased profitability, stimulate the creation of new providers or expansion of new ones.

A major factor in waterpoint failure is difficulty in accessing necessary maintenance; this model would help ameliorate that. In summary, our goal was:

**Reduce overhead costs by anticipating water pump maintenance status.**




***

### The Problem

Providing additional context, we viewed several problems: several water points need work, maintenance was costly and infrequent, and water points would stay in states of disrepair because of these costs

Our model helps to fix those problems by starting with more efficient market planning in determining resource distributions and repair/maintenance/parts necessities. Further, it should reduce volatility and increase profitability for maintenance providers. Finally, ideally it would increase the size of the pie by enabling more maintenance providers to enter the space

## Data Understanding

The data from this project came from various collection efforts in the early 2000s, with GeoData LTD. listed as the data recorder. Combining industry knowledge with 

Overall, the data provided included the sales price and other information for around 21,000 houses sold between May 2014 and My 2015.

An overall map of the data:
![King County House Sales](Images/location.png)

The distribution of prices looked like this:

![Price Histogram](Images/price.png)

### Data Categories
####  Geographical
This data included politically geographical data as well as geological data about the altitudes of the water points


#### Management
Several index names covered this category: installer, scheme management, management, etc.

#### Waterpoint Specifics
These datapoints could be summarized as specific to the actual water point - names, type of extraction, quality of the water, etc.

#### Numerical, Ordinal,  Categorical
The datatypes used in these columns were of various types that would have to be considered when implementing our algorithms.

### Feature Importance
In order to make the model understandable, we extracted the features which the model determined to be most impactfully correlated with the status group, including: quantity of water available, region, payment type, source, elevation, etc.

### Data Limitations

Amongst others, the limitations on data we encountered that would impact how our model trained:

1. Multiple reporters
2. Inconsistent naming/reporting
3. Opaque Values
4. Few examples of "needs repair"
5. Clear mistakes

We "solved some of these issues by:
1. Autocorrection for typographical errors
2. Google maps data for missing elevation data
3. Filling in other missing data by the most similar attributes
                                        

***

## Feature Engineering
Given the categories of data listed above, we selected certain factors within the dataset, and created or found new features that would help to create a more accurate pricing model. The new features were:

1. Defining top school districts
2. Creating a "season sold" factor
3. Reshaping categorical values from continuous variables (e.g., grade/condition)
4. Combining features, like ratio of basement space to living space
5. Creating a user input function

***

## Model & Prediction

We used the Ordinary Least Squares Regression to create a model which would help determine the most impactful factors and help us more accurately predict the prices.

### Test Assumptions (Linearity, Multicollinearity)
In order for this model to work the most effectively, we evaluated it for linearity between factors, and multicollinearity between pairs of factors, which gave us the following:

![Linearity](Images/linearity.png)


![Multicollinearity](Images/multicollinearity.png)



### Iterate and Evaluate Models

After iterating through, and making adjustments to, and evaluating 3 regression models, we adopted the final one using the highest R-squared value as a measure of performance. With more time we could have further refined the model to account for what appears to be an exponential function from the residuals plot:

![Residuals](Images/QQ_Residuals_Plot.png)




### Predict Values with Example Inputs

Given the model we produced, we were able to create functions that would take the user inputs and output a prediction. In this example, the potential client would input the numbers after the colon:


![Example Prediction](Images/example_prediction.png)



## Conclusions

The most predictive factors in home price were:

### Location
Price would increase by  180k USD for top 5 school district, by 210k for homes outside the city, and 480k for homes on the waterfront

![Great Location!](Images/seattle-2426307.jpg)

### Home Size
Basement square footage is worth less than square footage above ground, and too many bedrooms could lower the value of the house.

![Home Size](Images/sq_ft_living.png)

### Quality
Homes with high grades (good architecture and build quality) are worth significantly more than lower grades, and homes in a very good condition have a relatively significant impact over just average homes.

![Grade 6 vs Grade 12 Home](Images/grade6vs12.png)

### When to Sell
Homes sold in spring/summer sold significantly more than those sold in fall/winter - presumably from that gloomy Seattle rain :-P

![Seasons Sold Comparison Price](Images/mean_price_season_sold.png)

## Further Steps

### Internal to Data Science Team
With less of a time constraint, we would:

1. Fix normality issues in model.
2. Fix heteroskedasticity issues in model.
3. Fix multicollinearity issues in model
4. Test effectiveness of model using test data.
5. Explore industry and create new features
5. Deploy consumer product

### For Real Estate Team
Given additional resources, we would recommend the following to the firm:

1. Collect additional relevant data around other factors, using expertise
2. Open up to home prices above 1 million
3. Provide more time-relevant data


## ASANTE!
Or, Thank you!

![Flag of Tanzania](Images/tanzania_flag.jpg)

## Project Info

Contributors: __[Alexander](https://www.linkedin.com/in/anewt/)__ __[Newton](https://github.com/anewt225)__, __[James](https://www.linkedin.com/in/james-shaw-848984104//)__ __[Shaw](https://github.com/godelayheehoo)__

Languages  : Python

Tools/IDE  : Git, Command Line (Windows), Anaconda, Jupyter Notebook, Google Slides

Libraries  : numpy, pandas, matplotlib, seaborn, scikit-learn, missingno, geopandas, descartes, shapely

Duration   : July 2020
Last Update: 07.31.2020


```python

```
