# Introduction

## Final Project Submission

***
- Student Name: Adam Marianacci
- Student Pace: Flex
- Scheduled project review date/time: TBD
- Instructor Name: Mark Barbour

# Business Understanding

It is my job to help the WWFA (Water Wells For Africa) organization identify wells that are in need or repair in Tanzania.

# Data Understanding

The data used in this analysis comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. The final dataframe used in this analysis contained over 38,000 entries. The dataset consisted of various information about waterwells in Tanzania such as the functioning status, water quality,  age, source, and altitude to name a few. One limitation of the dataset is that it is a fairly small since we are dealing with predictive modeling. Another limitation was that many of the features in the dataset were shown to have insignificant importance when it came to predicting wells that were in need of repair. The dataset was suitable for the project because it did reveal some notable features about wells. I was able to gain insight into identifying where repairs were needed to help the WWFA promote access to potable water across Tanzania.

# Data Preperation

In [1]:
# Importing the necessary libraries
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import folium
import statsmodels as sm
import sklearn
import sklearn.preprocessing as preprocessing
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Set display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
# Importing the dataframes
df_x = pd.read_csv('data/training_set_values.csv')
df_y = pd.read_csv('data/training_set_labels.csv')

In [4]:
# Combining the 2 dataframes into 1 new dataframe
Waterwells_df = pd.concat([df_y, df_x], axis=1)

In [5]:
# Previewing the dataframe
Waterwells_df.head()

Unnamed: 0,id,status_group,id.1,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,functional,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,functional,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,functional,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,non functional,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,functional,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [7]:
Waterwells_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 42 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   status_group           59400 non-null  object 
 2   id                     59400 non-null  int64  
 3   amount_tsh             59400 non-null  float64
 4   date_recorded          59400 non-null  object 
 5   funder                 55765 non-null  object 
 6   gps_height             59400 non-null  int64  
 7   installer              55745 non-null  object 
 8   longitude              59400 non-null  float64
 9   latitude               59400 non-null  float64
 10  wpt_name               59400 non-null  object 
 11  num_private            59400 non-null  int64  
 12  basin                  59400 non-null  object 
 13  subvillage             59029 non-null  object 
 14  region                 59400 non-null  object 
 15  re

Dropping columns that are not directly related to the business problem and also have high cardinality, making them difficult to one hot encode.

In [8]:
# Dropping irrelevant columns from the dataframe, also columns with large amounts of missing data
columns_to_drop = [
    'id', 'scheme_management', 'region', 'region_code',
    'payment', 'public_meeting', 'district_code', 'population','amount_tsh',
    'num_private', 'basin', 'latitude', 'longitude',
    'waterpoint_type_group', 'source_class', 'payment_type', 'management_group', 'recorded_by', 
    'extraction_type', 'management', 
    'source_type', 'extraction_type_group', 'permit', 'funder',
    'date_recorded', 'installer', 'ward', 'scheme_name', 'wpt_name', 'lga', 'subvillage'
]

Waterwells_df = Waterwells_df.drop(columns_to_drop, axis=1, errors='ignore')


In [9]:
# Create a new column 'needs_repair' by merging the two categories
Waterwells_df['needs_repair'] = Waterwells_df['status_group'].replace(
    {'functional': 0, 'non functional': 1, 
     'functional needs repair': 1})

# Drop the original 'status_group' column
Waterwells_df.drop('status_group', axis=1, inplace=True)

#Display the updated DataFrame
Waterwells_df.head()



Unnamed: 0,gps_height,construction_year,extraction_type_class,water_quality,quality_group,quantity,quantity_group,source,waterpoint_type,needs_repair
0,1390,1999,gravity,soft,good,enough,enough,spring,communal standpipe,0
1,1399,2010,gravity,soft,good,insufficient,insufficient,rainwater harvesting,communal standpipe,0
2,686,2009,gravity,soft,good,enough,enough,dam,communal standpipe multiple,0
3,263,1986,submersible,soft,good,dry,dry,machine dbh,communal standpipe multiple,1
4,0,0,gravity,soft,good,seasonal,seasonal,rainwater harvesting,communal standpipe,0


In [10]:
#dropping the missing values from the 'construction_year' column and creating a new df
Construction_Year_df = Waterwells_df[Waterwells_df['construction_year'] != 0]

# Calculate the current year
current_year = datetime.now().year

# Create a new column 'age' by subtracting construction year from the current year
Construction_Year_df['age'] = current_year - Waterwells_df['construction_year']

In [11]:
Construction_Year_df['construction_year'].value_counts()

2010    2645
2008    2613
2009    2533
2000    2091
2007    1587
2006    1471
2003    1286
2011    1256
2004    1123
2012    1084
2002    1075
1978    1037
1995    1014
2005    1011
1999     979
1998     966
1990     954
1985     945
1996     811
1980     811
1984     779
1982     744
1994     738
1972     708
1974     676
1997     644
1992     640
1993     608
2001     540
1988     521
1983     488
1975     437
1986     434
1976     414
1970     411
1991     324
1989     316
1987     302
1981     238
1977     202
1979     192
1973     184
2013     176
1971     145
1960     102
1967      88
1963      85
1968      77
1969      59
1964      40
1962      30
1961      21
1965      19
1966      17
Name: construction_year, dtype: int64

In [12]:
# deleting the 'construction_year' column since we replaced it with an 'age' column
Construction_Year_df = Construction_Year_df.drop('construction_year', axis=1)

In [13]:
Construction_Year_df.head()

Unnamed: 0,gps_height,extraction_type_class,water_quality,quality_group,quantity,quantity_group,source,waterpoint_type,needs_repair,age
0,1390,gravity,soft,good,enough,enough,spring,communal standpipe,0,25
1,1399,gravity,soft,good,insufficient,insufficient,rainwater harvesting,communal standpipe,0,14
2,686,gravity,soft,good,enough,enough,dam,communal standpipe multiple,0,15
3,263,submersible,soft,good,dry,dry,machine dbh,communal standpipe multiple,1,38
5,0,submersible,salty,salty,enough,enough,other,communal standpipe multiple,0,15


In [14]:
Construction_Year_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38691 entries, 0 to 59399
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   gps_height             38691 non-null  int64 
 1   extraction_type_class  38691 non-null  object
 2   water_quality          38691 non-null  object
 3   quality_group          38691 non-null  object
 4   quantity               38691 non-null  object
 5   quantity_group         38691 non-null  object
 6   source                 38691 non-null  object
 7   waterpoint_type        38691 non-null  object
 8   needs_repair           38691 non-null  int64 
 9   age                    38691 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 3.2+ MB


In [None]:
Construction_Year_df['needs_repair'].value_counts()

In [None]:
# Defining X and y variables
y = Construction_Year_df["needs_repair"]
X = Construction_Year_df.drop("needs_repair", axis=1)

In [None]:
# Performing a train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [None]:
# Looking at the number of missing values in each column
X_train.isna().sum()

In [None]:
# Create a list of all the categorical features
cols_to_transform = ['quantity_group', 'waterpoint_type','extraction_type_class',
                     'quality_group', 'source',
                     'water_quality', 'quantity']
# Create a dataframe with the new dummy columns created from the cols_to_transform list
X_train = pd.get_dummies(
    data=X_train, columns=cols_to_transform, drop_first=True, dtype=int)

In [None]:
X_train.info()

In [None]:
X_train.head()

In [None]:
# Defining the columns to scale
column_to_scale = ['gps_height']

# Initialize the scaler
scaler = MinMaxScaler()

# Fit the scaler on the specified columns and transform the data
X_train[column_to_scale] = scaler.fit_transform(X_train[column_to_scale])

In [None]:
# Inspecting the data to make sure it was scaled
X_train.head()

In [None]:
# Filtering the data based on 'needs_repair'
needs_repair_histogram = Construction_Year_df[Construction_Year_df['needs_repair'] == 1]['gps_height']

#plotting a histogram
plt.hist(needs_repair_histogram, bins=75, color='blue', alpha=0.5)
plt.xlabel('GPS Height')
plt.ylabel('Frequency')
plt.title('Histogram of GPS Height for Needs Repair')
plt.show()

In [None]:
# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(Construction_Year_df['gps_height'], bins=75, color='skyblue', edgecolor='black')

# Customize the plot
plt.title('Histogram of Wells at Each Altitude')
plt.xlabel('Altitude (gps_height)')
plt.ylabel('Number of Wells')
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

In [None]:
# Create a histogram for 'gps_height' for all wells
all_histogram, bin_edges_all = np.histogram(Construction_Year_df['gps_height'], bins=75)

# Create a histogram for 'gps_height' for wells that need repair
needs_repair_histogram, bin_edges_needs_repair = np.histogram(
    Construction_Year_df[Construction_Year_df['needs_repair'] == 1]['gps_height'], bins=75)

# Calculate the ratios
ratios = needs_repair_histogram / all_histogram.astype(float)

# Calculate the bin centers
bin_centers = (bin_edges_all[:-1] + bin_edges_all[1:]) / 2

# Plot the ratios
plt.figure(figsize=(10, 6))
plt.plot(bin_centers, ratios, color='orange', marker='o')

# Customize the plot
plt.title('Ratios of Wells that Need Repair to All Wells at Each Altitude')
plt.xlabel('Altitude (gps_height)')
plt.ylabel('Ratio')
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

In [None]:
# Filter data for wells that need repair and those that don't
needs_repair_age = Construction_Year_df[Construction_Year_df['needs_repair'] == 1]['age']

# Create histograms for age of wells
plt.figure(figsize=(10, 6))
plt.hist(needs_repair_age, bins=30, alpha=0.5, color='red', label='Needs Repair')

# Customize the plot
plt.title('Histogram of Well Age by Repair Status')
plt.xlabel('Age')
plt.ylabel('Number of Wells')
plt.legend()
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

In [None]:
# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(Construction_Year_df['age'], bins=75, color='skyblue', edgecolor='black')

# Customize the plot
plt.title('Histogram of Wells at Each Age')
plt.xlabel('Age (age)')
plt.ylabel('Number of Wells')
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

In [None]:
# Create a histogram for 'age' for all wells
all_histogram_age, bin_edges_all = np.histogram(Construction_Year_df['age'], bins=75)

# Create a histogram for 'gps_height' for wells that need repair
needs_repair_histo, bin_edges_needs_repair = np.histogram(
    Construction_Year_df[Construction_Year_df['needs_repair'] == 1]['age'], bins=75)

# Calculate the ratios
ratios = needs_repair_histo / all_histogram_age.astype(float)

# Calculate the bin centers
bin_centers = (bin_edges_all[:-1] + bin_edges_all[1:]) / 2

# Plot the ratios
plt.figure(figsize=(10, 6))
plt.plot(bin_centers, ratios, color='orange', marker='o')

# Customize the plot
plt.title('Ratios of Wells that Need Repair to All Wells at Each Age')
plt.xlabel('Age (age)')
plt.ylabel('Ratio')
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

# Modeling

In [None]:
# Building a logistic regression model
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
model_log = logreg.fit(X_train, y_train)
model_log

The classifier was about 74% accurate on the training data.

In [None]:
# Checking the performance on the training data
y_hat_train = logreg.predict(X_train)

train_residuals = np.abs(y_train - y_hat_train)
print(pd.Series(train_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(train_residuals, name="Residuals (proportions)").value_counts(normalize=True))

In [None]:
# Looking at the number of missing values in each column
X_test.isna().sum()

In [None]:
# Create a list of all the categorical features
cols_to_transform = ['quantity_group', 'waterpoint_type','extraction_type_class',
                     'quality_group', 'source',
                     'water_quality', 'quantity']
# Create a dataframe with the new dummy columns created from the cols_to_transform list
X_test = pd.get_dummies(
    data=X_test, columns=cols_to_transform, drop_first=True, dtype=int)

In [None]:
# Fit the scaler on the specified columns and transform the data
X_test[column_to_scale] = scaler.fit_transform(X_test[column_to_scale])

In [None]:
logreg.score(X_test, y_test)

We are still about 74% accuarate on our test data.

In [None]:
y_hat_test = logreg.predict(X_test)

test_residuals = np.abs(y_test - y_hat_test)
print(pd.Series(test_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(test_residuals, name="Residuals (proportions)").value_counts(normalize=True))

In [None]:
cvscore = cross_val_score(logreg, X_train, y_train.values, cv=10)

In [None]:
cvscore

In [None]:
np.average(cvscore)

In [None]:
np.std(cvscore)

Building a single decision tree

In [None]:
# Create the classifier, fit it on the training data and make predictions on the test set
clf = DecisionTreeClassifier(criterion='entropy')

clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
clf.feature_importances_

In [None]:
print("clf.feature_importances_:", clf.feature_importances_)
print("X.columns:", X_train.columns)

In [None]:
features = pd.DataFrame(clf.feature_importances_, index=X_train.columns, columns=['Importance'])
print(features)

Building a Random Forest Model

In [None]:
rf = RandomForestClassifier()

In [None]:
rf.fit(X_train, y_train)

In [None]:
y_pred = rf.predict(X_test)

In [None]:
rf.score(X_test, y_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
features = pd.DataFrame(rf.feature_importances_, index = X_train.columns)
print(features)

In [None]:
# Sorting the features by most influential to least
features_sorted = features.sort_values(by=0, ascending=False)
print(features_sorted)

Building a third model with hyperparameters

In [None]:
rf2 = RandomForestClassifier(n_estimators = 1000,
                            criterion = 'entropy',
                            min_samples_split = 10,
                            max_depth = 15,
                            random_state = 42
)

In [None]:
rf2.fit(X_train, y_train)

In [None]:
rf2.score(X_test, y_test)

In [None]:
y_pred2 = rf2.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred2))

In [None]:
features = pd.DataFrame(rf2.feature_importances_, index = X_train.columns)
print(features)

In [None]:
# Sorting the features by most influential to least
features_sorted = features.sort_values(by=0, ascending=False)
print(features_sorted)

In [None]:
print(confusion_matrix(y_test, y_pred))

The confusion matrix shows that our True/Positives are 2,388, our True/Negatives are 3,440. The False/Positives are at 897, and the False/Negatives are 1,014.

In [None]:
# Assuming you have your predicted labels and true labels
# Replace 'y_true' and 'y_pred' with your actual data
# Example:
# y_true = true labels
# y_pred = predicted labels

# Generate a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Set up a figure and axis
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.2)  # Adjust font size for better readability

# Create a heatmap of the confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', cbar=False,
            annot_kws={"size": 14}, square=True,
            xticklabels=['Not Needs Repair', 'Needs Repair'],
            yticklabels=['Not Needs Repair', 'Needs Repair'])

plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# Evaluation

My best performing model was my rf2 model which was the Random Forest model with hyperparameters. It showed a 76% on the macro avg. (where all classes equally contribute to the final averaged metric) of recall. Although this isn't great, it does help in identifying wells that are in need of repair. I focused on recall because it explains how many of the actual positive cases we were able to predict correctly. When it came to the problem of  the business understanding it was more of a concern to identify false negatives , labeling wells as not needing repair that are actually in need of repair will lead to people not having access to water. 

# Conclusion

# Recommendations

# Limitations

# Next Steps

In [None]:
# Look at better population around the well data