# Water Pump Functionality Prediction 
**By Assimagbe Albert Raphael**

![](https://i.imgur.com/8Ev8H1H.png 



## Project Overview
In this project, I will develop machine learning models to predict the functionality status of water pumps across Tanzania based on a variety of features including location, water quality, management structure, and technical specifications. This project addresses a critical infrastructure challenge: identifying which water pumps are functional, which need repairs, and which are non-functional to improve maintenance operations and ensure communities have access to clean water.

## Competition Link
[DrivenData - Pump it Up: Data Mining the Water Table](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/)


## Project Goals
1. Develop classification models to predict three target classes:
   - Functional water pumps
   - Water pumps that need repair
   - Non-functional water pumps
2. Analyze which factors most strongly influence water pump functionality
3. Create geospatial visualizations to communicate findings effectively
4. Deploy a simple dashboard to showcase models and insights

## Data Description

### Dataset Source
The data comes from Taarifa and the Tanzanian Ministry of Water, which aggregates information about water points across Tanzania.

### Features Overview
The dataset contains 40+ columns representing:
- Geographic information (coordinates, region, basin)
- Technical specifications (pump type, extraction method)
- Management details (installer, funder, payment type)
- Water characteristics (quality, quantity, source)
- Temporal information (construction year, recording date)

### Target Variable
- **status_group**: The operating condition of the waterpoint with three possible values:
  - `functional` - operational with no repairs needed
  - `functional needs repair` - operational but needs repairs
  - `non functional` - not operational

## Project Structure

### 1: Exploratory Data Analysis
- Clean and preprocess the dataset
- Handle missing values and categorical variables
- Create geospatial visualizations of water pump distribution
- Analyze relationships between features and pump functionality
- **Deliverable**: EDA report with visualizations and initial insights

### 2: Model Development
- Feature engineering and selection
- Train baseline classification models
- Implement cross-validation and hyperparameter tuning
- Evaluate models using appropriate classification metrics
- **Deliverable**: Trained models with documented performance metrics

### 3: Insights & Deployment
- Analyze feature importance and model explanations
- Create a Streamlit dashboard with interactive maps
- Document findings and recommendations for maintenance strategies
- **Deliverable**: Interactive dashboard and final presentation

### Required Libraries
```
pandas
numpy
scikit-learn
matplotlib
seaborn
geopandas
folium
streamlit
plotly
```

## Technical Requirements

### Required Technologies
- Python (pandas, scikit-learn, matplotlib/seaborn)
- Geospatial libraries (GeoPandas, Folium)
- Jupyter Notebooks for exploration and documentation
- GitHub for version control
- Streamlit for dashboard creation

### Performance Metric
- Classification accuracy (primary metric)
- Additional metrics: F1-score, precision, recall, and confusion matrix

## Resources

### Dataset Files
- `training_set.csv`: Training data with labels
- `test_set.csv`: Test data for predictions
- `SubmissionFormat.csv`: Template for competition submissions

### Technical Resources
- [Scikit-learn Classification Documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)
- [GeoPandas Documentation](https://geopandas.org/en/stable/)
- [Folium Documentation](https://python-visualization.github.io/folium/)
- [Streamlit Documentation](https://docs.streamlit.io/)

### Domain Resources
- [Taarifa Platform](https://taarifa.org/)
- [Tanzania Water Point Mapping Project](https://www.maji.go.tz/)

## Acknowledgments
- Data provided by Taarifa and the Tanzania Ministry of Water
- Project taken from the DrivenData competition

In [1]:
# Required imports

import warnings
warnings.filterwarnings('ignore')

# Core libraries
import numpy as np
import pandas as pd
import time
import pickle

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl

# Typing
from typing import Dict, List, Optional, Set

# Scikit-learn
from sklearn import svm, model_selection, tree, linear_model, neighbors, naive_bayes, ensemble
from sklearn import discriminant_analysis, gaussian_process
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score, classification_report, roc_curve, confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.calibration import calibration_curve

from sklearn.model_selection import cross_validate
from sklearn.metrics import (roc_auc_score, accuracy_score, precision_score, 
                           recall_score, f1_score, confusion_matrix)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier, 
                            AdaBoostClassifier)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.inspection import permutation_importance
from IPython.display import display

# Advanced models
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import Pool, CatBoostClassifier, CatBoostRegressor
from sklearn.preprocessing import MinMaxScaler


# Multi-label learning
from skmultilearn.problem_transform import BinaryRelevance

# Hyperparameter optimization
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

# Statistical libraries
from scipy import stats
import statsmodels.api as sm