In [None]:
## OVERVIEW

### Business Understanding

Water Problems in Tanzania, like many countries in Sub-Saharan Africa, faces significant challenges related to water supply and quality. The country struggles with uneven distribution and inadequate infrastructure. Many rural areas, in particular, experience severe water scarcity, which impacts health, agriculture, and economic development. The main issues include limited access to safe drinking water, inadequate sanitation, and frequent droughts, which exacerbate water shortages and affect food security. According to reports, only about 57% of the population has access to basic water services, and the situation is even more dire in rural regions where women and children often have to travel long distances to fetch water from unprotected sources, which are often contaminated and unsafe.

Water Well Projects in Tanzania
To address these critical issues, numerous water well projects have been initiated by the Tanzanian government, non-governmental organizations (NGOs), and international aid agencies. These projects aim to provide sustainable access to clean and safe water, primarily in underserved rural areas. The construction of water wells, including boreholes and shallow wells, is a common solution. These wells tap into underground aquifers to provide a reliable source of clean water. Many projects also incorporate the installation of hand pumps and mechanized pumps to improve water retrieval efficiency.

The challenges faced after implementing water well projects in Tanzania revolve around maintenance and sustainability, water quality, community involvement, financial constraints, environmental factors, and coordination with broader water management efforts. These projects encounter issues like the lack of expertise for maintenance, water contamination risks, insufficient community engagement, funding uncertainties, environmental impacts on water availability, and the need for integrated water resource management strategies. Overcoming these challenges demands a comprehensive approach combining technical solutions, community empowerment, sustainable financing, and holistic water management to ensure the long-term success of water well projects in Tanzania.

## Business Problem

We are tasked with developing a prediction tool that will predict water pumps in need of repair. We used the data provided in an iterative modeling process to create a final classification model or tool that can be used by development organizations to predict whether or not a pump is in need of repair before the pump fails. This tool will enable development organizations to appropriately allocate resources in dispatching repair teams.

Being a data science consulting company we have been hired by the Tanzanian Ministry of Water to create a prediction model to help classify whether water pumps are in need of repair (functional, functional but in need of repairs or non-functional). We have been hired to help reduce the possibility of the above mentioned challenges from occuring as it leads to waste of the Ministry's resources and only send out repair teams to pumps that only need of repairs or are non-functional.

 Main object is to maximize accuracy and  recall to ensure the people of Tanzania have access to potable water and few pumps that are non-functional or in need of repairs are over looked.

In [7]:
# import libraries

# Data Manipulation
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Numerical operations
import numpy as np





In [56]:
class DataSourcing:
  
  def __init__(self, filepath_status, filepath_features):
        self.filepath_status = filepath_status
        self.filepath_features = filepath_features
        self.load_data()
        self.merge_data()

  def load_data(self):
        self.status_df = pd.read_csv(self.filepath_status)
        self.features_df = pd.read_csv(self.filepath_features)
        print("Data loaded successfully.")
        return self.status_df, self.features_df

  def merge_data(self, how='right', on=None):
        self.data = pd.merge(self.status_df, self.features_df, how=how, on=on)
        print("Data merged successfully.")
        return self.data
  #def __init__(self,dataframe):
    #self.original = dataframe
    #self.dataframe = dataframe
  
  def give_info(self):
    message =  f"""
    ----------------------------------------------------------------------->
    DESCRIPTION OF THE DATAFRAME IN QUESTION:
    ----------------------------------------------------------------------->
    
    Dataframe information => {self.data.info()}
    
    ------------------------------------------------------------------------------------------------------------------------->
    
    Dataframe shape => {self.data.shape[0]} rows, {self.data.shape[1]} columns
    ------------------------------------------------------------------------------------------------------------------------->    
    
    There are {len(self.data.columns)} columns, namely: {self.data.columns}.  
    ------------------------------------------------------------------------------------------------------------------------->
        
    The first 5 records in the dataframe are seen here:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.data.head()}
    ------------------------------------------------------------------------------------------------------------------------->
       
    The last 5 records in the self.dataframe are as follows: 
    ------------------------------------------------------------------------------------------------------------------------->
    {self.data.tail()}
    ------------------------------------------------------------------------------------------------------------------------->
    
    The descriptive statistics of the dataframe (mean,median, max, min, std) are as follows:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.data.describe()}
    ------------------------------------------------------------------------------------------------------------------------->
    """
    print (message)
  
  def null_count(self):
    return self.data.isnull().sum()
  
  def unique_count(self):
    return self.data.nunique()
  
  def unique_per_column(self):
    print("<----- UNIQUE VALUES IN EACH COLUMN ----->")
    for col in self.data.columns:
      print(f"{col} ({len(self.data[col].unique())} unique)\n {sorted(self.data[col].unique())}")
      print()
    print("<----- END OF UNIQUE VALUES IN EACH COLUMN ----->")
    return

    
  def plot_water_pump_status(self):
        fig, ax = plt.subplots(figsize=(15, 10))

        ax.tick_params(axis="x", labelsize=24)
        ax.tick_params(axis="y", labelsize=18)

        ax.bar(x=self.x, height=self.height, color=self.colors)
        
        plt.title(self.title, fontsize=30)
        plt.ylabel(self.ylabel, fontsize=24)
        plt.ylim(0, self.ylim)


#### Loading the dataset

In [59]:
# Filepaths to the datasets
 
filepath_features = ('Data\water_columns.csv')
filepath_status = ('Data/water_status_group.csv')

filepath_features.head()
filepath_status.head()

In [51]:




data_source = DataSourcing(filepath_status, filepath_features)
  

data_source.give_info()

Data loaded successfully.
Data merged successfully.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   status_group           59400 non-null  object 
 2   amount_tsh             59400 non-null  float64
 3   date_recorded          59400 non-null  object 
 4   funder                 55765 non-null  object 
 5   gps_height             59400 non-null  int64  
 6   installer              55745 non-null  object 
 7   longitude              59400 non-null  float64
 8   latitude               59400 non-null  float64
 9   wpt_name               59400 non-null  object 
 10  num_private            59400 non-null  int64  
 11  basin                  59400 non-null  object 
 12  subvillage             59029 non-null  object 
 13  region                 59400 non-null  object 
 14  re

In [52]:
data_source.data['status_group'].value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [53]:
data_source.data['status_group'].value_counts(normalize = True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

In [57]:
x = ["Functional", "Functional Needs Repair", "Non Functional"]
height = [0.543081, 0.072677, 0.384242]
colors = ["cadetblue", "lightcoral", "cadetblue"]
title = "Status of Water Pumps"
ylabel = "Percentage"
ylim = 0.65
text_annotations = ['54%', '7%', '39%']
output_path = './images/Well_Status_bar_Chart.png'

plotter = Plotter(x, height, colors, title, ylabel, ylim, text_annotations, output_path)
plotter.plot_water_pump_status()

NameError: name 'Plotter' is not defined