<h1>DataSet Explanation</h1>
The Intention of this Jupyter Notebook is to give a clear visualization of the dataset we are working with:<br>
WaterPump Data in Africa

Source: https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/

In [8]:
import pandas as pd

df = pd.read_csv("Training_Set_Values.csv")
df.shape

(59400, 40)

In [9]:
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


Here is the official documentation of each Features:
* amount_tsh - Total static head (amount water available to waterpoint)
* date_recorded - The date the row was entered
* funder - Who funded the well
* gps_height - Altitude of the well
* installer - Organization that installed the well
* longitude - GPS coordinate
* latitude - GPS coordinate
* wpt_name - Name of the waterpoint if there is one
* num_private -
* basin - Geographic water basin
* subvillage - Geographic location
* region - Geographic location
* region_code - Geographic location (coded)
* district_code - Geographic location (coded)
* lga - Geographic location
* ward - Geographic location
* population - Population around the well
* public_meeting - True/False
* recorded_by - Group entering this row of data
* scheme_management - Who operates the waterpoint
* scheme_name - Who operates the waterpoint
* permit - If the waterpoint is permitted
* construction_year - Year the waterpoint was constructed
* extraction_type - The kind of extraction the waterpoint uses
* extraction_type_group - The kind of extraction the waterpoint uses
* extraction_type_class - The kind of extraction the waterpoint uses
* management - How the waterpoint is managed
* management_group - How the waterpoint is managed
* payment - What the water costs
* payment_type - What the water costs
* water_quality - The quality of the water
* quality_group - The quality of the water
* quantity - The quantity of water
* quantity_group - The quantity of water
* source - The source of the water
* source_type - The source of the water
* source_class - The source of the water
* waterpoint_type - The kind of waterpoint
* waterpoint_type_group - The kind of waterpoint

<h2>Missing Values</h2>

The issue with the Dataset now is that there are lot of missing/unavailable datas<br>
Let us take an example of the scheme_management column

In [10]:
df["scheme_management"].isnull().sum()

3877

As can be seen, for the column scheme_management there are 3877 missing data values.<br>
This is not the only column with missing data(e.g. there are lot of population with value == 0)<br>
<br> This will be part of the necessary consideration on how to take care of this values:<br>
* Imputing
* Removing Data with missing values


<h2>Similar/Duplicate Values</h2>

Another Issue that we can see is that there are columns that has really similar values

In [11]:
(df["extraction_type"]==df["extraction_type_group"]).value_counts()

True     56931
False     2469
dtype: int64

<h1> Dataset Visualization</h1>

In [12]:
import plotly.express as px

target = pd.read_csv("Training_set_labels.csv")
target.head()
color_list=["green","red","orange"]

In [13]:
value_count= target["status_group"].value_counts()
value_count.index

Index(['functional', 'non functional', 'functional needs repair'], dtype='object')

In [14]:
bar = px.bar(x=value_count.index, y=value_count, color_discrete_sequence=color_list, 
    color=value_count.index, title="Target Data Distribution", hover_name=value_count,
    width= 800, height=600
)
bar.update_xaxes(title_text="Label")
bar.update_yaxes(title_text="Quantity")