# Imports

In [1]:
import pandas as pd

# import training data and target
raw_data = pd.read_csv('data/training_data.csv')
raw_target = pd.read_csv('data/training_target.csv')

In [2]:
display(raw_target.info())
print(raw_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            59400 non-null  int64 
 1   status_group  59400 non-null  object
dtypes: int64(1), object(1)
memory usage: 928.2+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Create a raw dataframe with merged data and target. We will use this df during initial EDA so we can compare feature relationships with target, and so we can understand and deal with null values

In [3]:
raw_df = pd.merge(raw_data, raw_target, on='id')

The dataset for training includes 59,400 entries with 40 total features and a target label.

In [4]:
raw_df.status_group.value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

We know this is a ternary classification problem. The three possible values are:
- functional (F)
- non functional (NF)
- functional needs repair (FR)

Early analysis shows that our dataset is not balanced with respect to the label values. Only 7.2% of pumps are classified as functional needs repair, while 54.3% are functional and 38.5% are non functional. We will need to account for this disparity later.


# Null Checks

In [5]:
null_checks = pd.DataFrame(data=raw_df.isna().sum(),
                          columns=['null_count'])
null_checks['percent_of_data'] = (null_checks.null_count / len(raw_data)) * 100
null_checks = null_checks[null_checks.percent_of_data > 0.0]
null_checks.sort_values('percent_of_data', ascending=False, inplace=True)
null_checks

Unnamed: 0,null_count,percent_of_data
scheme_name,28166,47.417508
scheme_management,3877,6.526936
installer,3655,6.153199
funder,3635,6.119529
public_meeting,3334,5.612795
permit,3056,5.144781
subvillage,371,0.624579


There are 7 features with null values in our dataset, and we can what that number of nulls is by percent of total available data. 
______
We will work our way up from the smallest null_count to the largest and decide how to deal with the missing data

## Subvillage

In [6]:
subvillage_nans = raw_df[raw_df.subvillage.isnull()]
subvillage_nans.status_group.value_counts(normalize=True)

functional                 0.552561
non functional             0.444744
functional needs repair    0.002695
Name: status_group, dtype: float64

The null values in subvillage represent 0.6% of our total data. The distribution of the target label is very close to the whole dataset

In [7]:
subvillage_nans.region.value_counts()

Dodoma    361
Mwanza     10
Name: region, dtype: int64

All but 10 of our subvillage nan's come from the region of Dodoma, the rest from Mwanza. Lets look at the subvillage distribution of those regions from the whole dataset

In [8]:
raw_df[raw_df['region'] == 'Dodoma'].subvillage.value_counts()

Kawawa         54
Shuleni        43
Nyerere        35
Azimio         34
Majengo        32
               ..
Foye            1
Mtatangwe       1
Makao Mapya     1
Soya Mjini      1
Mgomwa          1
Name: subvillage, Length: 705, dtype: int64

In [9]:
raw_df[raw_df['region'] == 'Mwanza']

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
18,34169,0.0,2011-07-22,Hesawa,1162,DWE,32.920154,-1.947868e+00,Ngomee,0,...,milky,milky,insufficient,insufficient,spring,spring,groundwater,other,other,functional needs repair
53,32376,0.0,2011-08-01,Government Of Tanzania,0,Government,0.000000,-2.000000e-08,Polisi,0,...,unknown,unknown,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
85,68717,0.0,2011-08-08,Swedish,0,Sengerema Water Department,32.185517,-2.378772e+00,Kwa Swakala,0,...,soft,good,seasonal,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
97,29083,0.0,2011-08-04,Mzee Sh,0,ALLYS,33.079504,-3.009659e+00,Kwa Shij,0,...,unknown,unknown,enough,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,functional
129,41839,0.0,2011-07-27,Hesawa,0,HESAWA,32.993701,-2.690442e+00,Kwa Nhag,0,...,coloured,colored,dry,dry,shallow well,shallow well,groundwater,other,other,non functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59221,23897,0.0,2011-07-27,Ham,0,Sengerema Water Department,32.732351,-2.525240e+00,Kwa Mihayo,0,...,soft,good,dry,dry,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
59230,59380,0.0,2011-07-27,Hifab,0,Hesawa,33.186638,-3.206395e+00,Mwambogwa,0,...,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,non functional
59248,37081,0.0,2011-07-30,Hesawa,0,DWE,33.058223,-2.448200e+00,Sese Centre,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
59269,39456,0.0,2011-07-24,Government Of Tanzania,1185,RWE,32.933684,-1.985302e+00,School,0,...,soft,good,insufficient,insufficient,lake,river/lake,surface,communal standpipe multiple,communal standpipe,non functional


In [10]:
raw_df[raw_df['region'] == 'Mwanza'].subvillage.value_counts()

1                     132
Madukani               52
Bujingwa               25
Shuleni                19
Matale                 18
                     ... 
Bukalo                  1
Nyambona                1
Kabaganda B             1
Bulyahilu Center B      1
Mwambogwa               1
Name: subvillage, Length: 1507, dtype: int64

The number of missing subvillage names is about half of the total different names for subvillages in the region.