<a href="https://colab.research.google.com/github/zacherymoy/DS-Unit-2-Kaggle-Challenge/blob/master/Class_3_Assignment_LS_DS_223_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---

# Cross-Validation


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


You won't be able to just copy from the lesson notebook to this assignment.

- Because the lesson was ***regression***, but the assignment is ***classification.***
- Because the lesson used [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html), which doesn't work as-is for _multi-class_ classification.

So you will have to adapt the example, which is good real-world practice.

1. Use a model for classification, such as [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. Use hyperparameters that match the classifier, such as `randomforestclassifier__ ...`
3. Use a metric for classification, such as [`scoring='accuracy'`](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)
4. If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, such as [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) (not [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html))



## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- Add your own stretch goals!
- Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/). See the previous assignment notebook for details.
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?


### BONUS: Stacking!

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [0]:
train.drop(columns=['scheme_name','extraction_type_group', 'extraction_type_class', 'payment_type', 
                    'source_type', 'waterpoint_type_group',
                    'public_meeting', 'permit', 'recorded_by', 'id'])

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,scheme_management,construction_year,extraction_type,management,management_group,payment,water_quality,quality_group,quantity,quantity_group,source,source_class,waterpoint_type,status_group
0,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,VWC,1999,gravity,vwc,user-group,pay annually,soft,good,enough,enough,spring,groundwater,communal standpipe,functional
1,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,Other,2010,gravity,wug,user-group,never pay,soft,good,insufficient,insufficient,rainwater harvesting,surface,communal standpipe,functional
2,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,VWC,2009,gravity,vwc,user-group,pay per bucket,soft,good,enough,enough,dam,surface,communal standpipe multiple,functional
3,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,VWC,1986,submersible,vwc,user-group,never pay,soft,good,dry,dry,machine dbh,groundwater,communal standpipe multiple,non functional
4,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,,0,gravity,other,other,never pay,soft,good,seasonal,seasonal,rainwater harvesting,surface,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,Pangani,Kiduruni,Kilimanjaro,3,5,Hai,Masama Magharibi,125,Water Board,1999,gravity,water board,user-group,pay per bucket,soft,good,enough,enough,spring,groundwater,communal standpipe,functional
59396,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,Rufiji,Igumbilo,Iringa,11,4,Njombe,Ikondo,56,VWC,1996,gravity,vwc,user-group,pay annually,soft,good,enough,enough,river,surface,communal standpipe,functional
59397,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,Rufiji,Madungulu,Mbeya,12,7,Mbarali,Chimala,0,VWC,0,swn 80,vwc,user-group,pay monthly,fluoride,fluoride,enough,enough,machine dbh,groundwater,hand pump,functional
59398,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,Rufiji,Mwinyi,Dodoma,1,4,Chamwino,Mvumi Makulu,0,VWC,0,nira/tanira,vwc,user-group,never pay,soft,good,insufficient,insufficient,shallow well,groundwater,hand pump,functional


In [0]:
import numpy as np

def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()

    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')

    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()

    # return the wrangled dataframe
    return X


train = wrangle(train)
test = wrangle(test)

In [0]:
train.head()

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,longitude_MISSING,latitude_MISSING,construction_year_MISSING,gps_height_MISSING,population_MISSING,year_recorded,month_recorded,day_recorded,years,years_MISSING
0,69572,6000.0,Roman,1390.0,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109.0,True,GeoData Consultants Ltd,VWC,Roman,False,1999.0,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,False,False,False,False,False,2011,3,14,12.0,False
1,8776,0.0,Grumeti,1399.0,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280.0,,GeoData Consultants Ltd,Other,,True,2010.0,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional,False,False,False,False,False,2013,3,6,3.0,False
2,34310,25.0,Lottery Club,686.0,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250.0,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009.0,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional,False,False,False,False,False,2013,2,25,4.0,False
3,67743,0.0,Unicef,263.0,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58.0,True,GeoData Consultants Ltd,VWC,,True,1986.0,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional,False,False,False,False,False,2013,1,28,27.0,False
4,19728,0.0,Action In A,,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,,True,GeoData Consultants Ltd,,,True,,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional,False,False,True,True,True,2011,7,13,,True


In [0]:
import pandas_profiling

#df = pd.read_csv('my_data.csv')
pandas_profiling.ProfileReport(train)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,50
Number of observations,59400
Total Missing (%),4.5%
Total size in memory,20.7 MiB
Average record size in memory,366.0 B

0,1
Numeric,14
Categorical,27
Boolean,2
Date,0
Text (Unique),0
Rejected,5
Unsupported,2

0,1
Distinct count,59400
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,37115
Minimum,0
Maximum,74247
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,3730.9
Q1,18520.0
Median,37062.0
Q3,55656.0
95-th percentile,70564.0
Maximum,74247.0
Range,74247.0
Interquartile range,37137.0

0,1
Standard deviation,21453
Coef of variation,0.57802
Kurtosis,-1.2015
Mean,37115
MAD,18586
Skewness,0.0026225
Sum,2204638827
Variance,460240000
Memory size,3.4 MiB

Value,Count,Frequency (%),Unnamed: 3
2047,1,0.0%,
72310,1,0.0%,
49805,1,0.0%,
51852,1,0.0%,
62091,1,0.0%,
64138,1,0.0%,
57993,1,0.0%,
60040,1,0.0%,
33413,1,0.0%,
35460,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
74240,1,0.0%,
74242,1,0.0%,
74243,1,0.0%,
74246,1,0.0%,
74247,1,0.0%,

0,1
Distinct count,98
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,317.65
Minimum,0
Maximum,350000
Zeros (%),70.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,20
95-th percentile,1200
Maximum,350000
Range,350000
Interquartile range,20

0,1
Standard deviation,2997.6
Coef of variation,9.4367
Kurtosis,4903.5
Mean,317.65
MAD,522.12
Skewness,57.808
Sum,18868000
Variance,8985500
Memory size,3.4 MiB

Value,Count,Frequency (%),Unnamed: 3
0.0,41639,70.1%,
500.0,3102,5.2%,
50.0,2472,4.2%,
1000.0,1488,2.5%,
20.0,1463,2.5%,
200.0,1220,2.1%,
100.0,816,1.4%,
10.0,806,1.4%,
30.0,743,1.3%,
2000.0,704,1.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,41639,70.1%,
0.2,3,0.0%,
0.25,1,0.0%,
1.0,3,0.0%,
2.0,13,0.0%,

Value,Count,Frequency (%),Unnamed: 3
138000.0,1,0.0%,
170000.0,1,0.0%,
200000.0,1,0.0%,
250000.0,1,0.0%,
350000.0,1,0.0%,

0,1
Distinct count,1898
Unique (%),3.2%
Missing (%),6.1%
Missing (n),3635

0,1
Government Of Tanzania,9084
Danida,3114
Hesawa,2202
Other values (1894),41365
(Missing),3635

Value,Count,Frequency (%),Unnamed: 3
Government Of Tanzania,9084,15.3%,
Danida,3114,5.2%,
Hesawa,2202,3.7%,
Rwssp,1374,2.3%,
World Bank,1349,2.3%,
Kkkt,1287,2.2%,
World Vision,1246,2.1%,
Unicef,1057,1.8%,
Tasaf,877,1.5%,
District Council,843,1.4%,

0,1
Distinct count,2428
Unique (%),4.1%
Missing (%),34.4%
Missing (n),20438
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1018.9
Minimum,-90
Maximum,2770
Zeros (%),0.0%

0,1
Minimum,-90.0
5-th percentile,16.0
Q1,393.0
Median,1167.0
Q3,1498.0
95-th percentile,1899.9
Maximum,2770.0
Range,2860.0
Interquartile range,1105.0

0,1
Standard deviation,612.57
Coef of variation,0.60123
Kurtosis,-1.0861
Mean,1018.9
MAD,528.44
Skewness,-0.20193
Sum,39697000
Variance,375240
Memory size,3.4 MiB

Value,Count,Frequency (%),Unnamed: 3
-15.0,60,0.1%,
-16.0,55,0.1%,
-13.0,55,0.1%,
1290.0,52,0.1%,
-20.0,52,0.1%,
303.0,51,0.1%,
-14.0,51,0.1%,
-18.0,49,0.1%,
-19.0,47,0.1%,
1269.0,46,0.1%,

Value,Count,Frequency (%),Unnamed: 3
-90.0,1,0.0%,
-63.0,2,0.0%,
-59.0,1,0.0%,
-57.0,1,0.0%,
-55.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2623.0,1,0.0%,
2626.0,2,0.0%,
2627.0,1,0.0%,
2628.0,1,0.0%,
2770.0,1,0.0%,

0,1
Distinct count,2146
Unique (%),3.6%
Missing (%),6.2%
Missing (n),3655

0,1
DWE,17402
Government,1825
RWE,1206
Other values (2142),35312
(Missing),3655

Value,Count,Frequency (%),Unnamed: 3
DWE,17402,29.3%,
Government,1825,3.1%,
RWE,1206,2.0%,
Commu,1060,1.8%,
DANIDA,1050,1.8%,
KKKT,898,1.5%,
Hesawa,840,1.4%,
0,777,1.3%,
TCRS,707,1.2%,
Central government,622,1.0%,

0,1
Distinct count,57516
Unique (%),96.8%
Missing (%),3.1%
Missing (n),1812
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,35.15
Minimum,29.607
Maximum,40.345
Zeros (%),0.0%

0,1
Minimum,29.607
5-th percentile,30.624
Q1,33.285
Median,35.006
Q3,37.234
95-th percentile,39.15
Maximum,40.345
Range,10.738
Interquartile range,3.9486

0,1
Standard deviation,2.6074
Coef of variation,0.074181
Kurtosis,-0.86928
Mean,35.15
MAD,2.1924
Skewness,-0.13481
Sum,2024200
Variance,6.7987
Memory size,3.4 MiB

Value,Count,Frequency (%),Unnamed: 3
39.08887513,2,0.0%,
39.10530661,2,0.0%,
37.54340145,2,0.0%,
38.18053774,2,0.0%,
32.98856004,2,0.0%,
32.99327684,2,0.0%,
39.09309544,2,0.0%,
39.10124424,2,0.0%,
32.96700926,2,0.0%,
39.08628657,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
29.6071219,1,0.0%,
29.60720109,1,0.0%,
29.61032056,1,0.0%,
29.61096482,1,0.0%,
29.61194674,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
40.32340181,1,0.0%,
40.32522643,1,0.0%,
40.32523996,1,0.0%,
40.34430089,1,0.0%,
40.34519307,1,0.0%,

0,1
Distinct count,57517
Unique (%),96.8%
Missing (%),3.1%
Missing (n),1812
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-5.8856
Minimum,-11.649
Maximum,-0.99846
Zeros (%),0.0%

0,1
Minimum,-11.649
5-th percentile,-10.601
Q1,-8.6438
Median,-5.1727
Q3,-3.3728
95-th percentile,-1.8027
Maximum,-0.99846
Range,10.651
Interquartile range,5.271

0,1
Standard deviation,2.8099
Coef of variation,-0.47742
Kurtosis,-1.2032
Mean,-5.8856
MAD,2.4813
Skewness,-0.25229
Sum,-338940
Variance,7.8954
Memory size,3.4 MiB

Value,Count,Frequency (%),Unnamed: 3
-6.978755499999999,2,0.0%,
-2.51532072,2,0.0%,
-2.48937845,2,0.0%,
-6.96356538,2,0.0%,
-2.49454559,2,0.0%,
-9.2893492,2,0.0%,
-2.48708461,2,0.0%,
-6.99129411,2,0.0%,
-6.98945622,2,0.0%,
-2.51063865,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-11.64944018,1,0.0%,
-11.64837759,1,0.0%,
-11.58629656,1,0.0%,
-11.56857679,1,0.0%,
-11.56680457,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-0.9994692,1,0.0%,
-0.99911702,1,0.0%,
-0.99901209,1,0.0%,
-0.998916,1,0.0%,
-0.99846435,1,0.0%,

0,1
Distinct count,37400
Unique (%),63.0%
Missing (%),0.0%
Missing (n),0

0,1
none,3563
Shuleni,1748
Zahanati,830
Other values (37397),53259

Value,Count,Frequency (%),Unnamed: 3
none,3563,6.0%,
Shuleni,1748,2.9%,
Zahanati,830,1.4%,
Msikitini,535,0.9%,
Kanisani,323,0.5%,
Bombani,271,0.5%,
Sokoni,260,0.4%,
Ofisini,254,0.4%,
School,208,0.4%,
Shule Ya Msingi,199,0.3%,

0,1
Distinct count,65
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.47414
Minimum,0
Maximum,1776
Zeros (%),98.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,1776
Range,1776
Interquartile range,0

0,1
Standard deviation,12.236
Coef of variation,25.807
Kurtosis,11137
Mean,0.47414
MAD,0.9362
Skewness,91.934
Sum,28164
Variance,149.73
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
0,58643,98.7%,
6,81,0.1%,
1,73,0.1%,
5,46,0.1%,
8,46,0.1%,
32,40,0.1%,
45,36,0.1%,
15,35,0.1%,
39,30,0.1%,
93,28,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,58643,98.7%,
1,73,0.1%,
2,23,0.0%,
3,27,0.0%,
4,20,0.0%,

Value,Count,Frequency (%),Unnamed: 3
672,1,0.0%,
698,1,0.0%,
755,1,0.0%,
1402,1,0.0%,
1776,1,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Lake Victoria,10248
Pangani,8940
Rufiji,7976
Other values (6),32236

Value,Count,Frequency (%),Unnamed: 3
Lake Victoria,10248,17.3%,
Pangani,8940,15.1%,
Rufiji,7976,13.4%,
Internal,7785,13.1%,
Lake Tanganyika,6432,10.8%,
Wami / Ruvu,5987,10.1%,
Lake Nyasa,5085,8.6%,
Ruvuma / Southern Coast,4493,7.6%,
Lake Rukwa,2454,4.1%,

0,1
Distinct count,19288
Unique (%),32.5%
Missing (%),0.6%
Missing (n),371

0,1
Madukani,508
Shuleni,506
Majengo,502
Other values (19284),57513

Value,Count,Frequency (%),Unnamed: 3
Madukani,508,0.9%,
Shuleni,506,0.9%,
Majengo,502,0.8%,
Kati,373,0.6%,
Mtakuja,262,0.4%,
Sokoni,232,0.4%,
M,187,0.3%,
Muungano,172,0.3%,
Mbuyuni,164,0.3%,
Mlimani,152,0.3%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Iringa,5294
Shinyanga,4982
Mbeya,4639
Other values (18),44485

Value,Count,Frequency (%),Unnamed: 3
Iringa,5294,8.9%,
Shinyanga,4982,8.4%,
Mbeya,4639,7.8%,
Kilimanjaro,4379,7.4%,
Morogoro,4006,6.7%,
Arusha,3350,5.6%,
Kagera,3316,5.6%,
Mwanza,3102,5.2%,
Kigoma,2816,4.7%,
Ruvuma,2640,4.4%,

0,1
Distinct count,27
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,15.297
Minimum,1
Maximum,99
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,5
Median,12
Q3,17
95-th percentile,60
Maximum,99
Range,98
Interquartile range,12

0,1
Standard deviation,17.587
Coef of variation,1.1497
Kurtosis,10.288
Mean,15.297
MAD,9.487
Skewness,3.1738
Sum,908642
Variance,309.32
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
11,5300,8.9%,
17,5011,8.4%,
12,4639,7.8%,
3,4379,7.4%,
5,4040,6.8%,
18,3324,5.6%,
19,3047,5.1%,
2,3024,5.1%,
16,2816,4.7%,
10,2640,4.4%,

Value,Count,Frequency (%),Unnamed: 3
1,2201,3.7%,
2,3024,5.1%,
3,4379,7.4%,
4,2513,4.2%,
5,4040,6.8%,

Value,Count,Frequency (%),Unnamed: 3
40,1,0.0%,
60,1025,1.7%,
80,1238,2.1%,
90,917,1.5%,
99,423,0.7%,

0,1
Distinct count,20
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.6297
Minimum,0
Maximum,80
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,1
Q1,2
Median,3
Q3,5
95-th percentile,30
Maximum,80
Range,80
Interquartile range,3

0,1
Standard deviation,9.6336
Coef of variation,1.7112
Kurtosis,16.214
Mean,5.6297
MAD,4.7435
Skewness,3.962
Sum,334407
Variance,92.807
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
1,12203,20.5%,
2,11173,18.8%,
3,9998,16.8%,
4,8999,15.1%,
5,4356,7.3%,
6,4074,6.9%,
7,3343,5.6%,
8,1043,1.8%,
30,995,1.7%,
33,874,1.5%,

Value,Count,Frequency (%),Unnamed: 3
0,23,0.0%,
1,12203,20.5%,
2,11173,18.8%,
3,9998,16.8%,
4,8999,15.1%,

Value,Count,Frequency (%),Unnamed: 3
60,63,0.1%,
62,109,0.2%,
63,195,0.3%,
67,6,0.0%,
80,12,0.0%,

0,1
Distinct count,125
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
Njombe,2503
Arusha Rural,1252
Moshi Rural,1251
Other values (122),54394

Value,Count,Frequency (%),Unnamed: 3
Njombe,2503,4.2%,
Arusha Rural,1252,2.1%,
Moshi Rural,1251,2.1%,
Bariadi,1177,2.0%,
Rungwe,1106,1.9%,
Kilosa,1094,1.8%,
Kasulu,1047,1.8%,
Mbozi,1034,1.7%,
Meru,1009,1.7%,
Bagamoyo,997,1.7%,

0,1
Distinct count,2092
Unique (%),3.5%
Missing (%),0.0%
Missing (n),0

0,1
Igosi,307
Imalinyi,252
Siha Kati,232
Other values (2089),58609

Value,Count,Frequency (%),Unnamed: 3
Igosi,307,0.5%,
Imalinyi,252,0.4%,
Siha Kati,232,0.4%,
Mdandu,231,0.4%,
Nduruma,217,0.4%,
Mishamo,203,0.3%,
Kitunda,203,0.3%,
Msindo,201,0.3%,
Chalinze,196,0.3%,
Maji ya Chai,190,0.3%,

0,1
Distinct count,1049
Unique (%),1.8%
Missing (%),36.0%
Missing (n),21381
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,281.09
Minimum,1
Maximum,30500
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,40
Median,150
Q3,324
95-th percentile,897
Maximum,30500
Range,30499
Interquartile range,284

0,1
Standard deviation,564.69
Coef of variation,2.0089
Kurtosis,296.67
Mean,281.09
MAD,259.61
Skewness,10.989
Sum,10687000
Variance,318870
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,7025,11.8%,
200.0,1940,3.3%,
150.0,1892,3.2%,
250.0,1681,2.8%,
300.0,1476,2.5%,
100.0,1146,1.9%,
50.0,1139,1.9%,
500.0,1009,1.7%,
350.0,986,1.7%,
120.0,916,1.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,7025,11.8%,
2.0,4,0.0%,
3.0,4,0.0%,
4.0,13,0.0%,
5.0,44,0.1%,

Value,Count,Frequency (%),Unnamed: 3
9865.0,1,0.0%,
10000.0,3,0.0%,
11463.0,1,0.0%,
15300.0,1,0.0%,
30500.0,1,0.0%,

Unsupported value

0,1
Constant value,GeoData Consultants Ltd

0,1
Distinct count,13
Unique (%),0.0%
Missing (%),6.5%
Missing (n),3877

0,1
VWC,36793
WUG,5206
Water authority,3153
Other values (9),10371
(Missing),3877

Value,Count,Frequency (%),Unnamed: 3
VWC,36793,61.9%,
WUG,5206,8.8%,
Water authority,3153,5.3%,
WUA,2883,4.9%,
Water Board,2748,4.6%,
Parastatal,1680,2.8%,
Private operator,1063,1.8%,
Company,1061,1.8%,
Other,766,1.3%,
SWC,97,0.2%,

0,1
Distinct count,2697
Unique (%),4.5%
Missing (%),47.4%
Missing (n),28166

0,1
K,682
,644
Borehole,546
Other values (2693),29362
(Missing),28166

Value,Count,Frequency (%),Unnamed: 3
K,682,1.1%,
,644,1.1%,
Borehole,546,0.9%,
Chalinze wate,405,0.7%,
M,400,0.7%,
DANIDA,379,0.6%,
Government,320,0.5%,
Ngana water supplied scheme,270,0.5%,
wanging'ombe water supply s,261,0.4%,
wanging'ombe supply scheme,234,0.4%,

Unsupported value

0,1
Distinct count,55
Unique (%),0.1%
Missing (%),34.9%
Missing (n),20709
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1996.8
Minimum,1960
Maximum,2013
Zeros (%),0.0%

0,1
Minimum,1960
5-th percentile,1973
Q1,1987
Median,2000
Q3,2008
95-th percentile,2011
Maximum,2013
Range,53
Interquartile range,21

0,1
Standard deviation,12.472
Coef of variation,0.006246
Kurtosis,-0.55885
Mean,1996.8
MAD,10.543
Skewness,-0.73147
Sum,77259000
Variance,155.55
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
2010.0,2645,4.5%,
2008.0,2613,4.4%,
2009.0,2533,4.3%,
2000.0,2091,3.5%,
2007.0,1587,2.7%,
2006.0,1471,2.5%,
2003.0,1286,2.2%,
2011.0,1256,2.1%,
2004.0,1123,1.9%,
2012.0,1084,1.8%,

Value,Count,Frequency (%),Unnamed: 3
1960.0,102,0.2%,
1961.0,21,0.0%,
1962.0,30,0.1%,
1963.0,85,0.1%,
1964.0,40,0.1%,

Value,Count,Frequency (%),Unnamed: 3
2009.0,2533,4.3%,
2010.0,2645,4.5%,
2011.0,1256,2.1%,
2012.0,1084,1.8%,
2013.0,176,0.3%,

0,1
Distinct count,18
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
gravity,26780
nira/tanira,8154
other,6430
Other values (15),18036

Value,Count,Frequency (%),Unnamed: 3
gravity,26780,45.1%,
nira/tanira,8154,13.7%,
other,6430,10.8%,
submersible,4764,8.0%,
swn 80,3670,6.2%,
mono,2865,4.8%,
india mark ii,2400,4.0%,
afridev,1770,3.0%,
ksb,1415,2.4%,
other - rope pump,451,0.8%,

0,1
Distinct count,13
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
gravity,26780
nira/tanira,8154
other,6430
Other values (10),18036

Value,Count,Frequency (%),Unnamed: 3
gravity,26780,45.1%,
nira/tanira,8154,13.7%,
other,6430,10.8%,
submersible,6179,10.4%,
swn 80,3670,6.2%,
mono,2865,4.8%,
india mark ii,2400,4.0%,
afridev,1770,3.0%,
rope pump,451,0.8%,
other handpump,364,0.6%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
gravity,26780
handpump,16456
other,6430
Other values (4),9734

Value,Count,Frequency (%),Unnamed: 3
gravity,26780,45.1%,
handpump,16456,27.7%,
other,6430,10.8%,
submersible,6179,10.4%,
motorpump,2987,5.0%,
rope pump,451,0.8%,
wind-powered,117,0.2%,

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
vwc,40507
wug,6515
water board,2933
Other values (9),9445

Value,Count,Frequency (%),Unnamed: 3
vwc,40507,68.2%,
wug,6515,11.0%,
water board,2933,4.9%,
wua,2535,4.3%,
private operator,1971,3.3%,
parastatal,1768,3.0%,
water authority,904,1.5%,
other,844,1.4%,
company,685,1.2%,
unknown,561,0.9%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
user-group,52490
commercial,3638
parastatal,1768
Other values (2),1504

Value,Count,Frequency (%),Unnamed: 3
user-group,52490,88.4%,
commercial,3638,6.1%,
parastatal,1768,3.0%,
other,943,1.6%,
unknown,561,0.9%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
never pay,25348
pay per bucket,8985
pay monthly,8300
Other values (4),16767

Value,Count,Frequency (%),Unnamed: 3
never pay,25348,42.7%,
pay per bucket,8985,15.1%,
pay monthly,8300,14.0%,
unknown,8157,13.7%,
pay when scheme fails,3914,6.6%,
pay annually,3642,6.1%,
other,1054,1.8%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
never pay,25348
per bucket,8985
monthly,8300
Other values (4),16767

Value,Count,Frequency (%),Unnamed: 3
never pay,25348,42.7%,
per bucket,8985,15.1%,
monthly,8300,14.0%,
unknown,8157,13.7%,
on failure,3914,6.6%,
annually,3642,6.1%,
other,1054,1.8%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
soft,50818
salty,4856
unknown,1876
Other values (5),1850

Value,Count,Frequency (%),Unnamed: 3
soft,50818,85.6%,
salty,4856,8.2%,
unknown,1876,3.2%,
milky,804,1.4%,
coloured,490,0.8%,
salty abandoned,339,0.6%,
fluoride,200,0.3%,
fluoride abandoned,17,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
good,50818
salty,5195
unknown,1876
Other values (3),1511

Value,Count,Frequency (%),Unnamed: 3
good,50818,85.6%,
salty,5195,8.7%,
unknown,1876,3.2%,
milky,804,1.4%,
colored,490,0.8%,
fluoride,217,0.4%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
enough,33186
insufficient,15129
dry,6246
Other values (2),4839

Value,Count,Frequency (%),Unnamed: 3
enough,33186,55.9%,
insufficient,15129,25.5%,
dry,6246,10.5%,
seasonal,4050,6.8%,
unknown,789,1.3%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
enough,33186
insufficient,15129
dry,6246
Other values (2),4839

Value,Count,Frequency (%),Unnamed: 3
enough,33186,55.9%,
insufficient,15129,25.5%,
dry,6246,10.5%,
seasonal,4050,6.8%,
unknown,789,1.3%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
spring,17021
shallow well,16824
machine dbh,11075
Other values (7),14480

Value,Count,Frequency (%),Unnamed: 3
spring,17021,28.7%,
shallow well,16824,28.3%,
machine dbh,11075,18.6%,
river,9612,16.2%,
rainwater harvesting,2295,3.9%,
hand dtw,874,1.5%,
lake,765,1.3%,
dam,656,1.1%,
other,212,0.4%,
unknown,66,0.1%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
spring,17021
shallow well,16824
borehole,11949
Other values (4),13606

Value,Count,Frequency (%),Unnamed: 3
spring,17021,28.7%,
shallow well,16824,28.3%,
borehole,11949,20.1%,
river/lake,10377,17.5%,
rainwater harvesting,2295,3.9%,
dam,656,1.1%,
other,278,0.5%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
groundwater,45794
surface,13328
unknown,278

Value,Count,Frequency (%),Unnamed: 3
groundwater,45794,77.1%,
surface,13328,22.4%,
unknown,278,0.5%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
communal standpipe,28522
hand pump,17488
other,6380
Other values (4),7010

Value,Count,Frequency (%),Unnamed: 3
communal standpipe,28522,48.0%,
hand pump,17488,29.4%,
other,6380,10.7%,
communal standpipe multiple,6103,10.3%,
improved spring,784,1.3%,
cattle trough,116,0.2%,
dam,7,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
communal standpipe,34625
hand pump,17488
other,6380
Other values (3),907

Value,Count,Frequency (%),Unnamed: 3
communal standpipe,34625,58.3%,
hand pump,17488,29.4%,
other,6380,10.7%,
improved spring,784,1.3%,
cattle trough,116,0.2%,
dam,7,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
functional,32259
non functional,22824
functional needs repair,4317

Value,Count,Frequency (%),Unnamed: 3
functional,32259,54.3%,
non functional,22824,38.4%,
functional needs repair,4317,7.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.030505

0,1
True,1812
(Missing),57588

Value,Count,Frequency (%),Unnamed: 3
True,1812,3.1%,
(Missing),57588,96.9%,

0,1
Correlation,1

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.34864

0,1
True,20709
(Missing),38691

Value,Count,Frequency (%),Unnamed: 3
True,20709,34.9%,
(Missing),38691,65.1%,

0,1
Correlation,0.93323

0,1
Correlation,0.90895

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2011.9
Minimum,2002
Maximum,2013
Zeros (%),0.0%

0,1
Minimum,2002
5-th percentile,2011
Q1,2011
Median,2012
Q3,2013
95-th percentile,2013
Maximum,2013
Range,11
Interquartile range,2

0,1
Standard deviation,0.95876
Coef of variation,0.00047654
Kurtosis,0.61323
Mean,2011.9
MAD,0.89816
Skewness,-0.15098
Sum,119508147
Variance,0.91922
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
2011,28674,48.3%,
2013,24271,40.9%,
2012,6424,10.8%,
2004,30,0.1%,
2002,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2002,1,0.0%,
2004,30,0.1%,
2011,28674,48.3%,
2012,6424,10.8%,
2013,24271,40.9%,

Value,Count,Frequency (%),Unnamed: 3
2002,1,0.0%,
2004,30,0.1%,
2011,28674,48.3%,
2012,6424,10.8%,
2013,24271,40.9%,

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.3756
Minimum,1
Maximum,12
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,7
95-th percentile,10
Maximum,12
Range,11
Interquartile range,5

0,1
Standard deviation,3.0292
Coef of variation,0.6923
Kurtosis,-0.4984
Mean,4.3756
MAD,2.5952
Skewness,0.90951
Sum,259913
Variance,9.1763
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
3,17936,30.2%,
2,12402,20.9%,
7,6928,11.7%,
1,6354,10.7%,
10,5466,9.2%,
4,3970,6.7%,
8,3364,5.7%,
11,1349,2.3%,
12,621,1.0%,
6,346,0.6%,

Value,Count,Frequency (%),Unnamed: 3
1,6354,10.7%,
2,12402,20.9%,
3,17936,30.2%,
4,3970,6.7%,
5,336,0.6%,

Value,Count,Frequency (%),Unnamed: 3
8,3364,5.7%,
9,328,0.6%,
10,5466,9.2%,
11,1349,2.3%,
12,621,1.0%,

0,1
Distinct count,31
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,15.621
Minimum,1
Maximum,31
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,8
Median,16
Q3,23
95-th percentile,29
Maximum,31
Range,30
Interquartile range,15

0,1
Standard deviation,8.6876
Coef of variation,0.55613
Kurtosis,-1.1802
Mean,15.621
MAD,7.4559
Skewness,-0.063184
Sum,927917
Variance,75.474
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
3,4084,6.9%,
4,2758,4.6%,
18,2531,4.3%,
15,2365,4.0%,
19,2347,4.0%,
23,2294,3.9%,
16,2268,3.8%,
17,2250,3.8%,
22,2141,3.6%,
27,2052,3.5%,

Value,Count,Frequency (%),Unnamed: 3
1,1322,2.2%,
2,1657,2.8%,
3,4084,6.9%,
4,2758,4.6%,
5,1572,2.6%,

Value,Count,Frequency (%),Unnamed: 3
27,2052,3.5%,
28,1889,3.2%,
29,1349,2.3%,
30,1415,2.4%,
31,884,1.5%,

0,1
Distinct count,61
Unique (%),0.1%
Missing (%),34.9%
Missing (n),20709
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,15.356
Minimum,-7
Maximum,53
Zeros (%),1.0%

0,1
Minimum,-7
5-th percentile,1
Q1,5
Median,13
Q3,25
95-th percentile,39
Maximum,53
Range,60
Interquartile range,20

0,1
Standard deviation,12.493
Coef of variation,0.81355
Kurtosis,-0.53443
Mean,15.356
MAD,10.525
Skewness,0.73139
Sum,594130
Variance,156.07
Memory size,928.1 KiB

Value,Count,Frequency (%),Unnamed: 3
3.0,2740,4.6%,
1.0,2303,3.9%,
2.0,2129,3.6%,
5.0,1980,3.3%,
4.0,1890,3.2%,
13.0,1869,3.1%,
7.0,1404,2.4%,
6.0,1381,2.3%,
11.0,1352,2.3%,
14.0,1160,2.0%,

Value,Count,Frequency (%),Unnamed: 3
-7.0,1,0.0%,
-5.0,3,0.0%,
-4.0,2,0.0%,
-3.0,1,0.0%,
-2.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
49.0,25,0.0%,
50.0,84,0.1%,
51.0,31,0.1%,
52.0,11,0.0%,
53.0,91,0.2%,

0,1
Correlation,0.92589

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,longitude_MISSING,latitude_MISSING,construction_year_MISSING,gps_height_MISSING,population_MISSING,year_recorded,month_recorded,day_recorded,years,years_MISSING
0,69572,6000.0,Roman,1390.0,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109.0,True,GeoData Consultants Ltd,VWC,Roman,False,1999.0,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,False,False,False,False,False,2011,3,14,12.0,False
1,8776,0.0,Grumeti,1399.0,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280.0,,GeoData Consultants Ltd,Other,,True,2010.0,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional,False,False,False,False,False,2013,3,6,3.0,False
2,34310,25.0,Lottery Club,686.0,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250.0,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009.0,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional,False,False,False,False,False,2013,2,25,4.0,False
3,67743,0.0,Unicef,263.0,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58.0,True,GeoData Consultants Ltd,VWC,,True,1986.0,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional,False,False,False,False,False,2013,1,28,27.0,False
4,19728,0.0,Action In A,,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,,True,GeoData Consultants Ltd,,,True,,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional,False,False,True,True,True,2011,7,13,,True


In [0]:
# Baseline Determine majority class
target = 'status_group'
y_train = train[target]
y_train.value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

In [0]:
# The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target & id
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
print(len(features))
print(features)

42
['id', 'amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'year_recorded', 'month_recorded', 'day_recorded', 'years', 'basin', 'region', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group', 'longitude_MISSING', 'latitude_MISSING', 'construction_year_MISSING', 'gps_height_MISSING', 'population_MISSING', 'years_MISSING']


## Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.


In [0]:
#You won't be able to just copy from the lesson notebook to this assignment.

#Because the lesson was regression, but the assignment is classification.
#Because the lesson used TargetEncoder, which doesn't work as-is for multi-class classification.

In [0]:
#So you will have to adapt the example, which is good real-world practice.

#Use a model for classification, such as RandomForestClassifier
#Use hyperparameters that match the classifier, such as randomforestclassifier__ ...
#Use a metric for classification, such as scoring='accuracy'
#If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs 
#repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, 
#such as OrdinalEncoder)

In [0]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from random import randint 

X_train = train.drop(columns=target)
y_train = train[target]
X_test = test


#Create pipeline check if this can be used for classification
# pipeline, identical to above but with ordinal encoder
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True, cols=['basin']), 
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'], 
    'RandomForestClassifier__n_estimators': randint(50, 500), 
    'RandomForestClassifier__max_depth': [5, 10, 15, 20, None], 
}
# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=5, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

In [0]:
print('Model Hyperparameters:')
print(pipeline.named_steps['randomforestclassifier'])


Model Hyperparameters:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)


In [0]:
search.fit(X_train, y_train);

TypeError: ignored

In [0]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from random import randint 

X_train = train.drop(columns=target)
y_train = train[target]
X_test = test


#Create pipeline check if this can be used for classification
# pipeline, identical to above but with ordinal encoder
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True, cols=['basin']), 
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestregressor__n_estimators': randint(50, 500), 
    'randomforestregressor__max_depth': [5, 10, 15, 20, None], 
}
# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=5, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)
#Got everything to work up until here 

#This returns not sure if right 
#pipeline.fit(X_train, y_train)

#This doesn't work 
#search.fit(X_train, y_train);


In [0]:
search

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('onehotencoder',
                                              OneHotEncoder(cols=['basin'],
                                                            drop_invariant=False,
                                                            handle_missing='value',
                                                            handle_unknown='value',
                                                            return_df=True,
                                                            use_cat_names=True,
                                                            verbose=0)),
                                             ('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                      

In [0]:
#Doesn't work
#search.fit(X_train)

In [0]:
#Doesn't work 
#print('Best hyperparameters', search.best_estimator_)
#print('Cross-validation MAE', -search.best_score_)

In [0]:
search.get_params

<bound method BaseEstimator.get_params of RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('onehotencoder',
                                              OneHotEncoder(cols=['basin'],
                                                            drop_invariant=False,
                                                            handle_missing='value',
                                                            handle_unknown='value',
                                                            return_df=True,
                                                            use_cat_names=True,
                                                            verbose=0)),
                                             ('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
            

In [0]:
search.predict

<function sklearn.model_selection._search.BaseSearchCV.predict>

In [0]:
search.predict_log_proba

<function sklearn.model_selection._search.BaseSearchCV.predict_log_proba>

In [0]:
# Doesn't work 
#pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score').T

In [0]:
# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

search.fit(X_train, y_train);

NameError: ignored

##  Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue Submit Predictions button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)


In [0]:
submission = test[['id']].copy()
submission['status_group'] = y_pred
# submission['status_group']
submission.to_csv('3status_group_ED.csv', index=False)

In [0]:
from google.colab import files
files.download("3status_group_ED.csv")

##  Commit your notebook to your fork of the GitHub repo.