 Lambda School Data Science, Unit 2: Predictive Modeling

 # Regression & Classification, Module 4


 ## Assignment

 - [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
 - [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
 - [ ] Do train/validate/test split with the Tanzania Waterpumps data.
 - [ ] Begin with baselines for classification.
 - [ ] Use scikit-learn for logistic regression.
 - [ ] Get your validation accuracy score.
 - [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
 - [ ] Commit your notebook to your fork of the GitHub repo.

 ---


 ## Stretch Goals

 - [ ] Add your own stretch goal(s) !
 - [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
 - [ ] Make exploratory visualizations.
 - [ ] Do one-hot encoding. For example, you could try `quantity`, `basin`, `extraction_type_class`, and more. (But remember it may not work with high cardinality categoricals.)
 - [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
 - [ ] Get and plot your coefficients.
 - [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

 ---

 ## Data Dictionary

 ### Features

 Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

 - `amount_tsh` : Total static head (amount water available to waterpoint)
 - `date_recorded` : The date the row was entered
 - `funder` : Who funded the well
 - `gps_height` : Altitude of the well
 - `installer` : Organization that installed the well
 - `longitude` : GPS coordinate
 - `latitude` : GPS coordinate
 - `wpt_name` : Name of the waterpoint if there is one
 - `num_private` :
 - `basin` : Geographic water basin
 - `subvillage` : Geographic location
 - `region` : Geographic location
 - `region_code` : Geographic location (coded)
 - `district_code` : Geographic location (coded)
 - `lga` : Geographic location
 - `ward` : Geographic location
 - `population` : Population around the well
 - `public_meeting` : True/False
 - `recorded_by` : Group entering this row of data
 - `scheme_management` : Who operates the waterpoint
 - `scheme_name` : Who operates the waterpoint
 - `permit` : If the waterpoint is permitted
 - `construction_year` : Year the waterpoint was constructed
 - `extraction_type` : The kind of extraction the waterpoint uses
 - `extraction_type_group` : The kind of extraction the waterpoint uses
 - `extraction_type_class` : The kind of extraction the waterpoint uses
 - `management` : How the waterpoint is managed
 - `management_group` : How the waterpoint is managed
 - `payment` : What the water costs
 - `payment_type` : What the water costs
 - `water_quality` : The quality of the water
 - `quality_group` : The quality of the water
 - `quantity` : The quantity of water
 - `quantity_group` : The quantity of water
 - `source` : The source of the water
 - `source_type` : The source of the water
 - `source_class` : The source of the water
 - `waterpoint_type` : The kind of waterpoint
 - `waterpoint_type_group` : The kind of waterpoint

 ### Labels

 There are three possible values:

 - `functional` : the waterpoint is operational and there are no repairs needed
 - `functional needs repair` : the waterpoint is operational, but needs repairs
 - `non functional` : the waterpoint is not operational

 ---

 ## Generate a submission

 Your code to generate a submission file may look like this:

 ```python
 # estimator is your model or pipeline, which you've fit on X_train

 # X_test is your pandas dataframe or numpy array,
 # with the same number of rows, in the same order, as test_features.csv,
 # and the same number of columns, in the same order, as X_train

 y_pred = estimator.predict(X_test)


 # Makes a dataframe with two columns, id and status_group,
 # and writes to a csv file, without the index

 sample_submission = pandas.read_csv('sample_submission.csv')
 submission = sample_submission.copy()
 submission['status_group'] = y_pred
 submission.to_csv('your-submission-filename.csv', index=False)
 ```

 If you're working locally, the csv file is saved in the same directory as your notebook.

 If you're using Google Colab, you can use this code to download your submission csv file.

 ```python
 from google.colab import files
 files.download('your-submission-filename.csv')
 ```

 ---

In [2]:
# Read the Tanzania Waterpumps data
# train_features.csv : the training set features
# train_labels.csv : the training set labels
# test_features.csv : the test set features
# sample_submission.csv : a sample submission file in the correct format
    
import pandas
import numpy

train_features = pandas.read_csv('./data/waterpumps/train_features.csv')
train_labels = pandas.read_csv('./data/waterpumps/train_labels.csv')
test_features = pandas.read_csv('./data/waterpumps/test_features.csv')
sample_submission = pandas.read_csv('./data/waterpumps/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)



In [3]:
train_features.head()


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [4]:
train_features.isna().sum()
# We end up disregarding these -
# As they're all categorical variables, one-hot encoding
# takes care of our NaNs for us


id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [5]:
train_labels.head()


Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [6]:
train_kaggle = pandas.merge(train_features, train_labels, on='id')
train_kaggle.head()


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [7]:
train_features.describe(exclude=[numpy.number])


Unnamed: 0,date_recorded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,59400,55765,55745,59400,59400,59029,59400,59400,59400,56066,...,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400
unique,356,1897,2145,37400,9,19287,21,125,2092,2,...,7,8,6,5,5,10,7,3,7,6
top,2011-03-15,Government Of Tanzania,DWE,none,Lake Victoria,Madukani,Iringa,Njombe,Igosi,True,...,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,572,9084,17402,3563,10248,508,5294,2503,307,51011,...,25348,50818,50818,33186,33186,17021,17021,45794,28522,34625



 ## One-hot encoding and pre-split cleaning


In [8]:
import category_encoders
from typing import Optional
import numpy

def keepTopN(	column:pandas.Series,
				n:int,
				default:Optional[object] = None) -> pandas.Series:
	"""
	Keeps the top n most popular values of a Series, while replacing the rest with `default`
	
	Args:
		column (pandas.Series): Series to operate on
		n (int): How many values to keep
		default (object, optional): Defaults to NaN. Value with which to replace remaining values
	
	Returns:
		pandas.Series: Series with the most popular n values
	"""

	if default is None: default = numpy.nan

	val_counts = column.value_counts()
	if n > len(val_counts): n = len(val_counts)
	top_n = list(val_counts[:n].index)
	return(column.where(column.isin(top_n), other=default))

def oneHot(	frame:pandas.DataFrame, 
			cols:Optional[list] = None,
			exclude_cols:Optional[list] = None,
			max_cardinality:Optional[int] = None) -> pandas.DataFrame:
	"""
	One-hot encodes the dataframe.
	
	Args:
		frame (pandas.DataFrame): Dataframe to clean
		cols (list, optional): Columns to one-hot encode. Defaults to all string columns.
		exclude_cols (list, optional): Columns to skip one-hot encoding. Defaults to None.
		max_cardinality (int, optional): Maximum cardinality of columns to encode. Defaults to no maximum cardinality.
	
	Returns:
		pandas.DataFrame: The one_hot_encoded dataframe.
	"""


	one_hot_encoded = frame.copy()

	if cols is None: cols = list(one_hot_encoded.columns[one_hot_encoded.dtypes=='object'])

	if exclude_cols is not None:
		for col in exclude_cols:
			cols.remove(col)

	if max_cardinality is not None:
		described = one_hot_encoded[cols].describe(exclude=[numpy.number])
		cols = list(described.columns[described.loc['unique'] <= max_cardinality])

	encoder = category_encoders.OneHotEncoder(return_df=True, use_cat_names=True, cols=cols)
	one_hot_encoded = encoder.fit_transform(one_hot_encoded)

	return(one_hot_encoded)


In [9]:
def clean_X(df, max_ordinality=100, int_ts=False):

	cleaned = df.copy().drop(columns=['recorded_by'])

	categorical_description = cleaned.describe(exclude=[numpy.number])
	if int_ts: 
		cat_cols = categorical_description.drop(columns=['date_recorded']).columns
	else:
		cat_cols = categorical_description.columns
	# high_ordinality_cols = categorical_description[categorical_description.loc['unique'] > max_ordinality].columns
	
	for col in cat_cols:
		cleaned[col] = keepTopN(cleaned[col], max_ordinality, default='other')

	if int_ts:
		cleaned['date_recorded_dt'] = pandas.to_datetime(df['date_recorded'])
		cleaned['date_recorded_ts'] = cleaned['date_recorded_dt'].view('int64')

		return(cleaned.drop(columns=['date_recorded_dt', 'date_recorded']))
	else:
		return(cleaned)


In [10]:
train_targets = train_labels.sort_values(by=['id'])['status_group'].replace({'functional': 1, 'functional needs repair': 2, 'non functional': 3})


In [11]:
train_targets.isna().sum()


0

In [12]:
train_targets


9410     3
18428    1
12119    1
10629    1
2343     3
        ..
15137    1
8667     1
22584    3
108      3
39131    3
Name: status_group, Length: 59400, dtype: int64

In [13]:

combined = pandas.concat([train_features, test_features])

cleaned_combined = oneHot(clean_X(combined, max_ordinality=200, int_ts=True))#100))
cleaned_train = cleaned_combined[cleaned_combined['id'].isin(train_features['id'])].sort_values(by=['id'])
cleaned_test = cleaned_combined[cleaned_combined['id'].isin(test_features['id'])].sort_values(by=['id'])
assert list(cleaned_train.columns) == list(cleaned_test.columns)
assert list(cleaned_train['id']) == list(train_labels.sort_values(by=['id'])['id'])


In [14]:
set(cleaned_train.columns) - set(cleaned_test.columns)


set()

In [15]:
pandas.set_option('display.max_columns', 500)
# train_features.describe(exclude=[numpy.number])

In [16]:
dict(cleaned_combined.dtypes)


{'id': dtype('int64'),
 'amount_tsh': dtype('float64'),
 'funder_Roman': dtype('int64'),
 'funder_Grumeti': dtype('int64'),
 'funder_other': dtype('int64'),
 'funder_Unicef': dtype('int64'),
 'funder_Mkinga Distric Coun': dtype('int64'),
 'funder_Dwsp': dtype('int64'),
 'funder_Rwssp': dtype('int64'),
 'funder_Wateraid': dtype('int64'),
 'funder_Private': dtype('int64'),
 'funder_Danida': dtype('int64'),
 'funder_World Vision': dtype('int64'),
 'funder_Lawatefuka Water Supply': dtype('int64'),
 'funder_Rudep': dtype('int64'),
 'funder_Hesawa': dtype('int64'),
 'funder_Twe': dtype('int64'),
 'funder_Isf': dtype('int64'),
 'funder_African Development Bank': dtype('int64'),
 'funder_Government Of Tanzania': dtype('int64'),
 'funder_Water': dtype('int64'),
 'funder_Private Individual': dtype('int64'),
 'funder_Undp': dtype('int64'),
 'funder_Kirde': dtype('int64'),
 'funder_Cefa': dtype('int64'),
 'funder_Ces(gmbh)': dtype('int64'),
 'funder_European Union': dtype('int64'),
 'funder_Lga': 

In [17]:
cleaned_test


Unnamed: 0,id,amount_tsh,funder_Roman,funder_Grumeti,funder_other,funder_Unicef,funder_Mkinga Distric Coun,funder_Dwsp,funder_Rwssp,funder_Wateraid,funder_Private,funder_Danida,funder_World Vision,funder_Lawatefuka Water Supply,funder_Rudep,funder_Hesawa,funder_Twe,funder_Isf,funder_African Development Bank,funder_Government Of Tanzania,funder_Water,funder_Private Individual,funder_Undp,funder_Kirde,funder_Cefa,funder_Ces(gmbh),funder_European Union,funder_Lga,funder_District Council,funder_Muwsa,funder_Dwe/norad,funder_Kkkt_makwale,funder_Ces (gmbh),funder_Kkkt,funder_Roman Catholic,funder_Norad,funder_Adra,funder_Sema,funder_Dwe,funder_Rc Church,funder_Swedish,funder_Idc,funder_He,funder_Jica,funder_Aict,funder_Tcrs,funder_Kiuma,funder_Germany Republi,funder_Netherlands,funder_Nethalan,funder_Tasaf,funder_Concern World Wide,funder_Wfp,funder_World Bank,funder_Tanza,funder_0,funder_Shipo,funder_Fini Water,funder_Oxfarm,funder_Village Council,funder_Wvt,funder_Dhv,funder_Ir,funder_Oikos E.Afrika,funder_Anglican Church,funder_Donor,funder_Amref,funder_Ministry Of Water,funder_Adb,funder_Jbg,funder_Germany,funder_Kibaha Town Council,funder_Dfid,funder_Rural Water Supply And Sanitat,funder_Wananchi,funder_Fw,funder_No,funder_Co,funder_Ridep,funder_Tassaf,funder_Finw,funder_Fin Water,funder_Oxfam,funder_Plan International,funder_Go,funder_Cdtf,funder_Shawasa,funder_Un,funder_Commu,funder_Community,funder_Save The Rain Usa,funder_Tlc,funder_Plan Int,funder_W.B,funder_Lvia,funder_Songea District Council,funder_Hifab,funder_Rc Ch,funder_Snv,funder_National Rural,funder_Is,funder_Giz,funder_Wsdp,funder_Finn Water,funder_Villagers,funder_Abasia,funder_Unhcr,funder_Kuwait,funder_Magadini-makiwaru Water,funder_Kaemp,funder_Tardo,funder_Sabemo,funder_Missi,funder_Dmdd,funder_Dhv\norp,funder_Mission,funder_Ru,funder_Halmashauri Ya Wilaya Sikonge,funder_Japan,funder_Ki,funder_Marafip,funder_Ta,funder_Ded,funder_Soda,funder_Lwi,funder_Ics,funder_African,funder_Tabora Municipal Council,funder_Jaica,funder_Solidarm,funder_Rc,funder_Wua,funder_Md,funder_Dh,funder_Mbiuwasa,funder_Dasip,funder_Hsw,funder_Tz Japan,funder_Concern,funder_Caritas,funder_Conce,funder_Devon Aid Korogwe,funder_Kiliwater,funder_Lamp,funder_Bsf,funder_Fathe,funder_Unice,funder_Songea Municipal Counci,funder_Water User As,funder_Islamic Found,funder_Vwc,funder_Acra,funder_Gtz,funder_Kuwasa,funder_China Government,funder_Churc,funder_Mkinga Distric Cou,funder_Cafod,funder_Urt,funder_Water Aid /sema,funder_Ndrdp,funder_Holland,funder_Cocen,funder_Ncaa,funder_Finwater,funder_Dwssp,funder_The Desk And Chair Foundat,funder_Kkkt Church,funder_Tuwasa,funder_Irish Ai,funder_Mdrdp,funder_Kilindi District Co,funder_Kidp,funder_St,funder_Serikali,funder_Po,funder_Finida German Tanzania Govt,funder_Idara Ya Maji,funder_Miziriol,funder_H,funder_Ms,funder_Red Cross,funder_Losaa-kia Water Supply,funder_Kanisa Katoliki Lolovoni,funder_Tdft,funder_Cmsr,funder_W,funder_Partage,funder_Aar,funder_Dads,funder_Twesa,funder_Solidame,funder_Watu Wa Ujerumani,funder_Gen,funder_Redep,funder_Tanapa,funder_Kalta,funder_Ka,funder_Padep,funder_Si,funder_Songas,funder_Cg,funder_Tredep,gps_height,installer_Roman,installer_GRUMETI,installer_World vision,installer_UNICEF,installer_Artisan,installer_DWE,installer_DWSP,installer_Water Aid,installer_Private,installer_DANIDA,installer_Lawatefuka water sup,installer_WEDECO,installer_Danid,installer_TWE,installer_ISF,installer_other,installer_District council,installer_Water,installer_WU,installer_Central government,installer_CEFA,installer_Commu,installer_Accra,installer_World Vision,installer_LGA,installer_MUWSA,installer_KKKT _ Konde and DWE,installer_Government,installer_KKKT,installer_RWE,installer_Adra /Community,installer_SEMA,installer_SHIPO,installer_HESAWA,installer_ACRA,installer_Community,installer_Sengerema Water Department,installer_HE,installer_DA,installer_Adra,installer_AICT,installer_KIUMA,installer_CES,installer_Adra/Community,installer_Hesawa,installer_Water board,...,scheme_name_Chovora,scheme_name_Kibohelo forest,scheme_name_Uroki-Bomang'ombe water sup,scheme_name_Mangamba forest,scheme_name_Mlimba W,scheme_name_Water from DAWASCO,scheme_name_Muwimb,scheme_name_Mradi wa maji wa mahanje,scheme_name_RWSSP,scheme_name_Handeni Trunk Main(H,scheme_name_Nyafisi,scheme_name_Chanjare water supply,scheme_name_I,scheme_name_Vulue water supply,scheme_name_Machame water supply,scheme_name_It,scheme_name_Shengui forest,scheme_name_JAICA Borehole Scheme,scheme_name_World banks,scheme_name_Mradi wa maji Shirati,scheme_name_Mitema,scheme_name_Ilolo,scheme_name_Cham,scheme_name_Magati gravity water,scheme_name_Matund,scheme_name_MAKOGA WATER SUPPLY,scheme_name_Kiboelo forest,scheme_name_NCHULOWAIBALE WATER SUPPLY SCHEME,scheme_name_Ru,scheme_name_Bagamoyo Wate,scheme_name_RUMWAMCH,scheme_name_Magang,scheme_name_Mang`ula,scheme_name_upper Ruvu,scheme_name_Losaa Kia water supply,scheme_name_G,scheme_name_BRUDER,scheme_name_Mgandazi,scheme_name_Mbuo mkunwa water supply,scheme_name_Ga,scheme_name_Igongolo gravity water sche,scheme_name_Mradi wa maji wa peramiho,scheme_name_Itete wa,scheme_name_Monduli pipe line,scheme_name_Maleng,scheme_name_Nyachenda,scheme_name_Lemanyata pipe line,scheme_name_Tove mtwango,scheme_name_TPRI pipe line,scheme_name_Likamba mindeu pipe line,scheme_name_Hedaru kati water supply,scheme_name_Maramba gravity spri,scheme_name_Kimasaki gravity water supply,scheme_name_imalinyi water supply schem,scheme_name_Lake Victoria pipe scheme,scheme_name_Mradi wa maji wa matimila,scheme_name_Nyamitoko water,scheme_name_Holili water supply,scheme_name_U,scheme_name_Shirimatunda water Supply,scheme_name_Mradi wa maji wa pito,scheme_name_Timbolo sambasha TPRI pipe line,scheme_name_Kasangezi,scheme_name_Lyamungo-Umbwe water supply,scheme_name_Ki,scheme_name_Nzihi,scheme_name_W,scheme_name_no scheme,scheme_name_Idodi,scheme_name_Ngamanga water supplied sch,scheme_name_Sanje Wa,scheme_name_Ihum,scheme_name_Nasula gravity water supply,scheme_name_Tove,scheme_name_Endawasu,scheme_name_Moronga,scheme_name_Malemeu gravity water supply,scheme_name_Mlomboza forest,scheme_name_Nyamtukuza,scheme_name_Nyangao Water Supply,scheme_name_Wangingombe gravity Scheme,scheme_name_Bangata water project,scheme_name_Kidia kilemapunda,scheme_name_ngamanga water supplied sch,scheme_name_Mvaji ri,scheme_name_Churu water supply,scheme_name_Kihoro,scheme_name_Manyoni water supply,scheme_name_Nameqhwadiba,scheme_name_Msitu wa tembo pipe scheme,scheme_name_Gyewasu,scheme_name_Kyamara gravity water supply,scheme_name_Ntom,scheme_name_Mtikanga gravity Scheme,scheme_name_Mtam,scheme_name_GEN Borehole Scheme,scheme_name_D,scheme_name_Olkungabo gravity water supply,scheme_name_Kigongoi gravity wat,scheme_name_Mkalama Water supply,scheme_name_WAUSA,scheme_name_Mradi wa maji wa wino,scheme_name_Lema water supplied scheme,scheme_name_Mradi wa maji wa mbinga mh,scheme_name_Mlowa,scheme_name_A,scheme_name_Mongwa r,scheme_name_Ms,scheme_name_Kabingo/kiobela gravity water supply,scheme_name_Jumuhiya ya watumia maji,scheme_name_Ichonde,scheme_name_Namanga water project,scheme_name_SHIPO,scheme_name_J,scheme_name_Mabira water supp,scheme_name_imalinyi supply scheme,scheme_name_Jumuhiya ya watumia maji.1,scheme_name_Nyarubano,scheme_name_Mlangarini pipe line,permit_False,permit_True,permit_other,construction_year,extraction_type_gravity,extraction_type_submersible,extraction_type_swn 80,extraction_type_nira/tanira,extraction_type_india mark ii,extraction_type_other,extraction_type_ksb,extraction_type_mono,extraction_type_windmill,extraction_type_afridev,extraction_type_other - rope pump,extraction_type_india mark iii,extraction_type_other - swn 81,extraction_type_other - play pump,extraction_type_cemo,extraction_type_climax,extraction_type_walimi,extraction_type_other - mkulima/shinyanga,extraction_type_group_gravity,extraction_type_group_submersible,extraction_type_group_swn 80,extraction_type_group_nira/tanira,extraction_type_group_india mark ii,extraction_type_group_other,extraction_type_group_mono,extraction_type_group_wind-powered,extraction_type_group_afridev,extraction_type_group_rope pump,extraction_type_group_india mark iii,extraction_type_group_other handpump,extraction_type_group_other motorpump,extraction_type_class_gravity,extraction_type_class_submersible,extraction_type_class_handpump,extraction_type_class_other,extraction_type_class_motorpump,extraction_type_class_wind-powered,extraction_type_class_rope pump,management_vwc,management_wug,management_other,management_private operator,management_water board,management_wua,management_company,management_water authority,management_parastatal,management_unknown,management_other - school,management_trust,management_group_user-group,management_group_other,management_group_commercial,management_group_parastatal,management_group_unknown,payment_pay annually,payment_never pay,payment_pay per bucket,payment_unknown,payment_pay when scheme fails,payment_other,payment_pay monthly,payment_type_annually,payment_type_never pay,payment_type_per bucket,payment_type_unknown,payment_type_on failure,payment_type_other,payment_type_monthly,water_quality_soft,water_quality_salty,water_quality_milky,water_quality_unknown,water_quality_fluoride,water_quality_coloured,water_quality_salty abandoned,water_quality_fluoride abandoned,quality_group_good,quality_group_salty,quality_group_milky,quality_group_unknown,quality_group_fluoride,quality_group_colored,quantity_enough,quantity_insufficient,quantity_dry,quantity_seasonal,quantity_unknown,quantity_group_enough,quantity_group_insufficient,quantity_group_dry,quantity_group_seasonal,quantity_group_unknown,source_spring,source_rainwater harvesting,source_dam,source_machine dbh,source_other,source_shallow well,source_river,source_hand dtw,source_lake,source_unknown,source_type_spring,source_type_rainwater harvesting,source_type_dam,source_type_borehole,source_type_other,source_type_shallow well,source_type_river/lake,source_class_groundwater,source_class_surface,source_class_unknown,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_hand pump,waterpoint_type_other,waterpoint_type_improved spring,waterpoint_type_cattle trough,waterpoint_type_dam,waterpoint_type_group_communal standpipe,waterpoint_type_group_hand pump,waterpoint_type_group_other,waterpoint_type_group_improved spring,waterpoint_type_group_cattle trough,waterpoint_type_group_dam,date_recorded_ts
3296,10,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,197,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1999,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1299974400000000000
13662,13,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,803,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2009,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1360281600000000000
5518,14,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1804,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1980,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1364342400000000000
11349,29,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1302480000000000000
758,32,0.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1311724800000000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10514,74241,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1300147200000000000
7445,74244,0.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,695,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1993,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1360800000000000000
6195,74245,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1420,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1970,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1362355200000000000
11051,74248,0.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1280,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2011,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1360972800000000000


In [18]:
import sklearn.preprocessing as preprocessing

scaler = preprocessing.StandardScaler()
scaled_train = scaler.fit_transform(cleaned_train.drop(columns=['id']))
scaled_train_ids = cleaned_train['id']
scaled_test = scaler.transform(cleaned_test.drop(columns=['id']))
scaled_test_ids = cleaned_test['id']


In [20]:
import sklearn.linear_model as linear_model
import sklearn.model_selection as model_selection

X_train_train, X_train_test, y_train_train, y_train_test = model_selection.train_test_split(scaled_train, train_targets, random_state=1)

print(f'X_train_train: {X_train_train.shape}')
print(f'X_train_test: {X_train_test.shape}')
print(f'y_train_train: {y_train_train.shape}')
print(f'y_train_test: {y_train_test.shape}')

lr_model = linear_model.LogisticRegression(solver='lbfgs', multi_class='auto')
lr_model.fit(X_train_train, y_train_train)


X_train_train: (44550, 1516)
X_train_test: (14850, 1516)
y_train_train: (44550,)
y_train_test: (14850,)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
y_pred_train_train = lr_model.predict(X_train_train)
y_pred_train_test = lr_model.predict(X_train_test)
y_pred_train_test.shape


(14850,)

In [22]:
y_train_test.shape


(14850,)

In [23]:
y_pred_train_test


array([1, 3, 2, ..., 1, 3, 2])

In [24]:
from scipy.stats import mode
mode(train_targets)


ModeResult(mode=array([1]), count=array([32259]))

In [25]:
import sklearn.metrics as metrics
print(f'Baseline (all functional) for train.train: {metrics.accuracy_score(y_train_train, numpy.linspace(1,1,y_train_train.shape[0]))}')


Baseline (all functional) for train.train: 0.5432996632996633


In [26]:
print('Logistic Regression score:')
lr_model.score(X_train_train,y_train_train)


Logistic Regression score:


0.784983164983165

In [27]:
metrics.accuracy_score(y_train_train, y_pred_train_train)


0.784983164983165

In [28]:
print(f'Baseline (all functional) for train.test: {metrics.accuracy_score(y_train_test, numpy.linspace(1,1,y_train_test.shape[0]))}')


Baseline (all functional) for train.test: 0.5424242424242425


In [29]:
print('Logistic Regression score:')
lr_model.score(X_train_test,y_train_test)


Logistic Regression score:


0.7640404040404041

In [30]:
metrics.accuracy_score(y_train_test, y_pred_train_test)



0.7640404040404041

In [31]:
# cv_generator = model_selection.KFold(n_splits=5)

lrCV_model = linear_model.LogisticRegressionCV(solver='lbfgs', multi_class='auto', cv=10, n_jobs=-1, random_state=1, max_iter=20)

# lrCV_model = linear_model.LogisticRegressionCV(solver='lbfgs', multi_class='auto', cv=3, n_jobs=-1, random_state=1)
lrCV_model.fit(scaled_train, train_targets)




LogisticRegressionCV(Cs=10, class_weight=None, cv=10, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=20, multi_class='auto', n_jobs=-1, penalty='l2',
                     random_state=1, refit=True, scoring=None, solver='lbfgs',
                     tol=0.0001, verbose=0)

In [32]:
y_pred_train_train = lrCV_model.predict(X_train_train)
y_pred_train_test = lrCV_model.predict(X_train_test)
y_pred_train_test.shape


(14850,)

In [33]:
print('Logistic Regression Cross Validation score for train.train:')
lrCV_model.score(X_train_train,y_train_train)


Logistic Regression Cross Validation score for train.train:


0.7824242424242425

In [34]:
print('Logistic Regression Cross Validation score for train.test:')
lrCV_model.score(X_train_test,y_train_test)


Logistic Regression Cross Validation score for train.test:


0.7801346801346801

In [35]:
print('Logistic Regression Cross Validation score for train:')
lrCV_model.score(scaled_train, train_targets)


Logistic Regression Cross Validation score for train:


0.7818518518518518

In [36]:
# import sklearn.model_selection as model_selection
# train, test = model_selection.train_test_split(train_kaggle, random_state=1)

# print(f'train: {train.shape}')
# print(f'test: {test.shape}')


In [37]:
scaled_test_ids



3296        10
13662       13
5518        14
11349       29
758         32
         ...  
10514    74241
7445     74244
6195     74245
11051    74248
5318     74249
Name: id, Length: 14358, dtype: int64

In [38]:

y_pred = lrCV_model.predict(scaled_test)
out_df = pandas.DataFrame(y_pred, index=scaled_test_ids, columns=['status_group'])
out_df['status_group'] = out_df['status_group'].replace({1: 'functional', 2: 'functional needs repair', 3: 'non functional'})


In [39]:
out_df = out_df.reset_index()


In [40]:
out_df

# set(train_features['date_recorded'].unique()) - set(test_features['date_recorded'].unique())


Unnamed: 0,id,status_group
0,10,functional
1,13,non functional
2,14,functional
3,29,non functional
4,32,functional
...,...,...
14353,74241,functional
14354,74244,functional
14355,74245,functional
14356,74248,functional


In [41]:
# submission = pandas.concat([scaled_test_ids, out_df], axis=1)


In [42]:
# submission


In [43]:
out_df.to_csv('./module4/results.csv', index=False)

