<a href="https://colab.research.google.com/github/accarter/DS-Unit-2-Kaggle-Challenge/blob/master/module3-cross-validation/LS_DS_223_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---

# Cross-Validation


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


**You can't just copy** from the lesson notebook to this assignment.

- Because the lesson was **regression**, but the assignment is **classification.**
- Because the lesson used [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html), which doesn't work as-is for _multi-class_ classification.

So you will have to adapt the example, which is good real-world practice.

1. Use a model for classification, such as [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. Use hyperparameters that match the classifier, such as `randomforestclassifier__ ...`
3. Use a metric for classification, such as [`scoring='accuracy'`](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)
4. If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, such as [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) (not [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html))



## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- Add your own stretch goals!
- Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/). See the previous assignment notebook for details.
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?


### BONUS: Stacking!

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

### Filters

In [None]:
import numpy as np


def population_filter(se):
  '''
  Replaces rows with populations of 1 or fewer with np.nan.
  Replaces rows with outliers with np.nan.
  '''
  return pd.Series(np.where((se <= 1) | (se == np.nan), np.nan, se))


def remove_outliers(se):
  '''
  Remove outer 1% of values (outliers).
  '''
  se = se.dropna()
  return pd.Series(np.where((se >= np.percentile(se, 0.5)) &
                            (se <= np.percentile(se, 99.5)), se, np.nan))


def amount_tsh_filter(se):
  '''
  Replaces total static head values less than 0 with np.nan.
  '''
  return pd.Series(np.where((se <= 0), np.nan, se))


def gps_height_filter(se):
  '''
  Replaces gps height values of 0 or less with np.nan.
  '''
  return pd.Series(np.where(se <= 1, np.nan, se))


def construction_year_filter(se):
  '''
  Replaces construction years of 0 with np.nan.
  '''
  return pd.Series(np.where(se == 0, np.nan, se))


def num_private_filter(se):
  '''
  Replaces num private values of 0 or less with np.nan.
  '''
  return pd.Series(np.where(se <= 0, np.nan, se))


def axis_filter(se):
  '''
  Replaces latitude values of -2e-08 with np.nan.
  Replaces longitude vales of 0.0 with np.nan.
  '''
  null_lat = se.value_counts().index[0]
  return pd.Series(np.where(se == null_lat, np.nan, se))


def clean(filter):
  '''
  Decorator function for casting pd.Series as type float
  and removing any outliers after applying the specified filter.
  '''
  def wrapper(se):
    return remove_outliers(filter(se.astype(float)))
  return wrapper

### Replacements

In [None]:
numeric_features = ['population', 'amount_tsh', 'gps_height', 
                    'latitude', 'longitude', 'num_private']

measurements = ['mean', 'median', 'median', 'mean', 'mean', 'mode']

filters = [population_filter, amount_tsh_filter, gps_height_filter, 
           axis_filter, axis_filter, num_private_filter]

replacement_config = dict(zip(numeric_features, zip(measurements, filters)))

measurement_funcs = {
    'mean': lambda se: se.mean(),
    'mode': lambda se: se.mode()[0],
    'median': lambda se: se.median()
}

def find_replacement(df, feature):
  measurement, filter = replacement_config[feature]
  return measurement_funcs[measurement](clean(filter)(df[feature]))

# combine DataFrames for generating replacement values
tanzania = pd.concat([train.copy(), test.copy()])
replacements = {feature: find_replacement(tanzania, feature) for feature in numeric_features}
replacements

{'amount_tsh': 250.0,
 'gps_height': 1192.0,
 'latitude': -5.879993833884067,
 'longitude': 35.15198731008472,
 'num_private': 1.0,
 'population': 319.22367730422}

In [None]:
def get_means_by_region(X, features=['latitude', 'longitude', 'gps_height', 'population']):
  X = X.copy()
  grouped_by_region = X.groupby(['district_code', 'basin'])
  regional_features = {}
  for feature in features:
    regional_features[feature] = {}
    X[feature] = replacement_config[feature][1](X[feature])
    means_by_region = grouped_by_region[feature].mean()
    regional_means = {}
    for (code, basin), avg in zip(means_by_region.index, means_by_region.values):
      # if np.isnan(avg):
      #   avg = district_features[feature] # improve by getting avg by district code
      regional_means[f'{code}-{basin}'] = avg
    regional_features[feature] = regional_means
  return regional_features

def replace_by_location(X, features=['latitude', 'longitude', 'gps_height', 'population']):
  X = X.copy()
  for feature in features:
    X[feature] = replacement_config[feature][1](X[feature])
    for k, avg in regional_features[feature].items():
      code, basin = k.split('-')
      if np.isnan(avg):
        avg = district_features[feature][code]
      code = float(code)
      mask = ((X['district_code'] == code) & 
              (X['basin'] == basin) & 
              (X.apply(lambda x: np.isnan(x[feature]), axis=1)))
      X[feature] = pd.Series(np.where(mask, avg, X[feature]))
  return X

def get_means_by_district(X, features=['latitude', 'longitude', 'gps_height', 'population']):
  X = X.copy()
  grouped_by_district = X.groupby(['district_code'])
  district_features = {}
  for feature in features:
    district_features[feature] = {}
    X[feature] = replacement_config[feature][1](X[feature])
    means_by_district = grouped_by_district[feature].mean()
    district_means = {}
    for code, avg in zip(means_by_district.index, means_by_district.values):
      if np.isnan(avg):
        avg = replacements[feature] # improve by getting avg by district code
      district_means[f'{code}'] = avg
    district_features[feature] = district_means
  return district_features

tanzania = pd.concat([train, test])
regional_features = get_means_by_region(tanzania)
district_features = get_means_by_district(tanzania)

### Feature Engineering

In [None]:
from geopy.distance import great_circle

In [None]:
# determine the average latitude and longitude for all functional water pumps;
# used to calculate the distance of a water pump from this point

functional_epicenters = {}

region_groups = train.groupby(['district_code', 'basin']).mean().index
for code, basin in region_groups:
  mask = ((train['district_code'] == code) & 
          (train['basin'] == basin) & 
          (train['status_group'] == 'functional'))
  epicenter = tuple(train[mask][axis].mean() for axis in ('latitude', 'longitude'))
  functional_epicenters[f'{code}-{basin}'] = epicenter


def mean_district_epicenter(district):
  lat, lon, n = 0, 0, 0
  for k, coord in functional_epicenters.items():
    if int(k.split('-')[0]) == district:
      lat += coord[0]
      lon += coord[1]
      n += 1
  if n == 0:
    print(district)
    return (replacements['latitude'], replacements['longitude'])
  final_lat = lat / n
  final_lon = lon / n
  if abs(final_lat) > 90 or abs(final_lon) > 90:
    print(f'Woops: {final_lat} {final_lon}')
  return (final_lat, final_lon)


def coords_to_dist(row):
  '''
  Returns distance between coordinate and center of functional water pumps
  for that region.
  '''
  code = int(row['district_code'])
  region = row['region']
  try:
    epicenter = functional_epicenters['{:.0f}-{}'.format(code, region)]
  except:
    epicenter = mean_district_epicenter(code)
  latitude = row['latitude']
  longitude = row['longitude']
  print(f'{latitude} {longitude}')
  return great_circle((latitude, longitude), epicenter).miles

### Data Wrangling

In [None]:
def replace(X, feature):
  '''
  Replaces feature in X with replacement value.
  '''
  filter = replacement_config[feature][1]
  X[feature] = filter(X[feature])
  X[feature] = X[feature].replace(np.nan, replacements[feature])
  return X

def wrangle(X):
  X = X.copy()

  # Engineer feature: 
  # X['populated'] = pd.Series(np.where(X['population'] <= 250, 0, 1))

  # Engineer feature:
  # X['extraction_type_binary'] = pd.Series(np.where(X['extraction_type'] == 'other', 0, 1))

  # replace non-sense values in numeric columns with guesses

  # let the imputer take care of this step
  # for feature in ['num_private', 'amount_tsh']:
  #   X = replace(X, feature)

  X = replace_by_location(X)  

  # replace NaN values in categorical features with "Unknown"
  nan_cols = ['funder', 'installer', 'subvillage', 'public_meeting', 
              'scheme_management', 'scheme_name', 'permit']

  for feature in nan_cols:
    X[feature] = X[feature].replace(np.nan, 'Unknown')
  
  # drop duplicate columns
  duplicate_cols = ['quantity_group', 'extraction_type_group',
                    'extraction_type_class', 'payment_type',
                    'source_type', 'waterpoint_type_group']
  X = X.drop(columns=duplicate_cols)

  # drop columns meant for book-keeping
  book_keeping_cols = ['id', 'recorded_by']
  X = X.drop(columns=book_keeping_cols)

  # Convert date_recorded to datetime
  X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)

  # Extract components from date_recorded, then drop the original column
  X['year_recorded'] = X['date_recorded'].dt.year
  X['month_recorded'] = X['date_recorded'].dt.month
  X['day_recorded'] = X['date_recorded'].dt.day
  X = X.drop(columns='date_recorded')

  # Engineer feature: how many years from construction_year to date_recorded
  X['years'] = X['year_recorded'] - X['construction_year']
  # X['years'] = pd.Series(np.where(X['years'] > 0, X['years'], 1))

  # Engineer feature: geographic center of functional water pumps
  X['miles_from_functional_epicenter'] = X.apply(coords_to_dist, axis=1)

  # Engineer feature: boolean features to binary
  boolean_features = ['permit']
  for feature in boolean_features:
      X[feature] = pd.Series(np.where(X[feature], 1, 0))

  X['government_funded'] = pd.Series(np.where(X['funder'] == 'Government of Tanzania', 1, 0))

  X['source'] = pd.Series(np.where(X['source'] == 'other', 'unknown', X['source']))

  # Inspired Feature Engineering:

  # 1. pop/year

  # X['pop/year'] = X['population'] / X['years']

  # 2. water/person

  # X['water/person'] = X['amount_tsh'] / X['population']

  return X

In [None]:
train_clean, test_clean = [wrangle(df) for df in (train, test)]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
-7.55171854 35.94671256
-7.05360599 37.76261775
-8.95162531 32.93141129
-6.24341623 30.9697289
-6.01469111 35.49589022
-6.66433762 35.76339069
-3.69631997 37.72927859
-3.24296694 37.16716848
-3.36646543 36.888349100000006
-1.86532444 34.77261874
-3.66547956 30.48951309
-9.72706472 34.55558099
-3.21140347 35.27110513
-5.12104891 32.01114632
-3.031328319686942 33.81301594990397
-3.05401103 35.69702924
-10.73282239 39.54715904
-3.75301814 33.16174048
-2.5473522 36.78182342
-8.82368827 33.8523374
-5.42485887 38.97721897
-2.974970948546932 33.113829155159564
-5.86861215 35.69467676
-6.49344565 30.74573854
-1.36578423 34.14920101
-7.50888302 31.04145555
-3.09511899 30.88731886
-3.7195004999999997 37.58483359
-7.78275282 35.18198544
-2.00898014 32.88793109
-4.55645603 35.39357281
-4.35095075 34.59651463
-11.02518778 34.89180448
-4.23842731 37.99377305
-6.33008718 31.03191065
-6.79552019 36.38985429
-6.44372604 38.89875404
-8.107

In [None]:
train_clean.to_csv('train_clean.csv')
test_clean.to_csv('test_clean.csv')

In [None]:
# train_clean = pd.read_csv('train_clean.csv')
# test_clean = pd.read_csv('test_clean.csv')

### Feature Selection

In [None]:
target = 'status_group'
train_features = train_clean.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(
    include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
categorical_features = train_features.select_dtypes(
    exclude='number').columns.tolist()

features = numeric_features + categorical_features

X_train, X_test = [df[features] for df in (train_clean, test_clean)]
y_train = train_clean[target]

# assert_initial_shape(X_train, X_test)

print('Number of columns')
for df, name in [(X_train, 'X_train'), (X_test, 'X_test')]:
  print(f'{name}: {df.shape[1]}')

Number of columns
X_train: 37
X_test: 37


### Encoding

In [None]:
import category_encoders as ce

def encode(encoder, X_train, X_test, features, target, y_train=None):
  if y_train is None:
    X_train = encoder.fit_transform(X_train, y_train)
    X_test = encoder.transform(X_test)
    return X_train, X_test
  else:
    target_vectors = [[1.0 if x == name else 0.0 for x in y_train] for name in y_train.value_counts().index.tolist()]
    for target_vector in target_vectors:
        X_train = encoder.fit_transform(X_train, target_vector)
        X_test = encoder.transform(X_test)
    return X_train, X_test

cat_boost_encoded_features = ['funder']
hash_encoded_features = ['scheme_management']
target_encoded_features = ['management']

encoders = [ce.cat_boost.CatBoostEncoder(cols=cat_boost_encoded_features, random_state=42), 
            ce.hashing.HashingEncoder(cols=hash_encoded_features), 
            ce.target_encoder.TargetEncoder(cols=target_encoded_features, 
                                            min_samples_leaf=100,
                                            smoothing=10)]
encoded_features = [cat_boost_encoded_features, hash_encoded_features, target_encoded_features]
target_vectors = [y_train, None, y_train]

X_train_encoded, X_test_encoded = (df.copy() for df in (X_train, X_test))
for encoder, features, target_vector in zip(encoders, encoded_features, target_vectors):
  X_train_encoded, X_test_encoded = encode(encoder, X_train_encoded, X_test_encoded, features, target, target_vector)

In [None]:
X_train_encoded.to_csv('X_train_encoded.csv')
X_test_encoded.to_csv('x_test_encoded.csv')

# X_train_encoded = read_csv('X_train_encoded.csv')
# X_test_encoded = read_csv('X_test_encoded.csv')

### Pipeline

In [None]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn import model_selection

In [None]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    # ce.TargetEncoder(),
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=42, 
                           n_jobs=-1,
                           n_estimators=410,
                           max_features=19,
                           min_samples_leaf=4,
                           max_depth=160
                           )
                         )

result = model_selection.cross_val_score(pipeline, X_train_encoded, y_train, cv=3)

# pipeline.fit(X_train, y_train)
# print('Training Score:', pipeline.score(X_train, y_train))
# print('Validation Score:', accuracy_score(y_val, pipeline.predict(X_val)))
# array([0.81144781, 0.80968013, 0.81010101, 0.80791246, 0.80681818])

In [None]:
result

array([0.80292929, 0.80575758, 0.80277778])

### Cross Validation

In [None]:
# from sklearn.model_selection import RandomizedSearchCV

In [None]:
from numpy.random import uniform
from random import randint

n_iter = 100

param_distributions = {
    # 'targetencoder__min_samples_leaf': randint(1, 1000),
    # 'targetencoder__smoothing': uniform(1, 1000),
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__n_estimators': [randint(50, 500) for i in range(n_iter)],
    'randomforestclassifier__max_depth': range(0, 200, 20),
    'randomforestclassifier__max_features': [uniform(0, 1) for i in range(n_iter)]
}

search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    n_iter=n_iter,
    cv=5,
    scoring='accuracy',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)


search.fit(X_train_encoded, y_train)

In [None]:
# print('Best hyperparameters', search.best_params_)
# print('Cross-validation MAE', search.best_score_)

Best hyperparameters {'simpleimputer__strategy': 'mean', 'randomforestclassifier__n_estimators': 60, 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__max_depth': 100}
Cross-validation MAE 0.8063299663299663


In [None]:
# from sklearn.model_selection import StratifiedKFold
# from sklearn.model_selection import cross_val_score
# # load data

# # CV model

# kfold = StratifiedKFold(n_splits=10, random_state=42)
# results = cross_val_score(model, X_train_encoded, y_train, cv=5)
# print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
%%time
from xgboost import XGBClassifier

categorical_features = X_train_encoded.select_dtypes(exclude='number').columns.tolist()

encoder = ce.OrdinalEncoder(cols=categorical_features)
X_train_encoded = encoder.fit_transform(X_train_encoded, y_train)
X_test_encoded = encoder.transform(X_test_encoded)

modelxgb = XGBClassifier(
    random_state=42,
    objective = 'multi:softmax', 
    booster = 'gbtree', 
    nrounds = 'min.error.idx', 
    num_class = 3, 
    maximize = False, 
    eval_metric = 'merror', 
    eta = .1,
    max_depth = 14, 
    colsample_bytree = .4)

model = modelxgb.fit(X_train_encoded, y_train)

CPU times: user 1min 13s, sys: 91.5 ms, total: 1min 13s
Wall time: 1min 13s


In [None]:
X_test_encoded = encoder.transform(X_test_encoded)
X_train_encoded.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,permit,construction_year,year_recorded,month_recorded,day_recorded,years,miles_from_functional_epicenter,government_funded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,scheme_name,extraction_type,management,management_group,payment,water_quality,quality_group,quantity,source,source_class,waterpoint_type
0,0,1,0,0,0,0,0,0,6000.0,1390.0,34.938093,-9.856322,0,11,5,109.0,0,1999,2011,3,14,12,240.993961,0,0.072677,Roman,none,Lake Nyasa,Mnyusi B,Iringa,Ludewa,Mundindi,True,Roman,gravity,0.068902,user-group,pay annually,soft,good,enough,spring,groundwater,communal standpipe
1,0,0,0,0,0,0,0,1,0.0,1399.0,34.698766,-2.147466,0,20,2,280.0,1,2010,2013,3,6,3,311.544818,0,0.072677,GRUMETI,Zahanati,Lake Victoria,Nyamara,Mara,Serengeti,Natta,Unknown,Unknown,gravity,0.099002,user-group,never pay,soft,good,insufficient,rainwater harvesting,surface,communal standpipe
2,0,1,0,0,0,0,0,0,25.0,686.0,37.460664,-3.821329,0,21,4,250.0,1,2009,2013,2,25,4,261.197128,0,0.072677,World vision,Kwa Mahundi,Pangani,Majengo,Manyara,Simanjiro,Ngorika,True,Nyumba ya mungu pipe scheme,gravity,0.068902,user-group,pay per bucket,soft,good,enough,dam,surface,communal standpipe multiple
3,0,1,0,0,0,0,0,0,0.0,263.0,38.486161,-11.155298,0,90,63,58.0,1,1986,2013,1,28,27,127.263226,0,0.072677,UNICEF,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,Nanyumbu,Nanyumbu,True,Unknown,submersible,0.068902,user-group,never pay,soft,good,dry,machine dbh,groundwater,communal standpipe multiple
4,0,0,0,0,0,0,1,0,0.0,1242.418131,31.130847,-1.825359,0,18,1,480.564499,1,0,2011,7,13,2011,309.516345,0,0.072677,Artisan,Shuleni,Lake Victoria,Kyanyamisa,Kagera,Karagwe,Nyakasimbi,True,Unknown,gravity,0.065166,other,never pay,soft,good,seasonal,rainwater harvesting,surface,communal standpipe


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# load data

# CV model

kfold = StratifiedKFold(n_splits=10, random_state=42)
results = cross_val_score(model, X_train_encoded, y_train, cv=5)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))



Accuracy: 81.46% (0.30%)


In [None]:
model.fit(X_train_encoded, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, eta=0.1,
              eval_metric='merror', gamma=0, learning_rate=0.1,
              max_delta_step=0, max_depth=14, maximize=False,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nrounds='min.error.idx', nthread=None, num_class=3,
              objective='multi:softprob', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
              subsample=1, verbosity=1)

In [None]:
test_pred = pd.DataFrame({target: model.predict(X_test_encoded)}, index=test['id'])
test_pred.to_csv('submission.csv')

In [None]:
test['id'][:10]

0    50785
1    51630
2    17168
3    45559
4    49871
5    52449
6    24806
7    28965
8    36301
9    54122
Name: id, dtype: int64

In [None]:
from sklearn.neighbors import RadiusNeighborsClassifier


axes = ['latitude', 'longitude']
# axes = ['latitude']
kn = RadiusNeighborsClassifier(radius=1.0)
kn.fit(X_train_encoded[axes], y_train)
kn_results = cross_val_score(kn, X_train_encoded, y_train, cv=3)

ValueError: ignored

In [None]:
kn_results

array([0.51626263, 0.5140404 , 0.55005051])

In [None]:
categorical_features = X_train_encoded.select_dtypes(exclude='number').columns.tolist()
encoder = ce.OrdinalEncoder(cols=categorical_features)
X_train_encoded = encoder.fit_transform(X_train_encoded, y_train)
X_test_encoded = encoder.transform(X_test_encoded)

In [None]:
X_test_encoded.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,permit,construction_year,year_recorded,month_recorded,day_recorded,years,miles_from_functional_epicenter,government_funded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,scheme_name,extraction_type,management,management_group,payment,water_quality,quality_group,quantity,source,source_class,waterpoint_type
0,0,0,0,0,1,0,0,0,0.0,1996.0,35.290799,-4.059696,0,21,3,321.0,1,2012,2013,2,4,1,177.610206,0,0.078071,342.0,-1.0,5,10944.0,3,38,574.0,1,2.0,6,0.072677,4,2,1,1,4,2,2,4
1,0,1,0,0,0,0,0,0,0.0,1569.0,36.656709,-3.309214,0,2,2,300.0,1,2000,2013,2,4,13,274.500317,0,0.078071,6.0,-1.0,3,-1.0,17,27,368.0,1,417.0,1,0.072677,1,2,1,1,2,1,1,1
2,0,1,0,0,0,0,0,0,0.0,1567.0,34.767863,-5.004344,0,13,2,500.0,1,2010,2013,2,1,3,115.318722,0,0.078071,20.0,21519.0,5,7345.0,19,33,648.0,1,938.0,6,0.072677,1,2,1,1,2,2,2,4
3,0,1,0,0,0,0,0,0,0.0,267.0,38.058046,-9.418672,0,80,43,250.0,1,1987,2013,1,22,26,104.503966,0,0.078071,131.0,-1.0,4,5580.0,15,106,1796.0,2,2.0,6,0.072677,1,4,1,1,3,6,1,4
4,1,0,0,0,0,0,0,0,500.0,1260.0,35.006123,-10.950412,0,10,3,60.0,1,2000,2013,3,27,13,305.644547,0,0.012113,1133.0,2985.0,4,2891.0,10,98,654.0,2,319.0,1,0.072677,1,7,1,1,1,1,1,1


In [None]:
X_train_encoded.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,permit,construction_year,year_recorded,month_recorded,day_recorded,years,miles_from_functional_epicenter,government_funded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,scheme_name,extraction_type,management,management_group,payment,water_quality,quality_group,quantity,source,source_class,waterpoint_type
0,0,1,0,0,0,0,0,0,6000.0,1390.0,34.938093,-9.856322,0,11,5,109.0,0,1999,2011,3,14,12,240.993961,0,0.072677,Roman,none,Lake Nyasa,Mnyusi B,Iringa,Ludewa,Mundindi,True,Roman,gravity,0.068902,user-group,pay annually,soft,good,enough,spring,groundwater,communal standpipe
1,0,0,0,0,0,0,0,1,0.0,1399.0,34.698766,-2.147466,0,20,2,280.0,1,2010,2013,3,6,3,311.544818,0,0.072677,GRUMETI,Zahanati,Lake Victoria,Nyamara,Mara,Serengeti,Natta,Unknown,Unknown,gravity,0.099002,user-group,never pay,soft,good,insufficient,rainwater harvesting,surface,communal standpipe
2,0,1,0,0,0,0,0,0,25.0,686.0,37.460664,-3.821329,0,21,4,250.0,1,2009,2013,2,25,4,261.197128,0,0.072677,World vision,Kwa Mahundi,Pangani,Majengo,Manyara,Simanjiro,Ngorika,True,Nyumba ya mungu pipe scheme,gravity,0.068902,user-group,pay per bucket,soft,good,enough,dam,surface,communal standpipe multiple
3,0,1,0,0,0,0,0,0,0.0,263.0,38.486161,-11.155298,0,90,63,58.0,1,1986,2013,1,28,27,127.263226,0,0.072677,UNICEF,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,Nanyumbu,Nanyumbu,True,Unknown,submersible,0.068902,user-group,never pay,soft,good,dry,machine dbh,groundwater,communal standpipe multiple
4,0,0,0,0,0,0,1,0,0.0,1242.418131,31.130847,-1.825359,0,18,1,480.564499,1,0,2011,7,13,2011,309.516345,0,0.072677,Artisan,Shuleni,Lake Victoria,Kyanyamisa,Kagera,Karagwe,Nyakasimbi,True,Unknown,gravity,0.065166,other,never pay,soft,good,seasonal,rainwater harvesting,surface,communal standpipe
