# Model Selection

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

# load processed data
train = pd.read_csv('../data/processed/train.csv')
X_test = pd.read_csv('../data/processed/test.csv')

y_train = train['status_group']
X_train = train.drop(['status_group'], axis = 1)

test_ids = pd.read_csv('/Users/donaldfung/Documents/Github/datascience/Projects/Pump it Up - Data Mining the Water Table/data/raw/test.csv', usecols = ['id'])

outcomes = {0:'functional', 
            1:'functional needs repair', 
            2:'non functional'}

random_state = 5

I loaded the data and extracted the target column from the training set.  I also loaded the IDs from the raw `test.csv` file.  These IDs will be added as a requirment in the final submission file.  

I have also created a dictionary called `outcomes` to keep track of the class labels. This will come in handy when converting the encoded predictions into strings for the final submission.   

In [7]:
def save_predictions(ids = None, predictions = None, filepath = None):
    """Prepare submissions file"""
    submission = pd.DataFrame({'id': ids, 'status_group': predictions})
    submission['status_group'] = submission['status_group'].apply(lambda x: reverse_transform(x))
    submission.to_csv(path_or_buf = filepath, index = False)
    print("File saved!")
            
def reverse_transform(x):
    """Converts numerical labels to strings"""
    x = outcomes[x]
    return x

I created a function to prepare the final submission file.  It also converts the numerically encoded predictions into strings before saving the file.

In [5]:
def fit(model, X, y, n_folds, random_state):
    """Evaluate Model Performance"""
    scores = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             cv = n_folds, 
                             n_jobs = -1)
    print('CV accuracy scores:', scores)
    print('\nCV accuracy: {} +/- {}'.format(np.mean(scores), np.std(scores)))
    
lr = LogisticRegression(random_state = random_state)
fit(model = lr, X = X_train, y = y_train, n_folds = 5, random_state = random_state)

CV accuracy scores: [ 0.7472435   0.74589681  0.74385522  0.74006734  0.75425156]

CV accuracy: 0.7462628848957399 +/- 0.004671105530607837


I tested the performance of a Logistic Regression classifier on the data.  Without modifying any of the classifier's parameters, the classifier produced an average accuracy of approximately 74.6% on the validation set.  Note that the `cross_val_score` method performs a stratified k-fold cross-validation over the training set.  

There is a lot of room to improve the current accuracy score.  

In [6]:
def predict(model, X_train, y_train, X_test, random_state):
    preds = model.fit(X_train, y_train).predict(X_test)
    return preds

predictions=predict(model=lr,
            X_train=X_train,
            y_train=y_train,
            X_test=X_test,
            random_state=random_state)

# prepare first submission
save_predictions(ids = test_ids['id'], predictions = predictions, filepath = '../reports/submissions/submission4.csv')

File saved!
