# Train/Test, Cross-Validation, and Regularization

 create train/test splits in python.  We'll also see the `GridSearchCV` function, which is a very simple way to do cross-validation to select tuneable parameters.

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
# this import is new
from sklearn.cross_validation import train_test_split
# this import is new
from sklearn.grid_search import GridSearchCV
import numpy as np
import pandas as pd

Let's read in the "Hitters" dataset from ISLR that has information on baseball players, their stats, and their salaries.  Also, we'll drop any rows with missing values.

In [None]:
hitters = pd.read_csv("./hitters.csv")
hitters = hitters.dropna(inplace=False)
hitters.head()

We'll get rid of a few categorical columns rather than deal with converting them.  Then we'll create a binary variable for whether a player makes more than the median salary.

In [None]:
X = np.array(hitters.drop(["Salary", "League", "Division", "NewLeague"], axis=1))
y = (hitters["Salary"] >= np.median(hitters["Salary"])).astype("int")

Creating a training/testing split is extremely simple:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=10)

In [None]:
print X_train.shape
print X_test.shape

Next, we'll fit a logistic regression model to the training data and score the test data:

In [None]:
logit = LogisticRegression(penalty="l1", C=1e5)
logit.fit(X_train, y_train)

test_preds = logit.predict_proba(X_test)[:, 1]

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, test_preds)

In [None]:
# we want to draw the random baseline ROC line too
fpr_rand = tpr_rand = np.linspace(0, 1, 10)

plt.plot(fpr, tpr)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

In [None]:
roc_auc_score(y_test, test_preds)

If we re-run the train/test split, we'll see the variability in this estimate.

We can use the test set (which, in this case, should really be called a validation set) to choose the best value of the tuneable parameter `C` of the logisitc regression, which is the inverse of $\lambda$, the regularization strength.

In [1]:
# create equally space values beteen 10^-10 and 10^10
c_vals = np.logspace(-10, 10, 20)

aucs = []
for c_val in c_vals:
    logit = LogisticRegression(C=c_val)
    logit.fit(X_train, y_train)

    test_preds = logit.predict_proba(X_test)[:, 1]
    aucs.append(roc_auc_score(y_test, test_preds))

NameError: name 'np' is not defined

In [None]:
aucs

In [None]:
plt.plot(np.log10(c_vals), aucs)
plt.xlabel("C")
plt.ylabel("Test AUC")
plt.show()

Instead of using a train/test split, scikit-learn has a really nice way to use cross-validation to choose the tuneable parameters of a model.  First, we make a dictionary, where the key is the name of the parameter we want to tune (it has to match the name of the parameter in the model), and the values are the values we want to try:

In [None]:
param_grid = {"C": np.logspace(2, 8, 50)}

In [None]:
np.logspace(2, 8, 50)

Then, we pass in the model we want to fit and the grid.  The option 'n_jobs' allows us to split the cross-validation over multiple cores of your computer, and `refit` tells it to fit the best performing model on the full dataset once it's done.

In [None]:
cv = GridSearchCV(logit, param_grid, cv=10, n_jobs=1, refit=True, verbose=True)
cv.fit(X_train, y_train)

We can see the best values and the grid scores:

In [None]:
cv.best_estimator_

In [None]:
cv.best_params_

Let's see what value of $\lambda$ corresponds to the best C:

In [None]:
np.log10(1.0/cv.best_params_['C'])

In [None]:
cv.best_score_

In [None]:
cv.grid_scores_