In [None]:
gs_knn.best_score_

In [None]:
gs_knn.best_params_

In [None]:
evaluate_model(gs_knn.best_estimator_)

True positives: 68
True negatives:  149
False negatives:  35
False positives:  16

#### Print the best parameters and score for the gridsearched kNN model. How does it compare to the logistic regression model?

GridSearch - LogisticRegression  
Best_score = 0.79012345679012341  
Best_params = {'C': 1.3894954943731359, 'penalty': 'l2', 'solver': 'liblinear'}

GridSearch - KNN  
Best_score = 0.8058361391694725  
Best_params = {'leaf_size': 10, 'n_neighbors': 20, 'p': 1}

#### How does the number of neighbors affect the bias-variance tradeoff of your model?

#### [BONUS] Why?

Bias is increasing while increasing number of neighbors.         
Variance is decreasing while increasing number of neighbors

The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute t the prediction and in turn increases the bias of the model.

#### In what hypothetical scenario(s) might you prefer logistic regression over kNN, aside from model performance metrics?

Logistic regression doesn't need any parameter tuning.         
Logistic regression predicts probabilities, which are a measure of the confidence of prediction.

#### Fit a new kNN model with the optimal parameters found in gridsearch. 

In [None]:
# optimal LR model
confusion_matrix(y_test,y_logreg_pred)

### Optimal KNN model

True positives: 68              
True negatives:  149            
False negatives:  35           
False positives:  16       

### Optimal LogisticRegression model

True positives: 72  
True negatives:  134  
False negatives:  31  
False positives:  31  


Optimal KNN model is way better than LR on True negatives prediction.


#### [BONUS] Plot the ROC curves for the optimized logistic regression model and the optimized kNN model on the same plot.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
def auc_plotting_function(rate1, rate2, rate1_name, rate2_name, curve_name):
    AUC = auc(rate1, rate2)
    plt.plot(rate1, rate2, label=curve_name + ' (area = %0.2f)' % AUC, linewidth=4)
    plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel(rate1_name, fontsize=18)
    plt.ylabel(rate2_name, fontsize=18)
    plt.legend(loc="lower right")
    plt.show()
def plot_roc(y_true, y_score):
    fpr, tpr, _ = roc_curve(y_true, y_score)
    auc_plotting_function(fpr, tpr, 'False Positive Rate', 'True Positive Rate', 'ROC')

In [None]:
from sklearn.metrics import roc_curve, auc

y_logreg_score = logreg_optimal.decision_function(X_test)
plot_roc(y_test, y_logreg_score)
y_knn_score = knn_optimal.predict(X_test)
plot_roc(y_test, y_knn_score)



## [BONUS] Precision-recall

#### Gridsearch the same parameters for logistic regression but change the scoring function to 'average_precision'

`'average_precision'` will optimize parameters for area under the precision-recall curve instead of for accuracy.

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score,KFold
logreg = LogisticRegression(class_weight='balanced')
cv = KFold(len(y), n_folds=5, shuffle=True)
logreg_parameters = {
    'penalty':['l1','l2'],
    'C':np.logspace(-5,1,50),
    'solver':['liblinear']
}
gs_logreg = GridSearchCV(logreg,logreg_parameters,cv=cv,n_jobs=-1,scoring='average_precision')
gs_logreg.fit(X,y)

In [None]:
gs_logreg.best_estimator_

#### Examine the best parameters and score. Are they different than the logistic regression gridsearch in part 5?

In [None]:
knn_optimal = gs_knn.best_estimator_
knn_optimal

In [None]:
knn_optimal.fit(X_train,y_train)
y_knn_pred = knn_optimal.predict(X_test)

#### Construct the confusion matrix for the optimal kNN model. Is it different from the logistic regression model? If so, how?

In [None]:
# optimal KNN model
confusion_matrix(y_test,y_knn_pred)

#### Fit a new Logreg model with optimal params 

In [None]:
# Compare accuracy score

# tree with grid: 0.76119402985074625
# Knn with grid: 0.80970149253731338
# LR-Grid_search with scoring = 'average_precision': 0.77611940298507465

#### Plot all three optimized models' ROC curves on the same plot. 

In [None]:
from sklearn.metrics import roc_curve, auc

y_logreg_score = logreg_optimal_scoring.decision_function(X_test)
plot_roc(y_test, y_logreg_score)

y_knn_score = knn_optimal.predict(X_test)
plot_roc(y_test, y_knn_score)

y_tree_score = tree_optimal.predict(X_test)
plot_roc(y_test, y_tree_score)

#### Use sklearn's BaggingClassifier with the base estimator your optimized decision tree model. How does the performance compare to the single decision tree classifier?

In [None]:
from sklearn.ensemble import BaggingClassifier
baggingtree = BaggingClassifier(tree_optimal)
evaluate_model(baggingtree)

In [None]:
# compare to 0.76119402985074625 from optimized tree model. Bagging result is better.

#### Gridsearch the optimal n_estimators, max_samples, and max_features for the bagging classifier.

In [None]:
bagging_params = {'n_estimators': [10, 20, 30, 40],
                  'max_samples': [0.3,0.5,0.7,0.8,1.0],
                  'max_features': [0.1,0.3,0.5,0.7,0.8,1.0]}
cv = KFold(len(y),n_folds=5,shuffle=True)

gsbaggingtree = GridSearchCV(baggingtree,
                            bagging_params, n_jobs=-1,
                            cv=cv)
gsbaggingtree.fit(X,y)

In [None]:
gsbaggingtree.best_estimator_

In [None]:
gsbaggingtree.best_score_

#### Create a bagging classifier model with the optimal parameters and compare it's performance to the other two models.

In [None]:
gs_logreg.best_params_

In [None]:
# Part 5 best_params
#{'C': 1.3894954943731359, 'penalty': 'l2', 'solver': 'liblinear'}

#### Create the confusion matrix. Is it different than when you optimized for the accuracy? If so, why would this be?

In [None]:
evaluate_model(gs_logreg.best_estimator_)

It is better than LR and previous LR with grid search

#### Plot the precision-recall curve. 

In [None]:
logreg_optimal_scoring = gs_logreg.best_estimator_
logreg_optimal_scoring.fit(X_train,y_train)

In [None]:
y_pred = logreg_optimal_scoring.predict(X_test)

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
def auc_plotting_function(rate1, rate2, rate1_name, rate2_name, curve_name):
    plt.plot(rate1, rate2, linewidth=4)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel(rate1_name, fontsize=18)
    plt.ylabel(rate2_name, fontsize=18)
    plt.legend(loc="lower right")
    plt.show()
def plot_roc(y_true, y_score):
    precision, recall, _ = precision_recall_curve(y_true, y_score)
    auc_plotting_function(recall,precision,  'Recall', 'Precision', 'RP')

In [None]:
y_score = logreg_optimal_scoring.decision_function(X_test)
plot_roc(y_test, y_score)

## [VERY BONUS] Decision trees, ensembles, bagging

#### Gridsearch a decision tree classifier model on the data, searching for optimal depth. Create a new decision tree model with the optimal parameters.

In [None]:
evaluate_model(gsbaggingtree.best_estimator_)

### Random Forest, AdaBoost Regressor, Gradient Boosting Trees Regressor

In [None]:
gs_logreg.best_score_

In [None]:
tree_optimal = gs_tree.best_estimator_

In [None]:
evaluate_model(tree_optimal)

In [None]:
tree_optimal.fit(X_train,y_train)
y_pred = tree_optimal.predict(X_test)

#### Compare the performace of the decision tree model to the logistic regression and kNN models.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
cv = KFold(len(y), n_folds=5, shuffle=True)
tree_params = {
    'max_depth' : [1,2,3,4,5,6]
}
tree = DecisionTreeClassifier(class_weight='balanced')
gs_tree = GridSearchCV(tree,tree_params,cv=cv,n_jobs=-1,)
gs_tree.fit(X,y)

In [None]:
gs_tree.best_estimator_

In [None]:
gs_tree.best_score_

In [None]:
# StandardScaler 'age' and 'fare'

In [None]:
#age
from sklearn.preprocessing import normalize,StandardScaler
#preprocessing_age = normalize(df.Age)
#preprocessing_age
scalar = StandardScaler().fit(df.Age)
Age_transformed = scalar.transform(df.Age)
Age_transformed = pd.Series(Age_transformed,name='Age_transformed')

In [None]:
#Fare
scalar = StandardScaler().fit(df.Fare)
Fare_transformed = scalar.transform(df.Fare)
Fare_transformed = pd.Series(Fare_transformed,name='Fare_transformed')

In [None]:
df_complete = pd.concat([df.PassengerId,df.Name,df.Survived,df.Sex,
                         Embarked_dummies,Pclass_dummies,Age_transformed,Fare_transformed
                         ,df.SibSp,df.Parch], axis=1)
df_complete

## Logistic Regression and Model Validation

#### Define the variables

In [None]:
import numpy as np
np.random.seed(1)
X = df_complete.drop(df_complete[[0,1,2]],axis=1)
y = df_complete.Survived
X.head()

In [None]:
y.value_counts()/len(y)

In [None]:
from sklearn.cross_validation import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

def evaluate_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    a = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    cr = classification_report(y_test, y_pred)
    print cm
    print cr
    return a

In [None]:
from sklearn.linear_model import LogisticRegression
evaluate_model(LogisticRegression(class_weight='balanced'))


True positives: 72
True negatives:  134
False negatives:  31
False positives:  31

## Gridsearch - LR

Same on accuracy score

#### Explain the difference between the difference between the L1 (Lasso) and L2 (Ridge) penalties on the model coefficients.

Ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. 

#### What hypothetical situations are the Ridge and Lasso penalties useful?

Large number of variables or low ratio of no. observations to no. variables (including the n≪pn≪p case), high collinearity, seeking for a sparse solution (i.e., embed feature selection when estimating model parameters), or accounting for variables grouping in high-dimensional data set.

## Gridsearch and kNN

#### Perform Gridsearch for the same classification problem as above, but use KNeighborsClassifier as your estimator

At least have number of neighbors and weights in your parameters dictionary.