Handling Typical ML Problems
1. identification of the problem
2. Split the data into Training and Validation set
3.  identification of different variables in the data
4. Separate out the numerical variables first. These variables don’t need any kind of processing and thus we can start applying normalization and machine learning models to these variables
5. There are two ways in which we can handle categorical data:
	1. Convert the categorical data to labels using LabelEncoder
	2. Convert the labels to binary variables (one-hot encoding)
6. Stack the features together
7. start applying machine learning models. 
	At this stage only models you should go for should be ensemble tree based models. These models include:
		RandomForestClassifier
		RandomForestRegressor
		ExtraTreesClassifier
		ExtraTreesRegressor
		XGBClassifier
		XGBRegressor

	 To use linear models, one can use Normalizer or StandardScaler from scikit-learn.
	 Normalization methods work only on dense features and don’t give very good results if applied on sparse features. 	 Yes, one can apply StandardScaler on sparse matrices without using the mean (parameter: with_mean=False).

8. If the above steps give a “good” model, we can go for optimization of hyperparameters 

else decomposition methods to improve the model

    PCA
    LDA
    QDA
    SVD

    PCA - for images start with 10-15 componenets and increase this number as long as the quality of result improves substantially
    For other type of data, we select 50-60 components initially (we tend to avoid PCA as long as we can deal with the numerical data as it is).

### All necessary Import Reference

In [None]:
## Imports

%matplotlib inline

import numpy as np
import pandas as pd

# calculate accuracy measures and confusion matrix
from sklearn import metrics
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

### Reading Files

In [None]:
#Read file without column names
columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin',
           'BMI','DiabetesPedigreeFunction','Age','Outcome']
X = pd.read_csv(r"C:\Users\divyakamat\PG_AI_ML\python\project\module_3\pima-indians-diabetes.data", names=columns)
y = X.pop("Outcome")

### Seperating Independent and Dependent variables

In [None]:
target = 'Attrition'
X = df.loc[:, df.columns!=target]
y = df.loc[:, df.columns==target]


y = np.where(df['Class']== 2, 1, 0)
y = np.where(df['Attrition']=='Yes', 1, 0)

### Missing Values

In [None]:
# Total missing values for each feature
print(df.isnull().sum())

In [None]:
#Replace Missing Values

# Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)

# Location based replacement
df.loc[2,'ST_NUM'] = 125

# Replace using median 
median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True

df['Bare Nuclei'] = df['Bare Nuclei'].replace({0: median})
                          
          
#Replace ? with Nan
df.replace({'?': np.nan})

df = df.replace({'?': np.nan})
                          

### Replace zero with mean and median

In [None]:
df['Pres'] = df['Pres'].replace({0: df['Pres'].mean()})
df['mass'] = df['mass'].replace({0: df['mass'].median()})

In [None]:
data.drop(['Unnamed: 32',"id"], axis=1, inplace=True)


data.diagnosis = [1 if x == "M" else 0 for x in data.diagnosis]

In [None]:
#Dataframe copy
train_data = train_df.copy()

#Use median to fill NaNs
train_data["Age"].fillna(train_df["Age"].median(skipna=True), inplace=True)

#Use mode or maximum value to fill nan in categorical features
train_data["Embarked"].fillna(train_df['Embarked'].value_counts().idxmax(), inplace=True)

#Drop a column
train_data.drop('Cabin', axis=1, inplace=True)

### Type Conversions

In [None]:


pd.to_datetime(df['date'])

for feature in credit_df.columns: # Loop through all columns in the dataframe
    if credit_df[feature].dtype == 'object': # Only apply for columns with categorical strings
        credit_df[feature] = pd.Categorical(credit_df[feature]).codes # Replace strings with an integer
        
        
df["Bare Nuclei"]=df["Bare Nuclei"].astype(int)


In [None]:
def strip_character(dataCol):
    r = re.compile(r'[^a-zA-Z !@#$%&*_+-=|\:";<>,./()[\]{}\']')
    return r.sub('', dataCol)

df[resultCol] = df[dataCol].apply(strip_character)

#### Categorical column details in tabular format

In [None]:
def basic_details(df):
    b = pd.DataFrame()
    b['Missing value'] = df.isnull().sum()
    b['N unique value'] = df.nunique()
    b['dtype'] = df.dtypes
    return b
basic_details(train)

In [None]:
# dataset with numerical features
numeric_variables= list(X.dtypes[X.dtypes != "object"].index)
X[numeric_variables].head()

In [None]:
#List Categorical features
def describe_categorical(X):
    from IPython.display import display, HTML
    display(HTML(X[X.columns[X.dtypes == "object"]].describe().to_html()))

In [None]:
categorical_feature_columns = list(set(credit_df.columns) - set(credit_df._get_numeric_data().columns))
categorical_feature_columns

In [None]:
for i in [a for a in categorical_feature_columns if not a.startswith('default')]:
    print("----------------------------------------")
    print("Value counts for :",i)
    print("----------------------------------------")
    print(credit_df[i].value_counts())

In [None]:
df[df.dtypes[(df.dtypes != "object")].index.values].hist(figsize=[11,11])

In [None]:
for i in [a for a in categorical_feature_columns if not a.startswith('default')]:
        pd.crosstab(credit_df[i],credit_df.default).plot(kind='bar')

In [None]:
# %% normalization
x = (x_data -np.min(x_data))/(np.max(x_data)-np.min(x_data)).values

### Code to plot graph before and after missing value imputation

In [None]:
plt.figure(figsize=(15,8))
ax = train_df["Age"].hist(bins=15, density=True, stacked=True, color='teal', alpha=0.6)
train_df["Age"].plot(kind='density', color='teal')
ax = train_data["Age"].hist(bins=15, density=True, stacked=True, color='orange', alpha=0.5)
train_data["Age"].plot(kind='density', color='orange')
ax.legend(['Raw Age', 'Adjusted Age'])
ax.set(xlabel='Age')
plt.xlim(-10,85)
plt.show()

In [None]:
sns.barplot('Sex', 'Survived', data=train_df, color="aquamarine")
plt.show()

In [None]:
categorical = [
  'MSZoning', 'LotShape', 'Neighborhood', 'CentralAir', 'SaleCondition', 'MoSold', 'YrSold'
]
fig, ax = plt.subplots(2, 4, figsize=(20, 10))
for variable, subplot in zip(categorical, ax.flatten()):
    sns.countplot(housing[variable], ax=subplot)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(15, 10))
for var, subplot in zip(categorical, ax.flatten()):
    sns.boxplot(x=var, y='SalePrice', data=housing, ax=subplot)

#### Stacked Bar Plot

In [None]:
for i in [a for a in categorical_feature_columns if not a.startswith('default')]:
    table=pd.crosstab(credit_df[i],credit_df.default)
    table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)

In [None]:
## Print Max rows
def printall(X,max_rows=10):
    from IPython.display import display, HTML
    display(HTML(X.to_html(max_rows=max_rows)))
printall(X)

In [None]:
def clean_cabin(X):
    try:
        return x[0]
    except TypeError:
        return "None"
    
X["Cabin"] = X.Cabin.apply(clean_cabin)

### Distribution of feature against target class

In [None]:
plt.figure(figsize=(15,8))
ax = sns.kdeplot(df["mass"][df["class"] == 1], color="darkturquoise", shade=True)
sns.kdeplot(df["mass"][df["class"] == 0], color="lightcoral", shade=True)
plt.legend(['mass', 'class'])
plt.title('Density Plot of mass for diabetic and non diabetic Population')
ax.set(xlabel='Age')
plt.xlim(-10,85)
plt.show()

### Determine outliers in dataset
The extreme observations in data set which resembles completely different behavoir from the rest of data point are called outliers. The outliers present in numeric feature are replaced by 1%/99% of feature.



In [None]:
def outlier(df,columns):
    for i in columns:
        quartile_1,quartile_3 = np.percentile(df[i],[25,75])
        quartile_f,quartile_l = np.percentile(df[i],[1,99])
        IQR = quartile_3-quartile_1
        lower_bound = quartile_1 - (1.5*IQR)
        upper_bound = quartile_3 + (1.5*IQR)
        print(i,lower_bound,upper_bound,quartile_f,quartile_l)
                
        df[i].loc[df[i] < lower_bound] = quartile_f
        df[i].loc[df[i] > upper_bound] = quartile_l
        
outlier(train,num_col)
outlier(test,num_col) 

### Stratified Splitting

The splitting of data into training and validation sets “must” be done according to labels. In case of any kind of classification problem, use stratified splitting. In python, you can do this using scikit-learn very easily.

eval_size or the size of the validation set as 10% of the full data in the examples, but one can choose this value according to the size of the data they have.

In [None]:
from sklearn.cross_validation import StratifiedKFold
eval_size = 0.10
kf = StratifiedKFold(y,round (1./eval_size) )
train_indices,valid_indices = next (iter(kf))
X_train, y_train = X[train_indices], y[train_indices]
X_valid, y_valid = X[valid_indices], y[valid_indices]

In [None]:
from sklearn.cross_validation import KFold
eval_size = 0.10
kf = KFold(len(y),round (1./eval_size) )
train_indices,valid_indices = next (iter(kf))
X_train, y_train = X[train_indices], y[train_indices]
X_valid, y_valid = X[valid_indices], y[valid_indices]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101,stratify=y)

### Standard Scalar

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Label Encoder

In [None]:
from sklearn.preprocessing import LabelEncoder

lbl_enc = LabelEncoder()
lbl_enc.fit(X_train[categorical_features])
Xtrain_cat = lbl_enc.transform(X_train[categorical_features])

### OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit(X_train[categorical_features])
Xtrain_cat = ohe.transform(X_train[categorical_features])

### One hot encoding using pd.get dummies

In [None]:

#create categorical variables and drop some variables
training=pd.get_dummies(train_data, columns=["Pclass","Embarked","Sex"])

#Drop the actual columns 
training.drop('Sex_female', axis=1, inplace=True)
training.drop('PassengerId', axis=1, inplace=True)
training.drop('Name', axis=1, inplace=True)
training.drop('Ticket', axis=1, inplace=True)

### Combine text variables

In [None]:
text_data = list(X_train.apply(lambda s: '%s %s' %(x['column_1'],x['column_2']),axis=1))

In [None]:
##We can then use CountVectorizer or TfidfVectorizer on it

from sklearn.feature_extraction.text import CountVectorizer
ctv = CountVectorizer()
text_data_train = ctv.fit_transform(text_data_train)
text_data_valid = ctv.fit_transform(text_data_valid)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer()
text_data_train = tfv.fit_transform(text_data_train)
text_data_valid = tfv.fit_transform(text_data_valid)

In [None]:
#TfidfVectorizer performs better than the counts most of the time
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}'
                     ngram_range=(1,2), use_idf=1,smooth_idf=1, sublinear_tf=1,
                     stop_words = 'english')



In [None]:
#If you are applying these vectorizers only on the training set,
#make sure to dump it to hard drive so that you can use it later on the validation set.

import cPickle
cPickle.dump(vectorizer, open('vectorizer.pkl','wb'),-1)



### Stacker Module

Stacker module is not a model stacker but a feature stacker. The different features after the processing steps described above can be combined using the stacker module.

You can horizontally stack all the features before putting them through further processing by using numpy hstack or sparse hstack depending on whether you have dense or sparse features.


In [None]:
import numpy as np
from scipy import sparse

#in case of dense data
#Stack arrays in sequence horizontally (column wise).
X = np.hstack((x1,x2, ...))

# in case data is sparse
X = sparse.hstack((x1,x2, ...))

### Decomposition using PCA and SVD

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_componenets = 12)
pca.fit(xtrain)
xtrain = pca.transform(xtrain)



For text data, after conversion of text to sparse matrix, go for Singular Value Decomposition (SVD). A variation of SVD called TruncatedSVD can be found in scikit-learn.

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_componenets = 12o)
svd.fit(xtrain)
xtrain = svd.transform(xtrain)

The number of SVD components that generally work for TF-IDF or counts are between 120-200. Any number above this might improve the performance but not substantially and comes at the cost of computing power.

### Feature Selection

There are multiple ways in which feature selection can be achieved. One of the most common way is greedy feature selection (forward or backward). In greedy feature selection we choose one feature, train a model and evaluate the performance of the model on a fixed evaluation metric. We keep adding and removing features one-by-one and record performance of the model at every step. We then select the features which have the best evaluation score. 

Other faster methods of feature selection include selecting best features from a model. We can either look at coefficients of a logit model or we can train a random forest to select best features and then use them later with other machine learning models.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 100, n_jobs = -1)
clf.fit(X,y)
X_Selected = clf.transform(X)

#Remember to keep low number of estimators and minimal optimization of hyper parameters so that you don’t overfit.

In [None]:
model = RandomForestRegressor(n_estimators=100,oob_score=True,random_state=42)
model.fit(X,y)
model.feature_importances_

# feature_importances = pd.Series(model.feature_importances_, index=X.columns)
# feature_importances.plot(kind='barh',figsize=[7,6])

print(pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(['Imp'],ascending=False))


The feature selection can also be achieved using Gradient Boosting Machines. It is good if we use xgboost instead of the implementation of GBM in scikit-learn since xgboost is much faster and more scalable.

In [None]:
import xgboost as xgb

params = {}

model = xgb.train(params,dtrain, num_boost_round=100)
sorted(model.get_fscore().items(),key=lambda t: -t[1])

We can also do feature selection of sparse datasets using RandomForestClassifier / RandomForestRegressor and xgboost.

#### Chisqaure Test

In [None]:
from scipy.stats import chisquare,chi2_contingency

alpha=0.05
for i in [a for a in categorical_feature_columns if not a.startswith('default')]:
    print("ChiSquare test for: ",i)
    print("----------------------------------------")
    table=pd.crosstab(credit_df[i],credit_df.default)
    chi2,pval,dof,expected = chi2_contingency(table)
    print("ChiSquare test statistic: ",chi2)
    print("p-value: ",pval)
    
    
    if pval < alpha:
            result="\n{0} is IMPORTANT for Prediction".format(i)
    else:
            result="\n{0} is NOT an important predictor. (Discard {0} from model)".format(i)

    print(result) 
    print("\n===============================================================================")

### Recursive feature elimination
Given an external estimator that assigns weights to features, recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

cols = ["Age","Fare","TravelAlone","Pclass_1","Pclass_2","Embarked_C","Embarked_S","Sex_male","IsMinor"] 
X = final_train[cols]
y = final_train['Survived']
# Build a logreg and compute the feature importances
model = LogisticRegression()
# create the RFE model and select 8 attributes
rfe = RFE(model, 8)
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print('Selected features: %s' % list(X.columns[rfe.support_]))

### Feature ranking with recursive feature elimination and cross-validation
RFECV performs RFE in a cross-validation loop to find the optimal number or the best number of features. Hereafter a recursive feature elimination applied on logistic regression with automatic tuning of the number of features selected with cross-validation.

In [None]:
from sklearn.feature_selection import RFECV
# Create the RFE object and compute a cross-validated score.
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=LogisticRegression(), step=1, cv=10, scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features: %d" % rfecv.n_features_)
print('Selected features: %s' % list(X.columns[rfecv.support_]))

# Plot number of features VS. cross-validation scores
plt.figure(figsize=(10,6))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

In [None]:
Selected_features = ['Age', 'TravelAlone', 'Pclass_1', 'Pclass_2', 'Embarked_C', 
                     'Embarked_S', 'Sex_male', 'IsMinor']
X = final_train[Selected_features]

plt.subplots(figsize=(8, 5))
sns.heatmap(X.corr(), annot=True, cmap="RdYlGn")
plt.show()

### Plot learning Curves

In [None]:
from mlxtend.plotting import plot_learning_curves

#plot learning curves
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
bagging1 = BaggingClassifier(base_estimator=clf1, n_estimators=10, max_samples=0.8, max_features=0.8)

plt.figure()
plot_learning_curves(X_train, y_train, X_test, y_test, bagging1, print_model=False, style='ggplot')
plt.show()

## Model Evaluation

### 1. Model evaluation based on simple train/test split using train_test_split() function

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score 
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss

# create X (features) and y (response)
X = final_train[Selected_features]
y = final_train['Survived']

# use train/test split with different random_state values
# we can change the random_state values that changes the accuracy scores
# the scores change a lot, this is why testing scores is a high-variance estimate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# check classification scores of logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)
print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))

idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +  
      "and a specificity of %.3f" % (1-fpr[idx]) + 
      ", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))


### 2. Model evaluation based on K-fold cross-validation using cross_val_score() function¶

In [None]:
# 10-fold cross-validation logistic regression
logreg = LogisticRegression()
# Use cross_val_score function
# We are passing the entirety of X and y, not X_train or y_train, it takes care of splitting the data
# cv=10 for 10 folds
# scoring = {'accuracy', 'neg_log_loss', 'roc_auc'} for evaluation metric - althought they are many
scores_accuracy = cross_val_score(logreg, X, y, cv=10, scoring='accuracy')
scores_log_loss = cross_val_score(logreg, X, y, cv=10, scoring='neg_log_loss')
scores_auc = cross_val_score(logreg, X, y, cv=10, scoring='roc_auc')
print('K-fold cross-validation results:')
print(logreg.__class__.__name__+" average accuracy is %2.3f" % scores_accuracy.mean())
print(logreg.__class__.__name__+" average log_loss is %2.3f" % -scores_log_loss.mean())
print(logreg.__class__.__name__+" average auc is %2.3f" % scores_auc.mean())

### 3. Model evaluation based on K-fold cross-validation using cross_validate() function

In [None]:
from sklearn.model_selection import cross_validate

scoring = {'accuracy': 'accuracy', 'log_loss': 'neg_log_loss', 'auc': 'roc_auc'}

modelCV = LogisticRegression()

results = cross_validate(modelCV, X, y, cv=10, scoring=list(scoring.values()), 
                         return_train_score=False)

print('K-fold cross-validation results:')
for sc in range(len(scoring)):
    print(modelCV.__class__.__name__+" average %s: %.3f (+/-%.3f)" % (list(scoring.keys())[sc], -results['test_%s' % list(scoring.values())[sc]].mean()
                               if list(scoring.values())[sc]=='neg_log_loss' 
                               else results['test_%s' % list(scoring.values())[sc]].mean(), 
                               results['test_%s' % list(scoring.values())[sc]].std()))

### 4. GridSearchCV evaluating using multiple scorers simultaneously

In [None]:
from sklearn.model_selection import GridSearchCV

X = final_train[Selected_features]

param_grid = {'C': np.arange(1e-05, 3, 0.1)}
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc', 'Log_loss': 'neg_log_loss'}

gs = GridSearchCV(LogisticRegression(), return_train_score=True,
                  param_grid=param_grid, scoring=scoring, cv=10, refit='Accuracy')

gs.fit(X, y)
results = gs.cv_results_

print('='*20)
print("best params: " + str(gs.best_estimator_))
print("best params: " + str(gs.best_params_))
print('best score:', gs.best_score_)
print('='*20)

plt.figure(figsize=(10, 10))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",fontsize=16)

plt.xlabel("Inverse of regularization strength: C")
plt.ylabel("Score")
plt.grid()

ax = plt.axes()
ax.set_xlim(0, param_grid['C'].max()) 
ax.set_ylim(0.35, 0.95)

# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_C'].data, dtype=float)

for scorer, color in zip(list(scoring.keys()), ['g', 'k', 'b']): 
    for sample, style in (('train', '--'), ('test', '-')):
        sample_score_mean = -results['mean_%s_%s' % (sample, scorer)] if scoring[scorer]=='neg_log_loss' else results['mean_%s_%s' % (sample, scorer)]
        sample_score_std = results['std_%s_%s' % (sample, scorer)]
        ax.fill_between(X_axis, sample_score_mean - sample_score_std,
                        sample_score_mean + sample_score_std,
                        alpha=0.1 if sample == 'test' else 0, color=color)
        ax.plot(X_axis, sample_score_mean, style, color=color,
                alpha=1 if sample == 'test' else 0.7,
                label="%s (%s)" % (scorer, sample))

    best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
    best_score = -results['mean_test_%s' % scorer][best_index] if scoring[scorer]=='neg_log_loss' else results['mean_test_%s' % scorer][best_index]
        
    # Plot a dotted vertical line at the best score for that scorer marked by x
    ax.plot([X_axis[best_index], ] * 2, [0, best_score],
            linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)

    # Annotate the best score for that scorer
    ax.annotate("%0.2f" % best_score,
                (X_axis[best_index], best_score + 0.005))

plt.legend(loc="best")
plt.grid('off')
plt.show()

## Looping through multiple models

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    MLPClassifier(hidden_layer_sizes=(100,100,100),batch_size=10,max_iter=200),
    LogisticRegression(solver='lbfgs', multi_class='multinomial')]

# Logging for Visual Comparison
log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

for clf in classifiers:
    clf.fit(X_train, y_train)
    name = clf.__class__.__name__
    
    print("="*max(map(lambda x: len(x.__class__.__name__), classifiers)))
    print(name)
    
    print('****Results****')
    train_predictions = clf.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    print("Accuracy: {:.4%}".format(acc))
    
    train_predictions = clf.predict_proba(X_test)
    ll = log_loss(y_test, train_predictions)
    print("Log Loss: {:.4f}".format(ll))
    
    log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
    log = log.append(log_entry)
    
print("="*max(map(lambda x: len(x.__class__.__name__), classifiers)))

In [None]:
##Barplots display
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

plt.xlabel('Accuracy %')
plt.title('Classifier Accuracy')
plt.show()

sns.set_color_codes("muted")
sns.barplot(x='Log Loss', y='Classifier', data=log, color="g")

plt.xlabel('Log Loss')
plt.title('Classifier Log Loss')
plt.show()

In [None]:
favorite_clf = LogisticRegression(solver='lbfgs', multi_class='multinomial')
favorite_clf.fit(train, labels)
test_predictions = favorite_clf.predict_proba(test)

# Format DataFrame
submission = pd.DataFrame(test_predictions, columns=classes)
submission.insert(0, 'id', test_ids)
submission.reset_index()

### Saving a model using Pickle and Joblib

In [None]:
import pickle


model = LogisticRegression()
model.fit(X_train, Y_train)
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))
 
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)

In [None]:
#provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.
from sklearn.externals import joblib


model = LogisticRegression()
model.fit(X_train, Y_train)
# save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)
 
# some time later...
 
# load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)

### References

1. https://www.linkedin.com/pulse/approaching-almost-any-machine-learning-problem-abhishek-thakur
2. https://github.com/abhishekkrthakur/greedyFeatureSelection
3. https://github.com/vsmolyakov/experiments_with_python/blob/master/chp01/ensemble_methods.ipynb
4. https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python
5. https://www.kaggle.com/mnassrib/twelve-classifiers-for-standardized-leaf-dataset#Stratified-Train/Test-Split