![DiscoveryReady_logo.png](DiscoveryReady_logo.png)

<h2><center>General Ranking & Re-ranking</center></h2>

 <h3><center>Yipeng Han - 2018/07/17</center></h3>

### Introduction

Ranking project includes following steps:

1. Data Loading & Cleaning
2. Feature Engineering 
3. Models
    * Logistic Regression
    * Random Forest
    * Neural Network
    * Extreme Gradient Boosting
    * Long Short Term Memory
4. Ensemble and Stacking models
5. Apply uncertainty sampling
6. Save Result

For general ranking & re-ranking task, there is no specific code template right now. Mostly because input datasets have different format based on projects. So for here, I will use latest data on Stewart US as an example. Note that all those method, including both feature selecting, feature engineering and machine learning model may change more or less depending on cases. 

### Step by Step

##### 1. Data Loading & Cleaning

In [None]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import gc
from time import time
from datetime import datetime
sns.set_style("whitegrid")
sns.set_context('paper')
from collections import Counter
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, make_scorer
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss
os.chdir('R:/Analytics & Automation/Machine Learning/Stewart Merger/US')
mydata = pd.read_csv('20180709 ML Document Export US_export.csv')

In [None]:
mydata.columns

Transform TFIDF into dummy features

In [None]:
tfidf = pd.read_csv('20180709 Stewart US Merger ML Export TFIDF Export.csv')
del tfidf['workspacenum']
tfidf.columns = ['Artifact ID','TFIDF']

a = tfidf['TFIDF']
a = a.apply(lambda x: str(x).split(';'))
count = Counter()
TOP_TERMS = 50
for coord in a:
    count[coord[0]] += 1

tfidf_feature_list= [i for i, num in count.most_common(TOP_TERMS)]
a = a.apply(lambda x: [('TFIDF_'+i) for i in x if i in tfidf_feature_list])
a = a.apply(lambda x: ';'.join(x))
a[a.str.len()==0] = 'TFIDF_nan'
tfidf_feature = a.str.get_dummies(';')
tfidf_feature = pd.concat([tfidf['Artifact ID'],tfidf_feature],axis=1)

Transform cluster into dummy features

In [None]:
cluster = pd.read_csv('20180709 Stewart US Merger ML Export Lowest Level Relativity Analytics Cluster.csv')
cluster = cluster[['ID','ClusterName']]
cluster.columns = ['DocumentArtifactID','ClusterName']
a = cluster['ClusterName']
a = a.apply(lambda x: str(x).split(';'))
count = Counter()
TOP_CLUSTER = 50
for coord in a:
    count[coord[0]] += 1
    
cluster_feature_list= [i for i, num in count.most_common(TOP_CLUSTER)]
a = a.apply(lambda x: [('Cluster_'+i) for i in x if i in cluster_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'Cluster_nan'
cluster_feature = a.str.get_dummies(';')

Loading decision: there are 3 sets of decision in Stewart case, which means to run the whole ranking part by 3 times based on different decision input.

In [None]:
decision = pd.read_csv('20180709 Stewart Merge US ML Export -Three Set - Training Decisions.csv')
decision = decision[['ArtifactID','SetNum','MajorCategory']]
decision.columns =['Artifact ID','SetNum','Major Category']
decision['Major Category'] = decision['Major Category'].replace('Needs Further Review','Responsive')
decision['Major Category'] = decision['Major Category'].replace('Technical Unresolved ','Not Responsive')
decision1 = decision[decision['SetNum']==1][['Artifact ID','Major Category']]
decision2 = decision[decision['SetNum']==2][['Artifact ID','Major Category']]
decision3 = decision[decision['SetNum']==3][['Artifact ID','Major Category']]

Now create features for other columns from main table 

In [None]:

STR_NUMBER = 40
a = mydata['STR - 20180618 US Search Terms 001 W/10']
a = a.apply(lambda x: str(x).split(';'))
a = a.apply(lambda x: ['STR_w10_001'+ i for i in x])
count = Counter()
for coord in a:
    count[coord[0]] += 1    
w10_001_feature_list= [i for i, num in count.most_common(STR_NUMBER)]
a = a.apply(lambda x: [i for i in x if i in w10_001_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'STR_w10_001_nan'
w10_001_feature = a.str.get_dummies(',')

a = mydata['STR - 20180618 US Search Terms 001 W/5']
a = a.apply(lambda x: str(x).split(';'))
a = a.apply(lambda x: ['STR_w5_001'+ i for i in x])
count = Counter()
for coord in a:
    count[coord[0]] += 1    
w5_001_feature_list= [i for i, num in count.most_common(STR_NUMBER)]
a = a.apply(lambda x: [i for i in x if i in w5_001_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'STR_w5_001_nan'
w5_001_feature = a.str.get_dummies(',')

a = mydata['STR - 20180618 US Search Terms 002 W/10']
a = a.apply(lambda x: str(x).split(';'))
a = a.apply(lambda x: ['STR_w10_002'+ i for i in x])
count = Counter()
for coord in a:
    count[coord[0]] += 1    
w10_002_feature_list= [i for i, num in count.most_common(STR_NUMBER)]
a = a.apply(lambda x: [i for i in x if i in w10_002_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'STR_w10_002_nan'
w10_002_feature = a.str.get_dummies(',')


a = mydata['STR - 20180618 US Search Terms 002 W/5']
a = a.apply(lambda x: str(x).split(';'))
a = a.apply(lambda x: ['STR_w5_002'+ i for i in x])
count = Counter()
for coord in a:
    count[coord[0]] += 1    
w5_002_feature_list= [i for i, num in count.most_common(STR_NUMBER)]
a = a.apply(lambda x: [i for i in x if i in w5_002_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'STR_w5_002_nan'
w5_002_feature = a.str.get_dummies(',')


a = mydata['STR - 20180619 US Search Terms 003 W/10']
a = a.apply(lambda x: str(x).split(';'))
a = a.apply(lambda x: ['STR_w10_003'+ i for i in x])
count = Counter()
for coord in a:
    count[coord[0]] += 1    
w10_003_feature_list= [i for i, num in count.most_common(STR_NUMBER)]
a = a.apply(lambda x: [i for i in x if i in w10_003_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'STR_w10_003_nan'
w10_003_feature = a.str.get_dummies(',')

a = mydata['STR - 20180619 US Search Terms 003 W/5']
a = a.apply(lambda x: str(x).split(';'))
a = a.apply(lambda x: ['STR_w5_002'+ i for i in x])
count = Counter()
for coord in a:
    count[coord[0]] += 1    
w5_003_feature_list= [i for i, num in count.most_common(STR_NUMBER)]
a = a.apply(lambda x: [i for i in x if i in w5_003_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'STR_w5_003_nan'
w5_003_feature = a.str.get_dummies(',')



a = mydata['STR - DONOTUSE zAA Keyword Mining 001']
a = a.apply(lambda x: str(x).split(';'))
count = Counter()
TOP_TERMS = 50
for coord in a:
    count[coord[0]] += 1

zAA_feature_list= [i for i, num in count.most_common(TOP_TERMS)]
a = a.apply(lambda x: [('zAA_'+i) for i in x if i in zAA_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'zAA_nan'
zAA_feature = a.str.get_dummies(',')

a = mydata['STR - DR Responsive - US (Text)']
a = a.apply(lambda x: str(x).split(';'))
count = Counter()
TOP_TERMS = 50
for coord in a:
    count[coord[0]] += 1

Analytics_Res_feature_list= [i for i, num in count.most_common(TOP_TERMS)]
a = a.apply(lambda x: [('DRres_'+i) for i in x if i in Analytics_Res_feature_list])
a = a.apply(lambda x: ','.join(x))
a[a.str.len()==0] = 'DRres_nan'
DR_Res_feature = a.str.get_dummies(',')

SENDER_NUMBER = 15
SENDER_list = mydata['Sender Domains'].value_counts().index[0:(SENDER_NUMBER+1)]
mydata['Sender Domains'] = mydata['Sender Domains'].apply(lambda x: 
    x if x in SENDER_list else 'Sender: nan')
Sender_domain_feature = mydata['Sender Domains'].str.get_dummies()


Make sure all columns has been transformed into dummy for future usage. And there comes some basic feature engineering (transforming + normalization) to make sure that the following ML methods run smoothly. Need double check all features when initiating ML

In [None]:
mydata['DocType'] = mydata['DocType'].apply(lambda x: 'Doctype_'+str(x))
DocType = mydata['DocType'].str.get_dummies() 

mydata['ParentDate'] = mydata['ParentDate'].fillna('01/01/1990')
mydata['ParentDate'] = mydata['ParentDate'].apply(lambda x: (datetime(2018,4,1,0,0) - 
      datetime.strptime(str(x),'%m/%d/%Y')).days)
mydata['isParent'] = mydata['isParent'].replace(True,1)
mydata['isParent'] = mydata['isParent'].replace(False,0)
mydata['IsEmbedded'] = mydata['IsEmbedded'].replace(True,1)
mydata['IsEmbedded'] = mydata['IsEmbedded'].replace(False,0)
mydata['IsEmbedded']  = mydata['IsEmbedded'].fillna(0)

gc.collect()


mydata = mydata.drop(['BegDoc','BegAttach' ,'DR ITR Category','DocTypeDesc','DocType','DateSent', 'DateLastMod','Sender Domains'
                      ,'FileTypeDesc','FileExt',
                      'STR - 20180618 US Search Terms 001 W/10',
       'STR - 20180618 US Search Terms 001 W/5',
       'STR - 20180618 US Search Terms 002 W/10',
       'STR - 20180618 US Search Terms 002 W/5',
       'STR - 20180619 US Search Terms 003 W/10',
       'STR - 20180619 US Search Terms 003 W/5',
       'STR - DONOTUSE zAA Keyword Mining 001',
       'STR - DR Responsive - US (Text)'],axis=1)


In [None]:
mydata = pd.merge(mydata,tfidf_feature,on='Artifact ID',how='left')
mydata = pd.merge(mydata,cluster_feature,left_on='Artifact ID',right_on='DocumentArtifactID',how='left')
mydata = pd.concat([mydata,DocType,w10_001_feature,w5_001_feature,w10_002_feature,w5_002_feature,
                    w10_003_feature,w5_003_feature,Sender_domain_feature,zAA_feature,DR_Res_feature],axis=1)

mydata['ExtractedTextSize']=mydata['ExtractedTextSize'].fillna(0)
mydata['ExtractedTextSize'] = preprocessing.scale(mydata['ExtractedTextSize'])
mydata['FileSize'] = mydata['FileSize'].fillna(0)
mydata['FileSize'] = preprocessing.scale(mydata['FileSize'])
mydata['ParentDate'] = mydata['ParentDate'].fillna(0)
mydata['ParentDate'] = preprocessing.scale(mydata['ParentDate'])
mydata['NumAttach'] = mydata['NumAttach'].fillna(0)
mydata['NumAttach'] = preprocessing.scale(mydata['NumAttach'])
mydata['Recipients_Count']= preprocessing.scale(mydata['Recipients_Count'])
gc.collect()
b = mydata.copy()
mydata = b.copy()
mydata = pd.merge(mydata,decision3,left_on = 'Artifact ID',right_on='Artifact ID',how = 'left')
mydata['Major Category'] = mydata['Major Category'].replace('Responsive',1)
mydata['Major Category'] = mydata['Major Category'].replace('Not Responsive',0)
mydata['Major Category'] = mydata['Major Category'].replace('Technical Issue',0)

In [None]:
train = mydata[mydata['Major Category'].isin([0,1])].fillna(0)

y = train['Major Category']
print('Richnessof  of this data set: %2.2f' % (sum(y)/len(y)))
train = train.drop(['Major Category'],axis=1)
x = train.copy()
x = x.fillna(0)
y = y.fillna(0)

result = pd.DataFrame()
result['ArtifactID']=mydata['Artifact ID']
mydata_temp = mydata.drop(['Artifact ID'],axis=1).fillna(0)
x_predict = pd.DataFrame()
x_predict['BegDoc'] = train['Artifact ID']
del train['Artifact ID']

del x['Artifact ID']
del mydata_temp['Major Category']

#### 2. Feature Engineering

Currently we are using SMOTE to oversample the minority and truncated SVD to create some extra new features (most likely our raw data is sparse because we generate too many string dummies)

In [None]:
from imblearn.over_sampling import SMOTE,ADASYN, RandomOverSampler
from imblearn.combine import SMOTEENN,SMOTETomek
from imblearn.under_sampling import RandomUnderSampler
resampler = SMOTETomek(ratio = 'auto', smote = SMOTE(ratio='auto',k_neighbors=5,kind='svm'))
#resampler = RandomUnderSampler()

from sklearn.decomposition import TruncatedSVD
K = 200
svd = TruncatedSVD(n_components = K,random_state=0)
SVD_x = svd.fit(x)
x = pd.concat([x,pd.DataFrame(SVD_x.transform(x))],axis=1)
mydata_temp =  pd.concat([mydata_temp,pd.DataFrame(SVD_x.transform(mydata_temp))],axis=1)
perc=SVD_x.explained_variance_ratio_.cumsum()[K-1]
print("Cumulative explained variation: " + "{:.2%}".format(perc))
# set up for gridsearch
prec_scorer = make_scorer(precision_score)

#### 3. ML

All ML models follow these steps:

1. RFE(recursive feature selection) to find the best features
2. GridSearch to find to optimal set of parameters
3. SMOTE for training set to make it balanced
4. Fitting data to ML model
5. Make classification report and ROC curve plot
6. Do prediction

##### 3.1 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
starttime = time()
rfecv = RFECV(estimator=LogisticRegression(), step=0.02, cv=5,verbose=1, scoring='roc_auc')
rfecv.fit(x, y)
endtime = time()
print("Optimal number of features: %d" % rfecv.n_features_)
#print('Selected features: %s' % list(x.columns[rfecv.support_]))
print('Running time: ',(endtime-starttime),'s')

plt.figure(figsize=(10,6))
plt.title('Logistic regression RFE result')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (number of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

#
#log_selected_feature = list(x.columns[rfecv.support_])
log_x = x.iloc[:,rfecv.support_]
x_train, x_test, y_train, y_test = train_test_split(log_x, y, test_size=0.3, random_state=5)
x_train ,y_train = resampler.fit_sample(x_train, y_train)
logit = LogisticRegression()
parameters={'penalty':['l1','l2'],'C':[1,0.01,0.001,10,0.1],'random_state':[0,5,10,20]}
grid = GridSearchCV(estimator= logit,param_grid=parameters,cv=5,refit=True, scoring=prec_scorer)
grid = grid.fit(x_train,y_train)
logit = grid.best_estimator_
print(grid.best_params_ )
print(grid.best_score_)
logit.fit(x_train,y_train)

y_pred = logit.predict(x_test)
y_pred_proba = logit.predict_proba(x_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba,pos_label=1)
print('Train/Test split results:')
print(logit.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(logit.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba))
print(logit.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))
log_confusion_matrix1 = confusion_matrix(y_test, y_pred)
print(log_confusion_matrix1)
classification_report1 = classification_report(y_test, y_pred,digits=3)
print(classification_report1)
print('Logistic Regression model has out of sample accuracy: %.3f'%(100*np.sum(y_pred == y_test)/len(y_test)),'%')

#
#idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95
## plot ROC curve
#plt.figure(figsize=(10,6))
#plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
#plt.plot([0, 1], [0, 1], 'k--')
#plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
#plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
#plt.xlim([0.0, 1.0])
#plt.ylim([0.0, 1.05])
#plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
#plt.ylabel('True Positive Rate (recall)', fontsize=14)
#plt.title('Logistic Regression Receiver operating characteristic (ROC) curve')
#plt.legend(loc="lower right")
#plt.show()


temp = logit.predict_proba(mydata_temp.iloc[:,rfecv.support_])[:,1]
result['LR'] = temp
x_predict['LR'] = logit.predict_proba(log_x)[:, 1]
gc.collect()

##### 3.2 Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfecv2 = RFECV(RandomForestClassifier(), step=1, cv=3,verbose=1, scoring='accuracy')
rfecv2.fit(x, y)
print("Optimal number of features: %d" % rfecv2.n_features_)
#print('Selected features: %s' % list(x.columns[rfecv2.support_]))

plt.figure(figsize=(10,6))
plt.title('Random Forest RFE result')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (number of correct classifications)")
plt.plot(range(1, len(rfecv2.grid_scores_) + 1), rfecv2.grid_scores_)
plt.show()
#
#forest_selected_feature = list(x.columns[rfecv2.support_])
forest_x = x.iloc[:,rfecv2.support_]
x_train, x_test, y_train, y_test = train_test_split(forest_x, y, test_size=0.3, random_state=40)
x_train,y_train = resampler.fit_sample(x_train,y_train)

rf = RandomForestClassifier()
# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 9, 25, 100], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 5, 10], 
              'min_samples_split': [2, 5],
              'min_samples_leaf': [1, 5, 10],
              'oob_score': [True, False]
             }
grid = GridSearchCV(estimator= rf,param_grid=parameters,cv=5, scoring=prec_scorer,verbose=1)
grid = grid.fit(x_train,y_train)
rf = grid.best_estimator_
print(grid.best_params_ )
print(grid.best_score_)
rf.fit(x_train,y_train)

y_pred_3 = rf.predict(x_test).round()
y_pred_proba_3 = rf.predict_proba(x_test)[:,1]
forest_confusion_matrix = confusion_matrix(y_test, y_pred_3)
print(forest_confusion_matrix)
print('Random Forest model has out of sample accuracy of %.3f' %(100*np.sum(y_pred_3 == y_test)/len(y_test)),'%')


[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba_3)
print('Train/Test split results:')
print(rf.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred_3))
print(rf.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba_3))
print(rf.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))
rf_confusion_matrix2 = confusion_matrix(y_test, y_pred_3)
print(rf_confusion_matrix2)
classification_report2 = classification_report(y_test, y_pred_3,digits=3)
print(classification_report2)
print('Random Forest model has out of sample accuracy: %.3f'%(100*np.sum(y_pred_3== y_test)/len(y_test)),'%')

#idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95
## plot ROC curve
#plt.figure(figsize=(6,6))
#plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
#plt.plot([0, 1], [0, 1], 'k--')
#plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
#plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
#plt.xlim([0.0, 1.0])
#plt.ylim([0.0, 1.05])
#plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
#plt.ylabel('True Positive Rate (recall)', fontsize=14)
#plt.title('Random Forest Receiver operating characteristic (ROC) curve')
#plt.legend(loc="lower right")
##plt.savefig('Graphs/RF - ROC.png',dpi=1200)
#plt.show()

temp = rf.predict_proba(mydata_temp.iloc[:,rfecv2.support_])
result['RF'] = temp[:,1]
x_predict['RF'] = rf.predict_proba(forest_x)[:, 1]
gc.collect()

##### 3.3 Neural Network

In [None]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from keras.initializers import RandomNormal
from keras.optimizers import adam, Adadelta

def recall(y_true, y_pred):
    """Recall metric.
    Only computes a batch-wise average of recall.
    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=40)
x_train ,y_train = resampler.fit_sample(x_train, y_train)

NN = Sequential()
NN.add(Dense(64, input_dim=x_train.shape[1], activation='relu', kernel_initializer  = RandomNormal(mean=0.0, stddev=1/64, seed=None)))
#NN.add(Dropout(0.1))
NN.add(Dense(64,  activation='relu', kernel_initializer  = RandomNormal(mean=0.0, stddev=2/64, seed=None)))
#NN.add(Dropout(0.1))
NN.add(Dense(32,  activation='relu'))
NN.add(Dense(32,  activation='relu'))
NN.add(Dense(32,  activation='relu'))
#NN.add(Dropout(0.1))
##NN.add(Dropout(0.2))
#NN.add(Dense(64,  activation='relu', kernel_initializer  = RandomNormal(mean=0.0, stddev=2/256, seed=None)))
#NN.add(Dense(32,  activation='relu', kernel_initializer  = RandomNormal(mean=0.0, stddev=2/256, seed=None)))
NN.add(Dense(1,activation='sigmoid'))
opt = adam(lr = 0.01)
NN.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy',recall])
NN.summary()

earlystop = EarlyStopping(monitor='val_recall', patience=90,verbose=1, mode='auto')
ReduceLR = ReduceLROnPlateau(monitor='val_loss',factor=0.5,verbose=1,patience=30)
#path = 'ModelCheckPoint/weights.{epoch:02d} -- Loss.{val_loss:.2f} -- Recall.{val_recall:.2f}.hdf5'
#Checkpoint = ModelCheckpoint(path,monitor='val_recall',verbose=1,save_best_only=True,mode = 'val_recall')
callbacks_list = [ReduceLR]
NN.fit(x_train,y_train, epochs=100, batch_size=10,verbose=1,validation_data=(x_test, y_test),
          shuffle=True,callbacks=callbacks_list)

# callbacks=callbacks_list
#callbacks = callbacks_list_test
y_nn_pred_prob = NN.predict_proba(x_test)
y_nn_pred =  NN.predict_classes(x_test).squeeze()

[fpr_nn, tpr_nn, thr_nn] = roc_curve(y_test, y_nn_pred_prob)
print('Train/Test split results:')
print(NN.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_nn_pred))
print(NN.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_nn_pred))
print(NN.__class__.__name__+" auc is %2.3f" % auc(fpr_nn, tpr_nn))

nn_confusion_matrix = confusion_matrix(y_test, y_nn_pred)
print(nn_confusion_matrix)
classification_report4 = classification_report(y_test, y_nn_pred,digits=3)
print(classification_report4)
print('Neural Network model has out of sample accuracy of %.3f' %(100*np.sum(y_nn_pred == y_test)/len(y_test)),'%')

idx = np.min(np.where(tpr_nn > 0.95))
plt.figure(figsize=(6,6))
plt.plot(fpr_nn, tpr_nn, color='coral', label='ROC curve after SMOTE (area = %0.3f, recall = %0.3f)' % (auc(fpr_nn, tpr_nn),float(classification_report4.split()[11])))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr_nn[idx]], [tpr_nn[idx],tpr_nn[idx]], 'k--', color='blue')
plt.plot([fpr_nn[idx],fpr_nn[idx]], [0,tpr_nn[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Neural Network - ROC curve')
plt.legend(loc="lower right")
plt.savefig('Graphs/NN - ROC.png',dpi=1200)
plt.show()

temp = NN.predict_proba(mydata_temp)
result['Neural Network'] = temp
x_predict['Neural Network'] = NN.predict_proba(x)
gc.collect()

##### 3.4 XGBoost

In [None]:
import os
mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
from xgboost import XGBClassifier

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=5)
x_train ,y_train = resampler.fit_sample(x_train, y_train)


xgb = XGBClassifier()
parameters = {'objective':['binary:logistic'],
              'learning_rate': [0.05,0.1,0.001],
              'max_depth': [5,20],
              'subsample': [0.8],
              'colsample_bytree': [0.6,0.8],
              'n_estimators': [5,20,100]}
grid = GridSearchCV(estimator= xgb,param_grid=parameters,cv=5, scoring=prec_scorer,verbose=1)
grid = grid.fit(x_train,y_train)
xgb = grid.best_estimator_
print(grid.best_params_ )
print(grid.best_score_)
xgb.fit(x_train,y_train)


y_xgb_pred_prob = xgb.predict_proba(x_test)[:, 1]
y_xgb_pred =  xgb.predict(x_test).squeeze()

[fpr_xgb, tpr_xgb, thr_xgb] = roc_curve(y_test, y_xgb_pred_prob)
print('Train/Test split results:')
print(xgb.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_xgb_pred))
print(xgb.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_xgb_pred))
print(xgb.__class__.__name__+" auc is %2.3f" % auc(fpr_xgb, tpr_xgb))

xgb_confusion_matrix = confusion_matrix(y_test, y_xgb_pred)
print(xgb_confusion_matrix)
classification_report5 = classification_report(y_test, y_xgb_pred,digits=3)
print(classification_report5)


temp = xgb.predict_proba(mydata_temp)[:, 1]
result['XGBoost'] = temp
x_predict['XGBoost'] = xgb.predict_proba(x)[:, 1]
gc.collect()

##### 3.5 LSTM

Loading LSTM score

In [None]:
## Add LSTM result
#os.chdir('R:/Analytics & Automation/Machine Learning/Stewart Merger/US')
#lstm = pd.read_csv('Stewart US LSTM 20180709.csv')
#del lstm['Unnamed: 0']
#lstm.columns=['ArtifactID','LSTM']
#
#ids = lstm['ArtifactID']
#lstm = lstm[~ids.isin(ids[ids.duplicated()])]
#result = pd.merge(result,lstm,on='ArtifactID',how='left').fillna(0.5)
#x_predict = pd.merge(x_predict,lstm,on='ArtifactID',how='left').fillna(0.5)


#### 4. Ensemble Model

In [None]:
result['Ensemble'] = ((result['LR']+result['RF']+result['Neural Network']+result['XGBoost'])/4)
x_predict['Ensemble'] = ((x_predict['LR']+x_predict['RF']+x_predict['Neural Network']+result['XGBoost'])/4)


y_ensemble_pred_prob = x_predict['Ensemble']
y_ensemble_pred =  x_predict['Ensemble'].round().squeeze()

y_pre = [int(item>0.5) for  item in y_ensemble_pred_prob]
[fpr, tpr, thr] = roc_curve(y, y_pre)
print('Train/Test split results:')
print("Ensemble model accuracy is %2.3f" % accuracy_score(y, y_pre))
print("Ensemble model log_loss is %2.3f" % log_loss(y,y_pre))
print("Ensemble model auc is %2.3f" % auc(fpr, tpr))

ensemble_confusion_matrix = confusion_matrix(y,y_pre)
print(ensemble_confusion_matrix)
classification_report4 = classification_report(y, y_pre,digits=3)
print(classification_report4)
print('Ensemble model has out of sample accuracy of %.3f' %(100*np.sum(y_pre == y)/len(y)),'%')

#### 5. Stacking Model

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_predict.iloc[:,1:5], y, test_size=0.3, random_state=40)
stack_lr = LogisticRegression()
stack_lr.fit(x_train,y_train)

y_pred = stack_lr.predict(x_test)
y_pred_proba = stack_lr.predict_proba(x_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)
print('Train/Test split results:')
print(stack_lr.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(stack_lr.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba))
print(stack_lr.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))
stack_confusion_matrix = confusion_matrix(y_test, y_pred)
print(stack_confusion_matrix)
classification_report3 = classification_report(y_test, y_pred,digits=3)
print(classification_report3)
print('Stacking Logistic Regression model has out of sample accuracy: %.3f'%(100*np.sum(y_pred == y_test)/len(y_test)),'%')
#
#idx = np.min(np.where(tpr > 0.95))
#plt.figure(figsize=(6,6))
#plt.plot(fpr, tpr, color='coral', label='ROC curve after SMOTE (area = %0.3f, recall = %0.3f)' % (auc(fpr, tpr),float(classification_report3.split()[11])))
#plt.plot([0, 1], [0, 1], 'k--')
#plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
#plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
#plt.xlim([0.0, 1.0])
#plt.ylim([0.0, 1.05])
#plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
#plt.ylabel('True Positive Rate (recall)', fontsize=14)
#plt.title('Stacking - ROC curve')
#plt.legend(loc="lower right")
#plt.savefig('Graphs/Stacking - ROC.png',dpi=1200)
#plt.show()

result['Stack logistic'] = stack_lr.predict_proba(result.iloc[:,1:5])[:,1]

#### 6. Output

After running the ML part for 3 times, we need to combine the result from these 3 sets and apply uncertainty sampling to it.

In [None]:
os.chdir('R:/Analytics & Automation/Machine Learning/Stewart Merger/US Final Pred Coding/20180720 Rank')
result.to_csv('US Result set1 20180720.csv')

set1 = pd.read_csv('US Result set1 20180720.csv')
set2 = pd.read_csv('US Result set2 20180720.csv')
set3 = pd.read_csv('US Result set3 20180720.csv')

set4 = (set1+set2+set3)/3
result=set4
del result['Unnamed: 0']

os.chdir('R://Analytics & Automation//Machine Learning//Yipeng//Uncertainty Sampling')
import US

probs = pd.DataFrame(result['Stack logistic'])
index =  US.uncertainty_sampling('least_confident', probs)
result = result.ix[index]
result = result.reset_index().reset_index()
result = result.rename(columns={'level_0':'Review Order'})
result = result.set_index('index')
result = result.sort_index()
os.chdir('R:/Analytics & Automation/Machine Learning/Stewart Merger/US Final Pred Coding/20180720 Rank')
result.to_csv('US Result combined 20180719.csv')