# Intent Classification On Enron E-mail

# Understanding the data

Looks like our main objective from the dataset is to predict whether an email text have Positive or Negative intent based .

# Initial approach

Firstly,I will be using a simple Convolutional Neural Network to for text classification task.

https://arxiv.org/abs/1408.5882 (Convolutional Neural Networks for Sentence Classification by Yoon Kim)

## Extracting the data
Data is being extracted from a .txt file and then converted into .csv and put into the respective dataframes.

In [34]:
import csv

from pandas.io.json import json_normalize
import os
import numpy as np
from keras.layers import Activation, Conv1D, Dense, Embedding, Flatten, Input, MaxPooling1D
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

txt_file = r"train.txt"
csv_file = r"train.csv"

# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_file, "rt",encoding='utf-8'), delimiter = '\t')

out_csv = csv.writer(open(csv_file, 'wt',encoding='utf-8'))
out_csv.writerow(('intent', 'text'))
out_csv.writerows(in_txt)

import pandas as pd 

data = pd.read_csv("train.csv") 
data.head()

print(data.shape)


intent=data['intent'] #extract intent
text=data['text'] #extract text

print(intent[1],text[1])

Using TensorFlow backend.


(3632, 2)
No Act now to keep your life on the go!


## Convert the labels to numeric values



In [35]:
unique_categories=data.intent.unique()
unique_categories_dictionary=dict(zip(unique_categories,range(0,len(unique_categories))))
labels=data['intent'].map(unique_categories_dictionary, na_action='ignore')
print(len(labels))
print(labels)


3632
0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
3602    1
3603    1
3604    1
3605    1
3606    1
3607    1
3608    1
3609    1
3610    1
3611    1
3612    1
3613    1
3614    1
3615    1
3616    1
3617    1
3618    1
3619    1
3620    1
3621    1
3622    1
3623    1
3624    1
3625    1
3626    1
3627    1
3628    1
3629    1
3630    1
3631    1
Name: intent, Length: 3632, dtype: int64


## Preprocessing the data

We tokenize the text before we can feed it into a neural network. This tokenization process will also remove some of the features of the original text, such as all punctuation or words that are less common.

We have to specify the size of our vocabulary. Words that are less frequent will get removed. In this case we want to retain the 5,000 most common words.


In [36]:
vocab_size = 5000
tokenizer = Tokenizer(num_words=vocab_size) # Setup tokenizer
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text) # Generate sequences

print (len(sequences))  ## Testing the output of the tokenizer functions.
print (len(sequences[0]))
print (sequences[0])

3632
9
[116, 126, 18, 64, 1, 46, 761, 99, 47]


##### Word_index
Our text is now converted to sequences of numbers. It makes sense to convert some of those sequences back into text to check what the tokenization did to our text. To this end we create an inverse index that maps numbers to words while the tokenizer maps words to numbers.


In [37]:
word_index = tokenizer.word_index
print('Found {:,} unique words.'.format(len(word_index)))

#Create inverse index mapping numbers to words
inv_index = {v: k for k, v in tokenizer.word_index.items()}

# Print out text again
for w in sequences[0]:
    x = inv_index.get(w)
    print(x,end = ' ')


Found 6,800 unique words.
1 contact me now to make 100 today link 

#### Measuring the text length

Let's ensure all sequences have the same length.

In [40]:
# Get the average length of a text
avg = sum(map(len, sequences)) / len(sequences)

# Get the standard deviation of the sequence length
std = np.sqrt(sum(map(lambda x: (len(x) - avg)**2, sequences)) / len(sequences))


avg,std

(16.282764317180618, 10.145697980559813)

In [42]:
pad_sequences([[1,2,3]], maxlen=5) # testing the pad_sequences function

array([[0, 0, 1, 2, 3]])

In [41]:
max_length = 26
text_seq = pad_sequences(sequences, maxlen=max_length)

### Turning labels into one-hot encodings
Labels can quickly be encoded into one-hot vectors with Keras:


In [43]:
from keras.utils import to_categorical

labels=data['intent'].map(unique_categories_dictionary, na_action='ignore')
labels = to_categorical(np.asarray(labels))

print('Shape of data:', text_seq.shape)
print('Shape of labels:', labels.shape)

print(labels)


Shape of data: (3632, 26)
Shape of labels: (3632, 2)
[[ 1.  0.]
 [ 1.  0.]
 [ 1.  0.]
 ..., 
 [ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]]


### Loading GloVe embeddings

A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be trained using the input corpus itself or can be generated using pre-trained word embeddings such as Glove, FastText, and Word2Vec.


In [44]:

glove_dir = 'C:/Users/admin/Desktop' # This is the folder with the dataset/glove embeddings

embeddings_index = {} # We create a dictionary of word -> embedding

with open(os.path.join(glove_dir, 'glove.6B.100d.txt'),encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0] # The first value is the word, the rest are the values of the embedding
        embedding = np.asarray(values[1:], dtype='float32') # Load embedding
        embeddings_index[word] = embedding # Add embedding to our embedding dictionary

print('Found {:,} word vectors in GloVe.'.format(len(embeddings_index)))


Found 400,000 word vectors in GloVe.


In [45]:
print (embeddings_index['frog'])
print (len(embeddings_index['frog']))

[ 0.043084    0.53232998  0.54254001 -0.076952   -0.29673001  0.52986002
  0.21379     0.15789001 -0.39520001 -0.91889    -0.65850002  0.68706
  0.10821    -0.10694    -0.34009999  1.04400003  0.12774999  0.51156998
  0.60314     0.71366    -0.53740001  0.37737     0.12186     0.60891002
  0.50107002  2.02150011 -0.47318     0.46952999  0.12542     0.60206997
  0.11007     0.37586999  1.01370001 -0.24779999  0.65748     0.12801
 -0.57647002 -0.25753999  0.62426001  0.010864   -0.40680999  0.16173001
 -0.84694999 -0.24603     0.29078001  0.85460001 -0.067021    0.69331002
 -0.71544999 -0.25184    -0.74741    -0.26506999  0.48730001  0.41991001
 -0.86741    -0.52350003 -0.44773999 -0.044584    0.033836    0.29909
  0.73754001  0.81651002  0.69431001  0.80453002  0.29276001 -0.025244
 -0.30452999 -0.34329     0.11933    -0.29655001  0.1072     -0.18945999
  0.18501    -0.75480002 -0.25628     0.34437999 -0.016743    0.0040503
  0.39342001  0.99404001 -0.32159001 -0.49434     0.41707999 -0

### Using GloVe Embeddings

Following snippet shows how to use pre-trained word embeddings in the model. There are four essential steps:

1.Loading the pretrained word embeddings.<br/>
2.Creating a tokenizer object.(already done in the above code)<br/>
3.Transforming text documents to sequence of tokens and pad them.(already done in the above code)<br/>
4.Create a mapping of token and their respective embeddings.</br>

In [47]:
embedding_dim = 100 # We use 100 dimensional glove vectors

word_index = tokenizer.word_index
nb_words = min(vocab_size, len(word_index)) # How many words are there actually

embedding_matrix = np.zeros((nb_words+1, embedding_dim))

# The vectors need to be in the same position as their index. 
# Meaning a word with token 1 needs to be in the second row (rows start with zero) and so on

# Loop over all words in the word index
for word, i in word_index.items():
    # If we are above the amount of words we want to use we do nothing
    if i >= vocab_size: 
        continue
    # Get the embedding vector for the word
    embedding_vector = embeddings_index.get(word)
    # If there is an embedding vector, put it in the embedding matrix
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector


### Architecture of the model :

The architecture is comprised of three key pieces:<br/>

1.Word Embedding: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.<br/>
2.Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.<br/>
3.Fully Connected Model: The interpretation of extracted features in terms of a predictive output.<br/>

In [48]:
model = Sequential()
model.add(Embedding(nb_words+1, 
                    embedding_dim, 
                    input_length=max_length, 
                    weights = [embedding_matrix], 
                    trainable = False))
model.add(Conv1D(50, 3, activation='relu'))  #https://arxiv.org/abs/1408.5882 (Convolutional Neural Networks for Sentence Classification by Yoon Kim)
model.add(MaxPooling1D(3))                    #https://arxiv.org/abs/1510.03820(A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification)
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(2, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 26, 100)           500100    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 24, 50)            15050     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 8, 50)             0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 400)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                20050     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 102       
Total params: 535,302
Trainable params: 35,202
Non-trainable params: 500,100
_________________________________________________________________

## Preparing the test data

We will prepare the test data in same way we prepared train data.

In [50]:
txt_test_file = r"test.txt"
csv_test_file = r"test.csv"

# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_test_file, "rt",encoding='utf-8'), delimiter = '\t')

out_csv = csv.writer(open(csv_test_file, 'wt',encoding='utf-8'))
out_csv.writerow(('intent', 'text'))
out_csv.writerows(in_txt)


test_data = pd.read_csv("test.csv") 
test_data.head()

print(test_data.shape)


test_intent=test_data['intent'] #extract intent
test_text=test_data['text'] #extract text


unique_categories=test_data.intent.unique()
unique_categories_dictionary=dict(zip(unique_categories,range(0,len(unique_categories))))
test_labels=test_data['intent'].map(unique_categories_dictionary, na_action='ignore')
print(len(test_labels))


vocab_size = 5000
test_tokenizer = Tokenizer(num_words=vocab_size) # Setup tokenizer
test_tokenizer.fit_on_texts(test_text)
test_sequences = test_tokenizer.texts_to_sequences(test_text) # Generate sequences


print (len(test_sequences))  ## Testing the output of the tokenizer functions.
print (len(test_sequences[0]))
print (test_sequences[0])


test_word_index = test_tokenizer.word_index
print('Found {:,} unique words.'.format(len(test_word_index)))


#Create inverse index mapping numbers to words
test_inv_index = {v: k for k, v in test_tokenizer.word_index.items()}

# Print out text again
for w in test_sequences[0]:
    x = test_inv_index.get(w)
    print(x,end = ' ')
    
print("\n")   
# Get the average length of a text
avg = sum(map(len, sequences)) / len(sequences)

# Get the standard deviation of the sequence length
std = np.sqrt(sum(map(lambda x: (len(x) - avg)**2, sequences)) / len(sequences))

avg,std


max_length = 26
test_text_seq = pad_sequences(test_sequences, maxlen=max_length)

test_labels = to_categorical(np.asarray(test_labels))
print('Shape of test data:', test_text_seq.shape)
print('Shape of test labels:', test_labels.shape)

(992, 2)
992
992
12
[3, 143, 313, 2, 25, 7, 4, 818, 47, 30, 627, 133]
Found 3,029 unique words.
i look forward to meeting you and learning about your successful business 

Shape of test data: (992, 26)
Shape of test labels: (992, 2)


### Training the model

In [53]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',  # https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance
              metrics=['accuracy'])

model_history=model.fit(text_seq, labels, validation_split=0.0,validation_data=(test_text_seq,test_labels),shuffle=True, epochs=15)



Train on 3632 samples, validate on 992 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


We are getting a test accuracy of around 53% which is not a satisfactory result.Low accuracy using the above deep learning model can be explained by the following graph:

![title](ds.png)

As we construct larger neural networks and train them with more and more data, their performance continues to increase. This is generally different to other machine learning techniques that reach a plateau in performance.<br/>
So,the amount of training examples are really low for a deep learning model to give a good accuracy.Now,we will try non-deep learning models.

## Traditional ML models

## Import the relevant libraries

sklearn library for various ML algorithms like Random Forest,SGD(Stocastic Gradient Descent),Logistic Regression and various other algorithms.<br/>

numpy library for handling linear alzebra related tasks.<br/>

pandas library for data manipulation and analysis. (In this case,reading the csv file into data-frames).<br/>

In [1]:

import csv
import numpy as np
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn


## Converting the training data format

The given training data is in .txt format with tab separation between the label(i.e. Intent) and corresponding "text".So,the data is converted into .csv format for easy manipulation and further conversion into dataframes.

In [3]:
txt_file = r"train.txt"  # for converting the txt file to a csv file for better handling of data
csv_file = r"train.csv"

in_txt = csv.reader(open(txt_file, "rt",encoding='utf-8'), delimiter = '\t') # Opening the text file in text mode

out_csv = csv.writer(open(csv_file, 'wt',encoding='utf-8')) # writing the contents of text file to a csv file.
out_csv.writerow(('Intent', 'text'))
out_csv.writerows(in_txt)


train_data = pd.read_csv("train.csv") 
print(train_data.head())
print()

print("Shape of train data",train_data.shape)



  Intent                                               text
0     No     >>> [1]Contact Me Now to Make $100 Today!$LINK
1     No               Act now to keep your life on the go!
2     No  Choose between $500 and $10000 dollars with up...
3     No                         Click above to earn today.
4     No        Click here to receive your first $10 today:

Shape of train data (3657, 2)


### Converting the test data format

In [4]:
# Perform the same actities to convert the test_data txt file to test_data csv file.

txt_test_file = r"test.txt"
csv_test_file = r"test.csv"




in_txt = csv.reader(open(txt_test_file, "rt",encoding='utf-8'), delimiter = '\t')

out_csv = csv.writer(open(csv_test_file, 'wt',encoding='utf-8'))
out_csv.writerow(('Intent', 'text'))
out_csv.writerows(in_txt)


test_data = pd.read_csv("test.csv") 
test_data.head()

print("Shape of test data:",test_data.shape)


test_intent=test_data['Intent'] #extract intent
test_text=test_data['text'] #extract text


Shape of test data: (907, 2)


### Extracting the Intent and Text into Series

Taking out the columns of the dataframe into corresponding series for better handling of data.

In [5]:
Intent=train_data['Intent'] #extract intent
text=train_data['text'] #extract text
print(train_data.shape)

print(type(Intent))

(3657, 2)
<class 'pandas.core.series.Series'>


### Label Binarizer

Converts target variable to binary output.

In [6]:
from sklearn.preprocessing import LabelBinarizer
lb= LabelBinarizer()

### Stemming

Stemming takes morphologically complex words and reduces them to their root morphemes. For example, if the input contains the words "tasty", "tastier", and "tastiest", the suffixes would be stripped off, and all of these forms would be treated as a single stem, "tasti".<br/>

In this case,I have used PorterStemmer because it is the least aggressive of all the other commonly used stemmers(Snowball,Lancaster).

In [7]:
from nltk.stem.porter import PorterStemmer 
ps=PorterStemmer()

### Removing Stop words

Stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools.Text may contain stop words like ‘the’, ‘is’, ‘are’.

In [8]:
#Regular expressions help us deal with characters and numerics and modify them according to our requirement.

import re

#import the nltk library.The nltk module contains a list of stop words.

from nltk.corpus import stopwords
sw = set(stopwords.words('english')) #We get a set of English stop words using this line.

## Performing preprocessing

In the preprocessing step, we are removing the punctuation,lower-casing the words,removing the stop words for the text column.<br/>
In the label column,we are using label_binarizer to convert the "Yes" and "No" to "1" and "0" respectively.

In [9]:
def processing(df):
    
    #lowering and removing punctuation
    df['pre_processed'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]','', x.lower()))
    #removing the numerical values and working only with text values
    df['pre_processed'] = df['pre_processed'].apply(lambda x : re.sub('[^a-zA-Z]', " ", x ))
    #removing the stopwords
    df['pre_processed'] = df['pre_processed'].apply(lambda x: ' '.join([word for word in x.split() if word not in sw ]))
    #stemming the words
    df['pre_processed'] = df['pre_processed'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
   
    df["Intent"]= lb.fit_transform(df["Intent"])
    return df
    
df= processing(train_data)   
#df = df.drop("processed", axis=1)
df.head()


Unnamed: 0,Intent,text,pre_processed
0,0,>>> [1]Contact Me Now to Make $100 Today!$LINK,contact make todaylink
1,0,Act now to keep your life on the go!,act keep life go
2,0,Choose between $500 and $10000 dollars with up...,choos dollar year repay
3,0,Click above to earn today.,click earn today
4,0,Click here to receive your first $10 today:,click receiv first today


### Spliting the features and target dataframe

In [10]:
def FTsplit(df):       #This subroutine splits the given dataframe into features and target dataframe
    features='pre_processed'
    target = 'Intent'
    X= df[features]
    y= df[target]
    return X, y

In [11]:
X, y =FTsplit(df)

In [12]:
print(X)  #Checking the features dataframe by printing

0                                  contact make todaylink
1                                        act keep life go
2                                 choos dollar year repay
3                                        click earn today
4                                click receiv first today
5                                   click start shop link
6               click watch im sure long hell leav public
7                       confirm view first great opportun
8                       copi past link browser see result
9                    find add clip pressroom second video
10                                       go direct access
11                                              go inform
12      adob acrobat reader pleas click follow instruc...
13                       want confirm simpli ignor messag
14                  dont want chang password ignor messag
15      see noth banner pleas click httpwwwbzprodtvmai...
16      see noth banner pleas click httpwwwbzprodtvmai...
17            

## TF-IDF
It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 


### Prepare the test data

Prepare the test data by using tf-idf vectorizer.It takes a list of documents and returns a document-term matrix.

In [13]:
#Prepare test data using tf-idf vectorizer.

test_vectorisor=TfidfVectorizer("english")

df_test=processing(test_data)

X_test, y_test =FTsplit(df_test)
test_text_feat = X_test
test_features = test_vectorisor.fit_transform(test_text_feat)


print("Dimension of document-term matrix",test_features.shape)


Dimension of document-term matrix (907, 2130)


### Prepare the training data

In [14]:
#Prepare train data using tf-idf vectorizer.

train_vectorisor=TfidfVectorizer("english",max_features=test_features.shape[1])
df_train=processing(train_data)

X_train, y_train =FTsplit(df_train)
train_text_feat = X_train
train_features = train_vectorisor.fit_transform(train_text_feat)

print("Dimension of document-term matrix:",train_features.shape)

Dimension of document-term matrix: (3657, 2130)


### Importing different classifers from sklearn library


In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
#import xgboost as xgb

## Initilaising different classifiers

In [16]:
svc = SVC(kernel='sigmoid', gamma=1.0)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=2000,max_depth=70,min_samples_split=5,min_samples_leaf=100, random_state=111)
abc = AdaBoostClassifier(n_estimators=62, random_state=112)
sgd = SGDClassifier(loss='log', penalty='l2',alpha=1e-3, n_iter=5, random_state=111)
# xgbb = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
#                         subsample=0.8, nthread=10, learning_rate=0.1)

### Functions to train and predict

In [19]:
clfs = {'SVC' : svc,'LR': lrc, 'RF': rfc,'SGD': sgd}

def train_classifier(clf, feature_train, labels_train):    
    clf.fit(feature_train, labels_train)
    
def predict_labels(clf, features):
    return (clf.predict(features))


In [46]:
# pred_scores = []
# for k,v in clfs.items():
#     train_classifier(v, features_train, labels_train)
#     pred = predict_labels(v,features_test)
#     pred_scores.append((k, [accuracy_score(labels_test,pred)]))

### Training and Prediction

Training and prediction is being done using different classfiers.

In [20]:
pred_scores = []
for k,v in clfs.items():
    train_classifier(v, train_features, y)
    pred = predict_labels(v,test_features)
    pred_scores.append((k, [accuracy_score(y_test,pred)]))



In [22]:
df = pd.DataFrame.from_items(pred_scores,orient='index', columns=['Score']) #Store predictions of various algos in a df.
df

Unnamed: 0,Score
SVC,0.714443
LR,0.738699
RF,0.753032
SGD,0.746417


It can be seen from the above table that the random forest classifier gives the highest accuracy of 75.3%.Stocastic Gradient Descent gives the accuracy of 74.6%,which is a close second.

### Ensemble Methods

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability over a single estimator.<br/>

Two families of ensemble methods are usually distinguished:<br/>

In <b>averaging methods</b>, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.<br/>
Examples: Bagging methods, Forests of randomized trees, …<br/>

By contrast, in <b>boosting methods</b>, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.<br/>

Below we are using a average method with Voting Classifier.

In [26]:
from sklearn.ensemble import VotingClassifier

eclf = VotingClassifier(estimators=[('SGD', sgd), ('LR', lrc), ('RF', rfc), ('Ada', abc)], voting='soft')

In [28]:
eclf.fit(train_features,y)

pred = eclf.predict(test_features)
print(accuracy_score(y_test,pred))




0.740904079383


Clearly,we can see that the accuracy of the Voting Classfier (74.09 %) is less than the accuracy of random forest tree(75.3%).<br/>

Now,we will try to improve the accuracy by hyperparameter tuning of random forest claasifier using GridSearch.

In [30]:
from sklearn.model_selection import GridSearchCV
rfc = RandomForestClassifier(n_estimators=2000,max_depth=70,min_samples_split=5,min_samples_leaf=100, random_state=111)

# parameters for GridSearchCV
param_grid2 = {"n_estimators": [1300, 1400, 1500],
              "max_depth": [70,80],
              "min_samples_split": [5,10,15],
              "min_samples_leaf": [50,100,200],
              "max_leaf_nodes": [20, 40],
              }
grid_search = GridSearchCV(rfc, param_grid=param_grid2)
grid_search.fit(train_features,y)

# print(grid_search.cv_results_)
print(grid_search.best_params_)



{'max_depth': 70, 'max_leaf_nodes': 20, 'min_samples_leaf': 50, 'min_samples_split': 5, 'n_estimators': 1300}


We used the grid search to find the best hyperparameters for the random forest classifier.Now,we use those hyperparameters for new prediction.

In [32]:
eclf.fit(train_features,y)
rfc = RandomForestClassifier(n_estimators=1300,max_depth=70,min_samples_split=5,min_samples_leaf=50, random_state=111)

rfc.fit(train_features,y)

pred = rfc.predict(test_features)
print(accuracy_score(y_test,pred))



0.753031973539


On changing our candidates(n_estimators,max_depth,max_leaf nodes) we can hope to improve our accuracy.