# Classifier First Version

These is a very detailed explanation of a basic NLP classifier if you have any questions don't hesitate to send it to me !!!

I put here the first version of Classifier. It use basic ML classifier from the sklearn library. This code can be used as the basic structure for the classifier and each part of the algorithm should of course be ameliorate in order to get the best accuracy.

## I- Import basic library

I always import these basic libraries at the begining of a program just beacause it is likely that they will be used in the code.

In [1]:
import numpy as np
import pandas as pd
import nltk
import sklearn

__Numpy__ is mostly used to manage lists and matrix efficiently. It is also used for mathematical functions

__pandas__ is used to handle the dataset. In this algorithm it will only help us to import the excel data to python as dataframe

__nltk__ is used to pre-process the data. You will see later what specific object of the library we will use

__sklearn__ is the library that contain the optimized classifier algorithms. You will see later how we use these objects.

## II- Import Data

In this part we import the data from Excel to Python dataframe thanks to pandas and we change the shape of the data in order to create a convenient input for the rest of the algorithm.

for the next two lines you will need to adapte the directory of the data in your computer ! 

In [2]:
dataset_Healthy = pd.read_excel(r'C:\Users\Alu\Desktop\S and P Global\Funds Articles_Healthy_New.xlsx', dtype={'Name':str}, quoting = 3)
dataset_Unhealthy = pd.read_excel(r'C:\Users\Alu\Desktop\S and P Global\Fund Articles_Unhealthy_New.xlsx', dtype={'Name':str}, quoting = 3)

These two lines use the function read_excel of the pandas library to import the data from a excel file in your workingdirectory file to a dataframe.
To dataframes are created: One for the healthy documents and one for the unhealthy documents.

The parameter dtype is used to tell the function to consider the values in the excel as string (a string is a chain of character: 'i am a cat' this is the way python deal with text) 
The parameter quoting set as 3 just allows to consider quotes (") as a normal character. Indeed, because string is delimited by quotes, if the algorithm doesn't consider (") as normal characters it can cause problems if there is quotes in the data.

If you want more information about the different parameters of a function you just have to google the name of the function and documentation (for instance: pandas.read_excel documentation) and the first link should give you all the information you need.


In [3]:
dataset_Healthy.columns = ['Articles', 'Labels']
dataset_Unhealthy.columns = ['Articles', 'Labels']

I rename the two columns of the dataframe
A dataframe is big table. This is a different object than a simple python table (list of lists) or a numpy table (array) but it works almost the same way. They are most efficient to manage a big set a data but don't worry if you are not comfortable with dataframes we will soon change the type of the data to a numpy arrays (which work exactly as a list of lists).

The first column is Article. Each line of this column contains one article in type string.
The second column is Label. This column is empty for now, but it will contain 1 if the article on the same line is Healthy and 0 otherwise
The next code fills these last columns.

In [4]:
dataset_Healthy['Labels']=1
dataset_Unhealthy['Labels']=0

The label column of the healthy dataframe is fill with 1
the label column of the Unhealthy dataframe is fill with 0

The next lines allow to set the total number documents we will use in the rest of the algorithm. Of course, to get the best accuracy we should make the next ML algorithm learn on the most of data we can. But for the purpose of building the algorithm it is good to have the possibility to set the number of documents considerate manually.

So, we will cut the two dataframe (only take the first rows = truncate) to have the total of number of documents we want.
But if the documents have a specific order in the excel the selection we will apply can bias the learning of the future algorithm so before truncating the two dataframes we need to shuffle the rows (mix the rows) of these dataframes.

In [5]:
#Schuffle the dataframes
dataset_Healthy = dataset_Healthy.sample(frac=1).reset_index(drop=True)
dataset_Unhealthy = dataset_Unhealthy.sample(frac=1).reset_index(drop=True)

#set the Size of the data as parameter
Size_Data = 10000

#Troncaturate the dataframes
dataset_Healthy = dataset_Healthy[:int(Size_Data/2)]
dataset_Unhealthy = dataset_Unhealthy[:int(Size_Data/2)]

The first two lines allow to Shuffle the dataframes. You don't have to care about the parameters. If you want more information about this information google it and find the documentation

The last two lines truncate the dataframe (take the Size_Data/2  first row of the data ). 

Note here that I take the same number of Healthy and Unhealthy documents. Actually, I don't know if training the ML algorithm on a data which have more Healthy document than Unhealthy documents have an impact on the learning algorithm. I don't think that it changes anything, but it was easier for me to truncate like that in a first time.

Then, we want one dataframe with the Size_data number of healthy and unhealthy documents in a random order.
So, the next lines concatenate the two previous dataframe and shuffle this dataframe.

In [6]:
#Concatenate
Data = pd.concat([dataset_Healthy,dataset_Unhealthy])

#Schuffle
Data = Data.sample(frac=1).reset_index(drop=True)

Finally we split the dataset in a training set and a test set :

In [7]:
r_split = 0.2

Data_train =  Data[:int(Size_Data*(1-r_split))].reset_index(drop=True)
Data_test =  Data[int(Size_Data*(1-r_split)):Size_Data].reset_index(drop=True)

## III- Preprocessing the data

In this part we will pre-process the training data. The pre-processing phase have three major parts. 
- We want to tokenize the documents, 
- we want to create the matrix of feature that we will use as input of the classifier algorithm, 
- finally, we will scale the matrix of feature.

### A- Tokenize

Tokenize mean to clean the documents in order to extract the most relevant part of each documents

So we will Create a function named tokenizer that takes as input a document (as type string) and return the same document but cleaned (the output is also a string).
Here is the function, I explain each line just after the code:

In [8]:
#import the re library
import re

#Import object form nltk 
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# create the function
def my_tokenizerI(s):
    s = re.sub('[^1-9a-zA-Z]', ' ', s)
    s = s.lower() 
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t) > 2] 
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] 
    tokens = [t for t in tokens if t not in stopwords]
    tokens = ' '.join(tokens)
    return tokens

Let focus on the function:

The input is s, a document of type string (--> s= 'even if his wife is vegetarian, the CEO of JP Morgan loves Chicken Gyros...')

__Line 1__ : _s = re.sub('[^1-9a-zA-Z]', ' ', s)_
This line removes the characters that are not numbers or letters and replace them with a space:
To do so it uses the sub function of the re library that's why I import the library re before the function.
s = 'even if his wife is vegetarian the CEO of JP Morgan loves Chicken Gyros '

__Line 2__ : _s = s.lower()_
This line put all characters in lowercase
s = 'even if his wife is vegetarian  the ceo of jp morgan loves chicken gyros '

__Line 3__ : _tokens = nltk.tokenize.word_tokenize(s))_
This line uses the tokenize function of the nltk library. It transforms the string s in a list of token. A token is simply a unique word of type string:
tokens = ['even','if','his','wife','is','vegetarian','the','ceo','of','jp','morgan','loves','chicken','gyros']

__Line 4__ : _tokens = [t for t in tokens if len(t) > 2]_
This line simply removes all words that have less than 2 letters:
tokens = ['even','his','wife','vegetarian','the','ceo','morgan','loves','chicken','gyros']

__Line 5__ : _tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]_
This line lemmatizes all the token of the list tokens. It means that it take the roots of the word (working -> work ; loves -> love )
I use an object of the nltk library, that’s why, before the function, I had to import the object from the nltk library and then create an object.
tokens = ['even','his','wife','vegetarian','the','ceo','morgan','love','chicken','gyro']

__Line 6__ : _tokens = [t for t in tokens if t not in stopwords]_
This line removes the words that don't have a really meaning. To do so we need to import the stopwords list. This is simply a list of useless word. There is many lists like that on internet but i choose the one from nltk library. You can see that I imported this list just before the function.
tokens = ['even','wife','vegetarian','ceo','morgan','love','chicken','gyro']

__Line 7__ : _s = ' '.join(tokens)_
This last line transforms the list of token in the output string
s = 'even wife vegetarian ceo morgan love chicken gyro'

And finally, the output is the cleaned string s

Of course, this Tokenizer should be improved, there is a lot of to do here that have a great impact on the result. For instance, at the beginning I remove all the numbers of the documents, When I tried not to remove the numbers, the final result has been improved by 1% accuracy!!

Let see now how this function is used, and how to prepare the input of our classifier.

### B- Create the matrix of feature

This part is a bit more complicated. Our goal is to create a matrix that our classifier algorithm can understand.

Indeed, ML algorithm are based on mathematical models and therefor they only understand numbers. They don't understand the strings in our dataframe. So, we have to find a way to translate our cleaned text into matrix of number.

There is two ways to do so:

__1 ---> bag of word:__
    Imagine you make the list of all the words in all the documents of the data, this set of word is named the vocabulary.
    Let say that the size of the vocabulary is 100 000 and the number of documents you use for this algorithm is 10 000
    the bag of word matrix A is the 10 000 X 100 000 sized matrix where each line represents a document and each column
    represent a word of the vocabulary. The coefficient of the line i and the column j in A is the number of times the word 
    j appears in the document i.
    
__1 ---> tfidf :__
    This is almost the same thing. This is also a 10 000 X 100 000 sized matrix where each lines represent a document and each
    columns represent a word but this time, the coefficient of the line i and the column j is the frequency of appearance of the
    word i present in the document j.
    
to summarize these two methods, allow to change a set of documents into a huge matrix. This matrix will be use as the input of our ML algorithm.

The sklearn will provide us the right objects to create these matrices it will be very easy you will see.

In [9]:
#Create a clean corpus
corpus_train =[]
for i in range (0,int(Size_Data*(1-r_split))):
    s = my_tokenizerI(Data_train['Articles'][i])
    corpus_train.append(s)

Max_NB_Word = 100000

# Crete the matrix a purpose
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(max_features=Max_NB_Word)
X_train=tfidf.fit_transform(corpus_train).toarray()

#from sklearn.feature_extraction.text import CountVectorizer 
#cv = CountVectorizer(max_features=Max_NB_Word)
#X_train = cv.fit_transform(corpus_train).toarray()

y_train = Data_train.iloc[:, 1].values


As I said we will use a sklearn object to create the matrix of feature. However, this object requires as input a list of string. 

So before creating the matrix of feature we will create this list. we initialise this list (named corpus) as an empty list. 
then we loop on each rows of our dataframe and at each iteration we add to the corpus the document of the current row but tokenized with our previous tokenizer function.
At the end of the loop (it can take 5 to 10 minutes...) we have the list of all our tokenized documents (in type string).

Then we are ready to create the matrix of feature. 
we import the TfidfVectorizer object from the skleanr library
then you initialize the object. It takes one important parameter: 'max_features=' allow to choose the maximal length of your vocabulary. So, if you set this parameter to Max_NB_Word = 100000 the object will only consider the 100 000 most frequent words. It is also a very important parameter it allows to reduce the noise of the data (remove useless words that have bad effect on the result).
finally, you use the method fit_transform to the corpus. This method return the huge matrix we want!  
the .toarray() allows to get an array matrix which is more efficient than list of list...

As you can see I put the code for the two methods. After some tests it seems that the tfidf method give better results, so you can forget the last three line of code!

We also create our vector of dependant variable y_train : y_train[i]=1 if the document in the row i of X_train is healthy and y_train[i]=0 otherwise.

### D- Scale the Matrix

Before fitting the model, we need a last small step: scaling the matrix

what does that mean? Actually, this is not necessary in our case, but this is always a good thing to do and it doesn't take lot of work.

Imagine in a completely other situation you have a data from a bank and you try to classify the customers. Your matrix of feature can have a column that give the income of a given customer and, in another column, you can have his credit score. a normal value for income is around 100 000 and a credit score is around 0.8. If you don't put these two attributes to the same scale the classifier will think that the income is a more important value to classify just because the value is bigger... So, in order to avoid this miss understanding you should provide to your classifier features that have approximatively the same kind of value. So before fitting the model you make an operation on each column to reduce for instance the income to values between -1 and 1.
You only have to subtract the mean of the incomes of all the customers to the actual value and divide by the same mean.

As I said this is not necessary in our case because all the features are frequency of appearance and therefore all the feature of a given columns are already on the same scale. However, if you use the bag of word method the scaling become necessary (excepte for lineare model such as Logistic regression that don't need to have scaled input ).

Overall I think this is better to forget this scaling part. It is very long in term of calculation and we can avoid it by using tfidf method or the logistic regression model...


In [10]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

At the end of this part it is important to see that we only deal with the train data. We have fited all our objects (tfidf and scaler) on the training set. Indeed we used the function fit_transform that make the following to operations:

__first__ fit the object to the train data : for the scaler it mean that the object memorise the mean of each columns of the training set.

__second__ transform the object : it makes the appropiate transformation according to the previous fit: for instance, for the scaler the transforme function use the list of the mean of each column calculated in the fit method and use it to scale each columm (by substracting the mean of the colone to each coeficient of the colone and then divided by this same mean)

This is important to understand that, if we prepared the data with these objects before applying our model, we would need to adapte our test set the same way before testing them on the model. So the object we just create makes part of the model and we will need them to transform our test set in the test phase.

## IV- Fit the Model

Here we will see how to fit a model to our data. You will see thanks to our previous work and the sklearn library this is really easy.

In the code I put three different models:

- logistic Regression
- Naive Bayes
- Random Forest
- SVM

The Math behind each model is a bit complicated and I don't think that it is really important for our project... So I won't explain it here but if you want a quick theorical explanation just ask!



In [11]:
# Logistic Regression 
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
classifierNB = GaussianNB()
classifierNB.fit(X_train, y_train)

# Random Forest
from sklearn.ensemble import RandomForestClassifier
classifierRF = RandomForestClassifier(n_estimators = 20, criterion = 'entropy')
classifierRF.fit(X_train, y_train)

# Kernel SVM 
from sklearn.svm import SVC
classifierSVC = SVC(kernel = 'rbf')
classifierSVC.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

One again three lines:

import the object,

initialise the object,

fit the object!

And after several minutes of processing you have your classifier ready and fitted !!!
We will see later how to use it and how to evaluate it

If you look in the documentation of sklearn and look for the different model, you will find that there is a lot of parameters we can add to our classifier to improve them. I think we should focus more on the random forest method, but I didn't try to customize the other methods. 

Now we have fitted our classifier let see how to evaluate the result!
    

## V- Evaluate the model

As I said previously before testing the test set we should adapt the data set to the model.

to do so we need to :
- Create the matrix of feature X_test the same way as we created X_train
- Scale X_test the same way as we scaled X_train



In [14]:
#Corpus test 
corpus_test =[]
for i in range (0,int(Size_Data*(r_split))):
    Tokens = my_tokenizerI(Data_test['Articles'][i])
    corpus_test.append(Tokens)

#Matrix of feature
X_test=tfidf.transform(corpus_test).toarray()
#X_test=cv.transform(corpus_test).toarray()

#Vector of dependant variables
y_test = Data_test.iloc[:, 1].values

#Scale
X_test = sc.transform(X_test)

It is important to observe that we don't create new object, we use the one created for preprocessing the training data. Actually we only transforme the test set the same way we transformed the training set ! thats why we use the "transform" method of the different object and not the fit_transform method as previously. The object should stay fited to the training set !! 


Now let see how to make a prediction :

In [15]:
y_pred = classifierRF.predict(X_test)

You just have to us the predict function on the classifier and it will return the vector of 1 and 0 that predict if the documents of the test set or healthy or unhealthy.
so you can compare y_pred and y_test and hopefully they are similar!!

the confusion matrix is a matrix that allow to make this comparison:

In [17]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[164,  40],
       [ 21, 175]], dtype=int64)

cm is 2X2 matrix 
- cm[0][0] = number of unhealthy documents well predicted

- cm[1][1] = number of healthy documents well predicted

- cm[0][1] = number of healthy documents predicted as unhealthy

- cm[1][0] = number of unhealthy documents predicted as healthy

so you can print (cm[0][0]+cm[1][1])/(cm[0][0]+cm[1][1]+cm[1][0]+cm[0][1]) to get the accuracy on the test set !

This is a good way to evaluate your final model but __be careful !!!__ we can't use this accuracy to optimize our hyperparameters (number of trees, number of documents, max number of word considerate).!!! If we use this measure of accuracy on the test set to optimize our parameters, the test set become a kind of training set for those parameters and therefore we can no longer use it to give an accurate prediction! In other words, we cannot use our test set before that the model is the good one! the test set should only be used to give the final accuracy of our model!

So how to optimize the hyperparameters? We should find a way to measure the accuracy of our model without altering the integrity of our test set! This other way is the cross validation. the cross-validation method allows to measure the accuracy of the model by looking only on the training set. how does it work?
- first the training set is splited in 10 (you can modify this value) and each split is divided in a training subset and a test subset
- then the algorithm learns on each training subset and calculate his accuracy on the associated test subset
- finally, it returns the mean of the 10 accuracy calculate.

So, this method allows to evaluate the accuracy without altering the test set; it will allow us to optimize the parameters.

In [18]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifierRF, X = X_train, y = y_train, cv = 10)
accuracies.mean()

0.8612925797882731

thanks again sklearn! the parameter cv is the number of splits of the training set, 10 is a good value...


There is a lot of different parameters we can change to find the optimal model. The most efficient I reach with this algorithm was 0.874, I used the random forest algorithm with 20 trees, the entropy criterion, the size of the data was 10 000 and i was considering the 80 000 most frequent word on a tfidif feature matrix.


To be more efficient while looking for the best model you can use the Grid method. This is a sklearn object that test all different method with the cross validation according to the parameters you want it to test and return the best model with the accuracy. here is the code:


In [None]:
from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators': [5, 10], 'criterion': ['entropy']}]
              
grid_search = GridSearchCV(estimator = classifierRF,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

You can look the documentation to get more detail on this object but on a normal computer running this algorithm can take hours...

## VI- Save the model

I put here some quick line to Save a model in your working directory a load it later. It can take time to fit a model especially in our case with such a huge matrix of feature.
Don't forget to also download the preprocessing objects!

In [None]:
#directory="Models\model_RF_1.json"
def Save_model (directory,model):
    pickle.dump(model, open(directory, 'wb'))
    print("Saved model to disk")


def Load_model(directory):
    loaded_model = pickle.load(open(directory, 'rb'))
    print("Loaded model from disk")
    return (loaded_model)