# Exercise for 2nd interview at Onogone

*Made by Anh-Thi DINH*

## 1. Goals

- The goal is to build a binary classifier based on the attached corpus. Documents are classified as either belonging to the desired class (TRUE) or not (FALSE). 
- We would simply like you to have a look at the data and come up with one or several strategies to build such a classifier and maximize its accuracy.
- This may include suggestions on how to pre-process, vectorize or resample the data or how to evaluate the classifier. 
- If time permits, code samples (in the language of your choice) in which you implement and evaluate your different strategies would be appreciated.

## 2. Result (TL;DR;)

What I'm going to do in this notebook:

1. Import the data with the encoding "ANSI" (I've found this by using Notepad in Windows, it's a trick) and take a look at the data set (**Section 4**).
3. Data preprocessing : remove single charaters, remove stopwords, remove double spaces, remove special characters, make all letters to lower case (**Section 5**).
4. Building a baseline model using Logistic Regression and evaluate the results. Before doing that, I vectorize the data into numerics (**Section 6.3**).
5. Building some other classifiers using: Support Vector Machine (**Section 7.1**), Naive Bayes (**Section 7.2**), Random Forest (**Section 7.3**).
6. Conclusion (**Section 8**) and some discussions for other methods to improve the results (**Section 9**).

Below is the table of accuracies.

![Table of results](./result.jpg)

## 3. Import necessary libraries

In [190]:
import pandas as pd
import numpy as np

# text preprocessing
import re
import nltk # natural language toolkit
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = stopwords.words('english')
# for stemming
from nltk.stem.porter import PorterStemmer

# split the data into train / test sets
from sklearn.model_selection import train_test_split

# vectorize contents 
from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction.text import TfidfVectorizer

# different classification algorithms
from sklearn.linear_model import LogisticRegression # logistic regression
from sklearn import svm # support vector machine
from sklearn.naive_bayes import MultinomialNB # naive bayes
from sklearn.ensemble import RandomForestClassifier

# evaluations
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
# from sklearn import metrics

# cross validation
# from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, cross_val_predict

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\ProgramData\Anaconda3\lib\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 4. Importing data

In [145]:
# Trick: using Notepad in Windows to findout the encoding of the csv file
df = pd.read_csv("exerciceDS.csv", encoding="ANSI")

# Change True/False to 1/0 for more easily looking
df["Classes"] = df["Classes"].astype(int)

# first look on the data frame
df.head()

Unnamed: 0,Content,Classes
0,To Editor Re Penn Station Now Always Zach Gros...,1
1,My brother drone morning flew park wind carrie...,1
2,This holiday advertising campaign United State...,1
3,Through dim memory South Bronx childhood blurr...,1
4,LONDON For Queen Elizabeth II failed attend Ch...,1


Get some information about data.

In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5991 entries, 0 to 5990
Data columns (total 2 columns):
Content    5991 non-null object
Classes    5991 non-null int32
dtypes: int32(1), object(1)
memory usage: 70.3+ KB


Count the number of desired classes / not desired classes in the data frame.

In [147]:
df.Classes.value_counts()

0    5635
1     356
Name: Classes, dtype: int64

## 5. Data preprocessing

In [150]:
X = df['Content'].values # returns a np array
y = df['Classes'].values

In [151]:
contents = []
for i in range(0, len(X)):
    
    # Remove all the special characters
    content = re.sub(r'\W', ' ', X[i])
    
    # remove all single characters
    content = re.sub('(\\b[A-Za-z] \\b|\\b [A-Za-z]\\b)', '', content)
    
    # Substituting multiple spaces with single space
    content = re.sub(r'\s+', ' ', content, flags=re.I)
    
    # make all letters to lower case
    content = content.lower()
        
    # Lemmatization
    content = content.split()
    ps = PorterStemmer()
    content = [ps.stem(word) for word in content if not word in set(stopwords)]
    content = ' '.join(content)
    
    # add to contents
    contents.append(content)

## 6. A baseline model

We are going to build a simple model which is then used as a comparison with the more advanced models that you want to test. 

- In this baseline model, we don't make many things on the preprocessing text (removing stopwords, stemming,...)
- We use BOW (bag-of-words) model to create vectors out of contents.
- We use Logistic Regression (LoR) to train our model.

### 6.1. Verorizing data

We vectorize `contents` into numbers using BOW and we limit the number of words in the vocabulary based on their frequency. It's because we don't want the words which appear few in contents affect our results. We don't want also the words that appear very-few or very-often between items in contents.

In [194]:
# vectorize X using BOW
vectorizer = CountVectorizer(max_features=2000, min_df=5, max_df=0.8, stop_words=stopwords)
vectorizer.fit(contents)
X = vectorizer.transform(contents)
X

<5991x2000 sparse matrix of type '<class 'numpy.int64'>'
	with 714944 stored elements in Compressed Sparse Row format>

One can see that, `X` is a 5991x2000 matrix where 5991 is the number of training samples and 2000 is the size of the vocabulary created from `vectorizer` on `X`.

### 6.2. Split data into train / test sets

In [156]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=273)
print("X_train: {a}, X_test: {b}".format(a=X_train.shape[0], b=X_test.shape[0]))

X_train: 4493, X_test: 1498


### 6.3. Classifying using Logistic Regression

Training the model with Logistic Regression.

In [174]:
LoRclass = LogisticRegression()
LoRclass.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Make a prediction.

In [175]:
y_pred = LoRclass.predict(X_test)

Evaluate the model with different types of evaluations.

In [176]:
# mean accuracy (higher is better)
train_acc = LoRclass.score(X_train, y_train)
test_acc = LoRclass.score(X_test, y_test)
print("Mean accuracy: {:.3f} (training), {:.3f} (testing)".format(train_acc, test_acc))

# jaccard (higher is better)
test_acc = jaccard_similarity_score(y_test, y_pred)
print("Jaccard index: {:.3f}".format(test_acc))

# f1_score (higher is better)
test_acc = f1_score(y_test, y_pred, average='binary')
print("F1-score: {:.3f}".format(test_acc))

# log loss (smaler is better)
test_acc = log_loss(y_test, y_pred)
print("Log loss: {:.3f}".format(test_acc))

Mean accuracy: 0.999 (training), 0.983 (testing)
Jaccard index: 0.983
F1-score: 0.848
Log loss: 0.576


In our problem, *F1-score* should be used than the others because we need a precision in our prediction (0 or 1). The *Log loss* is usually used for predicted values which are probability values between 0 and 1.

The training accuracy is very high (almost 1), it may lead to an overfitting problem. We try using a Cross-Validation (K-fold for example) to get a better evaluation.

In [169]:
# Make cross validated predictions
scores = cross_val_score(LoRclass, X, y, cv=7, scoring='accuracy')
print("Accuracy score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(LoRclass, X, y, cv=7, scoring='f1') # binary
print("F1-score accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))

Accuracy score: 0.984 (+/- 0.016)
F1-score accuracy: 0.861 (+/- 0.117)


## 7. Other models 

### 7.1. Classifying using Support Vector Machine algorithm (SVM)

In [177]:
SVMclass = svm.SVC()
SVMclass.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Make a prediction.

In [178]:
y_pred = SVMclass.predict(X_test)

Evaluate the model with different types of evaluations.

In [179]:
# mean accuracy (higher is better)
train_acc = SVMclass.score(X_train, y_train)
test_acc = SVMclass.score(X_test, y_test)
print("Mean accuracy (SVM): {:.3f} (training), {:.3f} (testing)".format(train_acc, test_acc))

# jaccard (higher is better)
test_acc = jaccard_similarity_score(y_test, y_pred)
print("Jaccard index: {:.3f}".format(test_acc))

# f1_score (higher is better)
test_acc = f1_score(y_test, y_pred, average='binary')
print("F1-score: {:.3f}".format(test_acc))

Mean accuracy (SVM): 0.987 (training), 0.977 (testing)
Jaccard index: 0.977
F1-score: 0.761


Let's try with a cross validation to check the accuracy.

In [180]:
# Make cross validated predictions
scores = cross_val_score(SVMclass, X, y, cv=7, scoring='accuracy')
print("Accuracy score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(SVMclass, X, y, cv=7, scoring='f1')
print("F1-score accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))

Accuracy score: 0.979 (+/- 0.011)
F1-score accuracy: 0.789 (+/- 0.121)


In this case, SVM **take more time to execute** but **it doesn't give any better result** than Logistic Regression.

### 7.2. Classifying using Naive Bayes

In [185]:
NBclass = MultinomialNB()
NBclass.fit(X_train, y_train)
y_pred = NBclass.predict(X_test)

Evaluate,

In [183]:
# mean accuracy (higher is better)
train_acc = NBclass.score(X_train, y_train)
test_acc = NBclass.score(X_test, y_test)
print("Mean accuracy (NBclass): {:.3f} (training), {:.3f} (testing)".format(train_acc, test_acc))

# jaccard (higher is better)
test_acc = jaccard_similarity_score(y_test, y_pred)
print("Jaccard index: {:.3f}".format(test_acc))

# f1_score (higher is better)
test_acc = f1_score(y_test, y_pred, average='binary')
print("F1-score: {:.3f}".format(test_acc))

Mean accuracy (NBclass): 0.935 (training), 0.929 (testing)
Jaccard index: 0.929
F1-score: 0.604


Cross validation,

In [187]:
# Make cross validated predictions
scores = cross_val_score(NBclass, X, y, cv=7, scoring='accuracy')
print("Accuracy score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(NBclass, X, y, cv=7, scoring='f1')
print("F1-score accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))

Accuracy score: 0.886 (+/- 0.182)
F1-score accuracy: 0.563 (+/- 0.342)


The Naive Bayes take **less time to execute** but **the result is not so good!**

### 7.3. Random Forest Classifier

In [191]:
RFclass = RandomForestClassifier(n_estimators=1000, random_state=0) # 1000 trees
RFclass.fit(X_train, y_train)
y_pred = RFclass.predict(X_test)

In [192]:
# mean accuracy (higher is better)
train_acc = RFclass.score(X_train, y_train)
test_acc = RFclass.score(X_test, y_test)
print("Mean accuracy (RFclass): {:.3f} (training), {:.3f} (testing)".format(train_acc, test_acc))

# jaccard (higher is better)
test_acc = jaccard_similarity_score(y_test, y_pred)
print("Jaccard index: {:.3f}".format(test_acc))

# f1_score (higher is better)
test_acc = f1_score(y_test, y_pred, average='binary')
print("F1-score: {:.3f}".format(test_acc))

Mean accuracy (RFclass): 1.000 (training), 0.979 (testing)
Jaccard index: 0.979
F1-score: 0.784


In [193]:
# Make cross validated predictions
scores = cross_val_score(RFclass, X, y, cv=7, scoring='accuracy')
print("Accuracy score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(RFclass, X, y, cv=7, scoring='f1')
print("F1-score accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))

Accuracy score: 0.980 (+/- 0.012)
F1-score accuracy: 0.801 (+/- 0.113)


Because we take a very large number of trees for the RF, it takes too much time to execute but the result is really good in comparison with other algorithms (not with Logistic Regression).

## 8. Conclusion

It seems that the Logistic Regression is the best one among above methods. The runner-up is the Random Forest but it takes a lot of time to execute the code (we can take fewer trees to make it faster). Both of them lead easily to the overfitting problem if we fix a train/test set at the beginning (both give 1 in the training accuracy). That's why we need a cross-validation step to evaluate our methods and the results are not too bad.

## 9. Discussion

There are many other algorithms we can use to classify our data. Besides that, we can also modify the parameters in our algorithms to make a better results (if possible). Below are some options:

1. Change the `CountVectorizer`'s parameters, such as: increasing the number of features (`max_features`), number of trusted words (`min_df`, `max_df`) 
2. Instead of using `CountVectorizer`, we can try with `TfidfVectorizer`.
3. Feature scaling (not so helpful for the Random Forest)
4. Try Word Embedding with the help of **Word2Vec** or something like that.
5. Try to use Neural Networks with the help of **Keras**.

Because I don't have too much time to take into account all of above things, I'll leave them for future researches.