In [14]:
import pandas as pd
from stop_words import get_stop_words
from nltk.corpus import stopwords
import re
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from nltk.stem.snowball import ItalianStemmer
import nltk
from num2words import num2words
import matplotlib.pyplot as plt
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
import numpy as np

### Data exploration

The **TripAdvisor dataset** chosen for this project contains information regarding **italian reviews on the famous travel platform**.

The dataset is composed of **28754 labelled textual reviews** and it has **no missing values / inconsistencies**.

Its attributes are:<br/>
{<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"**text**": string - The review<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"**class**": boolean ("pos" / "neg") - The label associated to the review <br/>
}<br/>

Analyzing the dataset, the first thing which stands out is that **it is label-imbalanced**, considering its very small size and that **almost the 70% of the reviews are classified as positive**.<br/>

However, **the negative ones are tendentially longer**: in fact, the average length for negative reviews is around 140 words, in contrast to the average 100 words of the positive ones.<br/>

Due to this fact, we can assume that **there is the potential to extract more features from an average negative review than from an average positive review**, trying to reduce the gap setted by the imbalance of the classes.

In [6]:
dev = pd.read_csv('exam_development.csv')
eval = pd.read_csv('exam_evaluation.csv')

print("Development dataset:\n")
print("Shape: " + str(dev.shape) + "\n")
print("Values counts:\n" + str(dev.loc[:, "class"].value_counts()) + "\n\n\n")

print("Evaluation dataset:\n")
print("Shape: " + str(eval.shape))

dict = dev.loc[:, "class"].value_counts().to_dict()

Development dataset:

Shape: (28754, 2)

Values counts:
pos    19532
neg     9222
Name: class, dtype: int64



Evaluation dataset:

Shape: (12323, 1)


### Preprocessing

The **preprocessing pipeline** that I have scheduled for this project is the following: 
1. Conversion to lower case
2. Punctuation removal
3. Italian stopwords removal
4. Stemming process

First of all, I decided to transform every review to **lower case**, in order to help the other steps (in particular the stopwords removal and stemming processes) of the pipeline.

I also replaced every " ' " with a space, in order to properly separate the words.

Then, I removed all the punctuation from the reviews, including "!".<br/>
Initially I did not want to take it off, because it could be useful to **increase the strength of a sentence** and **enlarge its polarity** (for example from "quite positive" to "very positive"), but considering that we are in a "boolean" case, I decided to remove it.

Regarding the two last steps, which work directly on words, I tried different approaches.<br/> 
At first I created a bag of words from the union of three sets:
- Stop words from the Python library stop_words • Stop words from the Python library NLTK
- First 100 numbers

After some time I figured out that **having numbers inside my bag of words is a potentially dangerous approach in terms of losing features**, so I ended up that the best set was the one from stop_words library in union with the one from NLTK.

I also removed the word "non" from the set, because I think that it is extremely important in the context of the sentences.

The stemmer I chose is ItalianStemmer from NLTK.<br/>
I used a stemmer in order to truncate superlatives, avoid taking care of the gender of nouns and normalize the vocabulary.

After this preprocessing pipeline, I used TfidfVectorizer from sklearn in order to transform the reviews into a TF-IDF features matrix.

In [17]:
# Stemmer
stemmer = ItalianStemmer()

print("Development dataset")
print("-  -  -  -  -  -  -  -  -")
print("Lower case text")
dev.loc[:, "text"] = dev.text.apply(lambda x: str.lower(x).replace("'", " "))

print("Remove punctuation")
dev.loc[:, "text"] = dev.text.apply(lambda x: " ".join(re.findall('[\w]+', x)))

print("Remove italian stopwords")
dev.loc[:, "text"] = dev.text.apply(lambda x: removeStopWords(x))

print("Stemming process")
dev.loc[:, "text"] = dev.text.apply(lambda x: stem(x))

print("- - - - - - - - - - - -")

print("Evaluation dataset")
print("-  -  -  -  -  -  -  -  -")

print("Lower case text")
eval.loc[:, "text"] = eval.text.apply(lambda x: str.lower(x).replace("'", " "))

print("Remove punctuation")
eval.loc[:, "text"] = eval.text.apply(lambda x: " ".join(re.findall('[\w]+', x)))

print("Remove italian stopwords")
eval.loc[:, "text"] = eval.text.apply(lambda x: removeStopWords(x))

print("Stemming process")
eval.loc[:, "text"] = eval.text.apply(lambda x: stem(x))

Development dataset
-  -  -  -  -  -  -  -  -
Lower case text
Remove punctuation
Remove italian stopwords
Stemming process
- - - - - - - - - - - -
Evaluation dataset
-  -  -  -  -  -  -  -  -
Lower case text
Remove punctuation
Remove italian stopwords
Stemming process


### Algorithm choice

#### Random Forest
As a first approach, I used the **RandomForest** from Python's sklearn package as a base classifier. The test set was made up of the 20% of the development set.

Before doing any types of tuning, the classifier reached 91% of accuracy, which was not absolutely bad.

After this first attempt, I did some **tuning to the hyperparameters** (that are better explained in the next section) that lead the classifier to 93% of accuracy.

#### Logistic Regressor

In addition, I tried to use the **LogisticRegressor**, still from the sklearn package; I thought that this regressor could be a good one for this project because of its capability on **modelling binary classification problems**.

Using the LogisticRegressor the accuracy reached 95%.

Finally, I decided to try the **LinearSVC classifier** from sklearn, assuming that with the preprocessing steps done before and some tuning on the TfidfVectorizer the dataset was splitted clearly into two different clusters.

In conclusion, after a little bit of tuning, the LinearSVC reached **97.5% of accuracy**.

#### Linear Support Vector Machine

Even if Support Vector Machine algorithm (with linear kernel) perform similarly to the Logistic Regression, I think that the first one performs better on this project because of the **sensitivity to marginal values**. 

The sigmoid function of the LogisticRegressor tends to not properly identify simil-neutral values, while the Support Vector Machine algorithm tries to construct the best widest possible separating line to split this two clusters.

All the evaluation for the accuracy were done with f1_score, as suggested in the assignment.

### Tuning and validation

In order to tune properly the classifiers I used the **GridSearchCV** to tune one main parameter, C. 
I thought that this was the most important parameter I should have worked on because of the problems with marginal values that I had.

After some attempts, I ended up that **C = 5 was the best in terms of accuracy**. So, my classifier work best with a small margin around the hyperplane of the LinearSVC.

"**class_weight**" is another crucial parameter I setted. <br/>
In fact, we still have to keep in mind that the dataset was imbalanced, so I had to give to the classifier a way to rebalance this disparity.

For the **TfidfVectorizer**, I setted *max_df* to 0.3 to prune eventual corpus-specific stop words. I think that this parameter could be setted also up to 0.4.

Moreover, I fixed *ngram_range* to (1, 2), especially because I decided to left the "non" word outside of the stopwords and so inside of the dataset. <br/>
I tried also with trigrams, considering that in the italian language the word "non" could appear quite far to the main concept of the sentence.

In [22]:
# Local training
cv = TfidfVectorizer(ngram_range=(1, 2), binary=True, max_df=0.3)
cv.fit(dev.loc[:, "text"])
X = cv.transform(dev.loc[:, "text"])

print("Fit and Predict:")

X_train, X_test, y_train, y_test = train_test_split(X, dev.loc[:, "class"], test_size=0.2, random_state=0)

for c in [0.5, 1, 5, 10]:
    lr = svm.LinearSVC(class_weight=dict, C=c, max_iter=15000)
    lr.fit(X_train, y_train)
    predictions = lr.predict(X_test)
    print("Accuracy for C:%s \n(accuracy_score):%s"
          % (c, accuracy_score(y_test, predictions)))
    print("(f1_score):", f1_score(y_test, predictions, average='weighted'))

feature_to_coef = {
    word: coef for word, coef in zip(
        cv.get_feature_names(), lr.coef_[0]
    )
}

print("\n\nBest positive words:")
for best_positive in sorted(
        feature_to_coef.items(),
        key=lambda x: x[1],
        reverse=True)[:5]:
    print(best_positive)
print("\nBest negative words:")
for best_negative in sorted(
        feature_to_coef.items(),
        key=lambda x: x[1])[:5]:
    print(best_negative)

cv = TfidfVectorizer(ngram_range=(1, 2), max_df=0.3)
cv.fit(dev.loc[:, "text"])
X = cv.transform(dev.loc[:, "text"])
X_test = cv.transform(eval.loc[:, "text"])

lr = svm.LinearSVC(class_weight=dict, max_iter=15000)
lr.fit(X, dev.loc[:, "class"])

predictions = lr.predict(X_test)

with open('exam_export.csv', 'w') as file:
    file.write("Id,Predicted\n")
    for index in eval.index:
        s = predictions[index]
        file.write(str(index) + "," + s + "\n")

Fit and Predict:
Accuracy for C:0.5 
(accuracy_score):0.9680055642496957
(f1_score): 0.9679298239468118
Accuracy for C:1 
(accuracy_score):0.9680055642496957
(f1_score): 0.9679298239468118
Accuracy for C:5 
(accuracy_score):0.9680055642496957
(f1_score): 0.9679298239468118
Accuracy for C:10 
(accuracy_score):0.9680055642496957
(f1_score): 0.9679298239468118


Best positive words:
('perfett', 4.315061287108056)
('eccellent', 4.122722642589093)
('fantast', 3.511618121202611)
('confortevol', 3.191626867207938)
('po', 3.187816105687944)

Best negative words:
('pessim', -4.904545308063406)
('sporc', -4.551094328473539)
('scars', -3.9961653985313177)
('scortes', -3.7267099986243837)
('vecc', -3.3512247498507683)


In [16]:
stopWords = stopwords.words('italian')

def removeStopWords(s):
    s = ' '.join(word for word in s.split() if word not in stopWords)
    return s

def stem(s):
    global stemmer
    s = ' '.join(stemmer.stem(word) for word in s.split())
    return s