# Text Mining project - IMDB dataset

*Marco Donzella, Rebecca Picarelli*

## 1. Text Representation

I dati testuali sono stati rappresentati in diversi modi, così da riuscire ad identificare il miglior tipo di rappresentazione che permetta di estrapolare più informazione possibile da essi.

Per prima cosa si definiscono i dataset di training e test, definendone la relativa sentiment.

In [43]:
import pandas as pd
from shutil import copyfile
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (precision_recall_curve, auc, roc_auc_score, confusion_matrix,
                             f1_score, fbeta_score, precision_score,
                             recall_score, classification_report)
import itertools
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [5]:
copyfile('/content/gdrive/MyDrive/Text Mining/TM_Project/test_neg.csv', 'test_neg.csv')
copyfile('/content/gdrive/MyDrive/Text Mining/TM_Project/test_pos.csv', 'test_pos.csv')
copyfile('/content/gdrive/MyDrive/Text Mining/TM_Project/train_neg.csv', 'train_neg.csv')
copyfile('/content/gdrive/MyDrive/Text Mining/TM_Project/train_pos.csv', 'train_pos.csv')

'train_pos.csv'

In [6]:
test_neg=pd.read_csv('test_neg.csv')
test_pos=pd.read_csv('test_pos.csv')
train_neg=pd.read_csv('train_neg.csv')
train_pos=pd.read_csv('train_pos.csv')

In [7]:
test_neg['sentiment']= False
test_pos['sentiment']= True
train_neg['sentiment']= False
train_pos['sentiment']= True

In [8]:
test= test_neg.append(test_pos).reset_index()
test.rename(columns = {'0' : 'review'}, inplace = True)
test=test[['review' ,'sentiment']]

In [9]:
train= train_neg.append(train_pos).reset_index()
train.rename(columns = {'0' : 'review'}, inplace = True)
train=train[['review' ,'sentiment']]

In [10]:
x_train=train['review']
y_train=train['sentiment']

In [11]:
x_test=test['review']
y_test=test['sentiment']

Ora vengono definite le seguenti rappresentazioni dei dati testuali:
- Binary Vectorizer
- Count Vectorizer
- Count Vectorizer Bigram
- TF_IDF
- TF_IDF Bigram

In [12]:
# Binary Vectorizer
bv = CountVectorizer(min_df=0., max_df=1. ,binary=True, max_features = 1000)
X_train_bv = bv.fit_transform(x_train) 
X_test_bv = bv.transform(x_test)

In [13]:
# Count Vectorizer
cv = CountVectorizer(min_df=0., max_df=1., max_features = 1000)
X_train_cv = cv.fit_transform(x_train) 
X_test_cv = cv.transform(x_test)

In [14]:
# Count Vectorizer Bigram
cv_bi = CountVectorizer(min_df=0., max_df=1., max_features = 1000, ngram_range=(1,2))
X_train_cv_bi = cv_bi.fit_transform(x_train) 
X_test_cv_bi = cv_bi.transform(x_test)

In [15]:
# TF_IDF
tfidf_vect = TfidfVectorizer(min_df=0., max_df=1., use_idf=True, max_features = 1000)           
X_train_tfidf = tfidf_vect.fit_transform(x_train)
X_test_tfidf = tfidf_vect.transform(x_test)

In [16]:
# TF_IDF Bigram
tfidf_vect_bi = TfidfVectorizer(min_df=0., max_df=1., ngram_range=(1, 2), use_idf=True, max_features = 1000)
X_train_tfidf_bi = tfidf_vect_bi.fit_transform(x_train)
X_test_tfidf_bi = tfidf_vect_bi.transform(x_test)

## 2. Classification

In [46]:
def score_model(clf, x_train, x_test, y_train, y_test):
    train_score = clf.score(x_train, y_train) # Train Accuracy
    test_score = clf.score(x_test, y_test)    # Test Accuracy

    predictions = clf.predict(x_test)

    prec = precision_score(y_test, predictions)
    rec = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions) # F1
    f2 = fbeta_score(y_test, predictions, beta=2) # F2
    cm = confusion_matrix(y_test, predictions)
    scores_strings = ["Train Accuracy", "Test Accuracy", "Test Precision",
                      "Test Recall", "Test F1", "Test F2"]

    scores = [train_score, test_score, prec, rec, f1, f2]

    print(("{:20s} {:.5f}\n"*6)[:-1].format(*itertools.chain(*zip(scores_strings, scores))))
    print("Classification report:")
    print(classification_report(y_test, predictions, digits=5))

### 2.1 Classification - Binary Vectorizer

In [47]:
clf = MultinomialNB()
clf.fit(X_train_bv, y_train)
score_model(clf, X_train_bv, X_test_bv, y_train, y_test)

Train Accuracy       0.84096
Test Accuracy        0.83660
Test Precision       0.83518
Test Recall          0.83872
Test F1              0.83695
Test F2              0.83801
Classification report:
              precision    recall  f1-score   support

       False    0.83803   0.83448   0.83625     12500
        True    0.83518   0.83872   0.83695     12500

    accuracy                        0.83660     25000
   macro avg    0.83661   0.83660   0.83660     25000
weighted avg    0.83661   0.83660   0.83660     25000



In [19]:
clf = LinearSVC(penalty='l1', loss='squared_hinge', dual=False) # migliori parametri per matrici sparse
clf.fit(X_train_bv, y_train)
score_model(clf, X_train_bv, X_test_bv, y_train, y_test)

Train Accuracy       0.87644
Test Accuracy        0.85576
Test Precision       0.84710
Test Recall          0.86824
Test F1              0.85754
Test F2              0.86393
Classification report:
              precision    recall  f1-score   support

       False    0.86487   0.84328   0.85394     12500
        True    0.84710   0.86824   0.85754     12500

    accuracy                        0.85576     25000
   macro avg    0.85598   0.85576   0.85574     25000
weighted avg    0.85598   0.85576   0.85574     25000



In [20]:
clf = LogisticRegression()
clf.fit(X_train_bv, y_train)
score_model(clf, X_train_bv, X_test_bv, y_train, y_test)

Train Accuracy       0.87680
Test Accuracy        0.85600
Test Precision       0.84929
Test Recall          0.86560
Test F1              0.85737
Test F2              0.86229
Classification report:
              precision    recall  f1-score   support

       False    0.86297   0.84640   0.85460     12500
        True    0.84929   0.86560   0.85737     12500

    accuracy                        0.85600     25000
   macro avg    0.85613   0.85600   0.85599     25000
weighted avg    0.85613   0.85600   0.85599     25000



In [21]:
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_bv, y_train)
score_model(clf, X_train_bv, X_test_bv, y_train, y_test)

Train Accuracy       1.00000
Test Accuracy        0.82660
Test Precision       0.82942
Test Recall          0.82232
Test F1              0.82585
Test F2              0.82373
Classification report:
              precision    recall  f1-score   support

       False    0.82383   0.83088   0.82734     12500
        True    0.82942   0.82232   0.82585     12500

    accuracy                        0.82660     25000
   macro avg    0.82662   0.82660   0.82660     25000
weighted avg    0.82662   0.82660   0.82660     25000



In [22]:
clf = AdaBoostClassifier(random_state=42)
clf.fit(X_train_bv, y_train)
score_model(clf, X_train_bv, X_test_bv, y_train, y_test)

Train Accuracy       0.80364
Test Accuracy        0.80092
Test Precision       0.78229
Test Recall          0.83392
Test F1              0.80728
Test F2              0.82306
Classification report:
              precision    recall  f1-score   support

       False    0.82218   0.76792   0.79413     12500
        True    0.78229   0.83392   0.80728     12500

    accuracy                        0.80092     25000
   macro avg    0.80224   0.80092   0.80070     25000
weighted avg    0.80224   0.80092   0.80070     25000



### 2.2 Classification - Count Vectorizer

In [23]:
clf = MultinomialNB()
clf.fit(X_train_cv, y_train)
score_model(clf, X_train_cv, X_test_cv, y_train, y_test)

Train Accuracy       0.83424
Test Accuracy        0.83016
Test Precision       0.83213
Test Recall          0.82720
Test F1              0.82966
Test F2              0.82818
Classification report:
              precision    recall  f1-score   support

       False    0.82822   0.83312   0.83066     12500
        True    0.83213   0.82720   0.82966     12500

    accuracy                        0.83016     25000
   macro avg    0.83017   0.83016   0.83016     25000
weighted avg    0.83017   0.83016   0.83016     25000



In [24]:
clf = LinearSVC(penalty='l1', loss='squared_hinge', dual=False) 
clf.fit(X_train_cv, y_train)
score_model(clf, X_train_cv, X_test_cv, y_train, y_test)

Train Accuracy       0.87384
Test Accuracy        0.85680
Test Precision       0.84855
Test Recall          0.86864
Test F1              0.85848
Test F2              0.86455
Classification report:
              precision    recall  f1-score   support

       False    0.86545   0.84496   0.85508     12500
        True    0.84855   0.86864   0.85848     12500

    accuracy                        0.85680     25000
   macro avg    0.85700   0.85680   0.85678     25000
weighted avg    0.85700   0.85680   0.85678     25000



In [25]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_cv, y_train)
score_model(clf, X_train_cv, X_test_cv, y_train, y_test)

Train Accuracy       0.87468
Test Accuracy        0.85712
Test Precision       0.85028
Test Recall          0.86688
Test F1              0.85850
Test F2              0.86351
Classification report:
              precision    recall  f1-score   support

       False    0.86423   0.84736   0.85571     12500
        True    0.85028   0.86688   0.85850     12500

    accuracy                        0.85712     25000
   macro avg    0.85726   0.85712   0.85711     25000
weighted avg    0.85726   0.85712   0.85711     25000



In [26]:
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_cv, y_train)
score_model(clf, X_train_cv, X_test_cv, y_train, y_test)

Train Accuracy       1.00000
Test Accuracy        0.82836
Test Precision       0.83034
Test Recall          0.82536
Test F1              0.82784
Test F2              0.82635
Classification report:
              precision    recall  f1-score   support

       False    0.82640   0.83136   0.82887     12500
        True    0.83034   0.82536   0.82784     12500

    accuracy                        0.82836     25000
   macro avg    0.82837   0.82836   0.82836     25000
weighted avg    0.82837   0.82836   0.82836     25000



In [27]:
clf = AdaBoostClassifier(random_state=42)
clf.fit(X_train_cv, y_train)
score_model(clf, X_train_cv, X_test_cv, y_train, y_test)

Train Accuracy       0.80476
Test Accuracy        0.80012
Test Precision       0.78367
Test Recall          0.82912
Test F1              0.80575
Test F2              0.81961
Classification report:
              precision    recall  f1-score   support

       False    0.81860   0.77112   0.79415     12500
        True    0.78367   0.82912   0.80575     12500

    accuracy                        0.80012     25000
   macro avg    0.80113   0.80012   0.79995     25000
weighted avg    0.80113   0.80012   0.79995     25000



### 2.3 Classification - Count Vectorizer Bigram

In [28]:
clf = MultinomialNB()
clf.fit(X_train_cv_bi, y_train)
score_model(clf, X_train_cv_bi, X_test_cv_bi, y_train, y_test)

Train Accuracy       0.83136
Test Accuracy        0.82604
Test Precision       0.82487
Test Recall          0.82784
Test F1              0.82635
Test F2              0.82724
Classification report:
              precision    recall  f1-score   support

       False    0.82722   0.82424   0.82573     12500
        True    0.82487   0.82784   0.82635     12500

    accuracy                        0.82604     25000
   macro avg    0.82604   0.82604   0.82604     25000
weighted avg    0.82604   0.82604   0.82604     25000



In [29]:
clf = LinearSVC(penalty='l1', loss='squared_hinge', dual=False) 
clf.fit(X_train_cv_bi, y_train)
score_model(clf, X_train_cv_bi, X_test_cv_bi, y_train, y_test)

Train Accuracy       0.87336
Test Accuracy        0.85600
Test Precision       0.84738
Test Recall          0.86840
Test F1              0.85776
Test F2              0.86411
Classification report:
              precision    recall  f1-score   support

       False    0.86505   0.84360   0.85419     12500
        True    0.84738   0.86840   0.85776     12500

    accuracy                        0.85600     25000
   macro avg    0.85622   0.85600   0.85598     25000
weighted avg    0.85622   0.85600   0.85598     25000



In [30]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_cv_bi, y_train)
score_model(clf, X_train_cv_bi, X_test_cv_bi, y_train, y_test)

Train Accuracy       0.87496
Test Accuracy        0.85584
Test Precision       0.84903
Test Recall          0.86560
Test F1              0.85723
Test F2              0.86223
Classification report:
              precision    recall  f1-score   support

       False    0.86292   0.84608   0.85442     12500
        True    0.84903   0.86560   0.85723     12500

    accuracy                        0.85584     25000
   macro avg    0.85598   0.85584   0.85583     25000
weighted avg    0.85598   0.85584   0.85583     25000



In [48]:
clf = RandomForestClassifier()
clf.fit(X_train_cv_bi, y_train)
score_model(clf, X_train_cv_bi, X_test_cv_bi, y_train, y_test)

Train Accuracy       1.00000
Test Accuracy        0.82720
Test Precision       0.82678
Test Recall          0.82784
Test F1              0.82731
Test F2              0.82763
Classification report:
              precision    recall  f1-score   support

       False    0.82762   0.82656   0.82709     12500
        True    0.82678   0.82784   0.82731     12500

    accuracy                        0.82720     25000
   macro avg    0.82720   0.82720   0.82720     25000
weighted avg    0.82720   0.82720   0.82720     25000



In [32]:
clf = AdaBoostClassifier(random_state=42)
clf.fit(X_train_cv_bi, y_train)
score_model(clf, X_train_cv_bi, X_test_cv_bi, y_train, y_test)

Train Accuracy       0.80432
Test Accuracy        0.80004
Test Precision       0.78432
Test Recall          0.82768
Test F1              0.80542
Test F2              0.81863
Classification report:
              precision    recall  f1-score   support

       False    0.81760   0.77240   0.79436     12500
        True    0.78432   0.82768   0.80542     12500

    accuracy                        0.80004     25000
   macro avg    0.80096   0.80004   0.79989     25000
weighted avg    0.80096   0.80004   0.79989     25000



### 2.4 Classification - TF_IDF

In [33]:
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
score_model(clf, X_train_tfidf, X_test_tfidf, y_train, y_test)

Train Accuracy       0.83968
Test Accuracy        0.83476
Test Precision       0.83160
Test Recall          0.83952
Test F1              0.83554
Test F2              0.83792
Classification report:
              precision    recall  f1-score   support

       False    0.83798   0.83000   0.83397     12500
        True    0.83160   0.83952   0.83554     12500

    accuracy                        0.83476     25000
   macro avg    0.83479   0.83476   0.83476     25000
weighted avg    0.83479   0.83476   0.83476     25000



In [34]:
clf = LinearSVC()
clf.fit(X_train_tfidf, y_train)
score_model(clf, X_train_tfidf, X_test_tfidf, y_train, y_test)

Train Accuracy       0.87500
Test Accuracy        0.85880
Test Precision       0.85365
Test Recall          0.86608
Test F1              0.85982
Test F2              0.86357
Classification report:
              precision    recall  f1-score   support

       False    0.86410   0.85152   0.85776     12500
        True    0.85365   0.86608   0.85982     12500

    accuracy                        0.85880     25000
   macro avg    0.85888   0.85880   0.85879     25000
weighted avg    0.85888   0.85880   0.85879     25000



In [35]:
clf = LogisticRegression()
clf.fit(X_train_tfidf, y_train)
score_model(clf, X_train_tfidf, X_test_tfidf, y_train, y_test)

Train Accuracy       0.87360
Test Accuracy        0.86048
Test Precision       0.85436
Test Recall          0.86912
Test F1              0.86168
Test F2              0.86613
Classification report:
              precision    recall  f1-score   support

       False    0.86682   0.85184   0.85926     12500
        True    0.85436   0.86912   0.86168     12500

    accuracy                        0.86048     25000
   macro avg    0.86059   0.86048   0.86047     25000
weighted avg    0.86059   0.86048   0.86047     25000



In [36]:
clf = RandomForestClassifier()
clf.fit(X_train_tfidf, y_train)
score_model(clf, X_train_tfidf, X_test_tfidf, y_train, y_test)

Train Accuracy       1.00000
Test Accuracy        0.82920
Test Precision       0.83047
Test Recall          0.82728
Test F1              0.82887
Test F2              0.82792
Classification report:
              precision    recall  f1-score   support

       False    0.82794   0.83112   0.82953     12500
        True    0.83047   0.82728   0.82887     12500

    accuracy                        0.82920     25000
   macro avg    0.82920   0.82920   0.82920     25000
weighted avg    0.82920   0.82920   0.82920     25000



In [37]:
clf = AdaBoostClassifier()
clf.fit(X_train_tfidf, y_train)
score_model(clf, X_train_tfidf, X_test_tfidf, y_train, y_test)

Train Accuracy       0.80472
Test Accuracy        0.80068
Test Precision       0.78257
Test Recall          0.83272
Test F1              0.80687
Test F2              0.82218
Classification report:
              precision    recall  f1-score   support

       False    0.82127   0.76864   0.79408     12500
        True    0.78257   0.83272   0.80687     12500

    accuracy                        0.80068     25000
   macro avg    0.80192   0.80068   0.80048     25000
weighted avg    0.80192   0.80068   0.80048     25000



### 2.5 Classifier - TF_IDF Bigram

In [38]:
clf = MultinomialNB()
clf.fit(X_train_tfidf_bi, y_train)
score_model(clf, X_train_tfidf_bi, X_test_tfidf_bi, y_train, y_test)

Train Accuracy       0.83676
Test Accuracy        0.83236
Test Precision       0.82439
Test Recall          0.84464
Test F1              0.83439
Test F2              0.84051
Classification report:
              precision    recall  f1-score   support

       False    0.84073   0.82008   0.83028     12500
        True    0.82439   0.84464   0.83439     12500

    accuracy                        0.83236     25000
   macro avg    0.83256   0.83236   0.83233     25000
weighted avg    0.83256   0.83236   0.83233     25000



In [39]:
clf = LinearSVC()
clf.fit(X_train_tfidf_bi, y_train)
score_model(clf, X_train_tfidf_bi, X_test_tfidf_bi, y_train, y_test)

Train Accuracy       0.87592
Test Accuracy        0.85816
Test Precision       0.85269
Test Recall          0.86592
Test F1              0.85925
Test F2              0.86324
Classification report:
              precision    recall  f1-score   support

       False    0.86381   0.85040   0.85705     12500
        True    0.85269   0.86592   0.85925     12500

    accuracy                        0.85816     25000
   macro avg    0.85825   0.85816   0.85815     25000
weighted avg    0.85825   0.85816   0.85815     25000



In [40]:
clf = LogisticRegression()
clf.fit(X_train_tfidf_bi, y_train)
score_model(clf, X_train_tfidf_bi, X_test_tfidf_bi, y_train, y_test)

Train Accuracy       0.87272
Test Accuracy        0.85976
Test Precision       0.85337
Test Recall          0.86880
Test F1              0.86102
Test F2              0.86567
Classification report:
              precision    recall  f1-score   support

       False    0.86638   0.85072   0.85848     12500
        True    0.85337   0.86880   0.86102     12500

    accuracy                        0.85976     25000
   macro avg    0.85988   0.85976   0.85975     25000
weighted avg    0.85988   0.85976   0.85975     25000



In [41]:
clf = RandomForestClassifier()
clf.fit(X_train_tfidf_bi, y_train)
score_model(clf, X_train_tfidf_bi, X_test_tfidf_bi, y_train, y_test)

Train Accuracy       1.00000
Test Accuracy        0.83172
Test Precision       0.83191
Test Recall          0.83144
Test F1              0.83167
Test F2              0.83153
Classification report:
              precision    recall  f1-score   support

       False    0.83153   0.83200   0.83177     12500
        True    0.83191   0.83144   0.83167     12500

    accuracy                        0.83172     25000
   macro avg    0.83172   0.83172   0.83172     25000
weighted avg    0.83172   0.83172   0.83172     25000



In [42]:
clf = AdaBoostClassifier()
clf.fit(X_train_tfidf_bi, y_train)
score_model(clf, X_train_tfidf_bi, X_test_tfidf_bi, y_train, y_test)

Train Accuracy       0.80576
Test Accuracy        0.80164
Test Precision       0.78048
Test Recall          0.83936
Test F1              0.80885
Test F2              0.82688
Classification report:
              precision    recall  f1-score   support

       False    0.82625   0.76392   0.79386     12500
        True    0.78048   0.83936   0.80885     12500

    accuracy                        0.80164     25000
   macro avg    0.80337   0.80164   0.80136     25000
weighted avg    0.80337   0.80164   0.80136     25000

