# Text Mining Project Work (Group 5)

**Text Classification and Sentiment Analysis**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Group 5
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- The function of every command or group of related commands
must be documented clearly and concisely. 
- The submission deadline is the 1st July 2022.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference. 
- If still in doubt about anything, ask the teacher
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30

## Setup

The following cell contains some necessary imports

In [68]:
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import os
from urllib.request import urlretrieve
from statsmodels.stats.contingency_tables import mcnemar

Run the following to download the necessary files

In [69]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [70]:
download("Gift_Cards.json.gz", "https://www.dropbox.com/s/c87cjds263jt3sb/Gift_Cards.json.gz?dl=1")

In [71]:
download("Magazine_Subscriptions.json.gz", "https://www.dropbox.com/s/g6om8q8c8pvirw8/Magazine_Subscriptions.json.gz?dl=1")

In [72]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/gioel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Exercises

1) We provide in the `Gift_Cards.json.gz` file a dataset composed by several reviews posted on Amazon.com about Gift cards products. 
Each review is labeled with a score between 1 and 5 stars (represented by the ```overall``` feature).

The text of each review is represented by the ```reviewText``` feature which is going to be our input data along with the ```overall``` one.

Load 100000 random reviews putting it in a new Pandas dataframe.

In [73]:
df = pd.read_json('Gift_cards.json.gz',lines = True,compression= 'gzip')
df = df.sample(n = 100000)
df = df.reset_index()


2) Print the dataset rows number and visualize the first 5 rows.

In [74]:
df.head()

Unnamed: 0,index,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,92033,5,,True,"01 15, 2014",A19K3K40J7WI4O,B00G4IWEZG,,Esdownunder,A joyful Christmas design suitable for anyone ...,Delightful Christmas colour,1389744000,
1,11370,2,,True,"06 25, 2014",A1CHRP1FUC4PO3,B004LLILK0,,Gina Bergman,Sent this to my older father...it ended up in ...,A little disappointed,1403654400,
2,48658,5,,True,"12 29, 2014",AS9RDPAVLR8GH,B007V6ETXA,,LUIS LOZANO,Transaccion exitosa,Five Stars,1419811200,
3,70771,4,,True,"06 15, 2015",A2W770L2WMUXKJ,B00BWDH0LQ,,Dorothy Suz,Well received by recipient,Four Stars,1434326400,
4,45036,5,,True,"09 13, 2016",A1M4612IHHCC9K,B006PJHPV2,,vincent pizzo,Got a good meal at a good price. Card made it ...,Five Stars,1473724800,


3) Undersample the data by `overall` feature in order to obtain a class-balanced dataset.



In [75]:
rus = RandomUnderSampler(random_state=42)
df,_ = rus.fit_resample(df, df["overall"])

4) Cast the `reviewText` column to unicode string



In [76]:
df["reviewText"] = df["reviewText"].values.astype('U')

**5)** Select from data only the features named ```reviewText``` and ```overall``` putting them in a dataframe





In [77]:
df = df[['reviewText', 'overall']]

**6)** Verify the distribution of the number of stars

In [78]:
df["overall"].value_counts()

1    1107
2    1107
3    1107
4    1107
5    1107
Name: overall, dtype: int64

**7)** Remove from the dataframe the reviews rated with 3 stars.

In [79]:
df = df[df['overall']!= 3]

In [80]:
df['overall'].unique()

array([1, 2, 4, 5])

**8)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 or 5 stars and `"neg"` for reviews with 1 or 2 stars.

In [81]:
def mapToLabel(value):
  if value >= 4:
    return "pos"
  elif value <= 2:
    return "neg"

df["label"] = df["overall"].apply(mapToLabel)

df.head() 

Unnamed: 0,reviewText,overall,label
0,"The products are awful, worst green green I h...",1,neg
1,"I'm located in Canada, and I purchased this on...",1,neg
2,The print option makes the gift card seem a bi...,1,neg
3,couldn't apply it to a kindle when we were tol...,1,neg
4,it REALLY REALLY does not work,1,neg


**9)** Split the dataset randomly into a training set with 80% of data and a test set with the remaining 20%, stratifying the split by the `label` variable

In [82]:
y = df["label"]
X = df["reviewText"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("training set shape: " + str(X_train.shape))
print("Test set shape: " + str(X_test.shape))

training set shape: (3542,)
Test set shape: (886,)


**10)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 7 documents and using only unigrams. Then, extract the document-term matrix for them.

In [83]:
vect = TfidfVectorizer(min_df=7, ngram_range=(1,1))
train_dtm = vect.fit_transform(X_train)

In [84]:
test_dtm = vect.transform(X_test)

**11)** Train a Support Vector Machine of your choice on the training reviews with a regularization parameter equals to 5, using the representation created above

In [91]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)



In [92]:
clf = SVC(C = 5, kernel = 'linear')
clf.fit(train_dtm, y_train)


**12)** Verify the accuracy of the classifier on the test set and try to maximize it tuning the Support Vector Machine kernel and regularization factor.

In [94]:
clf.score(test_dtm, y_test)

0.8408577878103838

In [98]:
from sklearn.model_selection import GridSearchCV
params = {
    'C' : [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
    }
grid = GridSearchCV(SVC(), params, scoring = 'accuracy', cv = 5)
grid.fit(train_dtm, y_train)
params = grid.best_params_

In [101]:
clf = SVC(C = params['C'], kernel = params['kernel'])
clf.fit(train_dtm, y_train)

In [102]:
clf.score(test_dtm, y_test)

0.871331828442438

**13)** Train a Deep Learning model (excluding transformer-based models like BERT) using the document-term representation built in point 10. The usage of recurrent layers is up to you.

14) Evaluate the model calculating the accuracy on test data. Try to maximize the model accuracy by tuning the neural network. 

15) Evaluate the DL trained model on 50000 random reviews from the dataset in `Magazine_Subscriptions.json.gz` file.

Hint: you have to repeat the preprocessing steps done in the previous steps for the Gift cards reviews.

16) Extra: train/fine-tune a transformer-based model (e.g. BERT) on Gift Cards training reviews and evaluate it on the Gift Cards test reviews.