# Text Mining Project Work (Group 5)

**Text Classification and Sentiment Analysis**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Group 5
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- The function of every command or group of related commands
must be documented clearly and concisely. 
- The submission deadline is the 1st July 2022.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference. 
- If still in doubt about anything, ask the teacher
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30

## Setup

The following cell contains some necessary imports

In [5]:
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import os
from urllib.request import urlretrieve
from statsmodels.stats.contingency_tables import mcnemar

Run the following to download the necessary files

In [6]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [7]:
download("Gift_Cards.json.gz", "https://www.dropbox.com/s/c87cjds263jt3sb/Gift_Cards.json.gz?dl=1")

In [8]:
download("Magazine_Subscriptions.json.gz", "https://www.dropbox.com/s/g6om8q8c8pvirw8/Magazine_Subscriptions.json.gz?dl=1")

In [9]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/gioel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Exercises

1) We provide in the `Gift_Cards.json.gz` file a dataset composed by several reviews posted on Amazon.com about Gift cards products. 
Each review is labeled with a score between 1 and 5 stars (represented by the ```overall``` feature).

The text of each review is represented by the ```reviewText``` feature which is going to be our input data along with the ```overall``` one.

Load 100000 random reviews putting it in a new Pandas dataframe.

In [10]:
df = pd.read_json('Gift_cards.json.gz',lines = True,compression= 'gzip')
df = df.sample(n = 100000)
df = df.reset_index()


2) Print the dataset rows number and visualize the first 5 rows.

In [11]:
df.head()

Unnamed: 0,index,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,31463,5,,True,"01 7, 2013",A1YO5TU01S6XBL,B004RD9ACA,,Lori S,Very easy to do and nice that you can put it i...,gift card,1357516800,
1,101270,5,,True,"01 8, 2015",A151SOYKNBCXWV,B00ISCEAMG,,Hawaiian Sister,I ordered this because it was free & who doesn...,Good idea,1420675200,
2,19910,5,,True,"04 6, 2018",A256L5K68UXBYH,B004LLIKVU,{'Gift Amount:': ' 50'},kate bujor,An easy way to send a gift to relatives abroad...,So easy and hassle-free,1522972800,
3,103924,5,,True,"10 10, 2016",A2FB181U8NTPPU,B00JDQJZWG,{'Gift Amount:': ' 75'},Murielle,Everyone loved the cute tin for our grand daug...,Five Stars,1476057600,
4,42243,1,,True,"09 16, 2015",A2R82GWR3MUL6K,B0066AZGD4,,Learning Ray,I wanted to buy a S$100 card but ended up buyi...,I wanted to buy a S$100 card but ended up ...,1442361600,


3) Undersample the data by `overall` feature in order to obtain a class-balanced dataset.



In [12]:
rus = RandomUnderSampler(random_state=42)
df,_ = rus.fit_resample(df, df["overall"])

4) Cast the `reviewText` column to unicode string



In [13]:
df["reviewText"] = df["reviewText"].values.astype('U')

**5)** Select from data only the features named ```reviewText``` and ```overall``` putting them in a dataframe





In [14]:
df = df[['reviewText', 'overall']]

**6)** Verify the distribution of the number of stars

In [15]:
df["overall"].value_counts()

1    1103
2    1103
3    1103
4    1103
5    1103
Name: overall, dtype: int64

**7)** Remove from the dataframe the reviews rated with 3 stars.

In [16]:
df = df[df['overall']!= 3]

In [17]:
df['overall'].unique()

array([1, 2, 4, 5])

**8)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 or 5 stars and `"neg"` for reviews with 1 or 2 stars.

In [18]:
def mapToLabel(value):
  if value >= 4:
    return "pos"
  elif value <= 2:
    return "neg"

df["label"] = df["overall"].apply(mapToLabel)

df.head() 

Unnamed: 0,reviewText,overall,label
0,DOESNT ACTIVATE would never by one again. AMAZ...,1,neg
1,The box is ripped and looks used. Embarrassin...,1,neg
2,My son did not receive this gift from me.,1,neg
3,My daughter wanted to buy an app for my grandd...,1,neg
4,Received an empty envelope! Still trying to fi...,1,neg


**9)** Split the dataset randomly into a training set with 80% of data and a test set with the remaining 20%, stratifying the split by the `label` variable

In [19]:
y = df["label"]
X = df["reviewText"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("training set shape: " + str(X_train.shape))
print("Test set shape: " + str(X_test.shape))

training set shape: (3529,)
Test set shape: (883,)


**10)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 7 documents and using only unigrams. Then, extract the document-term matrix for them.

In [35]:
vect = TfidfVectorizer(min_df=7, ngram_range=(1,1))
train_dtm = vect.fit_transform(X_train).toarray()

In [36]:
test_dtm = vect.transform(X_test).toarray()

**11)** Train a Support Vector Machine of your choice on the training reviews with a regularization parameter equals to 5, using the representation created above

In [91]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)



In [25]:
def make_target(labels):
    return pd.DataFrame({
        "pos": labels == "pos", # if the label is "pos" then return 1 else return 0
        "neg": labels == "neg" # if the label is "neg" then return 1 else 0
    }).astype(int)

In [39]:
train_target = make_target(y_train)
test_target = make_target(y_test)

In [92]:
clf = SVC(C = 5, kernel = 'linear')
clf.fit(train_dtm, y_train)


**12)** Verify the accuracy of the classifier on the test set and try to maximize it tuning the Support Vector Machine kernel and regularization factor.

In [94]:
clf.score(test_dtm, y_test)

0.8408577878103838

In [98]:
from sklearn.model_selection import GridSearchCV
params = {
    'C' : [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
    }
grid = GridSearchCV(SVC(), params, scoring = 'accuracy', cv = 5)
grid.fit(train_dtm, y_train)
params = grid.best_params_

In [101]:
clf = SVC(C = params['C'], kernel = params['kernel'])
clf.fit(train_dtm, y_train)

In [102]:
clf.score(test_dtm, y_test)

0.871331828442438

**13)** Train a Deep Learning model (excluding transformer-based models like BERT) using the document-term representation built in point 10. The usage of recurrent layers is up to you.

In [37]:
from keras.models import Sequential
from keras.layers import Dense

num_terms = len(vect.get_feature_names_out()) 

model = Sequential([
    Dense(256, activation="sigmoid", input_dim=num_terms),
    Dense(64, activation="sigmoid"),
    Dense(16, activation="sigmoid"),
    Dense(2, activation="softmax")
])
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(train_dtm, train_target,batch_size=200,epochs=3)

Epoch 1/3


2022-06-27 12:50:29.742368: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-06-27 12:50:30.056069: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x151037f10>

14) Evaluate the model calculating the accuracy on test data. Try to maximize the model accuracy by tuning the neural network. 

In [41]:
model.evaluate(test_dtm, test_target)



[0.6723747849464417, 0.8335220813751221]

In [42]:
from keras import regularizers

model = Sequential([
    Dense(256, activation="sigmoid", input_dim=num_terms, kernel_regularizer=regularizers.L1L2(l1=1e-8, l2=1e-6)),
    Dense(64, activation="sigmoid"),
    Dense(16, activation="sigmoid"),
    Dense(2, activation="softmax")
])
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [43]:
model.fit(train_dtm, train_target, batch_size=200, epochs=10)

Epoch 1/10


2022-06-27 15:00:30.864528: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x159c6fb80>

In [44]:
model.evaluate(test_dtm, test_target)



2022-06-27 15:00:46.435742: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




[0.3559566140174866, 0.8686296343803406]

15) Evaluate the DL trained model on 50000 random reviews from the dataset in `Magazine_Subscriptions.json.gz` file.

Hint: you have to repeat the preprocessing steps done in the previous steps for the Gift cards reviews.

In [45]:
df_2 = pd.read_json('Magazine_Subscriptions.json.gz',lines = True,compression= 'gzip')
df_2 = df_2.sample(n = 50000)
df_2 = df_2.reset_index()
df_2.head()

Unnamed: 0,index,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image
0,81371,5,,True,"07 17, 2016",A19FLCEIE0MRLI,B000AIG4ES,Jess,Ordered for my husband who loves cars. He was ...,Great!,1468713600,,
1,83724,3,,True,"04 22, 2016",A3HH3Y7WE9Q7UN,B002NM7VNM,LA,So much more advertising than actual articles....,So much more advertising than actual articles....,1461283200,{'Format:': ' Print Magazine'},
2,75157,1,,True,"07 20, 2016",A1MX2F7Y83O9XK,B00005N7R0,lauren l.,Order at your own risk. Magazine was not as go...,Poor customer service.,1468972800,,
3,31861,5,6.0,True,"02 15, 2014",A3LSSKZJ08O5FY,B00006KSSX,Yarnspinner13,I have been getting Piecework for years and ye...,One of my favorite fiber magazines,1392422400,{'Format:': ' Kindle Edition'},
4,23934,5,,False,"03 23, 2007",A2DJA4JZWGL1OR,B00005NIN8,Byates,I'll never subscribe any other way again. The...,Came fast.,1174608000,,


In [46]:
rus = RandomUnderSampler(random_state=42)
df_2, _ = rus.fit_resample(df_2, df_2["overall"])
df_2.head()

Unnamed: 0,index,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image
0,67527,1,7.0,False,"02 23, 2014",A10D9F8QMOYMB1,B005ST869K,M. Davis,I don't understand this - I was looking forwar...,Digital costs more than dead tree?,1393113600,,
1,28053,1,3.0,True,"09 9, 2016",A2G7RIPZRYDB4D,B000060MGT,Grace,Ordered in May 2016.\nReceived in July 2016.\n...,Ordered in May 2016. Received in July 2016. ...,1473379200,,
2,77247,1,11.0,False,"09 13, 2003",A1YNEY755DJZUU,B00006K85A,David Young (@deepinthecode),"This magazine, while presumably well-intention...",Magazine is dangerous to the faith of Catholics,1063411200,,
3,75131,1,,False,"06 27, 2017",A3R7500TAKDDL3,B00005N7QS,shaheenAustin,Really not good.,"Junk articles, very lacking in any real info",1498521600,{'Format:': ' Kindle Edition'},
4,19187,1,3.0,True,"01 30, 2010",A1XMH5XV0F902F,B00005NIP7,L. Pham,"As of today, I am still waiting to receive Tra...",Never got the magazine,1264809600,{'Format:': ' Print Magazine'},


In [47]:
df_2["reviewText"] = df_2["reviewText"].values.astype('U')
df_2 = df_2[["reviewText", "overall"]]
df_2 = df_2[df_2["overall"] != 3]

df_2["label"] = df_2["overall"].apply(mapToLabel)

df_2.head() # To visualize the new df_2

Unnamed: 0,reviewText,overall,label
0,I don't understand this - I was looking forwar...,1,neg
1,Ordered in May 2016.\nReceived in July 2016.\n...,1,neg
2,"This magazine, while presumably well-intention...",1,neg
3,Really not good.,1,neg
4,"As of today, I am still waiting to receive Tra...",1,neg


In [48]:
y = df_2["label"]
X = df_2["reviewText"]

#vect = TfidfVectorizer(min_df=7, ngram_range=(1,1))
test_dtm_2 = vect.transform(X).toarray()
test_target_2 = make_target(y)

In [49]:
model.evaluate(test_dtm_2, test_target_2)



[0.5281423330307007, 0.7463718056678772]

16) Extra: train/fine-tune a transformer-based model (e.g. BERT) on Gift Cards training reviews and evaluate it on the Gift Cards test reviews.