## 1. Importing Libraries & Data

** **

This project was developed by <br><br>

*<center>António Oliveira - 2023039 - Industrial Applications of AI*

### 1.1 Libraries

In [44]:
import pandas as pd


# stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# Vectorization
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

# external py file
import functions2 as ext

# Warnings
import warnings
warnings.filterwarnings("ignore")

Checking the available functions in the external file.

In [8]:
functions = [func for func in dir(ext) if callable(getattr(ext, func))]

# Print the list of functions
print("Functions in the external module:\n")
for func in functions:
    print(func+ '\n')

Functions in the external module:

PorterStemmer

WordNetLemmatizer

contains_emoji

detect

detect_language

detect_outliers_per_column

is_emoji

plot_boxplots

preprocessor

replace_emojis

sent_tokenize

stopword_remover

translate_with_deepl

word_tokenize



### 1.2 Data

In [83]:
data = pd.read_csv('/Users/antoniooliveira/Projects/Industrial Applications of AI/Assignment 4/clean_data_v2.csv')
data.head(3)

Unnamed: 0,published_date,published_platform,rating,helpful_votes,title,text,Hour,Day,Month,Year,...,language,char_count,translated_text,contains_emoji,title_contains_emoji,no_emoji_text,clean_text,translated_title,no_emoji_title,clean_title
0,2023-12-28 13:02:14+00:00,Mobile,5,0,Best classes and good environment,Good thanks for everything good work group 👍 h...,3,28,12,2023,...,en,124,Good thanks for everything good work group 👍 h...,True,False,Good thanks for everything good work group thu...,Good thanks everything good work group thumbs_...,Best classes and good environment,Best classes and good environment,Best classes good environment
1,2023-12-12 05:38:26+00:00,Desktop,4,0,Harvard University,Harvard University was founded in 1636 and is ...,1,12,12,2023,...,en,322,Harvard University was founded in 1636 and is ...,False,False,Harvard University was founded in 1636 and is ...,Harvard University founded 1636 private Ivy Le...,Harvard University,Harvard University,Harvard University
2,2023-12-10 13:21:35+00:00,Mobile,3,0,Walk around campus,We did a walk around most of the Harvard Campu...,6,10,12,2023,...,en,301,We did a walk around most of the Harvard Campu...,False,False,We did a walk around most of the Harvard Campu...,"We walk around Harvard Campus, beautiful old, ...",Walk around campus,Walk around campus,Walk around campus


Dropping columns unnecessary to this task.

In [84]:
data = data.drop(['title', 'text', 'contains_emoji',
                 'title_contains_emoji', 'translated_text',
                 'clean_title', 'translated_title', 'Timezone'], axis = 1)

Since this notebook consists of an extention of the project, the final preprocessing steps will be done here. These will take advantage of the text already translated and without any emojis, and apply the *preprocessor* function (available in *function2*) to the *no_emoji_text*.

With this we aim to experiment with lemmatisation, stemming or none, and after compare the results of different models by assessing which performs best

## 2. Text Preprocessing

**Lemmatization**

In [85]:
data['lemmatized_text'] = data['no_emoji_text'].apply(lambda text: ext.preprocessor(text,
                 remove_punctuation=False,
                 lowercase=True,
                 tokenized_output=False,
                 remove_stopwords=True,
                 lemmatization=True,
                 stemming=False,
                 sentence_output=False
))

                         
data['lemmatized_text']

0       good thanks everything good work group thumbs_...
1       harvard university founded 1636 private ivy le...
2       walk around harvard campus, beautiful old, exp...
3       walk university ground tour guide- tour cruise...
4       finally made harvard!! iconic university campu...
                              ...                        
3144    boston usually time go cambridge( subway) sudd...
3145    hoped harvard university tour would give u ins...
3146    see city, study home one world's+ emblazoned u...
3147    hello, studied harvard law school year therefo...
3148    lovely building long history. free tour( engli...
Name: lemmatized_text, Length: 3149, dtype: object

**Stemming**

In [86]:
data['stemmed_text'] = data['no_emoji_text'].apply(lambda text: ext.preprocessor(text,
                 remove_punctuation=False,
                 lowercase=True,
                 tokenized_output=True,
                 remove_stopwords=True,
                 lemmatization=False,
                 stemming=True,
                 sentence_output=False
))

                         
data['stemmed_text']

0       [good, thank, everyth, good, work, group, thum...
1       [harvard, univers, found, 1636, privat, ivi, l...
2       [walk, around, harvard, campu, ,, beauti, old,...
3       [walk, univers, ground, tour, guid, -, tour, c...
4       [final, made, harvard, !, !, icon, univers, ca...
                              ...                        
3144    [boston, usual, time, go, cambridg, (, subway,...
3145    [hope, harvard, univers, tour, would, give, us...
3146    [see, citi, ,, studi, home, one, world, 's, +,...
3147    [hello, ,, studi, harvard, law, school, year, ...
3148    [love, build, long, histori, ., free, tour, (,...
Name: stemmed_text, Length: 3149, dtype: object

**Simple Tokenization**

In [87]:
data['tokenized_text'] = data['no_emoji_text'].apply(lambda text: ext.preprocessor(text,
                 remove_punctuation=False,
                 lowercase=True,
                 tokenized_output=True,
                 remove_stopwords=True,
                 lemmatization=False,
                 stemming=False,
                 sentence_output=False
))

                         
data['tokenized_text']

0       [good, thanks, everything, good, work, group, ...
1       [harvard, university, founded, 1636, private, ...
2       [walk, around, harvard, campus, ,, beautiful, ...
3       [walk, university, grounds, tour, guide, -, to...
4       [finally, made, harvard, !, !, iconic, univers...
                              ...                        
3144    [boston, usually, time, go, cambridge, (, subw...
3145    [hoped, harvard, university, tour, would, give...
3146    [see, city, ,, study, home, one, world, 's, +,...
3147    [hello, ,, studied, harvard, law, school, year...
3148    [lovely, building, long, history, ., free, tou...
Name: tokenized_text, Length: 3149, dtype: object

### 2.1 Vectorization

As we are using different techniques to reduce words (Stemming, Lemmatization and None), we must define in which column the Vectorization step is to be performed.

In terms of Vectorization Techniques, a brief description of the used ones follows:

- Bag of Words - simple text vectorisation that does not consider word order or context

- TF-IDF - gives more weight to less common words (can be missleading)

- Word2Vec - produces dense word embbedings that capture semantic meanings and relationships between words (more complex)

In [45]:
column = data['lemmatized_text']

**Bag Of Words**

In [88]:
vectorizer = CountVectorizer(stop_words=stop_words,max_features=2500)
corpus = column
X = vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()
print("Selected Features:", feature_names)


Selected Features: ['00' '000' '10' ... 'youth' 'youthful' 'zuckerberg']


**TF-IDF**

In [73]:
vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=2500)
corpus = column
X = vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()
print("Selected Features:", feature_names)

Selected Features: ['00' '000' '10' ... 'youth' 'youthful' 'zuckerberg']


**Document Embeddings (Doc2Vec)**

In [50]:
from gensim.models.doc2vec import Doc2Vec,TaggedDocument

def read_corpus(column, tokens_only=False):
    for i, tokens in enumerate(column):
        try:
            tokens = ast.literal_eval(tokens)
        except:
            tokens = tokens
        if tokens_only:
            yield tokens
        else:
            yield TaggedDocument(tokens, [i])

In [51]:
corpus = list(read_corpus(column=column))

Initialising the model and building its vocabolary.

In [53]:
model = Doc2Vec(vector_size=300,negative=5, hs=0, min_count=2,dm=0, sample = 0,epochs=30,workers = 8)
model.build_vocab(corpus)

Showing the total number of words in the corpus.

In [54]:
model.corpus_total_words

644870

Trainining the model.

In [55]:
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

Storing the document vectors as X, as it was done previously when applying BoW and TF-IDF.

In [56]:
X = []
for i in range(len(model.dv)):
    X.append(model.dv[i])

## 3. Model Training

In this section different models will be trained with the vectorised data. This process was developed as follows:
1. Selection of a Word Reduction Technique (Lemmatization, Stemming, None)
2. Selection of a Vectorization Technique (BoW, TF-IDF, Doc2Vec)
3. Model Training and Hyperparameter Tuning

For this reason, it must be noted that since when running the Vectorization Techniques they overwrite each other, only one can be ran at a time. As so, all but the best one will be left commented.

After this selection, a baseline model was produced, to be used as a guideline when evaluating different models. This baseline model was produced using Stratified K-Fold and Logistic Regression, without any hyperparameter tuning.

However, before starting, one must define the target.


In [74]:
y = data['rating']

**Baseline Model**

In [96]:
logreg = LogisticRegression()

# Initialize Stratified K-Fold
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, val_idx) in enumerate(stratified_kfold.split(X, y)):
    # Extract the training and validation data for this fold
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Train the model 
    logreg.fit(X_train, y_train)

    # Make predictions 
    predictions = logreg.predict(X_val)

    # Evaluate the model
    print(f"Fold {fold + 1}:")
    print(classification_report(y_val, predictions))


Fold 1:
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         3
           2       0.00      0.00      0.00         9
           3       0.34      0.27      0.30        52
           4       0.43      0.38      0.40       200
           5       0.67      0.76      0.71       366

    accuracy                           0.58       630
   macro avg       0.29      0.28      0.28       630
weighted avg       0.56      0.58      0.57       630

Fold 2:
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         3
           2       1.00      0.11      0.20         9
           3       0.50      0.37      0.42        52
           4       0.40      0.37      0.38       200
           5       0.66      0.73      0.69       366

    accuracy                           0.57       630
   macro avg       0.51      0.32      0.34       630
weighted avg       0.56      0.57      0.56       630

Fold 3

**Multinomial NB**

In [92]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import RandomizedSearchCV, ParameterGrid, StratifiedKFold
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [89]:
# Defining Parameters

param_dist = {
    'alpha': np.linspace(0, 1, 20),
    'fit_prior': [True, False],
    'force_alpha': [True, False]
}

# Perform Random Search
random_search = RandomizedSearchCV(MultinomialNB(), param_distributions=param_dist, n_iter=50, cv=5, scoring='f1_weighted', n_jobs=-1, random_state=42)
random_search.fit(X, y)

# Get the best parameters
best_params = random_search.best_params_

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype):
  if is_

  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_

  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_

In [90]:
print(f"Best parameters: {best_params}")

Best parameters: {'force_alpha': True, 'fit_prior': True, 'alpha': 0.5263157894736842}


In [91]:
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Iterate over the folds
for fold, (train_idx, val_idx) in enumerate(stratified_kfold.split(X, y)):
    
    # Extract the training and testing data for this fold
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Train and evaluate your model on this fold
    print(f"Fold {fold + 1}:")
    print("Training data:", X_train.shape[0], len(y_train))
    print("Testing data:", X_val.shape[0], len(y_val))
    
    # Training and evaluation 
    mnb = MultinomialNB(**best_params)
    mnb.fit(X_train,y_train)
    predictions = mnb.predict(X_val)
    print(classification_report(y_val, predictions))
    print("\n")

Fold 1:
Training data: 2519 2519
Testing data: 630 630
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         3
           2       0.00      0.00      0.00         9
           3       0.35      0.33      0.34        52
           4       0.42      0.41      0.42       200
           5       0.70      0.73      0.71       366

    accuracy                           0.58       630
   macro avg       0.30      0.29      0.29       630
weighted avg       0.57      0.58      0.58       630



Fold 2:
Training data: 2519 2519
Testing data: 630 630
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         3
           2       0.00      0.00      0.00         9
           3       0.35      0.31      0.33        52
           4       0.40      0.34      0.37       200
           5       0.66      0.75      0.70       366

    accuracy                           0.57       630
   macro avg       