<a href="https://colab.research.google.com/github/dawit-andargachew/AI-playground/blob/main/Sentiment%20Analysis/%5Bholdout%5D_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis

The dataset is described here: https://www.aclweb.org/anthology/P04-1035.pdf

It is part of nltk, so it is convenient for us to use.

The goal of this exercise is to build a first machine learning model using the tools that we have seen so far: choose how to preprocess the text, create a bag of words feature representation, train a model using an ML method of your choice.

You need to use the following split for the data:

*   test: 30% of the documents
*   The rest of the documents will be split as
    *   train: 75% of the documents
    *   validation: 25% of the documents


Use accuracy as evaluation measure.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Computing Vectorial Representations

In [2]:
!python -m spacy download "en_core_web_sm"

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import nltk
nltk.download('movie_reviews') # loads the dataset
nltk.download('punkt')
#!python -m spacy download "en_core_web_sm"


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Loading the data
In the following I extract the raw content of the reviews (movie_reviews.raw()), i.e. each review is a string.
Another option is to use movie_reviews.words() that returns each review as a list of tokens. Feel free to use whichever best fit your needs.


In [4]:
from nltk.corpus import movie_reviews
import random
import spacy
from scipy.sparse import coo_matrix, vstack
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

nlp_en = spacy.load("en_core_web_sm", disable=['ner', 'parser'])

documents = [(movie_reviews.raw(fileid), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

print("number of docs loaded:", len(documents))

corpus = [ x[0] for x in documents ] # the list of text
y_corpus = [ x[1] for x in documents ] # teh corresponding lables - the sentiment for each text
print(corpus[0])
print(y_corpus[0])

random.seed(42)


number of docs loaded: 2000
plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an 

## Exercise

Create a vectorial representation of the data, then apply a learning algorithm by optimising the hyperparameters on the dev set. If you need to use any function that depends on random number generators, use 42 as seed.
Test several representations. You may try functions of the libraries we have seen in class or make your own vectorial representation from scratch.
Once you have selected the best hyperparameters and preprocessing, retrain your model on the union of the training and validation sets, then compute the accuracy on the test set.

Report your test performance on Moodle. In Moodle you are also supposed to upload the notebook in .py format (Menu File->Download->Download .py)
In the file with your code motivate any significant choice you made and all different preprocessing you attempted (clearly highlight the best one, though).

**Bonus Exercise** for your best model, print the 30 tokens whose corresponding parameter have highest absolute value. What do you think of this list? Does it make sense? Are all tokens expected?


## vectorize the data since it is text

### it is better to vectorie the whole corups before splitting the data. and shuffling while splitting is a good use of thumb.

## vectorizing the data after splitting might be problematic since the ordor might got changed and is misleading

In [None]:
#---------------------- STAGE-1: preprocessing

# Load NLP model
nlp_en = spacy.load("en_core_web_sm", disable=['ner', 'parser'])

# Define custom tokenizer
def spacy_tokenizer(text):
    return [token.text for token in nlp_en(text)]

# Split data BEFORE vectorizing
X_train_corpus, X_test_corpus, y_train, y_test = train_test_split(
    corpus, y_corpus, train_size=0.70, random_state=42
)

# Use SPACY vectorizer
vectorizer = CountVectorizer(binary=False, tokenizer=spacy_tokenizer)

# Fit on training data only
vectorizer.fit(X_train_corpus)

# Transform train and test data
X_train = vectorizer.transform(X_train_corpus)
X_test = vectorizer.transform(X_test_corpus)

# Print some details
print("Vocabulary size:", len(vectorizer.get_feature_names_out()))
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


#---------------------- STAGE-2: fitting the data on the model phase

# split the training data
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, train_size = 0.75, random_state=42) # split 75%
y_train = np.array(y_train)
y_validation = np.array(y_validation)

clf_lr = LogisticRegression(max_iter=100_000)

# fit the daa
clf_lr.fit(X_train, y_train)

#---------------------- STAGE-3: Prediction stage on the validation set

y_trainPred = clf_lr.predict(X_train)
y_validation_Pred = clf_lr.predict(X_validation)

# with out lambda - reguralization parameter
print(accuracy_score(y_train, y_trainPred))
print(accuracy_score(y_validation, y_validation_Pred))



In [None]:
# print train data sizes
print("x: ",X_train.shape)
print("y: ",y_trainPred.shape)

# y_hat values
print("x: ", X_validation.shape)
print("y: ", y_validation_Pred.shape)

In [None]:
#---------------------- STAGE-4: Traingin with lambda - reguralization
C = [0.001, 0.01, 0.02, 0.022, 0.024, 0.27, 0.03, 0.1, 1, 1.02, 1.1, 1.2, 30, 98, 99, 100, 150, 200, 250, 300, 350, 1000, 2000, 3000]


for c in C:
    clf_lr = LogisticRegression(C = c, max_iter=100_000) # training the regressin and passing 'c' as a pramater
    clf_lr.fit(X_train, y_train)

    # estimate y_hat
    y_trainPred = clf_lr.predict(X_train)
    y_valPred = clf_lr.predict(X_validation)

    tr_acc = accuracy_score(y_train, y_trainPred)
    val_acc = accuracy_score(y_validation, y_valPred)

    print(f"LR. C= {c}.\tTrain ACC: {tr_acc}\tVal Acc: {val_acc}")




In [None]:
import numpy as np

# Get the learned coefficients from the trained model
feature_names = vectorizer.get_feature_names_out()
coefficients = clf_lr.coef_[0]  # LogisticRegression stores coefficients as a 2D array

# Get the indices of the top 30 absolute coefficient values
top_30_indices = np.argsort(np.abs(coefficients))[-30:]  # Get the last 30 (highest absolute values)

# Print the top 30 most important words with their coefficients
print("\nTop 30 most influential tokens:")
for idx in reversed(top_30_indices):  # Reverse to get the largest first
    print(f"{feature_names[idx]}: {coefficients[idx]:.4f}")


In [None]:
# testing on the "Test Data"

# Choose the best C value (replace this with the best performing one from Stage 4)
best_C = 1000  # Example value, update it accordingly

# Train final model with the best C
final_model = LogisticRegression(C=best_C, max_iter=100_000)
final_model.fit(X_train, y_train)

# Predict on test set
y_test_pred = final_model.predict(X_test)

# Evaluate performance
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Final Test Accuracy: {test_accuracy}")
