In this task I will have a couple of tasks to performe in my process of sentiment analysis:
## Sentiment Analysis process:
    1.  Analyze the Data and understand how I should handle it.
    2.  Load the necessary data and filter out redundant information.
    3.  Handle the data and prepare it such that it fits the models I will be using. This step will contain a couple of subprocess:
        a. Tokenizing the data.
        b. Sanitizing the data by removing all the stopwords, punctuation and numbers that will only reduce the performance of the Sentiment Analysis model. 
        c. Lemmatizing the data in order to make it more understandable for the model.
        d. Vectorizing in order to fit the data to an graph that I will be using when performing the real Sentiment Analysis. 
    4. Finding the most optimal algorithm.
        a. I will have to try different algorithms with cross validation and use their results in order to find which one fits the data the best. 
        b. In addition I will need to find the most optimal parameters. I will see later how I choose to search for them. 
    5. Training the model with the preprocessed data.
    6. Testing the model with the preprocessed data. 

And then I will be evaluating the data, to see if my score is sufficient or not. 

In [1]:
import json
import nltk
import spacy
import numpy as np
import requests


import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# This is for the sentimental analysis more exactly for comparing the different algorithms with eachother. 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

!pip install https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.1.0/nb_core_news_sm-3.1.0.tar.gz

Defaulting to user installation because normal site-packages is not writeable
Collecting https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.1.0/nb_core_news_sm-3.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.1.0/nb_core_news_sm-3.1.0.tar.gz (16.1 MB)
     ---------------------------------------- 0.0/16.1 MB ? eta -:--:--
      --------------------------------------- 0.2/16.1 MB 4.8 MB/s eta 0:00:04
     -- ------------------------------------- 1.0/16.1 MB 10.2 MB/s eta 0:00:02
     ----- ---------------------------------- 2.2/16.1 MB 15.4 MB/s eta 0:00:01
     -------- ------------------------------- 3.4/16.1 MB 18.1 MB/s eta 0:00:01
     ------------ --------------------------- 5.1/16.1 MB 21.6 MB/s eta 0:00:01
     --------------- ------------------------ 6.3/16.1 MB 23.6 MB/s eta 0:00:01
     ----------------- ---------------------- 7.2/16.1 MB 22.9 MB/s eta 0:00:01
     -------------------- --------

### 1.Analyzing the data & understanding its structure

In [2]:
def data_analysys(json_data) -> list:
    keys_in_data = set()
    print("The size of the data is:", len(json_data))

    for item in json_data:
        keys = item.keys()
        keys_in_data.update(keys)

    print(f'Main keys of this data are: {keys_in_data}')

I will use the function above to analyze the structure of the content of this file. To learn how I should be preprocessing it. And how I can make it more efficient.

In the cell below I simply load the documents from the url's and print the recieved data using the previously defined data_analysys() function.

In [3]:
# Replace this URL with the raw URL of the file you want to fetch from GitHub
train_url = "https://raw.githubusercontent.com/ltgoslo/norec_sentence/main/binary/train.json"
test_url = "https://raw.githubusercontent.com/ltgoslo/norec_sentence/main/binary/test.json"

# Fetch the content of the file
train_response = requests.get(train_url)
test_response = requests.get(test_url)

# Raise an exception if there was an error fetching the file
train_response.raise_for_status() 
test_response.raise_for_status()

# Get the content of the file as a string
json_training_data = json.loads(train_response.text)
json_test_data = json.loads(test_response.text)

# This is just for analytic reasons. Not for loading the content:
data_analysys(json_training_data)
data_analysys(json_test_data)

The size of the data is: 3894
Main keys of this data are: {'text', 'sent_id', 'label'}
The size of the data is: 583
Main keys of this data are: {'text', 'sent_id', 'label'}


### 2.Loading the data & filtering out what's redundant

As you can see there's an extra, redundant label that I do not have to consider doing my sentiment analysis, and that is the "sent_id" label. The date the message was sent is irrelevant therefore I can remove it from my data to do more efficient model training. 

I will only be using the text and label part of this data so I continue in defining a function that reads only 'label' & 'text' part of the data. 


"data_reading()" extracts the data from the json file provided in a structured way: ["sentance", "sentiment"]. I consider the data not to complex so I decided to not bother with visualizing the data to much.

In [4]:
def data_reading(json_data) -> list:
    # Extract message_data directly using a list comprehension
    message_data = [[category["text"], category["label"]] for category in json_data]

    return message_data

Then I proceed to define the global variable to make sure that I only load this data once. This is done in order to make the processing much faster. 

In [5]:
train_data = data_reading(json_training_data)
test_data = data_reading(json_test_data)

### 3.Handling & Praparing the data for my model.

Before creating the function I was struggling with accessing the data in an efficient way. I wanted to divide: the sanitizing, tokenizing, lemmatizing and vectorizing of the data. But I quickly figured that it would have cost a lot of computation time since I would have to access the same data multiple times in nested for loops. So instead I created preprocessing() which did sanitizing with tokenizing, and then lemmatized the sentances in one go. In addition to that, this function divides the preprocessed data into sentances and sentiments, which are perfectly prepared to be vectorized. 

In [6]:
def preprocessing(data):
    sentences = []
    sentiments = []
    lemmatizer = spacy.load("nb_core_news_sm")  # Norwegian lemmatization model.

    for chunk in data:
        text, sentiment = chunk[0], chunk[1]
        # Here I tokenize the data, and prepare it to run it through a sanitizer.
        pre_lemmatized = lemmatizer(text)


        # Here I add the lemmatized lowercase word to the "lemmatized" list,
        # but only after it passes all the tests, to check if it's a legitimate word.  
        lemmatized = [root_word.lemma_.lower() for root_word in pre_lemmatized
                      if (not root_word.is_punct
                          and not root_word.is_currency
                          and not root_word.is_digit
                          and not root_word.is_space
                          and not root_word.is_stop
                          and not root_word.like_num)]


        # This if statement checks if the list is empty or not.
        # If it is, it simply continues to the next sentance without adding the 
        # sentiment. That provides a reasurance that there will not be any sentiments
        # attached to a empty list.
        if len(lemmatized)==0:
            continue
        else:
            # And here I join each processed word to one sentance in order to do
            # sentance sentimental analysis.
            sentence = ' '.join(lemmatized)
            sentences.append(sentence)
            sentiments.append(sentiment)

    return sentences, sentiments


In the last part of the for loop, where I have that if statement I figured that when filtering out all the stopwords all slang words may be filtered out too. So I needed to add that to not end up with some sentiments being attached to empty lists. 

So I performe the vectorization using CountVectorizer() & LabelEncoder() because that seemed most trivial of every other way that I stumbled upon. I considered if I should use TF-IDF or Bag-of-Words vectorization. But I concluded with TF-IDF seemed to be more fit for bigger datas with variaty of lenghts and inputs. Meanwhile here I have tree prepared documents with the same format which will not vary as much as data from the real world would. 
And my implementation is shown bellow:

In [7]:
def vectornihilation(x_axis, y_axis):
    vectorizer = CountVectorizer()
    labelizer = LabelEncoder()
    X_matrix = vectorizer.fit_transform(x_axis)
    Y_matrix = labelizer.fit_transform(y_axis)

    return X_matrix, Y_matrix, vectorizer, labelizer

And this function gives me a matrix as the x vectors and a numpy array with vector representations of the labels as y.

### 4. Finding the optimal algorithm to performe sentance sentimental analysis.

We considered specificly these tree models to implement because:
Naive Bayes method was often implemented in sentimen analysis and could either have a very good score or very bad depending on your data type and preprocessing. This model also seemed as one of the closest to the types of models that the book implemented therefore I considered this model to be most relevant.

Logistic Regression model is said to be a very precise model in general with very comprehensible outputs and was said to be easy to implement in general. The book also implements this method so I thought that I will have enough documentation on it to be able to understand it in depth.

Decision Tree model is also a very popular option in sentimental analysis because of it's abilities of generalization. By that I mean that if you find a perfect grouping size for the model to use you can be able to predict sentiments in a very accurate way. 

Continuing:
Now that I have preprocessed data I can use that tougheter with the GridSearchCV() method in order to find the best algorithm and the best parameters for that algorithm to predict the sentiments of the sentances. 

I do that by implementing exhoustive search on the parameters for each of the algorithms.

In [9]:
def naiveBayes(x_ax, y_ax):
    hypeOmeters = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]}
    
    grid = GridSearchCV(GaussianNB(), hypeOmeters, cv=5)
    grid.fit(x_ax, y_ax)
    
    return grid.best_params_['var_smoothing']

My Naive Bayes function performes an exhoustive search looking for the optimal parameter var_smooth which addresses the problem called "zero variance" which basically addresses data where the variance is zero or almost zero (very small), and makes sure that the model can still make good prediction even when the data sets variance for each label is very small. 

Then the function returns the best parameter for Naive Bayes model that was estimated using our data set

In [11]:
def logisticRegression(x_ax, y_ax):
    C_meters = {'C': [0.001, 0.01, 0.1, 1, 10]}
    
    grid = GridSearchCV(LogisticRegression(), C_meters, cv=5)
    grid.fit(x_ax, y_ax)
    
    return grid.best_params_['C']

Logistic Regression uses the C parameter which decides how precise it should predict the the oucome given an input. If given a too small C parameter the Logistic Regression model would fit to perfectly to the test data whcih is also known as overfitting.

In [13]:
def decissionTree(x_ax, y_ax):
    grido_meters = { 'max_depth': [2, 4, 6, 8, 10], 'min_samples_split': [2, 5, 10, 20] }

    grid = GridSearchCV(DecisionTreeClassifier(), grido_meters, cv=5)
    grid.fit(x_ax, y_ax)
    
    max_depth, min_sample = grid.best_params_['max_depth'], grid.best_params_['min_samples_split']
    return max_depth, min_sample


Decission tree tries to classify / generalize the input compared to label as much as possible. It creates consequances that each input vector has, which means it tries to understand the correlation between the "label" and the "text" by creating as pure groups of similarities between sentances with the same labels as possible. 

Therefore we must provide two of the hypermeters which are:

max_depth which decides how finegrained the similarities between two sentances should be. (How deep in the details the models will investigate)

min_sample_splits which decides how narrow the groups of different consequances should be. 

This function works much like the functions defined above by exhoustive search for the optimal hypermeter and then returning the one witht he best score.


Here we use a general function that goes through each function automaticly registrating their best parameter. We will use this function to find out which of the models is most precise with least divation.

In [29]:
# optimal_parameters() returns the optimal hyperparameter for the given functions.
def optimal_paramteres(sentances, sentiments, y_axis, densed_X):
    Logistic_param = logisticRegression(
    x_ax = densed_X, 
    y_ax = y_axis
    )

    Naive_param = naiveBayes(
        x_ax = densed_X, 
        y_ax = y_axis
    )

    DecTree_depth_param, DecTree_sample_param = decissionTree(
        x_ax = densed_X, 
        y_ax = y_axis
    )

    return [Logistic_param, Naive_param, DecTree_depth_param, DecTree_sample_param]

In [15]:
# Here we use the best parameters obtain from each of the functions defined.
sentances , sentiments = preprocessing(train_data)
x_ax, y_ax, vectorizer, labelizer = vectornihilation(sentances, sentiments)
densed_X = x_ax.toarray()

optimal_paramteres = optimal_paramteres(
    sentances = sentances,
    sentiments = sentiments,
    y_axis = y_ax,
    densed_X = densed_X
)

# Getting the optimal hyperparameters
logistic_param, naive_param, decTree_depth_param, decTree_sample_param = optimal_paramteres

Using the scores parameters obtained we obtained the best possible score from each of choosen functions.

In [26]:
# This function will return the best cross validation score 
# which we will use to evaluate which function performes best.
def OptimalSentimentAnalysis(densed_X, y_axis, model, param1, param2=None):
    if param2 is None:
        if model == GaussianNB:
            Cross_Validation = cross_val_score(
                model(var_smoothing=param1),
                densed_X,
                y_axis
            )
        elif model == LogisticRegression:
            Cross_Validation = cross_val_score(
                model(C=param1),
                densed_X,
                y_axis
            )
    else:
        if model == DecisionTreeClassifier:
            Cross_Validation = cross_val_score(
                model(max_depth=param1, min_samples_split=param2),
                densed_X,
                y_axis
            )
    return Cross_Validation


In [33]:
# Getting the cross validation score for each model

nb_score = OptimalSentimentAnalysis(
    densed_X = densed_X,
    y_axis = y_ax,
    model = GaussianNB,
    param1 = naive_param
)

lr_score = OptimalSentimentAnalysis(
    densed_X = densed_X,
    y_axis = y_ax,
    model = LogisticRegression,
    param1 = logistic_param
)

dtc_score = OptimalSentimentAnalysis(
    densed_X = densed_X,
    y_axis = y_ax,
    model = DecisionTreeClassifier,
    param1 = decTree_depth_param,
    param2 = decTree_sample_param
)

In [59]:
# Getting the cross validation score for each model
models_info = [
    ("GaussianNB", nb_score),
    ("LogisticRegression", lr_score),
    ("DecisionTreeClassifier", dtc_score)
]

for model_name, scores in models_info:
    print(f"Model: {model_name}")
    print(f"Cross-validation mean: {scores.mean():.4f}")
    print(f"Cross-validation standard deviation: {scores.std():.4f}")
    print()

Model: GaussianNB
Cross-validation mean: 0.5829
Cross-validation standard deviation: 0.0101

Model: LogisticRegression
Cross-validation mean: 0.7079
Cross-validation standard deviation: 0.0082

Model: DecisionTreeClassifier
Cross-validation mean: 0.6922
Cross-validation standard deviation: 0.0064



In the results given above we can se that LogisticRegression has the best occuracy which is why we will be using this to performe sentimental analysis.

We see a perfect exaple of how to much generalization can also be bad for the model, (underfitting the model). We can see that on the standard deviation of the cross validation of Naive Bayes model. The std there is much higher then std for Logistic Regression & Decision Tree. Which results in having to many wrong decisions.

Logistic Regression is the perfect middle between those two with it's std diviation.

We continue bellow with the best resulting model and proceeding to train and test it.

In [60]:
def select_best_model(model_scores_list):
    best_model = None
    best_mean_score = -1
    best_params = None

    for model, scores, params in model_scores_list:
        mean_score = scores.mean()
        if mean_score > best_mean_score:
            best_model = model
            best_mean_score = mean_score
            best_params = params

    return best_model, best_mean_score, best_params


In [61]:
# Example usage
model_scores_list = [
    (GaussianNB, nb_score, naive_param),
    (LogisticRegression, lr_score, logistic_param),
    (DecisionTreeClassifier, dtc_score, (decTree_depth_param, decTree_sample_param))
]

best_model, best_mean_score, best_params = select_best_model(model_scores_list)
print(f'Best model: {best_model.__name__} with mean score: {best_mean_score}')

Best model: LogisticRegression with mean score: 0.7079115397406053


In [66]:
def best_model_train_and_test(train_x, train_y, test_x, test_y, model_class, params):
    if model_class == DecisionTreeClassifier:
        max_depth, min_samples_split = params
        model = model_class(max_depth=max_depth, min_samples_split=min_samples_split)
    elif model_class == GaussianNB:
        var_smoothing = params
        model = model_class(var_smoothing=var_smoothing)
    elif model_class == LogisticRegression:
        C_param = params
        model = model_class(C=C_param)

    # Train the model
    model.fit(train_x, train_y)

    # Test the model
    test_predictions = model.predict(test_x)

    accuracy = accuracy_score(test_y, test_predictions)
    report = classification_report(test_y, test_predictions)

    return (f"{model_class.__name__}:\nAccuracy: {accuracy:.4f}\n{report}\n")


In [67]:
def evaluate_best_model():
    train_sentences, train_sentiments = preprocessing(data_reading(json_training_data))
    test_sentences, test_sentiments = preprocessing(data_reading(json_test_data))

    x_train, y_train, vectorizer, labelizer = vectornihilation(train_sentences, train_sentiments)
    x_test = vectorizer.transform(test_sentences)
    y_test = labelizer.transform(test_sentiments)

    dense_x_train = x_train.toarray()
    dense_x_test = x_test.toarray()

    return best_model_train_and_test(
        dense_x_train,
        y_train,
        dense_x_test,
        y_test,
        best_model,
        best_params
    )


In [68]:
print(evaluate_best_model())

LogisticRegression:
Accuracy: 0.7153
              precision    recall  f1-score   support

           0       0.59      0.28      0.38       182
           1       0.74      0.91      0.82       401

    accuracy                           0.72       583
   macro avg       0.66      0.60      0.60       583
weighted avg       0.69      0.72      0.68       583




My model of Logistic Regression has a precission of ansvering 60% correct when it's a positive sentiment and 74% correct if it's a negativ. This can be good reasoning to draw a conclusion that negative sentiments are in general more easily detected. 

(In our model 1 means positive and 0 means negative)

Recall part of the function provides the metrics of how much of previously obtained knowladge our model uses when predicting the values. Supporting my previous conclusion we can see that the model does less recal when trying to identify negativly labeled sentances, which leads to a reasonable conclusion that negative sentiments are much more easily to detect then positive. 

F1 score shows the balance that the model maintaince between precission where it makes the decision of the sentiment without recalling any previously obtained knowledge and recalling before decission.

Another factor to my conclusion is the support parameter that shows how many different types of input each of the sentiments had. Which we can se was mostly positive. And therefore lead to the model having lesser chances to recall when analysing the sentance. 