In this task I will have a couple of tasks to performe in my process of sentiment analysis:
## Sentiment Analysis process:
    1.  Analyze the Data and understand how I should handle it.
    2.  Load the necessary data and filter out redundant information.
    3.  Handle the data and prepare it such that it fits the models I will be using. This step will contain a couple of subprocess:
        a. Tokenizing the data.
        b. Sanitizing the data by removing all the stopwords, punctuation and numbers that will only reduce the performance of the Sentiment Analysis model. 
        c. Lemmatizing the data in order to make it more understandable for the model.
        d. Vectorizing in order to fit the data to an graph that I will be using when performing the real Sentiment Analysis. 
    4. Finding the most optimal algorithm.
        a. I will have to try different algorithms with cross validation and use their results in order to find which one fits the data the best. 
        b. In addition I will need to find the most optimal parameters. I will see later how I choose to search for them. 
    5. Training the model with the preprocessed data.
    6. Testing the model with the preprocessed data. 

And then I will be evaluating the data, to see if my score is sufficient or not. 

In [1]:
import json
import nltk
import spacy
import numpy as np
import requests


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# This is for the sentimental analysis more exactly for comparing the different algorithms with eachother. 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

!pip install https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.1.0/nb_core_news_sm-3.1.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.1.0/nb_core_news_sm-3.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.1.0/nb_core_news_sm-3.1.0.tar.gz (16.1 MB)
                                              0.0/16.1 MB ? eta -:--:--
                                              0.4/16.1 MB 8.3 MB/s eta 0:00:02
     --                                       1.2/16.1 MB 12.6 MB/s eta 0:00:02
     -----                                    2.1/16.1 MB 15.1 MB/s eta 0:00:01
     -------                                  3.0/16.1 MB 15.8 MB/s eta 0:00:01
     ----------                               4.0/16.1 MB 17.2 MB/s eta 0:00:01
     ------------                             5.2/16.1 MB 18.5 MB/s eta 0:00:01
     --------------                           6.0/16.1 MB 19.1 MB/s eta 0:00:01
     -----------------                        6.9/16.1 MB 19.2 MB/s eta 0:00:01
     --------------------       

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\oskar\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python310\\site-packages\\~ydantic\\annotated_types.cp310-win_amd64.pyd'
Check the permissions.



In [2]:
def data_analysys(json_data) -> list:
    message_data = []

    keys_in_data = set()
    print("The size of the data is:", len(json_data))

    for item in json_data:
        keys = item.keys()
        keys_in_data.update(keys)
        message_data.append((item['text'], item['label']))

    print(keys_in_data)

Using the function defined above I am able to extracts all the different categories from the dataset, without needing to do manual looking into the .json document:

In [3]:
# Replace this URL with the raw URL of the file you want to fetch from GitHub
url = "https://raw.githubusercontent.com/ltgoslo/norec_sentence/main/binary/train.json"

# Fetch the content of the file
response = requests.get(url)
response.raise_for_status()  # Raise an exception if there was an error fetching the file

# Get the content of the file as a string
json_data = json.loads(response.text)

data_analysys(json_data)

The size of the data is: 3894
{'label', 'text', 'sent_id'}


I will only be using the text and label part of this data so I continue in defining a function that will extracts exactly that part of the data. 
"data_reading()" extracts the data from the json file provided in a structured way: ["sentance", "sentiment"]. I consider the data not to complex so I decided to not bother with visualizing the data to much.

In [4]:
def data_reading(json_data) -> list:
    # Since I don't want to overfit, I must 75% only use 75% of the data.
    # So I have to find how much of the data is supposed to go as train data and how much of the data should go to validation:
    message_data = []
    data = json_data

    # Simply implement the division here:
    full = len(data)
    seventy_prcnt = 0.75 * full
    rest_prcnt = full - seventy_prcnt

    # data is an array with dictionaries, each having three elements each.
    for category in data:
        # category is the index of the list with the three elements: sent_id, text, label
        message_data.append([category["text"], category["label"]])  # This appends [text, label] to the array.

    test = message_data[:round(seventy_prcnt)]
    validate = message_data[-round(rest_prcnt):]

    return test, validate


Later when working with this project I struggled with removing the non-alphabetic characters from the dataset and more so to do that in an efficient way. Without needing to access the data unnecessary many times. So I came to a consensus with myself in doing tokenization, sanitazation and lemmatization in one function and name that function: preprocessing()

In [5]:
def preprocessing(data):
    sentences = []
    sentiments = []
    lemmatizer = spacy.load("nb_core_news_sm")  # Norwegian lemmatization model.

    for chunk in data:
        text, sentiment = chunk[0], chunk[1]
        pre_lemmatized = lemmatizer(text)

        lemmatized = [root_word.lemma_.lower() for root_word in pre_lemmatized
                      if (not root_word.is_punct
                          and not root_word.is_currency
                          and not root_word.is_digit
                          and not root_word.is_space
                          and not root_word.is_stop
                          and not root_word.like_num)]

        sentence = ' '.join(lemmatized)
        sentences.append(sentence)
        sentiments.append(sentiment)

    return sentences, sentiments


I figured that some sentances may consist of only stopwords so I will need a function where I can check that the sentance list is not empty:

Before creating the function I was struggling with accessing the data in an efficient way. I wanted to divide: the sanitizing, tokenizing, lemmatizing and vectorizing of the data. But I quickly figured that it would have cost a lot of computation time since I would have to access the same data multiple times in nested for loops. So instead I created preprocessing() which did sanitizing with tokenizing, and then lemmatized the sentances in one go. In addition to that. This function divides the preprocessed data into sentances and sentiments, which are perfectly prepared to be vectorized.  

After running the data through the preprocessing() function my data can be easily vectorized since the function separates sentances and theirs sentiments:

So I performe the vectorization using CountVectorizer() & LabelEncoder() because that seemed most trivial of every other way that I stumbled upon. I considered if I should use TF-IDF or Bag-of-Words vectorization. But I concluded with TF-IDF seemed to be more fit for bigger datas with variaty of lenghts and inputs. Meanwhile here I have tree prepared documents with the same format which will not vary as much as data from the real world would. 
And my implementation is shown bellow:

In [6]:
def vectornihilation(x_axis, y_axis):
    vectorizer = CountVectorizer()
    labelizer = LabelEncoder()
    X_matrix = vectorizer.fit_transform(x_axis)
    Y_matrix = labelizer.fit_transform(y_axis)

    return X_matrix, Y_matrix

And this function gives me a matrix as the x vectors and a numpy array with vector representations of the labels as y.

Now that I have preprocessed data I can use that tougheter with the GridSearchCV() method in order to find the best algorithm and the best parameters for that algorithm to predict the sentiments of the sentances. 

I do that by implementing exhoustive search on the parameters for each of the algorithms.

I choose these tree algorithms because I have read that those are most often tried when performing sentiment analysis. 

In [7]:
def NaiveBayesScore(x_ax, y_ax):
    hypeOmeters = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]}

    grid = GridSearchCV(GaussianNB(), hypeOmeters, cv=5)
    grid.fit(X=x_ax, y=y_ax)
    
    optimal_param = grid.best_params_['var_smoothing']
    
    return optimal_param

In [8]:
def LogisticRegressionScore(x_ax, y_ax):
    C_meters = {'C': [0.001, 0.01, 0.1, 1, 10]}

    grid = GridSearchCV(LogisticRegression(), C_meters, cv=5) 
    grid.fit(X=x_ax, y=y_ax)
    
    optimal_param = grid.best_params_['C']

    return optimal_param

In [9]:
def DecissionTreeScore(x_ax, y_ax):
    grido_meters = { 'max_depth': [2, 4, 6, 8, 10], 'min_samples_split': [2, 5, 10, 20] }
    grid = GridSearchCV(DecisionTreeClassifier(), grido_meters, cv=5)

    grid.fit(X=x_ax, y=y_ax)

    optimal_p1, optimal_p2 = grid.best_params_['max_depth'], grid.best_params_['min_samples_split']

    return optimal_p1, optimal_p2


And here I will create a function that will test the different algorithm with their best parameters and present the best algorithm with it's best parameter. 

In [10]:
# Cross-validation:
def OptimalSentimentAnalysis(json_data):
    data, validation = data_reading(json_data) #TODO: Remove the limit.
    sentances, sentiments = preprocessing(data= data)
    sentances_secured, sentiments_secured = check(sentances, sentiments) 
    x_axis, y_axis = vectornihilation(sentances_secured, sentiments_secured)
    densed_X = x_axis.toarray() # I had to convert the matrix to danse array in order to process it

    Logistic_param = LogisticRegressionScore(x_ax=densed_X, y_ax=y_axis)
    Naive_param = NaiveBayesScore(x_ax=densed_X, y_ax=y_axis)
    DecTree_depth_param, DecTree_sample_split_param = DecissionTreeScore(x_ax=densed_X, y_ax=y_axis)

    Cross_Naive_score = cross_val_score(GaussianNB(var_smoothing=Naive_param), densed_X, y_axis)
    Cross_Logistic_score = cross_val_score(LogisticRegression(C=Logistic_param), densed_X, y_axis)
    Cross_DecTree_score = cross_val_score(DecisionTreeClassifier(max_depth=DecTree_depth_param, min_samples_split=DecTree_sample_split_param), densed_X, y_axis)
    
    print(f'Naive Bayes algorithm scored: {Cross_Naive_score.mean()}')
    print(f'LogisticRegression algorithm scored: {Cross_Logistic_score.mean()}')
    print(f'Decission Tree scored: {Cross_DecTree_score.mean()}')

In [11]:
# OptimalSentimentAnalysis("train.json")

Here I quickly explain what different parameters does in the different models:
var_smoothing variable chooses how smoothly the variance should be distributed in the Naive Bayes distribution model.

C is a value that regularates the complexity of the Logistic Regregression model. What I try to find is the maximum Likelyhood parameter. Which is the parameter that gives me the optimal values for the likelyhood of getting Y given an X.

max_depth is the maximum depth of the decision tree. A decision tree can become very complex and can overfit the training data if it is allowed to grow too deep. So, setting a maximum depth for the tree can help to prevent overfitting and improve generalization to new data.

min_samples_split chooses how small the decision groups can be, preventing this parameter from being to small can help in preventing overfitting by creating a more general model that fits more data. 



And now I that I have found that Logistic Regression is the model that I am supposed to use I can validate the accuracy of the model before testing it:

In [12]:
def LogisticSentimentAnalysisRegression(datafile):
    LogiReg = LogisticRegression() #Instantiating the model.

    data, validation = data_reading(datafile)[:50] 
    train_sentances, train_sentiments = preprocessing(data= data)
    x_axis, y_axis = vectornihilation(train_sentances, train_sentiments)
    densed_X = x_axis.toarray() # I had to convert the matrix to danse array in order to process it

    valid_sentances, valid_sentiments = preprocessing(data= validation)
    validX_axis, validY_axis = vectornihilation(valid_sentances, valid_sentiments)
    validDensed_X = validX_axis


    LogiReg.fit(densed_X, y_axis)

    validity = LogiReg.predict(densed_X)

    correct = 0
    for answer, solution in zip(validity, y_axis):
        # Negative: 0, Neutral: 1, Positive: 2
        if answer == solution:
            correct+=1
            print(answer, solution)

    
    score = correct/len(validity)

    print(score)

In [15]:
LogisticSentimentAnalysisRegression(json_data)

0 0
1 1
1 1
1 1
0 0
1 1
1 1
1 1
1 1
0 0
1 1
1 1
0 0
0 0
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
1 1
0 0
1 1
1 1
0 0
0 0
1 1
1 1
0 0
1 1
0 0
0 0
0 0
0 0
0 0
1 1
0 0
1 1
1 1
1 1
0 0
1 1
1 1
0 0
1 1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
0 0
1 1
0 0
1 1
1 1
1 1
0 0
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
1 1
1 1
0 0
1 1
1 1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0 0
1 1
1 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
1 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
1 1
1 1
1 1
1 1
1 1
0 0
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
