In this task I will have a couple of tasks to performe in my process of sentiment analysis:
## Sentiment Analysis process:
    1.  Analyze the Data and understand how I should handle it.
    2.  Load the necessary data and filter out redundant information.
    3.  Handle the data and prepare it such that it fits the models I will be using. This step will contain a couple of subprocess:
        a. Tokenizing the data.
        b. Sanitizing the data by removing all the stopwords, punctuation and numbers that will only reduce the performance of the Sentiment Analysis model. 
        c. Lemmatizing the data in order to make it more understandable for the model.
        d. Vectorizing in order to fit the data to an graph that I will be using when performing the real Sentiment Analysis. 
    4. Finding the most optimal algorithm.
        a. I will have to try different algorithms with cross validation and use their results in order to find which one fits the data the best. 
        b. In addition I will need to find the most optimal parameters. I will see later how I choose to search for them. 
    5. Training the model with the preprocessed data.
    6. Testing the model with the preprocessed data. 

And then I will be evaluating the data, to see if my score is sufficient or not. 

In [4]:
import json
import nltk
import spacy
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# This is for the sentimental analysis more exactly for comparing the different algorithms with eachother. 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [5]:
# Visualizing the data that I will handle:
def data_analysys(file_name) -> list:
    message_data = []
    with open(f'3class/{file_name}', "r") as training_data:
        keys_in_data = set()
        train_array = json.load(training_data)

        print("The size of the data is:", len(train_array))

        for item in train_array:
            keys = item.keys()
            keys_in_data.update(keys)
        training_data.close() 
            
    print(keys_in_data)

Using the function defined above I am able to extracts all the different categories from the dataset, without needing to do manual looking into the .json document:

In [6]:
data_analysys("train.json") 

The size of the data is: 7973
{'text', 'label', 'sent_id'}


I will only be using the text and label part of this data so I continue in defining a function that will extracts exactly that part of the data. 
"data_reading()" extracts the data from the json file provided in a structured way: ["sentance", "sentiment"]. I consider the data not to complex so I decided to not bother with visualizing the data to much.

In [7]:
def data_reading(file_name) -> list:
    # Since I don't want to overfit, I must 75% only use 75% of the data. 
    # So I have to find how much of the data is suppose to go as train data and how much of the data should go to validation:  
    message_data = []
    with open(f'3class/{file_name}', "r") as data:
        data_array = json.load(data)

        # Simply implement the division here:
        full = len(data_array)
        seventy_prcnt = 0.75 * full
        rest_prcnt = full-seventy_prcnt

        #data_array is an array with dictionaries, each having three elements each. 
        for category in data_array:
            #category is the index of the list with the three elements: sent_id, text, label
            message_data.append([category["text"], category["label"]]) # This appends [text, label] to the array.

        data.close() 
            
    test = message_data[:round(seventy_prcnt)]
    validate = message_data[-round(rest_prcnt):]

    return test, validate

Later when working with this project I struggled with removing the non-alphabetic characters from the dataset and more so to do that in an efficient way. Without needing to access the data unnecessary many times. So I came to a consensus with myself in doing tokenization, sanitazation and lemmatization in one function and name that function: preprocessing()

In [8]:
def preprocessing(data):
    sentances = []
    sentiments = []
    lemmatizer = spacy.load("nb_core_news_sm") # Norwegian lemmatization model. 
    sentance = " "

    for chunk in data: # chunk is a index in data that both contains the sentance and it's sentiment. Not edited. 
        that_text_part_of_the_chunk, that_sentiment_part_of_the_chunk = chunk[0], chunk[1]
        
        # pre_lemmed is a sentance that is put in the spacy format, I do this in order to later be able to use spacy different attributes to sanitize my data.
        pre_lemmed = lemmatizer(that_text_part_of_the_chunk) 
        lemmed = []

        for root_word in pre_lemmed:
            # Here I "manually" sanitize the data by checking if the current word/lemma is anything else but a string. 
            # I found this method much more accurate then relying on a symbol library and comparing the current word with each symbol for each word that I processed. 
            if (not root_word.is_punct
                    and not root_word.is_currency 
                    and not root_word.is_digit
                    and not root_word.is_space
                    and not root_word.is_stop
                    and not root_word.like_num):
                
                lemmaRoot = root_word.lemma_ # Reducing the word to it's original lemma:
                lower_lemmaRoot = lemmaRoot.lower() # Lowercase to make it more efficient for the machine to work with.
                lemmed.append(lower_lemmaRoot) #Append it to the list that will represent a sentance.
                
        # When done with creating the list that will be a sentance. Join each index into a sentance:
        sentance = ' '.join(lemmed)
        sentances.append(sentance) #And create a list of sentances
        sentiments.append(that_sentiment_part_of_the_chunk) 

    return sentances, sentiments

I figured that some sentances may consist of only stopwords so I will need a function where I can check that the sentance list is not empty:

In [9]:
def check(sentances, sentiments):
    fixed_sentances = []
    fixed_sentiments = []
    
    for sentance_index in range(len(sentances)):

        if len(sentances[sentance_index]) != 0:
            fixed_sentances.append(sentances[sentance_index])
            fixed_sentiments.append(sentiments[sentance_index])

    return fixed_sentances, fixed_sentiments


Before creating the function I was struggling with accessing the data in an efficient way. I wanted to divide: the sanitizing, tokenizing, lemmatizing and vectorizing of the data. But I quickly figured that it would have cost a lot of computation time since I would have to access the same data multiple times in nested for loops. So instead I created preprocessing() which did sanitizing with tokenizing, and then lemmatized the sentances in one go. In addition to that. This function divides the preprocessed data into sentances and sentiments, which are perfectly prepared to be vectorized.  

After running the data through the preprocessing() function my data can be easily vectorized since the function separates sentances and theirs sentiments:

So I performe the vectorization using CountVectorizer() & LabelEncoder() because that seemed most trivial of every other way that I stumbled upon. I considered if I should use TF-IDF or Bag-of-Words vectorization. But I concluded with TF-IDF seemed to be more fit for bigger datas with variaty of lenghts and inputs. Meanwhile here I have tree prepared documents with the same format which will not vary as much as data from the real world would. 
And my implementation is shown bellow:

In [10]:
def vectornihilation(x_axis, y_axis):
    vectorizer = CountVectorizer()
    labelizer = LabelEncoder()
    X_matrix = vectorizer.fit_transform(x_axis)
    Y_matrix = labelizer.fit_transform(y_axis)

    return X_matrix, Y_matrix

And this function gives me a matrix as the x vectors and a numpy array with vector representations of the labels as y.

Now that I have preprocessed data I can use that tougheter with the GridSearchCV() method in order to find the best algorithm and the best parameters for that algorithm to predict the sentiments of the sentances. 

I do that by implementing exhoustive search on the parameters for each of the algorithms.

I choose these tree algorithms because I have read that those are most often tried when performing sentiment analysis. 

In [11]:
def NaiveBayesScore(x_ax, y_ax):
    hypeOmeters = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]}

    grid = GridSearchCV(GaussianNB(), hypeOmeters, cv=5)
    grid.fit(X=x_ax, y=y_ax)
    
    optimal_param = grid.best_params_['var_smoothing']
    
    return optimal_param

In [12]:
def LogisticRegressionScore(x_ax, y_ax):
    C_meters = {'C': [0.001, 0.01, 0.1, 1, 10]}

    grid = GridSearchCV(LogisticRegression(), C_meters, cv=5) 
    grid.fit(X=x_ax, y=y_ax)
    
    optimal_param = grid.best_params_['C']

    return optimal_param

In [13]:
def DecissionTreeScore(x_ax, y_ax):
    grido_meters = { 'max_depth': [2, 4, 6, 8, 10], 'min_samples_split': [2, 5, 10, 20] }
    grid = GridSearchCV(DecisionTreeClassifier(), grido_meters, cv=5)

    grid.fit(X=x_ax, y=y_ax)

    optimal_p1, optimal_p2 = grid.best_params_['max_depth'], grid.best_params_['min_samples_split']

    return optimal_p1, optimal_p2


And here I will create a function that will test the different algorithm with their best parameters and present the best algorithm with it's best parameter. 

In [14]:
# Cross-validation:
def OptimalSentimentAnalysis(file):
    data, validation = data_reading(file) #TODO: Remove the limit.
    sentances, sentiments = preprocessing(data= data)
    sentances_secured, sentiments_secured = check(sentances, sentiments) 
    x_axis, y_axis = vectornihilation(sentances_secured, sentiments_secured)
    densed_X = x_axis.toarray() # I had to convert the matrix to danse array in order to process it

    Logistic_param = LogisticRegressionScore(x_ax=densed_X, y_ax=y_axis)
    Naive_param = NaiveBayesScore(x_ax=densed_X, y_ax=y_axis)
    DecTree_depth_param, DecTree_sample_split_param = DecissionTreeScore(x_ax=densed_X, y_ax=y_axis)

    Cross_Naive_score = cross_val_score(GaussianNB(var_smoothing=Naive_param), densed_X, y_axis)
    Cross_Logistic_score = cross_val_score(LogisticRegression(C=Logistic_param), densed_X, y_axis)
    Cross_DecTree_score = cross_val_score(DecisionTreeClassifier(max_depth=DecTree_depth_param, min_samples_split=DecTree_sample_split_param), densed_X, y_axis)
    
    print(f'Naive Bayes algorithm scored: {Cross_Naive_score.mean()}')
    print(f'LogisticRegression algorithm scored: {Cross_Logistic_score.mean()}')
    print(f'Decission Tree scored: {Cross_DecTree_score.mean()}')

In [15]:
OptimalSentimentAnalysis("train.json")

KeyboardInterrupt: 

Here I quickly explain what different parameters does in the different models:
var_smoothing variable chooses how smoothly the variance should be distributed in the Naive Bayes distribution model.

C is a value that regularates the complexity of the Logistic Regregression model. What I try to find is the maximum Likelyhood parameter. Which is the parameter that gives me the optimal values for the likelyhood of getting Y given an X.

max_depth is the maximum depth of the decision tree. A decision tree can become very complex and can overfit the training data if it is allowed to grow too deep. So, setting a maximum depth for the tree can help to prevent overfitting and improve generalization to new data.

min_samples_split chooses how small the decision groups can be, preventing this parameter from being to small can help in preventing overfitting by creating a more general model that fits more data. 

