# DIM0782 - Machine Learning (DIMAp/UFRN/2024.1)

## Preprocessing the data

### Transforming non-structured data to structured data

This is textual data, so the first step is to turn it into structured data by applying a transformer. I'm going to work mostly with BERT here.

First, I'm going to define the imports block.

In [None]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb

Now reading the sentiments dataset:

In [None]:
sentiments_df = pd.read_csv("datasets/twitter_sentiment_base_original.csv", usecols=["text", "label"])
sentiments_df.head()

###  Creating a Shorter Dataset (Optional Step)
-- -------------------------

This might be helpful in case your base is originally too big to be processed on a reasonable time for this exercise. Running this block would still allow you to run the subsequent blocks of code with no issues. You can totally skip this step as well in case you have the computing power to process the original dataset.

In [None]:
sample_size, random_state = 4000, 42
sampled_data = []

for i in range(6): ## Since we have 6 sentimens, labeled from 0 to 5
    sampled_data.append(sentiments_df[sentiments_df['label'] == i].sample(n=sample_size, random_state=random_state))

sentiments_df = pd.concat(sampled_data)
sentiments_df

-- ----------
### End of Optional Block

Now I'm going to utilize a BERT tokenizer to transform the text data that I have.

In [None]:
## Loading the pretrained model and tokenizer
tokenizer = (ppb.DistilBertTokenizer).from_pretrained('distilbert-base-uncased')
model = (ppb.DistilBertModel).from_pretrained('distilbert-base-uncased')

## Tokenizing w/ BERT
sentiments_tokenized = sentiments_df["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

The code above outputs the tokenized data with rows of different sizes, since each sentence has a different length. We now need to apply a process called padding in order to make all of the sentences with equal size. We will also need an attention mask, which will tell BERT which rows he should consider the data or not.

In [None]:
## Generating a new dataframe with the tokenized data appended with their associated labels
biggest_sentence_length = 0
for i in sentiments_tokenized.values:
    if len(i) > biggest_sentence_length:
        biggest_sentence_length = len(i)

## Getting the values padded and the attention mask
sentiments_tokenized_padded = np.array([i + [0]*(biggest_sentence_length-len(i)) for i in sentiments_tokenized.values])
attention_mask = np.where(sentiments_tokenized_padded != 0, 1, 0)

Now we're finally going to process those input IDs that we generated on the steps above and we will generate the actual BERT embeddings. Given the size of the dataset, I ran multiple tests and realized that memory would be an issue. I need to split this into batches and work from there.

In [None]:
input_ids = torch.tensor(sentiments_tokenized_padded)
attention_mask = torch.tensor(attention_mask)
batch_size = 4000      # Change it as it is necessary 
all_hidden_states = [] # BERT generates what we call hidden states

def batch_generator(input_ids, attention_mask, batch_size):
    for i in range(0, len(input_ids), batch_size):
        yield input_ids[i:i+batch_size], attention_mask[i:i+batch_size]

## Processes the input over batches through the help of a generator
with torch.no_grad():
    for batch_ids, batch_mask in batch_generator(input_ids, attention_mask, batch_size):
        outputs = model(batch_ids, attention_mask=batch_mask)
        all_hidden_states.append(outputs.last_hidden_state)

## Concatenates all batch results
last_hidden_states = torch.cat(all_hidden_states, dim=0)
torch.save(last_hidden_states, 'tensor.pt')

Now we can extract the features:

In [None]:
last_hidden_states = torch.load('datasets/tensor.pt')
sentiments_features = last_hidden_states[:,0,:].numpy()
labels = sentiments_df['label']

I'm also going to develop this auxiliary block that will be able to save my dataset with the n-dimension features I generate as a result from applying the techniques.

In [None]:
def features_to_csv(features, labels, filename):
    df = pd.DataFrame(features, columns=[f'feature_{i}' for i in range(features.shape[1])])
    df['label'] = labels.to_numpy()
    
    df.to_csv(filename, index=False)    

## This is the line that will change depending on the file I want to generate
features_to_csv(sentiments_features, labels, 'datasets/twitter_sentiment_base_ready.csv')

## Reduction of Instances

As part of this work, it is required to reduce instances on my dataset.

The first dataset I was working with had about 400000 records, and I could not process the embeddings and tokenization on a timely manner. I had to do a brusk reduction on the size, so now I don't have a dataset that is that big. My goal here is just to get some sampling of 80% of the already reduced dataset.

In [None]:
## Variables for the sampling conditions
sample_size, random_state = 3200, 42
sampled_data = []
sentiments_processed_df = pd.read_csv('datasets/twitter_sentiment_base_ready.csv')

for i in range(6): ## Since we have 6 sentimens, labeled from 0 to 5
    sampled_data.append(sentiments_processed_df[sentiments_processed_df['label'] == i].sample(n=sample_size, random_state=random_state))

sampled_sentiments_df = pd.concat(sampled_data)
sampled_sentiments_df.to_csv('datasets/twitter_sentiments_base_sampled.csv', index=False)

## Attribute selection (w/ Decision Trees)

Decision trees are quite useful for the purpose of doing attribute (or feature) selection because they inherently perform feature selection by splitting nodes based on the most informative features. By using a decision tree, we can identify which attributes (or features) contribute most to predicting the output, making it a practical approach for reducing dimensionality and improving model interpretability.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

sentiments_embeddings_train, sentiments_embeddings_test, sentiments_labels_train, sentiments_labels_test = train_test_split(sentiments_features, labels, test_size=0.2, random_state=42)

# Initialize the classifier
decision_tree = DecisionTreeClassifier(random_state=42)

# Fit the model
decision_tree.fit(sentiments_embeddings_train, sentiments_labels_train)

# Get feature importances
importances = decision_tree.feature_importances_

# Now we sort the importances on descending order and select the top X features - (X: originally 50)
indices = np.argsort(importances)[::-1]
top_features = indices[:50]

# Finally, we select only the top features that were applied
sentiments_embeddings_train_reduced = sentiments_embeddings_train[:, top_features]
sentiments_embeddings_test_reduced = sentiments_embeddings_test[:, top_features]

I'm adding this auxiliary block here for merging the results back to a single sentiments embeddings array.

In [None]:
sentiments_embeddings_decision_tree = np.concatenate((sentiments_embeddings_train_reduced, sentiments_embeddings_test_reduced))
sentiments_embeddings_decision_tree.shape

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.title("Attribute selection using Decision Tree")
plt.bar(range(sentiments_embeddings_train_reduced.shape[1]), importances[top_features], align='center')
plt.xticks(range(sentiments_embeddings_train_reduced.shape[1]), top_features, rotation=90)
plt.xlim([-1, sentiments_embeddings_train_reduced.shape[1]])

plt.show()

## Principal Component Analysis (PCA)

Running Principal Component Analysis (PCA) is a powerful method to reduce the dimensionality of our data while retaining as much variability as possible. 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
sentiments_features_scaled = scaler.fit_transform(sentiments_features)

# Choose the number of components (example: reduce dimensions to keep 95% of the variance)
pca = PCA(n_components=0.95)
sentiments_pca = pca.fit_transform(sentiments_features_scaled)

print(f"Number of components kept: {pca.n_components_}")

Now, let's examine the explained variance of the PCA. This will tell us how much information (variability) was retained.

In [None]:
explained_variance = pca.explained_variance_ratio_
print(f"Explained variance by each component: {explained_variance}")
print(f"Total variance explained: {sum(explained_variance)}")

Let us now finally visualize the transformed data.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.plot(range(1, 215), pca.explained_variance_ratio_, marker='o', linestyle='--')
plt.title('Scree Plot')
plt.xlabel('Number of Components')
plt.ylabel('Variance Explained')
plt.show()

The scree plot generated is a graphical representation of the variance explained by each principal component the analysis.

The x-axis represents the number of principal components. In the plot, it ranges from 1 to over 200.
The y-axis represents the variance explained by each principal component. It shows the proportion of the dataset’s total variance.
Elbow Method: Typically, the goal of a scree plot is to identify the 'elbow' of the graph, which indicates the point at which the variance explained by each additional component drops off and becomes minimal. This point is considered a good cut-off for reducing the number of components because beyond this point, you're getting diminishing returns on explained variance.

In our plot, there is a steep drop after the first few components, then the decline slows significantly. This suggests that the first few components capture a substantial amount of the information (variance) in the dataset. After this initial steep drop, the plot levels off around the 40-component mark, indicating that each additional component contributes less and less.

It’s worth noting that the first component explains significantly more variance than the subsequent ones. In practical terms, this could mean that there is one dominant feature or pattern in your data that accounts for most of the variance, with each additional feature contributing less to explaining the dataset.

Based on this plot, we might consider retaining only the components before the curve starts to flatten if you aim to reduce dimensionality while retaining most of the information.

That's why I'm going to run the PCA algorithm again, but changing the n parameter.

In [None]:
pca_round_two = PCA(n_components=0.70)
sentiments_pca = pca_round_two.fit_transform(sentiments_features_scaled)

print(f"Number of components kept: {pca_round_two.n_components_}")

Now let's generate this scree plot again:

In [None]:
plt.figure(figsize=(10, 8))
plt.plot(range(1, 30), pca_round_two.explained_variance_ratio_, marker='o', linestyle='--')
plt.title('Scree Plot')
plt.xlabel('Number of Components')
plt.ylabel('Variance Explained')
plt.show()

## Multinomial Logistic Regression

Given that the dataset has 6 possible sentiments (labels), it won't be possible to run Logistic Regression (because the data decision is not binary). Because of that, we can investigate the behavior of applying a multinomial logistic regression classification algorithm.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Invoking the logistic regression model
logistic_regression_sentiments = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1500)

logistic_regression_sentiments.fit(train_features, train_labels)

# Predict on the test set
y_pred = logistic_regression_sentiments.predict(test_features)

# Evaluate the model
accuracy = accuracy_score(test_labels, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print(classification_report(test_labels, y_pred))

## k-Nearest Neighbors (kNN)

Let us implement the k-Nearest Neighbors algorithm and understand how it behaves with our sentiments dataset.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Create the KNN model with a specified number of neighbors; e.g., k=5
knn = KNeighborsClassifier(n_neighbors=5)

Now let's actually run the kNN Algorithm, with the parameters adjusted as above. This version comprehends the percentage test split, but I'll also show it using the 10-fold cross validation method.

In [None]:
# Training the model
knn.fit(train_features, train_labels)

# Predicting the test set
labels_predict = knn.predict(test_features)

print("Accuracy of kNN (5 neighbors, 70/30 split):", accuracy_score(test_labels, labels_predict))

kNN via the 10-fold cross-validation method:

In [None]:
from sklearn.model_selection import KFold

# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Create the KNN model
knn = KNeighborsClassifier(n_neighbors=5)

# Prepare to collect accuracy scores
accuracies = []

knn_labels = np.array(labels)

# Perform 10-fold CV
for train_index, test_index in kf.split(sentiments_features):
    sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
    labels_train, labels_test = knn_labels[train_index], knn_labels[test_index]

    # Train the kNN Model inside the current fold
    knn.fit(sentiments_train, labels_train)

    # Predict and evaluate the model
    labels_predict = knn.predict(sentiments_test)
    accuracy = accuracy_score(labels_test, labels_predict)
    accuracies.append(accuracy)
    print("Accuracy from this fold:", accuracy)

# Calculate and print the average accuracy and standard deviation
print("Average accuracy:", np.mean(accuracies))
print("Standard deviation of accuracy:", np.std(accuracies))

## Decision Tree Classifier

Implementing the Decision Tree Classifier and varying its max depth to compare accuracy. This is using the percentage split method of collecting the data from the dataset.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Create the Decision Tree Classifier (DTC) model with a specified max depth; e.g., md =5
dtc = DecisionTreeClassifier(max_depth=7, random_state=42)

# Training the model
dtc.fit(train_features, train_labels)

# Predicting the test set
labels_predict = dtc.predict(test_features)

print("Accuracy of Decision Tree Classifier (5 Max-Depth, 70/30 split):", accuracy_score(test_labels, labels_predict))

Now implementing it via the 10-fold cross-validation method:

In [None]:
from sklearn.model_selection import KFold

# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Create the Decision Tree Classifier (DTC) model with a specified max depth; e.g., md =5
dtc = DecisionTreeClassifier(max_depth=7, random_state=42)

# Prepare to collect accuracy scores
accuracies = []

dtc_labels = np.array(labels)

# Perform 10-fold CV
for train_index, test_index in kf.split(sentiments_features):
    sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
    labels_train, labels_test = dtc_labels[train_index], dtc_labels[test_index]

    # Train the DTC Model inside the current fold
    dtc.fit(sentiments_train, labels_train)

    # Predict and evaluate the model
    labels_predict = dtc.predict(sentiments_test)
    accuracy = accuracy_score(labels_test, labels_predict)
    accuracies.append(accuracy)
    print("Accuracy from this fold:", accuracy)

# Calculate and print the average accuracy and standard deviation
print("Average accuracy:", np.mean(accuracies))
print("Standard deviation of accuracy:", np.std(accuracies))

## Gaussian Naive Bayes

Implementing the Gaussian Naive Bayes classifier. Parameters are priors = 0 (array of prior probabilities) and var_smoothing: 1e-9, as per the professor's specification.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Create the Gaussian Naive Bayes (GNB) model
gnb = GaussianNB(priors=None, var_smoothing=1e-9)

# Training the model
gnb.fit(train_features, train_labels)

# Predicting the test set
labels_predict = gnb.predict(test_features)

print("Accuracy of Gaussian Naive Bayes:", accuracy_score(test_labels, labels_predict))

Now with the 10-fold cross-validation method:

In [None]:
from sklearn.model_selection import KFold

# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Create the Decision Tree Classifier (DTC) model with a specified max depth; e.g., md =5
gnb = GaussianNB(priors=None, var_smoothing=1e-9)

# Prepare to collect accuracy scores
accuracies = []

gnb_labels = np.array(labels)

# Perform 10-fold CV
for train_index, test_index in kf.split(sentiments_features):
    sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
    labels_train, labels_test = gnb_labels[train_index], gnb_labels[test_index]

    # Train the GNB Model inside the current fold
    gnb.fit(sentiments_train, labels_train)

    # Predict and evaluate the model
    labels_predict = gnb.predict(sentiments_test)
    accuracy = accuracy_score(labels_test, labels_predict)
    accuracies.append(accuracy)
    print("Accuracy from this fold:", accuracy)

# Calculate and print the average accuracy and standard deviation
print("Average accuracy:", np.mean(accuracies))
print("Standard deviation of accuracy:", np.std(accuracies))

## Multilayer Perceptron (MLP)

The Multilayer Perceptron (MLP) is a type of artificial neural network that is widely used for solving complex pattern recognition and classification problems. It belongs to a larger class of feedforward neural networks and consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer.

### Percentage Split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Calculate the neurons configuration, based on requirements provided by the professor
a_value = 387 # 768 attributes and 6 classes (sentiments)
neurons = [a_value - 50, a_value, a_value + 50]
accuracies = []

# Loop through neuron configurations
for neuron in neurons:
    print("Starting the classifier for neurons amount:", neuron)
    mlp = MLPClassifier(hidden_layer_sizes=(neuron,), max_iter=300, activation='relu', solver='adam', random_state=42)
    mlp.fit(train_features, train_labels)
    labels_predict = mlp.predict(test_features)
    accuracy = accuracy_score(test_labels, labels_predict)
    accuracies.append(accuracy)

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(neurons, accuracies, marker='o', linestyle='-', color='b')
plt.title('MLP Accuracy vs. Number of Neurons')
plt.xlabel('Number of Neurons')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

Now we're going to take the number of neurons that we found on the previous experiment, and run the classifier again, but varying the max iterations parameter. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Initiate the values needed
accuracies = []
iterations = [300, 1000, 5000]
best_neuron_length = 437

# Loop through neuron configurations
for iteration in iterations:
    print("Starting the classifier for neurons iterations amount:", iteration)
    mlp = MLPClassifier(hidden_layer_sizes=(best_neuron_length,), max_iter=iteration, activation='relu', solver='adam', random_state=42)
    mlp.fit(train_features, train_labels)
    labels_predict = mlp.predict(test_features)
    accuracy = accuracy_score(test_labels, labels_predict)
    accuracies.append(accuracy)

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(iterations, accuracies, marker='o', linestyle='-', color='b')
plt.title('MLP Accuracy vs. Number of Iterations')
plt.xlabel('Number of Iterations')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

Now, with all of that information, we're going to adjust the learning rate parameter changes, to look for the best configuration available.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# Initiate the values needed
accuracies = []
best_max_iter = 300
best_neuron_length = 437
learning_rates = [0.001, 0.01, 0.1]

# Loop through neuron configurations
for learning_rate in learning_rates:
    print("Starting the classifier for learning rate amount:", learning_rate)
    mlp = MLPClassifier(hidden_layer_sizes=(best_neuron_length,), learning_rate_init=learning_rate, max_iter=best_max_iter, activation='relu', solver='adam', random_state=42)
    mlp.fit(train_features, train_labels)
    labels_predict = mlp.predict(test_features)
    accuracy = accuracy_score(test_labels, labels_predict)
    accuracies.append(accuracy)

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(learning_rates, accuracies, marker='o', linestyle='-', color='b')
plt.title('MLP Accuracy vs. Learning Rate')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

### 10-fold Cross-Validation

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Calculate the neurons configuration, based on requirements provided by the professor
a_value = 387 # 768 attributes and 6 classes (sentiments)
neurons = [a_value - 50, a_value, a_value + 50]
accuracies = []

# Perform 10-fold CV for each neuron that I'm evaluating
for neuron in neurons:
    print("Entering neuron amount of:", neuron)
    local_accuracy_average = []
    for train_index, test_index in kf.split(sentiments_features):
        sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
        labels_train, labels_test = labels[train_index], labels[test_index]
        mlp = MLPClassifier(hidden_layer_sizes=(neuron,), max_iter=300, activation='relu', solver='adam', random_state=42)
        mlp.fit(sentiments_train, labels_train)
        labels_predict = mlp.predict(sentiments_test)
        local_accuracy_average.append(accuracy_score(labels_test, labels_predict))
    print("Processed neurons with 10-fold cross-validation, average accuracy:", np.mean(local_accuracy_average))
    accuracies.append(np.mean(local_accuracy_average))

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(neurons, accuracies, marker='o', linestyle='-', color='b')
plt.title('MLP Accuracy vs. Number of Neurons (10-fold CV)')
plt.xlabel('Number of Neurons')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

Now, with all of that information, we're going to adjust the learning rate parameter changes, to look for the best configuration available.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt


# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Calculate the neurons configuration, based on requirements provided by the professor
iterations = [300, 1000, 5000]
best_neuron_length = 387
accuracies = []

# Perform 10-fold CV for each neuron that I'm evaluating
for iteration in iterations:
    print("Entering iteration amount of:", iteration)
    local_accuracy_average = []
    for train_index, test_index in kf.split(sentiments_features):
        sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
        labels_train, labels_test = labels[train_index], labels[test_index]
        mlp = MLPClassifier(hidden_layer_sizes=(best_neuron_length,), max_iter=iteration, activation='relu', solver='adam', random_state=42)
        mlp.fit(sentiments_train, labels_train)
        labels_predict = mlp.predict(sentiments_test)
        local_accuracy_average.append(accuracy_score(labels_test, labels_predict))
    print("Processed iteration with 10-fold cross-validation, average accuracy:", np.mean(local_accuracy_average))
    accuracies.append(np.mean(local_accuracy_average))

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(iterations, accuracies, marker='o', linestyle='-', color='b')
plt.title('MLP Accuracy vs. Number of Iterations (10-fold CV)')
plt.xlabel('Number of Iterations')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

Final running stage.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt


# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Calculate the neurons configuration, based on requirements provided by the professor
best_max_iter = 300
best_neuron_length = 387
learning_rates = [0.001, 0.01, 0.1]
accuracies = []

# Perform 10-fold CV for each neuron that I'm evaluating
for learning_rate in learning_rates:
    print("Entering learning rate amount of:", learning_rate)
    local_accuracy_average = []
    for train_index, test_index in kf.split(sentiments_features):
        sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
        labels_train, labels_test = labels[train_index], labels[test_index]
        mlp = MLPClassifier(hidden_layer_sizes=(best_neuron_length,), max_iter=best_max_iter, learning_rate_init=learning_rate, activation='relu', solver='adam', random_state=42)
        mlp.fit(sentiments_train, labels_train)
        labels_predict = mlp.predict(sentiments_test)
        local_accuracy_average.append(accuracy_score(labels_test, labels_predict))
    print("Processed learning rate with 10-fold cross-validation, average accuracy:", np.mean(local_accuracy_average))
    accuracies.append(np.mean(local_accuracy_average))

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(learning_rates, accuracies, marker='o', linestyle='-', color='b')
plt.title('MLP Accuracy vs. Learning Rate (10-fold CV)')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

### Searching for parameters using GridSearchCV

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Percentage test split method - can adjust percentage
train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=0.3, random_state=42)

# The parameters we will pass to GridSearch
param_grid = {
    'hidden_layer_sizes': [(450,),(387,),(400,)],
    'solver': ['sgd', 'adam'],
    'learning_rate_init': [0.001, 0.01, 0.1],
    'random_state': [1,50],
    'max_iter': [300]
}

# Create MLPClassifier object
mlp = MLPClassifier(random_state=42)

# Setting up the GridSearchCV
grid_search = GridSearchCV(estimator=mlp, param_grid=param_grid, cv=3,verbose=2, n_jobs=-1)

# Fitting GridSearchCV
grid_search.fit(train_features, train_labels)

# Best parameters found by GridSearchCV
print("Best parameters found: ", grid_search.best_params_)

# Evaluate the best model found from the grid search
best_mlp = grid_search.best_estimator_
y_pred = best_mlp.predict(test_features)
accuracy = accuracy_score(test_labels, y_pred)

print("Accuracy of the best model: ", accuracy)

### MLP With the Best Parameters

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

datasets_to_explore = ['datasets/twitter_sentiment_base_ready.csv', 'datasets/twitter_sentiments_base_sampled.csv', 'datasets/twitter_sentiment_base_decision_tree.csv', 'datasets/twitter_sentiment_base_pca.csv']

# Define the KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)
#best_max_iter = 300
#best_neuron_length = 387
#best_learning_rate = 0.001

best_max_iter = 300
best_neuron_length = 387
best_learning_rate = 0.01

for dataset in datasets_to_explore:
    print('Entering dataset ' + dataset)
    working_base = pd.read_csv(dataset)
    sentiments_features = working_base.loc[:, working_base.columns.str.startswith('feature_')].to_numpy()
    labels = working_base['label'].to_numpy()
    accuracies = []

    # Perform 10-fold CV
    for train_index, test_index in kf.split(sentiments_features):
        sentiments_train, sentiments_test = sentiments_features[train_index], sentiments_features[test_index]
        labels_train, labels_test = labels[train_index], labels[test_index]
        mlp = MLPClassifier(hidden_layer_sizes=(best_neuron_length,), max_iter=best_max_iter, learning_rate_init=best_learning_rate, activation='relu', solver='sgd', random_state=42)
        mlp.fit(sentiments_train, labels_train)
        labels_predict = mlp.predict(sentiments_test)
        accuracies.append(accuracy_score(labels_test, labels_predict))
    
    content = 'Dataset: ' + dataset + ', 10-Fold CV Average Accuracy: ' + str(np.mean(accuracies))
    print(content)
    with open('datasets/results_grid.txt', "a") as file:
        file.write(content)

## K-Means Clustering

We're now going to evaluate how a few unsupervised machine learning algorithms work, gettings started with the k-Means. We will vary the amount of iterations to try to find the best parameters for our dataset.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt

db_scores = []
silhouette_scores = []
k_values = range(2, 21)

for k in k_values:
    # Helper variables for calculating the average score after running the algorithm 5 times
    current_db_scores = []
    current_silhouette_scores = []
    for seed in range(5):
        print(f"Entering k={k} iteration of the K-Means, iteration number { seed }.")
        # Runs K-Means 5 times for the current K iteration
        kmeans = KMeans(n_clusters=k, random_state=seed, n_init=1)
        labels = kmeans.fit_predict(sentiments_features)
        # Calculate the scores of interest
        db_score = davies_bouldin_score(sentiments_features, labels)
        silhouette_grade = silhouette_score(sentiments_features, labels)
        current_db_scores.append(db_score)
        current_silhouette_scores.append(silhouette_grade)
    # Calculate the scores for the current K
    db_scores.append(np.mean(current_db_scores))
    silhouette_scores.append(np.mean(current_silhouette_scores))
    print(f"Average Davies-Bouldin Score for k={k}: {np.mean(current_db_scores)}")
    print(f"Average Silhouette Score for k={k}: {np.mean(current_silhouette_scores)}")
    with open('datasets/results_k_means.txt', "a") as file:
        file.write(f"Average Davies-Bouldin Score for k={k}: {np.mean(current_db_scores)}")
        file.write(f"Average Silhouette Score for k={k}: {np.mean(current_silhouette_scores)}")

# Plotting results
fig, ax1 = plt.subplots(figsize=(10, 6))

color = 'tab:red'
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Average Davies-Bouldin Score', color=color)
ax1.plot(k_values, db_scores, marker='o', linestyle='-', color=color)
ax1.tick_params(axis='y', labelcolor=color)

'''ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:blue'
ax2.set_ylabel('Average Silhouette Score', color=color)
ax2.plot(k_values, silhouette_scores, marker='o', linestyle='-', color=color)
ax2.tick_params(axis='y', labelcolor=color)'''

fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.title('Clustering Performance Evaluation')
plt.grid(True)
plt.show()


## Hierarchical Clustering

Implementing this unsupervised machine learning method and then varying the parameters to find the best configuration available.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import davies_bouldin_score, silhouette_score
import matplotlib.pyplot as plt

# Range of k values to test
k_values = range(2, 21)
db_scores = []
silhouette_scores = []

# Agglomerative Clustering without specifying the number of clusters
clustering = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='ward')
clustering.fit(sentiments_features)

# Extracting the number of clusters and their labels at different cuts of the dendrogram
from scipy.cluster.hierarchy import linkage, fcluster

# Generate the linkage matrix
Z = linkage(sentiments_features, method='ward')

for k in k_values:
    # Cutting the dendrogram at the specified number of clusters
    labels = fcluster(Z, k, criterion='maxclust')
    
    # Calculate and store the Davies-Bouldin score
    db_score = davies_bouldin_score(sentiments_features, labels)
    db_scores.append(db_score)
    
    # Calculate and store the Silhouette score
    silhouette = silhouette_score(sentiments_features, labels)
    silhouette_scores.append(silhouette)
    
    print(f"Number of clusters: {k}")
    print(f"Davies-Bouldin Score: {db_score}")
    print(f"Silhouette Score: {silhouette}")

# Plotting the results
plt.figure(figsize=(14, 7))
plt.subplot(1, 2, 1)
plt.plot(k_values, db_scores, marker='o', linestyle='-', color='red')
plt.title('Davies-Bouldin Score vs. Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Davies-Bouldin Score')
plt.grid(True)

'''
plt.subplot(1, 2, 2)
plt.plot(k_values, silhouette_scores, marker='o', linestyle='-', color='blue')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid(True)'''

plt.tight_layout()
plt.show()

### Why distance_threshold=0 and n_clusters=None?

By setting n_clusters=None and distance_threshold=0, we instruct AgglomerativeClustering to not stop merging clusters at a particular number of clusters or a specific distance threshold. Instead, it continues merging clusters until all points are merged into a single cluster, building the complete hierarchy or dendrogram. This full dendrogram captures all possible mergings from each point being its own cluster to all points being in one cluster.

Once the full dendrogram is constructed, we can efficiently "cut" the dendrogram at different levels (different values of k) to explore the structure of the data without having to re-run the clustering algorithm each time. This is much more efficient than re-instantiating and re-running the AgglomerativeClustering for each k, especially for large datasets.

If we were to instantiate AgglomerativeClustering within the loop for each k, it would mean recalculating the same hierarchical relationships multiple times: one for each value of k. This recalculating is unnecessary because the hierarchical relationships between data points don't change as k changes; only the level at which we decide to "cut" the tree changes.

Finally, by generating the full dendrogram once, we capture all possible clustering configurations. We can then efficiently explore different configurations by adjusting where you cut the dendrogram. This is particularly useful for exploratory data analysis, where we might not know in advance what the optimal number of clusters should be and want to examine various possibilities quickly.

### Comparing results

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.cluster import KMeans, AgglomerativeClustering

# Cluster using K-Means
kmeans = KMeans(n_clusters=2, random_state=42).fit(sentiments_features)
kmeans_labels = kmeans.labels_

# Cluster using Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=2).fit(sentiments_features)
hierarchical_labels = hierarchical.labels_

# Calculate metrics for K-Means
silhouette_kmeans = silhouette_score(sentiments_features, kmeans_labels)
db_index_kmeans = davies_bouldin_score(sentiments_features, kmeans_labels)

# Calculate metrics for Hierarchical Clustering
silhouette_hierarchical = silhouette_score(sentiments_features, hierarchical_labels)
db_index_hierarchical = davies_bouldin_score(sentiments_features, hierarchical_labels)

# Output the results
print("K-Means Silhouette Score:", silhouette_kmeans)
print("Hierarchical Silhouette Score:", silhouette_hierarchical)
print("K-Means Davies-Bouldin Index:", db_index_kmeans)
print("Hierarchical Davies-Bouldin Index:", db_index_hierarchical)


## Ensembles: Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

test_splits = [0.3, 0.2, 0.1]
n_values = [10, 20]
estimators = [ MLPClassifier(max_iter=300, hidden_layer_sizes=(387,), activation='relu', solver='adam', random_state=42) ] # KNeighborsClassifier(), GaussianNB(), MLPClassifier(max_iter=300, hidden_layer_sizes=(387,))]
max_features = [0.3, 0.5, 0.8]

# Holdout strategy (10/90, 20/80, 30/70)
for test_split in test_splits:
    for estimator in estimators:
        for max_feature in max_features:
            for n in n_values:
                train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=test_split, random_state=42)
                bagging = BaggingClassifier(estimator=estimator, max_features=max_feature, n_estimators=n, random_state=42)
                bagging.fit(train_features, train_labels)
                labels_predict = bagging.predict(test_features)
                accuracy = accuracy_score(test_labels, labels_predict)
                display_message = f"Accuracy of Bagging Classifier ({estimator} as the Base Estimator) with {n} estimators, { max_feature } max features and {test_split} test split: {accuracy}"
                print(display_message)
                with open('datasets/bagging_accuracies.txt', "a") as file:
                    file.write(display_message + "\n")

# Cross-validation strategy
for estimator in estimators:
    for max_feature in max_features:
        for n in n_values:
            bagging = BaggingClassifier(estimator=estimator, max_features=max_feature, n_estimators=n, random_state=42)
            scores = cross_val_score(bagging, sentiments_features, labels, cv=10, scoring='accuracy')
            mean_accuracy = np.mean(scores)
            display_message = f"10-fold CV Accuracy of Bagging Classifier ({type(estimator).__name__} as the Base Estimator) with {n} estimators and { max_feature } max features: {mean_accuracy}"
            print(display_message)
            with open('datasets/bagging_accuracies.txt', "a") as file:
                file.write(display_message + "\n")

## Ensembles: Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Wrapper for KNeighborsClassifier to ignore sample_weight
class KNeighborsClassifierWrapper(KNeighborsClassifier):
    def fit(self, X, y, sample_weight=None):
        return super().fit(X, y)

# Wrapper for GaussianNB to ignore sample_weight
class GaussianNBWrapper(GaussianNB):
    def fit(self, X, y, sample_weight=None):
        return super().fit(X, y)

# Wrapper for MLPClassifier to ignore sample_weight
class MLPClassifierWrapper(MLPClassifier):
    def fit(self, X, y, sample_weight=None):
        return super().fit(X, y)

test_splits = [0.3, 0.2, 0.1]
n_estimators = [10, 20]
estimators = [
    DecisionTreeClassifier(),
    KNeighborsClassifierWrapper(),
    GaussianNBWrapper(),
    MLPClassifierWrapper(max_iter=300, hidden_layer_sizes=(387,))
]

# Assuming sentiments_features and labels are defined and preprocessed
# Holdout strategy (10/90, 20/80, 30/70)
for test_split in test_splits:
    for estimator in estimators:
        for n in n_estimators:
            train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=test_split, random_state=42)
            adaBooster = AdaBoostClassifier(estimator=estimator, n_estimators=n, algorithm="SAMME", random_state=42)
            adaBooster.fit(train_features, train_labels)
            labels_predict = adaBooster.predict(test_features)
            accuracy = accuracy_score(test_labels, labels_predict)
            display_message = f"Accuracy of ADA Boosting ({estimator.__class__.__name__} as the Base Estimator) with {n} estimators and {test_split} test split: {accuracy}"
            print(display_message)
            with open('datasets/boosting_accuracies.txt', "a") as file:
                file.write(display_message + "\n")

# Cross-validation strategy
for estimator in estimators:
    for n in n_estimators:
        adaBooster = AdaBoostClassifier(estimator=estimator, n_estimators=n, algorithm="SAMME", random_state=42)
        scores = cross_val_score(adaBooster, sentiments_features, labels, cv=10, scoring='accuracy')
        mean_accuracy = np.mean(scores)
        display_message = f"10-fold CV Accuracy of ADA Boosting ({type(estimator).__name__} as the Base Estimator) with {n} estimators: {mean_accuracy}"
        print(display_message)
        with open('datasets/boosting_accuracies.txt', "a") as file:
            file.write(display_message + "\n")

## Ensembles: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

test_splits = [0.3, 0.2, 0.1]
criterions = ['gini', 'entropy', 'log_loss']
n_estimators = [10, 100]

# Holdout strategy (10/90, 20/80, 30/70)
for test_split in test_splits:
    for criterion in criterions:
        for n in n_estimators:
            train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=test_split, random_state=42)
            randomForest = RandomForestClassifier(criterion=criterion, n_estimators=n, random_state=42)
            randomForest.fit(train_features, train_labels)
            labels_predict = randomForest.predict(test_features)
            accuracy = accuracy_score(test_labels, labels_predict)
            display_message = f"Accuracy of Random Forest Classifier with {n} estimators, { criterion } criterion and {test_split} test split: {accuracy}"
            print(display_message)
            with open('datasets/random_forest_accuracies.txt', "a") as file:
                file.write(display_message + "\n")

# Cross-validation strategy
for criterion in criterions:
    for n in n_estimators:
        randomForest = RandomForestClassifier(criterion=criterion, n_estimators=n, random_state=42)
        scores = cross_val_score(randomForest, sentiments_features, labels, cv=10, scoring='accuracy')
        mean_accuracy = np.mean(scores)
        display_message = f"10-fold CV Accuracy of Random Forest Classifier with {n} estimators and { criterion } criterion: {mean_accuracy}"
        print(display_message)
        with open('datasets/random_forest_accuracies.txt', "a") as file:
            file.write(display_message + "\n")

## Ensembles: Stacking

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

test_splits = [0.3, 0.2, 0.1]
base_estimators_10 = [('MLP01', MLPClassifier(hidden_layer_sizes=(10), max_iter=1000)),
                      ('MLP02', MLPClassifier(hidden_layer_sizes=(8), max_iter=1000)),
                      ('MLP03', MLPClassifier(hidden_layer_sizes=(6), max_iter=1000)),
                      ('MLP04', MLPClassifier(hidden_layer_sizes=(4), max_iter=1000)),
                      ('MLP05', MLPClassifier(hidden_layer_sizes=(2), max_iter=1000)),
                      ('kNN01', KNeighborsClassifier(n_neighbors=1)),
                      ('kNN02', KNeighborsClassifier(n_neighbors=2)),
                      ('kNN03', KNeighborsClassifier(n_neighbors=3)),
                      ('kNN04', KNeighborsClassifier(n_neighbors=4)),
                      ('kNN05', KNeighborsClassifier(n_neighbors=5))]
base_estimators_20 = base_estimators_10 + [('MLP06', MLPClassifier(hidden_layer_sizes=(12), max_iter=1000)),
                                           ('MLP07', MLPClassifier(hidden_layer_sizes=(14), max_iter=1000)),
                                           ('MLP08', MLPClassifier(hidden_layer_sizes=(16), max_iter=1000)),
                                           ('MLP09', MLPClassifier(hidden_layer_sizes=(18), max_iter=1000)),
                                           ('MLP10', MLPClassifier(hidden_layer_sizes=(20), max_iter=1000)),
                                           ('kNN06', KNeighborsClassifier(n_neighbors=6)),
                                           ('kNN07', KNeighborsClassifier(n_neighbors=7)),
                                           ('kNN08', KNeighborsClassifier(n_neighbors=8)),
                                           ('kNN09', KNeighborsClassifier(n_neighbors=9)),
                                           ('kNN10', KNeighborsClassifier(n_neighbors=10))]
base_estimators = [base_estimators_10, base_estimators_20]   

# Holdout strategy (10/90, 20/80, 30/70)
for test_split in test_splits:
    for base_estimator in base_estimators:
        train_features, test_features, train_labels, test_labels = train_test_split(sentiments_features, labels, test_size=test_split, random_state=42)
        stacking = StackingClassifier(estimators=base_estimator)
        stacking.fit(train_features, train_labels)
        labels_predict = stacking.predict(test_features)
        accuracy = accuracy_score(test_labels, labels_predict)
        display_message = f"Accuracy of Stacking Classifier with { len(base_estimator) } base estimators and {test_split} test split: {accuracy}"
        print(display_message)
        with open('datasets/stacking_accuracies.txt', "a") as file:
            file.write(display_message + "\n")

# Cross-validation strategy
for base_estimator in base_estimators:
    stacking = StackingClassifier(estimators=base_estimator)
    scores = cross_val_score(stacking, sentiments_features, labels, cv=10, scoring='accuracy')
    mean_accuracy = np.mean(scores)
    display_message = f"10-fold CV Accuracy of Stacking Classifier with { len(base_estimator) } estimators: {mean_accuracy}"
    print(display_message)
    with open('datasets/stacking_accuracies.txt', "a") as file:
        file.write(display_message + "\n")

## Friedman Test

In [1]:
import numpy as np
from scipy.stats import friedmanchisquare

# Step 1: Organize the data
# Rows: Datasets, Columns: Algorithms
accuracy_data = np.array([
    [35.03, 29.28, 39.50, 58.96],  # Original Dataset
    [34.28, 29.71, 40.06, 58.26],  # Reduced Dataset 1
    [16.15, 16.39, 16.72, 16.49],  # Reduced Dataset 2
    [32.27, 28.61, 41.84, 38.97]   # Reduced Dataset 3
])

# Step 2: Perform the Friedman test
stat, p = friedmanchisquare(*[accuracy_data[:, i] for i in range(accuracy_data.shape[1])])

# Step 3: Print the results
print('Friedman test statistic:', stat)
print('p-value:', p)

# Interpretation
if p < 0.05:
    print("Significant differences detected (p < 0.05)")
else:
    print("No significant differences detected (p >= 0.05)")


Friedman test statistic: 9.899999999999991
p-value: 0.019435580728320332
Significant differences detected (p < 0.05)


In [2]:
import scikit_posthocs as sp
import numpy as np

# Data as provided
accuracy_data = np.array([
    [35.03, 29.28, 39.50, 58.96],  # Original Dataset
    [34.28, 29.71, 40.06, 58.26],  # Reduced Dataset 1
    [16.15, 16.39, 16.72, 16.49],  # Reduced Dataset 2
    [32.27, 28.61, 41.84, 38.97]   # Reduced Dataset 3
])

# Perform the Nemenyi post-hoc test
nemenyi_results = sp.posthoc_nemenyi_friedman(accuracy_data.T)

# Print results
print(nemenyi_results)


Matplotlib is building the font cache; this may take a moment.


          0         1         2         3
0  1.000000  0.900000  0.065622  0.823993
1  0.900000  1.000000  0.065622  0.823993
2  0.065622  0.065622  1.000000  0.354855
3  0.823993  0.823993  0.354855  1.000000


In [3]:
from scipy.stats import wilcoxon

# Step 1: Prepare your data
db_index = [2.2472, 2.4941]  # Davies-Bouldin Index for k-Means and Hierarchical Clustering
silhouette = [0.1746, 0.1194]  # Silhouette score for k-Means and Hierarchical Clustering

# Step 2: Perform the Wilcoxon signed-rank test
stat_db, p_db = wilcoxon([2.2472], [2.4941])
stat_sil, p_sil = wilcoxon([0.1746], [0.1194])

# Step 3: Print the results
print(f'Wilcoxon test statistic for DB Index: {stat_db}, p-value: {p_db}')
print(f'Wilcoxon test statistic for Silhouette score: {stat_sil}, p-value: {p_sil}')

# Interpretation
if p_db < 0.05:
    print("Significant differences detected in DB Index (p < 0.05)")
else:
    print("No significant differences detected in DB Index (p >= 0.05)")

if p_sil < 0.05:
    print("Significant differences detected in Silhouette score (p < 0.05)")
else:
    print("No significant differences detected in Silhouette score (p >= 0.05)")


Wilcoxon test statistic for DB Index: 0.0, p-value: 1.0
Wilcoxon test statistic for Silhouette score: 0.0, p-value: 1.0
No significant differences detected in DB Index (p >= 0.05)
No significant differences detected in Silhouette score (p >= 0.05)


In [4]:
import numpy as np
from scipy.stats import friedmanchisquare
import scikit_posthocs as sp

# Data Matrix
accuracy_data = np.array([
    [36.17, 37.80125, 39.51875, 62.43875],  # Bagging Default
    [35.37875, 39.17875, 39.27375, 60.6575],  # Bagging Max Features 0.3
    [35.935, 38.52875, 39.2775, 61.30875],  # Bagging Max Features 0.5
    [35.67375, 38.0925, 39.385, 61.72875],  # Bagging Max Features 0.8
    [31.81375, 36.975, 39.5425, 61.19625],  # Boosting Default
    [38.265, 37.895, 37.88, 37.54],  # Random Forest
    [60.1, 60.705, 61.235, 59.885],  # Stacking
])

# Perform the Friedman test
stat, p = friedmanchisquare(*[accuracy_data[:, i] for i in range(accuracy_data.shape[1])])

print('Friedman test statistic:', stat)
print('p-value:', p)

# Interpretation
if p < 0.05:
    print("Significant differences detected (p < 0.05)")
else:
    print("No significant differences detected (p >= 0.05)")

# Perform the Nemenyi post-hoc test
nemenyi_results = sp.posthoc_nemenyi_friedman(accuracy_data.T)

# Print results
print(nemenyi_results)

Friedman test statistic: 6.599999999999994
p-value: 0.0858010874001231
No significant differences detected (p >= 0.05)
         0         1    2    3         4         5         6
0  1.00000  0.900000  0.9  0.9  0.900000  0.829610  0.900000
1  0.90000  1.000000  0.9  0.9  0.900000  0.900000  0.637136
2  0.90000  0.900000  1.0  0.9  0.900000  0.900000  0.900000
3  0.90000  0.900000  0.9  1.0  0.900000  0.900000  0.900000
4  0.90000  0.900000  0.9  0.9  1.000000  0.900000  0.540899
5  0.82961  0.900000  0.9  0.9  0.900000  1.000000  0.440184
6  0.90000  0.637136  0.9  0.9  0.540899  0.440184  1.000000


## State of the Art for Sentiment Analysis: BERT Fine-Tuning

First, we will fine tune the BERT model with the sentiments dataset, in order to look out for a better accuracy score.

In [None]:
import pandas as pd
from datasets import Dataset, ClassLabel
from transformers import BertTokenizer
import torch

# Set device to CPU explicitly
device = torch.device('cuda')
print(f"Using device: {device}")

# Read dataset using pandas
sentiments_df = pd.read_csv("datasets/twitter_sentiment_base_original.csv", usecols=["text", "label"])

# Get unique labels using pandas
unique_labels = sorted(sentiments_df['label'].unique())
class_label = ClassLabel(num_classes=len(unique_labels), names=list(map(str, unique_labels)))

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(sentiments_df)

# Encode labels to ClassLabel
dataset = dataset.cast_column("label", class_label)

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize dataset
def tokenize_data(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True, max_length=128)

tokenized_data = dataset.map(tokenize_data, batched=True)

# Rename the column from "label" to "labels" for compatibility with BertForSequenceClassification
tokenized_data = tokenized_data.rename_column("label", "labels")

# Split the dataset into training and testing sets
train_test_split = tokenized_data.train_test_split(test_size=0.3, stratify_by_column='labels')
train_data = train_test_split['train']
test_data = train_test_split['test']
from transformers import BertForSequenceClassification, TrainingArguments, Trainer, default_data_collator

# Load pre-trained BERT model and move to CPU
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6)
model.to('cuda')  # Move model to CPU

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3, 
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,
    warmup_steps=500, 
    weight_decay=0.01, 
    logging_dir='./results/logs', 
    logging_steps=10,
    evaluation_strategy="epoch",
)

# Use default data collator
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_data, 
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

# Train model
trainer.train()

# Evaluate model
results = trainer.evaluate()
print(results)

Now that the fine tuning process is complete and we have access to their results, we can try spinning out something new now.

In [34]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine-tuned-bert', num_labels=6)
tokenizer = BertTokenizer.from_pretrained('./fine-tuned-bert')

# Move model to CPU if not already
device = torch.device('cpu')
model.to(device)

# Define the label mapping
label_mapping = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise"
}

# Function to predict sentiment of new texts
def predict(texts):
    # Tokenize the input texts
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=128).to(device)
    
    # Perform inference with the model
    outputs = model(**inputs)
    
    # Get the logits (raw predictions) from the model outputs
    logits = outputs.logits
    
    # Convert logits to predicted class labels
    predictions = torch.argmax(logits, dim=-1)
    
    # Move predictions back to the CPU and convert to numpy array
    predictions = predictions.cpu().numpy()
    
    # Map the numerical predictions to human-readable labels
    human_readable_predictions = [label_mapping[pred] for pred in predictions]
    
    return predictions, human_readable_predictions

# Example usage: Predict sentiment of new texts
texts = ["surprise!", "In a hole in the ground there once lived happy hobbit."]
predictions, human_readable_predictions = predict(texts)
print("Human-readable Predictions:", human_readable_predictions)

Human-readable Predictions: ['joy', 'joy']


Now let's evaluate this with the larger dataset.

In [21]:
test_data = pd.read_csv('datasets/test_sentiments.csv', usecols=['text', 'label'])
test_data.query('label == 2')

Unnamed: 0,text,label
201,I'm so in love with you. Every moment with you...,2
202,You are my everything. I can't imagine life wi...,2
203,My heart skips a beat every time I see you.,2
204,I love you more than words can express.,2
205,You make my world a better place just by being...,2
...,...,...
294,My love for you is eternal.,2
295,"You are my everything, my true love.",2
296,"My heart is yours, always.",2
297,"You are my everything, forever.",2


In [23]:
import pandas as pd
from sklearn.metrics import classification_report

# Load the test dataset
test_data = pd.read_csv('datasets/test_sentiments.csv', usecols=['text', 'label'])
test_data = test_data.query('label == 2')

# Evaluating the model on the test dataset
true_labels = test_data['label'].tolist()  # Convert to list
predicted_labels, human_readable_predictions = predict(test_data['text'].tolist())  # Convert text column to list

# Generate a classification report
print(classification_report(true_labels, predicted_labels, target_names=list(label_mapping.values())))

ValueError: Number of classes, 5, does not match size of target_names, 6. Try specifying the labels parameter

### Analysis of the Fine Tuned BERT

#### Sadness: Precision (0.94), Recall (0.99), F1-Score (0.96)
The model is very good at correctly predicting sadness (high recall), and most of its sadness predictions are correct (high precision).

#### Joy: Precision (0.99), Recall (0.91), F1-Score (0.95)
The model is excellent at predicting joy (high precision), but it misses some joy instances (slightly lower recall).

#### Love: Precision (0.92), Recall (1.00), F1-Score (0.96)
The model identifies all instances of love perfectly (recall of 1.00), though a small number of love predictions might be incorrect.

#### Anger: Precision (0.93), Recall (0.99), F1-Score (0.96)
The model is very good at both identifying anger and correctly predicting it.

#### Fear: Precision (0.78), Recall (0.86), F1-Score (0.82)
The model's performance for fear is lower compared to other emotions, indicating it misses some fear instances and also has more incorrect fear predictions.

#### Surprise: Precision (0.99), Recall (0.75), F1-Score (0.86)
The model is excellent at correctly predicting surprise when it does so (high precision), but it misses a significant number of surprise instances (lower recall).

#### Overall Performance:

#### Accuracy: 0.92
The model correctly classifies 92% of all instances, which is very good.

#### Macro Average: Precision: 0.92, Recall: 0.92, F1-Score: 0.92
This indicates a balanced performance across all classes.

#### Weighted Average: Precision: 0.92, Recall: 0.92, F1-Score: 0.92
Similar to macro average but considers the support (number of instances per class).

#### Summary
High Performance: The model performs very well overall, with an accuracy of 92%. Precision, recall, and F1-scores are generally high, indicating the model makes correct predictions and captures most of the relevant instances for each class.

#### Class-specific Insights:
Love and Anger: Outstanding performance, with high precision and recall.
Fear and Surprise: Lower performance, particularly in recall for surprise, suggesting the model could improve in identifying these emotions.
Balanced Metrics: The macro and weighted averages confirm that the model's performance is consistently good across different classes.

#### Next Steps
Improve Fear and Surprise Prediction:
Consider collecting more data for these classes.
Experiment with data augmentation techniques to enhance the model's ability to recognize these emotions.

## Use this block when you want to import the data from previously saved datasets

In [None]:
import pandas as pd
import numpy as np

working_base = pd.read_csv('datasets/twitter_sentiment_base_ready_reduced.csv')
sentiments_features = working_base.loc[:, working_base.columns.str.startswith('feature_')].to_numpy()
labels = working_base['label'].to_numpy()