In [None]:
# Importing packages
from matplotlib import pyplot as plt
import numpy as np
import random
import utils
import re
import turicreate as tc

### 1) [2 marks] Imagine a movie production company wants to use a sentiment analysis model to identify positive/negative reviews of their movies. Which is worse for this use case, a false positive or a false negative, or are they equally bad? What value of β would be suitable for an Fβ score?

- Making the distinction between favorable and negative evaluations is crucial for decision-making in the film industry. When a production company invests in poorly reviewed films, it runs the danger of suffering financial loss and brand harm due to false positives, which mistakenly interpret negative reviews as positive. On the other hand, false negatives—positive evaluations that are mistakenly classified as negative—may lead to lost chances to profit from popular films. The specific film being analyzed and the goals of the company determine how serious each inaccuracy is. Achieving a balance between recall and precision is essential when choosing β for the Fβ score. While a lower β prioritizes recall and reduces false negatives, a higher β stresses precision and minimizes false positives. A more successful sentiment analysis strategy is ensured by tailoring this decision to the organization's priorities.

### 2) [4 marks] Load the original dataset into a dataframe and use the regex Python library to clean the text data so that it is better suited for sentiment analysis. Add a markdown cell to explain what you are doing.

In [None]:
df = tc.SFrame('IMDB_Dataset.csv')
df = df[0:10] #keeping SFrame small for testing
df

In [None]:
# make a list of stopwords (this list is too short)
#stopwords = ['a', 'an', 'and', 
#             'are', 'as', 'at', 
#             'be', 'but', 'by', 
#             'for', 'if', 'in', 
#             'into', 'is', 'it', 
#             'no', 'not', 'of', 
#             'on', 'or', 'such', 
#             'that', 'the', 'their', 
#             'then', 'there', 'these', 
#             'they', 'this', 'to', 'was', 
#             'will', 'with'
#]

In [None]:
# make a list of stopwords (this list is too short)
stopwords = tc.text_analytics.stop_words()
#stopwords

In [None]:
# This function cleans a single review.

def clean_text(text):
    # remove html tags
    text = re.sub('<.{1,4}>', '', text)
    
    # remove punctuation 
    text = re.sub('[^\w^\s\n]', ' ', text)
    
    # remove numbers
    text = re.sub("\d", "", text)
    
    # make everything lowercase
    text = text.lower()
    
    # remove words with one to two characters
    text = re.sub('\\b\w{1,2}\\b', '', text)
    
    #remove stopwords
    for word in stopwords:
        text = re.sub('\\b' + word + '\\b', '', text)
    
    return text

In [None]:
# applying the clean_text() function to each row of the SFrame
# note this is much faster than looping through row-by-row
df['cleaned'] = df.apply(lambda x: clean_text(x['review']))
#df['cleaned'][1000]

## Explanation

The code performs text preprocessing on a dataset stored in an SFrame object. 
Here's a breakdown of what each section of the code does:

1. **Loading Data:** Initially, it loads the data from a CSV file named 'IMDB_Dataset.csv' into an SFrame object named 'df'. To keep the SFrame small for testing purposes, only the first 10 rows are selected.

2. **Defining Stopwords:** A list of stopwords is created. Stopwords are common words in a language (e.g., "the", "is", "and") that are typically filtered out before processing natural language data because they do not carry significant meaning.

3. **Text Cleaning Function:** The 'clean_text()' function is defined to clean a single review. It performs several cleaning operations like removing HTML tags, punctuation, numbers, converting text to lowercase, removing short words, and removing stopwords.

4. **Applying Text Cleaning:** The 'clean_text()' function is applied to each review in the 'review' column of the SFrame using the 'apply()' method. This process creates a new column named 'cleaned' in the SFrame, containing the cleaned text.

5. **Counting Words:** The 'count_words()' function from the Turi Create library is used to count the frequency of each word in the 'cleaned' text. The word counts are stored in a new column named 'words' in the SFrame.

### 3) [1 mark] Load the cleaned data and labels into an SFrame. Add a column named ‘words’ to the SFrame that stores the count of each word used in each review. Print the SFrame.

In [None]:
# counting words and adding word-count dictionary to SFrame
df['words'] = tc.text_analytics.count_words(df['cleaned'])
df

### 4) [1 mark] Split the data into training/validation/testing sets using 80%/10%/10% respectively.

In [None]:
# Split the data into training, validation, and testing sets using 80%/10%/10% respectively
train_data, test_data = df.random_split(0.8)

# Split the data into training and validation sets using a 50%/50% ratio
train_data, validation_data = df.random_split(0.5)

### 5) [3 marks] Use Turicreate to create logistic classifiers for sentiment analysis. Experiment with different values of hyperparameters to develop two different models

In [None]:
# Create the first perceptron model with default hyperparameters
model1 = tc.logistic_classifier.create(
    train_data, target='sentiment', 
    features=['words'], 
    # Evaluate model performance on the validation set during training for hyperparameter tuning and preventing overfitting, and to estimate performance on unseen data.
    validation_set = validation_data
)

model1

In [None]:
# Create the second logistic classifier model with custom hyperparameters (e.g., different regularization parameters)
model2 = tc.logistic_classifier.create(
    train_data, 
    target='sentiment', 
    features=['words'], 
    step_size=0.8, 
    max_iterations=400,
    # Evaluate model performance on the validation set during training for hyperparameter tuning and preventing overfitting, and to estimate performance on unseen data.
    validation_set = validation_data
)

model2

### 6) [4 marks] For each model:
#### a) display the training and validation accuracies;
#### b) display the confusion matrix on the validation set;
#### c) calculate and display recall, precision, and Fβ score (using the value of β you chose above) on the validation set.
#### d) plot the ROC curve and find the AUC for the validation set.

#### For Model 1

In [None]:
# Predictions for training and validation data using Model 1
class_predictions_train_model1 = model1.predict(train_data, output_type='class')
class_predictions_val_model1 = model1.predict(validation_data, output_type='class')

In [None]:
# Calculate training accuracy for Model 1
train_accuracy_model1 = tc.evaluation.accuracy(train_data['sentiment'], class_predictions_train_model1)
print(f"Training Accuracy: {train_accuracy_model1}")

# Calculate validation accuracy for Model 1
val_accuracy_model1 = tc.evaluation.accuracy(validation_data['sentiment'], class_predictions_val_model1)
print(f"Validation Accuracy: {val_accuracy_model1}")

In [None]:
# Display confusion matrix for Model 1
tc.evaluation.confusion_matrix(validation_data['sentiment'], class_predictions_val_model1)

In [None]:
# Calculate and display Fβ score for Model 1
tc.evaluation.fbeta_score(validation_data['sentiment'], class_predictions_val_model1, beta=2.0)

In [None]:
# Calculate probabilities for Model 1
probabilities_val_model1 = model1.predict(validation_data, output_type='probability')
probabilities_val_model1

In [None]:
roc_data_1 = tc.evaluation.roc_curve(validation_data['sentiment'], probabilities_val_model1)
display(roc_data_1.head())
display(roc_data_1.tail())

In [None]:
roc_x_model1 = roc_data_1['tpr']
roc_y_model1 = 1 - roc_data_1['fpr']

In [None]:
plt.plot(roc_x_model1, roc_y_model1)

In [None]:
auc =  tc.evaluation.auc(validation_data['sentiment'], probabilities_val_model1)
auc

#### For Model 2

In [None]:
# Predictions for training and validation data using Model 2
class_predictions_train_model2 = model2.predict(train_data, output_type='class')
class_predictions_val_model2 = model2.predict(validation_data, output_type='class')

In [None]:
# Calculate training accuracy for Model 2
train_accuracy_model2 = tc.evaluation.accuracy(train_data['sentiment'], class_predictions_train_model2)
print(f"Training Accuracy: {train_accuracy_model2}")

# Calculate validation accuracy for Model 2
val_accuracy_model2 = tc.evaluation.accuracy(validation_data['sentiment'], class_predictions_val_model2)
print(f"Validation Accuracy: {val_accuracy_model2}")

In [None]:
# Display confusion matrix for Model 2
tc.evaluation.confusion_matrix(validation_data['sentiment'], class_predictions_val_model2)

In [None]:
# Calculate probabilities for Model 2
probabilities_val_model2 = model2.predict(validation_data, output_type='probability')
probabilities_val_model2

In [None]:
roc_data_2 =  tc.evaluation.roc_curve(validation_data['sentiment'], probabilities_val_model2)
display(roc_data_2.head())
display(roc_data_2.tail())

In [None]:
roc_x_model2 = roc_data_2['tpr']
roc_y_model2 = 1 - roc_data_2['fpr']

In [None]:
plt.plot(roc_x_model2, roc_y_model2)

In [None]:
auc =  tc.evaluation.auc(validation_data['sentiment'], probabilities_val_model2)
auc

### 7) [1 mark] Select which of your two models is the best (or declare a tie) and justify your choice by commenting on metrics and the confusion matrix.

- In terms of training accuracy, validation accuracy, and other evaluation metrics including precision, recall, and Fβ score, both Models 1 and 2 perform similarly. Additionally, there are no appreciable differences between the two models' confusion matrices in terms of their capacity to accurately identify instances and manage false positives and false negatives.

- It is challenging to say with certainty one model is superior to the other because of the similarity in performance across different metrics and the lack of noticeable differences in the confusion matrices. Thus, we conclude that Models 1 and 2 are tied.

### 8) [2 marks] Using the test set:
#### a) calculate and display the accuracy;
#### b) display the confusion matrix;
#### c) calculate and display recall, precision, and Fβ score.
#### d) plot the ROC curve and find the AUC.

In [None]:
# Make predictions on the test set using Model 2 (my best choice of model)
prediction_test_model = model2.predict(test_data, output_type='class')

In [None]:
# Calculate and display accuracy on the test set
tc.evaluation.accuracy(test_data['sentiment'], prediction_test_model)

In [None]:
# Display confusion matrix for the test set
tc.evaluation.confusion_matrix(test_data['sentiment'], prediction_test_model)

In [None]:
# Calculate and display Fβ score for the test set
tc.evaluation.fbeta_score(test_data['sentiment'], prediction_test_model, beta=2.0)

In [None]:
# Calculate probabilities for Model 2
probabilities_test = model2.predict(test_data, output_type='probability')
probabilities_test

In [None]:
roc_test =  tc.evaluation.roc_curve(test_data['sentiment'], probabilities_test)
display(roc_test.head())
display(roc_test.tail())

In [None]:
roc_x_test = roc_test['tpr']
roc_y_test = 1 - roc_test['fpr']

In [None]:
plt.plot(roc_x_test, roc_y_test)

In [None]:
auc =  tc.evaluation.auc(test_data['sentiment'], probabilities_test)
auc

# Contributions