# Assignment 2 - Elements of Data Processing COMP20008


# Contributors:
### **Euan Marshall - 1480650**
### **Di Wang - 1353637**
### **Harry Huppatz - 1357194**
### **Quoc Minh Hoang - 1386263**

## **Part 1: Data Exploration and Analysis.**

In [None]:
# Importing libraries, exploring the dataset and cleaning missing values.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

titles = pd.read_csv('titles.csv')
display(titles.head())

In [None]:
# Using the built-in pandas function we can return a 
# DataFrame of the Pearson's correlation between each of the continuously measured features.
features = ['tmdb_popularity', 'tmdb_score', 'release_year', 'imdb_votes', 'imdb_score']
titles[features].corr(method = 'pearson')


In [None]:
titles.dropna(subset=['imdb_score'], inplace=True)
non_numeric_mask = pd.to_numeric(titles['imdb_score'], errors='coerce').isna()

# Filter the DataFrame to get rows with non-numeric values
non_numeric_values = titles[non_numeric_mask]

# Display the result
print(non_numeric_values)

In [None]:
# Calculating Pearson and Spearman rank coefficients.

pearson_corr = titles.corr(method='pearson')['imdb_score']
spearman_corr = titles.corr(method='spearman')['imdb_score']

# Printing Pearson and Spearman rank coefficients.
print("Pearson Correlation Coefficients:")
print(pearson_corr)
print("\nSpearman Rank Correlation Coefficients:")
print(spearman_corr)


In [None]:
# Creating a pair plot to visualise the correlations between the numerical features. 
# Analysing spearman and pearson correlations.
import seaborn as sns

titles.dropna(inplace=True) # Dropping NaN values for purpose of quick viewing.

# Creating pariplot.
sns.pairplot(titles, height=1.2, aspect=1.25)

In [None]:
# Checking NaN values in each column.
df = pd.read_csv('titles.csv')
print(df.isna().sum())

In [None]:
# Exploring the correlation between IMDb score and TMDb score.
df = pd.read_csv('titles.csv')

# Creating a scatter plot.
plt.figure(figsize=(8, 6))
plt.scatter(df['tmdb_score'], df['imdb_score'], alpha=0.2)
plt.xlabel('TMDb Score')
plt.ylabel('IMDb Score')
plt.title('TMDB Score vs. IMDb Score')
plt.grid(True)
plt.show()

In [None]:
# Exploring the correlation between IMDb score and TMDb score.
df = pd.read_csv('titles.csv')

# Creating a scatter plot.
plt.figure(figsize=(8, 6))
plt.scatter(df['imdb_votes'], df['imdb_score'], alpha=0.5)
plt.xlabel('IMDB Votes')
plt.ylabel('IMDb Score')
plt.title('IMDB Votes vs. IMDb Score')
plt.grid(True)
plt.show()

In [None]:
# Exploring the correlation between IMDb score and TMDb score.
df = pd.read_csv('titles.csv')

# Creating a scatter plot.
plt.figure(figsize=(8, 6))
plt.scatter(df['tmdb_popularity'], df['imdb_score'], alpha=0.2)
plt.xlabel('TMDb Popularity')
plt.ylabel('IMDb Score')
plt.title('TMDB Popularity vs. IMDb Score')
plt.grid(True)
plt.show()

## **Part 2: Datafile Cleaning and Imputating.**

**The aim here is to fill the missing IMDb values, as these are what we are trying to predict.**

In [None]:
# Checking for missing IMDb score, TMDB score and IMDb votes.
missing_imdb = df['imdb_score'].isnull()
missing_tmdb = df['tmdb_score'].isnull()
missing_votes = df['imdb_votes'].isnull()

# Creating a condition to identify data points with missing IMDb and at least one of the other columns.
missing_condition = (missing_imdb &
                     (missing_tmdb | missing_votes))

# Counting the number of data points that meet the condition for each missing column.
count_missing_tmdb = len(df[missing_condition & missing_tmdb])
count_missing_votes = len(df[missing_condition & missing_votes])

print(f'Number of data points missing IMDb score and also missing TMDB score: {count_missing_tmdb}')
print(f'Number of data points missing IMDb score and also missing IMDb votes: {count_missing_votes}')

**Evidently, there is some correlation between IMDb and TMDb scores, as well as between IMDb scores and IMDb votes. However, for all of the data points missing IMDb scores, they are also missing IMDb votes values. Hence, we reject the idea of using IMDb votes as a feature for imputation of IMDb scores.**

In [None]:
# Finding data points with 10/10 TMDb scores, and the total number of them.
df = pd.read_csv('titles.csv')

# Filtering the DataFrame to get rows with TMDB scores equal to 10.
tmdb_10 = df[df['tmdb_score'] == 10]

# Displaying the rows with 10/10 TMDB scores.
display(tmdb_10)
print('Number of perfect TMDb score values: ', len(tmdb_10))
print('Percentage of total dataframe ', len(tmdb_10)/len(df))

**We also recognise the presence of data points with 10/10 TMDb scores but much lower IMDb scores. After further investigation, these are because they are not listed on the TMDb website. So we delete these values (as they make just 1.2% of the dataset, and TMDb scores will be used to predict missing IMDb scores due to their strong(er) correlation). This is an area for improvement, as deletion of values is not ideal, but incorrect data is also harmful to our imputation methods**

In [None]:
# Removing values where tmdb_score is 10 and where description is null.
df = df[df['tmdb_score'] != 10]
df.dropna(subset=['description'], inplace=True)
df.to_csv('fixed_titles.csv', index = False)

**USING MACHINE LEARNING MODEL TO FILL MISSING VALUES**


In [None]:
# Checking for any data rows that are missing both imdb_score and tmdb_score.
missing_scores = df[df['tmdb_score'].isna() & df['imdb_score'].isna()]

print('Number of data points with missing IMDb and TMDb scores is', len(missing_scores))

print('The percentage of the whole dataset is', len(missing_scores)/5850)

**From the calculation above, there are now 81 data values which are missing both TMDb and IMDb scores. Because we aim to impute the IMDb score from the TMDb score, these values must be eliminated. We therefore lose a further 1.3% of the data at this stage, which is reasonable considering there is no accurate means of filling these missing IMDb data points given the most strongly correlated value TMDb score is also missing.**

In [None]:
# Droping data values with NaN values for both TMDb and IMDb scores.
df.dropna(subset=['tmdb_score', 'imdb_score'], how='all', inplace=True)

**We now build a model using decision trees to predict the missing IMDb scores given the TMDb scores.**

In [None]:
# First, removing the missing IMDb records from the dataset. 
missing_imdb_score = df[df['imdb_score'].isnull() &
                        df['tmdb_score'].notnull()]

missing_imdb_score.to_csv('imdb_score.csv', index=False)

df_without_missing_imdb = df
df_without_missing_imdb.dropna(subset=['imdb_score'], inplace = True)

df_without_missing_imdb.to_csv('fixed_training_titles.csv', index=False)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('fixed_training_titles.csv')

def train_model(df, features, target):
    X = df[features]
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = DecisionTreeRegressor()
    model.fit(X_train, y_train)
    return model, X_test, y_test

# Train the models
model1, X1_test, y1_test = train_model(df, ['tmdb_score'], 'imdb_score')
                                       
score1 = model1.score(X1_test, y1_test)
print(f'Decision Tree R^2 Score: {score1}')

scores1 = cross_val_score(model1, X1_test, y1_test, cv=10)
print(f'Decision Tree Cross-Validation Score: {scores1.mean()}')

y1_pred = model1.predict(X1_test)
mse1 = mean_squared_error(y1_test, y1_pred)
mae1 = mean_absolute_error(y1_test, y1_pred)
print("Mean Squared Error for Decision Tree:", mse1)
print("Mean Absolute Error for Decision Tree:", mae1)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create the figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of actual values in blue
sns.histplot(y1_test, bins=10, color='blue', alpha=0.5, label='Actual', kde=True)

# Plot the distribution of predicted values in red
sns.histplot(y1_pred, bins=10, color='red', alpha=0.5, label='Predicted', kde=True)

# Add labels and a legend
plt.xlabel('IMDb Scores')
plt.ylabel('Frequency')
plt.title('Distribution of Actual vs Predicted Values for Decision Tree Imputation Model')
plt.legend()

# Display the plot
plt.grid(True)
plt.show()

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
import seaborn as sns
import matplotlib.pyplot as plt


def train_model(df, features, target):
    '''Trains Decision Tree models on the provided data.'''

    # Defining the features and the target for the model.
    X = df[features]
    y = df[target]

    # Splitting the data into training and testing sets.
    # A common rule is to use about 80% of the data for training and about 20% of the data for testing.
    # A common choice is to set random_state to 0 or 42.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initializing the model.
    # Fitting the model to the training data.
    model = DecisionTreeRegressor()
    model.fit(X_train, y_train)

    return model, X_test, y_test


# Defining the features.
selected_features = ['tmdb_score', 'tmdb_popularity']

# Training the models.
model2, X2_test, y2_test = train_model(df, selected_features, 'imdb_score')

score2 = model2.score(X2_test, y2_test)
print(f'Decision Tree Model2 R^2 Score: {score1}')

scores2 = cross_val_score(model2, X2_test, y2_test, cv=10)
print(f'Decision Tree Model2 Cross-Validation Score: {scores2.mean()}')

y2_pred = model2.predict(X2_test)
mse2 = mean_squared_error(y2_test, y2_pred)
mae2 = mean_absolute_error(y2_test, y2_pred)
print("Mean Squared Error for model2:", mse2)
print("Mean Absolute Error for model2:", mae2)

# Creating a figure and axis.
plt.figure(figsize=(10, 6))

# Plotting the distribution of actual IMDb scores in blue.
sns.histplot(y2_test, bins=10, color='blue', alpha=0.5, label='Actual', kde=True)

# Plotting the distribution of predicted IMDb scores in red.
sns.histplot(y2_pred, bins=10, color='red', alpha=0.5, label='Predicted', kde=True)

# Adding labels and a legend.
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.title('Distribution of Actual vs Predicted Values for Decision Tree Model')
plt.legend()

# Displaying the plot.
plt.grid(True)
plt.show()

**Evidently, this means of imputation is not very accurate. So we try other models.**
**We will next try a K-NN model for imputation of missing IMDb scores, using the TMDb score.**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('fixed_training_titles.csv')

# Dropping rows with missing values in specified columns.
df.dropna(subset=['tmdb_score', 'imdb_score'], inplace=True)  

def train_model(df, features, target):
    '''Trains K-NN models on the provided data.'''

    # Defining the features and the target for the model.
    X = df[features]
    y = df[target]

    # Splitting the data into training and testing sets.
    # A common rule is to use about 80% of the data for training and about 20% of the data for testing.
    # A common choice is to set random_state to 0 or 42.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize the model.
    # Fitting the model to the training data.
    model = KNeighborsRegressor()
    model.fit(X_train, y_train)

    return model, X_test, y_test

# Training the models with 'tmdb_score' and 'runtime' as features.
model1, X1_test, y1_test = train_model(df, ['tmdb_score'], 'imdb_score')

score1 = model1.score(X1_test, y1_test)
print(f'K-NN Model1 R^2 Score: {score1}')

scores1 = cross_val_score(model1, X1_test, y1_test, cv=10)
print(f'K-NN Model1 Cross-Validation Score: {scores1.mean()}')

y1_pred = model1.predict(X1_test)
mse1 = mean_squared_error(y1_test, y1_pred)
mae1 = mean_absolute_error(y1_test, y1_pred)
print("Mean Squared Error for K-NN Model1:", mse1)
print("Mean Absolute Error for K-NN Model1:", mae1)

# Creating a figure and axis.
plt.figure(figsize=(10, 6))

# Plotting the distribution of actual IMDb scores in blue.
sns.histplot(y1_test, bins=10, color='blue', alpha=0.5, label='Actual', kde=True)

# Plotting the distribution of predicted IMDb scores in red.
sns.histplot(y1_pred, bins=10, color='red', alpha=0.5, label='Predicted', kde=True)

# Adding labels and a legend.
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.title('Distribution of Actual vs Predicted IMDb Scores for KNN Method')
plt.legend()

# Displaying the plot.
plt.grid(True)
plt.show()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('fixed_training_titles.csv')

# Dropping rows with missing values in specified columns.
df.dropna(subset=['tmdb_score', 'imdb_score'], inplace=True)  

def train_model(df, features, target):
    '''Trains K-NN models on the provided data.'''

    # Defining the features and the target for the model.
    X = df[features]
    y = df[target]

    # Splitting the data into training and testing sets.
    # A common rule is to use about 80% of the data for training and about 20% of the data for testing.
    # A common choice is to set random_state to 0 or 42.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize the model.
    # Fitting the model to the training data.
    model = KNeighborsRegressor()
    model.fit(X_train, y_train)

    return model, X_test, y_test

# Defining the features.
selected_features = ['tmdb_score', 'tmdb_popularity']

# Training the models.
model2, X2_test, y2_test = train_model(df, selected_features, 'imdb_score')

score2 = model2.score(X2_test, y2_test)
print(f'K-NN Model2 R^2 Score: {score2}')

scores2 = cross_val_score(model2, X2_test, y2_test, cv=10)
print(f'K-NN Model2 Cross-Validation Score: {scores2.mean()}')

y2_pred = model2.predict(X2_test)
mse2 = mean_squared_error(y2_test, y2_pred)
mae2 = mean_absolute_error(y2_test, y2_pred)
print("Mean Squared Error for K-NN Model2:", mse2)
print("Mean Absolute Error for K-NN Model2:", mae2)

# Creating a figure and axis.
plt.figure(figsize=(10, 6))

# Plotting the distribution of actual IMDb scores in blue.
sns.histplot(y2_test, bins=10, color='blue', alpha=0.5, label='Actual', kde=True)

# Plotting the distribution of predicted IMDb scores in red.
sns.histplot(y2_pred, bins=10, color='red', alpha=0.5, label='Predicted', kde=True)

# Adding labels and a legend.
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.title('Distribution of Actual vs Predicted IMDb Scores for KNN Method')
plt.legend()

# Displaying the plot.
plt.grid(True)
plt.show()

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

df = pd.read_csv('fixed_training_titles.csv')

# Dropping rows with missing values in specified columns ('tmdb_score' and 'imdb_score').
df.dropna(subset=['tmdb_score', 'imdb_score'], inplace=True)

def train_model(df, features, target):
    '''Trains Linear Regression models on the provided data.'''    

    # Defining the features and the target for the model.
    X = df[features]
    y = df[target]

    # Splitting the data into training and testing sets.
    # A common rule is to use about 80% of the data for training and about 20% of the data for testing.
    # A common choice is to set random_state to 0 or 42.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initializing the model.
    # Fitting the model to the training data.
    model = LinearRegression()
    model.fit(X_train, y_train)

    return model, X_test, y_test

# Training the Linear Regression model with 'tmdb_score' as the feature and 'imdb_score' as the target.
model, X_test, y_test = train_model(df, ['tmdb_score'], 'imdb_score')

# Calculating R-squared, MSE, and MAE.
y_pred = model.predict(X_test)
r_squared = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# Print the evaluation metrics
print(f'Linear Regression R-squared Score: {r_squared:.2f}')
print("Mean Squared Error for Linear Regression Model:", mse)
print("Mean Absolute Error for Linear Regression Model:", mae)

# Creating a scatterplot to visualize the accuracy with a regression line
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.title('Actual vs. Predicted IMDb Scores')
plt.xlabel('Actual IMDb Score')
plt.ylabel('Predicted IMDb Scores')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--', linewidth=2, label='Regression Line')
plt.legend()
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Creating a figure and axis.
plt.figure(figsize=(10, 6))

# Plotting the distribution of actual IMDb scores in blue.
sns.histplot(y_test, bins=10, color='blue', alpha=0.5, label='Actual', kde=True)

# Plotting the distribution of predicted IMDb scores in red.
sns.histplot(y_pred, bins=10, color='red', alpha=0.5, label='Predicted', kde=True)

# Adding labels and a legend.
plt.xlabel('IMDb Values')
plt.ylabel('Frequency')
plt.title('Distribution of Actual vs Predicted IMDb Scores for Linear Regression Model')
plt.legend()

plt.grid(True)
plt.show()

**We will now evaluate the accuracy of using more simple imputation methods such as mode, mean and medians.**

In [None]:
import numpy as np

df = pd.read_csv('fixed_titles.csv')

# Calculating the mode IMDb score excluding NaN values.
mode_imdb_score = df['imdb_score'].mode()[0]

# Filling NaN values with the mode.
df['imdb_score'].fillna(mode_imdb_score, inplace=True)

# Calculating the Mean Standard Error (MSE).
actual_imdb_scores = df['imdb_score']  # Actual IMDb scores.
imputed_imdb_scores = np.full(len(df), mode_imdb_score)  # Imputed IMDb scores (all the same).

mse = mean_squared_error(actual_imdb_scores, imputed_imdb_scores)
print(f'Mean Squared Error (MSE) using mode for imputation: {mse:.4f}')
mae = mean_absolute_error(actual_imdb_scores, imputed_imdb_scores)

print("Mean Absolute Error using mode for imputation:", mae)

In [None]:
import numpy as np

df = pd.read_csv('fixed_titles.csv')

# Calculating the mean IMDb score excluding NaN values.
mean_imdb_score = df['imdb_score'].mean()

# Filling NaN values with the mean.
df['imdb_score'].fillna(mean_imdb_score, inplace=True)

# Calculating the Mean Standard Error (MSE).
actual_imdb_scores = df['imdb_score']  # Actual IMDb scores.
imputed_imdb_scores = np.full(len(df), mean_imdb_score)  # Imputed IMDb scores (all the same).

mse = mean_squared_error(actual_imdb_scores, imputed_imdb_scores)
print(f'Mean Squared Error (MSE) using mean for imputation: {mse:.4f}')
mae = mean_absolute_error(actual_imdb_scores, imputed_imdb_scores)

print("Mean Absolute Error using mean for imputation:", mae)

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

df = pd.read_csv('fixed_titles.csv')

# Calculating the median IMDb score excluding NaN values.
median_imdb_score = df['imdb_score'].median()

# Fill NaN values with the median.
df['imdb_score'].fillna(median_imdb_score, inplace=True)

# Calculate the Mean Standard Error (MSE)
actual_imdb_scores = df['imdb_score']  # Actual IMDb scores.
imputed_imdb_scores = np.full(len(df), median_imdb_score)  # Imputed IMDb scores (all the same).

mse = mean_squared_error(actual_imdb_scores, imputed_imdb_scores)
print(f'Mean Standard Error (MSE) using Median: {mse:.4f}')
mae = mean_absolute_error(actual_imdb_scores, imputed_imdb_scores)
print("Mean Absolute Error using Median for imputation:", mae)

**Evidently, the linear regression machine learning model produces the most accurate method of imputation when compared to the use of a decision tree or simple models such as imputation using mean or median. Thus, we will use the linear regression model for imputation of the missing values.**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

df = pd.read_csv('fixed_training_titles.csv')

# Dropping rows with missing values in specified columns ('tmdb_score' and 'imdb_score').
df.dropna(subset=['tmdb_score', 'imdb_score'], inplace=True)

# Defining the features and target.
features = ['tmdb_score']
target = 'imdb_score'

# Splitting the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)

# Creating and train a Linear Regression model.
model = LinearRegression()
model.fit(X_train, y_train)

predict_df = pd.read_csv('imdb_score.csv')

# Iterating through the DataFrame to fill missing values.
for index, row in predict_df.iterrows():
    if np.isnan(row['imdb_score']):
        predict_df.at[index, 'imdb_score'] = model.predict(np.array([[row['tmdb_score']]]))[0]

# Saving the result to a new CSV file.
combined_df = pd.concat([predict_df, df], axis=0, ignore_index=True)
combined_df.to_csv('final_filled_titles.csv', index=False)

# Ensuring all the relevant missing data has actually been filled.
df = pd.read_csv('final_filled_titles.csv')
df_nan = df.isna().sum()
print(df_nan)

**We have now filled the missing IMDb scores (given that data point had a valid TMDb score). From here, we can use the datafile in conducting natural language processing on text features of the descriptions to investigate the correlations between these text features and the IMDb scores of the shows and films.**

## **Part 3: Description Text Pre-Processing.**

In [None]:
# Importing necessary libraries.
import pandas as pd
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords

titles = pd.read_csv('final_filled_titles.csv') 
row = titles[titles['id'] == 'tm997728']

# Extracting the 'description' column value from the selected row.
sample = row['description']
sample = sample.reset_index(drop=True)
lower_sample = sample[0].lower()

# Replacing punctuation characters with spaces in the description text.
x = re.sub("[-,':\.]", " ", lower_sample)

# Splitting the description text into a list of words.
word_list = x.split(' ')

# Processing the words in the list.
for word in word_list:

    # Handling single-letter words other than 'a' and 'i'.
    if word != 'a' and word != 'i':
        if len(word) == 1:
            word_list[word_list.index(word) - 1] += word
            word_list.remove(word)

        # Handling words like 're' and 've' by combining them with the previous word.
        elif word == 're' or word == 've':
            word_list[word_list.index(word) - 1] += word
            word_list.remove(word)

    # Removing words with length 0.
    if len(word) == 0:
        word_list.remove(word)

# Creating a string without punctuation.
no_punct = ''
for word in word_list:
    no_punct += word + ' '

# Tokenizing the string into words.
tokens = nltk.word_tokenize(no_punct)

stop_words = set(stopwords.words('english'))

# Removing stopwords from the list of tokens.
stopwords_removed = [w for w in tokens if not w in stop_words]

# Reconstructing description with stopwords removed.
stop_removed_str = ''
for word in stopwords_removed:
    stop_removed_str = stop_removed_str + word + ' '

sia = SentimentIntensityAnalyzer()

# Analyzing sentiment for the original text and text with stopwords removed.
print(f"Sample description: {lower_sample}\n{sia.polarity_scores(lower_sample)} \nWith stopwords removed: {stop_removed_str}\n{sia.polarity_scores(stop_removed_str)}")


**After cleaning and pre-processing the text descriptions we can show that removing stopwords increases the 'visibility' of the sentiment of a given description. Thus we will use these pre-processing techniques in our further analysis.**

## **Part 4: Sentiment Analysis.**

In [None]:
# Preprocessing all descriptions and graphing sentiment vs IMDb score correlation.

titles = pd.read_csv('final_filled_titles.csv') 
titles.dropna(subset=['description'], inplace=True)

def preprocess_description_text(row):
    '''This function preprocesses the description text by converting it to lower case, removing punctuation,
    removing single-letter words (except 'a' and 'i'), removing stopwords, and calculating the sentiment score.'''

    description = row['description']

    # Converting to lower case and remove punctuation.
    # Assumption: All 'description' data are in English. Therefore, only .lower() is needed.
    lower_sample = description.lower()
    x = re.sub("[-,':.]", " ", lower_sample)
    x = re.sub("[-,':\.]", " ", lower_sample)
    word_list = x.split(' ')
    
    # Removing single-letter words (except 'a' and 'i') and certain other words.
    for word in word_list:
        if word != 'a' and word != 'i':
            if len(word) == 1:
                word_list[word_list.index(word) - 1] += word
                word_list.remove(word)
            elif word == 're' or word == 've' or word == 't':
                word_list[word_list.index(word) - 1] += word
                word_list.remove(word)
        if len(word) == 0:
            word_list.remove(word)

    # Joining words back into a string.
    no_punct = ''
    for word in word_list:
        no_punct += word + ' '

    # Tokenizing the string.
    tokens = nltk.word_tokenize(no_punct)

    # Removing stopwords.
    stop_words = set(stopwords.words('english'))
    stopwords_removed = [w for w in tokens if not w in stop_words]

    # Joining words back into a string.
    stop_removed_str = ''
    for word in stopwords_removed:
        stop_removed_str = stop_removed_str + word + ' '
    
    # Calculating sentiment score.
    sia = SentimentIntensityAnalyzer()
    sentiment_score = sia.polarity_scores(stop_removed_str)['compound']  # Getting only the compound score.

    return sentiment_score 

# Applying preprocessing function to each row of the DataFrame.
titles['sentiment_score'] = titles.apply(lambda x : preprocess_description_text(x), axis = 1)

# Creating scatter plot of sentiment scores vs. IMDb ratings.
plt.figure(figsize=(10, 6))
plt.scatter(titles['sentiment_score'], titles['imdb_score'], alpha=0.5)
plt.title('Sentiment Scores vs. IMDb Score')
plt.xlabel('Sentiment Score')
plt.ylabel('IMDb Score')
plt.grid(True)

plt.show()

# Calculating the Pearson correlation coefficient.
pearson_corr = titles['sentiment_score'].corr(titles['imdb_score'], method='pearson')
spearman_corr = titles['sentiment_score'].corr(titles['imdb_score'], method='spearman')

# Printing the correlation coefficient.
print("Pearson Correlation Coefficient:", pearson_corr)
print("Spearman Correlation Coefficient:", spearman_corr)

The following divides the dataset into the genres and the associated sentiment scores allocated to them.

In [None]:
# Initializing an empty dictionary to store genres.
genre_dict = {}

iterator = 0

# Iterating over the 'genres' column in the 'titles' DataFrame.
# Tokenising the genres for ease of use.
for i in titles['genres']:

    # Removing punctuation from the genres.
    pattern = r'[\"\'[\]]'
    wo_punct = re.sub(pattern, '', i)

    # Splitting the genres into a list.
    lst = wo_punct.split(', ')

    # Getting the corresponding sentiment score.
    score = titles['sentiment_score'][iterator]
    iterator = iterator + 1
    
    # Adding each genre and its corresponding sentiment score to the dictionary.
    for j in lst:
        if j not in genre_dict:
            genre_dict[j] = []
        genre_dict[j].append(score)

# Calculating the mean sentiment score for each genre.
genre_means = {genre: sum(values) / len(values) for genre, values in genre_dict.items()}

# Sorting the genres by their mean sentiment score.
genre_means = sorted(genre_means.items(), key=lambda x:x[1])
genre_means = dict(genre_means)

# Extracting genre names and mean values.
genre_names = list(genre_means.keys())
genre_means_values = list(genre_means.values())

# Creating a bar plot of mean sentiment scores by genre.
plt.figure(figsize=(10, 6))
plt.bar(genre_names, genre_means_values)
plt.xlabel('Genre')
plt.ylabel('Mean sentiment_score')
plt.title('Mean sentiment_score by Genre')
plt.xticks(rotation=90)  # Rotating genre names for better visibility.
plt.show()

**From the analysis above, there is no significant correlation between IMDb scores and the sentiment of a description. Hence, we will analyse the correlation between IMDb score and description using other natural language processing techniques.**

## **Part 5: Key Word Analysis using TF-IDF Vectorisation.**


In [None]:
# Finding the top words most correlated with IMDb scores.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression

df = pd.read_csv('final_filled_titles.csv') 

# Splitting the data into features (descriptions) and target (IMDb ratings).
X = df['description']
y = df['imdb_score']

# Creating a TF-IDF vectorizer to convert descriptions into numerical features.
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Fitting a linear regression model to predict IMDb ratings based on TF-IDF features.
regressor = LinearRegression()
regressor.fit(X_tfidf, y)

# Getting the coefficients of each word in the TF-IDF vectorizer.
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Creating a DataFrame to associate words with their coefficients.
coefficients_df = pd.DataFrame({'Word': tfidf_feature_names, 'Coefficient': regressor.coef_})

# Sortting the DataFrame by coefficient values to find the words with the highest correlation.
top_words = coefficients_df.sort_values(by='Coefficient', ascending=False)

# Displaying the top words correlated with high IMDb ratings.
print(top_words)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import math
import numpy as np

# Selecting a subset of top words for visualization (e.g., the top 10 words).
selected_words = top_words.head(6)

# Initializing a DataFrame to store presence (1 or 0) of selected words.
presence_df = X_tfidf[:, [tfidf_vectorizer.vocabulary_[word] for word in selected_words['Word'].values]].toarray()

# Creating a DataFrame combining the presence of selected words with IMDb ratings.
combined_df = pd.DataFrame(presence_df, columns=selected_words['Word'].values)
combined_df['IMDb_Rating'] = y.values

# Calculating the number of rows and columns for the grid.
num_cols = 3 
num_rows = math.ceil(len(selected_words) / num_cols)

# Creating a grid of subplots.
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

# Flattening the axes array for easier iteration.
axes = axes.flatten()

# Creating pairplots for each selected word vs. IMDb rating in separate subplots.
for i, word in enumerate(selected_words['Word'].values):
    ax = axes[i]
    sns.scatterplot(x=word, y='IMDb_Rating', data=combined_df, ax=ax)
    
    # Adding regression line and calculate correlation coefficient.
    reg_line = sns.regplot(x=word, y='IMDb_Rating', data=combined_df, ax=ax, scatter=False, color='red', line_kws={'label': 'Regression Line'})
    corr_coef = np.corrcoef(combined_df[word], combined_df['IMDb_Rating'])[0, 1]
    
    ax.set_title(f'{word} vs. IMDb Rating')
    ax.set_xlabel('Frequency of \'' + word + '\'')
    ax.set_ylabel('IMDb Rating')

    # Adding legend for the regression line and correlation value.
    ax.legend()
    ax.text(0.7, 0.9, f'Correlation: {corr_coef:.2f}', transform=ax.transAxes, fontsize=10, color='blue')

# Removing any empty subplots (if the number of selected words is not a multiple of num_cols).
for i in range(len(selected_words), num_cols * num_rows):
    fig.delaxes(axes[i])

# Adding a title for the entire figure.
fig.suptitle("Pairplots of Selected Words vs. IMDb Rating with Regression Lines and Correlation Values", y=1.02)

# Adjusting the spacing between subplots.
plt.tight_layout()

# Showing the figure.
plt.show()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import stats
from scipy.stats import spearmanr

df = pd.read_csv('final_filled_titles.csv')

# TF-IDF vectorization of movie descriptions.
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(df['description'])

# Calculating the mean TF-IDF value for each movie's description.
mean_tfidf = np.mean(X_tfidf.toarray(), axis=1)

# Calculating the correlation between the mean TF-IDF values and IMDb scores (Pearson).
pearson_corr = np.corrcoef(mean_tfidf, df['imdb_score'])[0, 1]

# Calculating the Spearman correlation between the mean TF-IDF values and IMDb scores.
spearman_corr, _ = spearmanr(mean_tfidf, df['imdb_score'])

# Create a scatter plot with Matplotlib.
plt.figure(figsize=(10, 6))
plt.scatter(mean_tfidf, df['imdb_score'], alpha=0.5)

# Adding a regression line.
slope, intercept, r_value, p_value, std_err = stats.linregress(mean_tfidf, df['imdb_score'])
plt.plot(mean_tfidf, slope * mean_tfidf + intercept, color='red', label=f'Regression Line (R={r_value:.2f})')

plt.xlabel('Mean TF-IDF Value')
plt.ylabel('IMDb Scores')
plt.title('Correlation between Mean TF-IDF and IMDb Scores')
plt.legend()
plt.grid(True)
plt.show()

# Printing the correlations.
print(f"Pearson Correlation between Mean TF-IDF and IMDb Scores: {pearson_corr:.4f}")
print(f"Spearman Correlation between Mean TF-IDF and IMDb Scores: {spearman_corr:.4f}")

**From the graphs displayed above, it is evident that there are some very loose correlations between the presence of these words and the IMDb score. This is a basis that could be used in our machine learning prediction stage. However, it should be noted that only few of the data points actually contain each of these more highly correlated words. Furthermore, from the Pearson coefficients displayed, these are low values and likely will not be sufficient for an accurate machine learning model unless some other feature selection is introduced**

## **Part 6: Vector Space Analysis.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import KernelPCA
from scipy.stats import pearsonr, spearmanr
from scipy import stats

df = pd.read_csv('final_filled_titles.csv')

# TF-IDF vectorization of movie descriptions.
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(df['description'])

# Applying Kernel PCA to collapse TF-IDF vectors into one dimension.
kernel_pca = KernelPCA(n_components=1, kernel='rbf')
X_kernel_pca = kernel_pca.fit_transform(X_tfidf.toarray())

# Calculating the correlation between the Kernel PCA component and IMDb scores.
correlation = np.corrcoef(X_kernel_pca.flatten(), df['imdb_score'])[0, 1]

# Calculating Pearson correlation between the Kernel PCA component and IMDb scores.
pearson_corr, _ = pearsonr(X_kernel_pca.flatten(), df['imdb_score'])
print(f"Pearson Correlation between Kernel PCA Component and IMDb Scores: {pearson_corr:.4f}")

# Calculating Spearman correlation between the Kernel PCA component and IMDb scores.
spearman_corr, _ = spearmanr(X_kernel_pca.flatten(), df['imdb_score'])
print(f"Spearman Correlation between Kernel PCA Component and IMDb Scores: {spearman_corr:.4f}")

# Creating a scatter plot.
plt.figure(figsize=(10, 6))
plt.scatter(X_kernel_pca, df['imdb_score'], alpha=0.5)

# Adding a regression line.
slope, intercept, r_value, p_value, std_err = stats.linregress(X_kernel_pca.flatten(), df['imdb_score'])
plt.plot(X_kernel_pca, slope * X_kernel_pca + intercept, color='red', label=f'Regression Line (R={r_value:.2f})')

plt.xlabel('Kernel PCA Component')
plt.ylabel('IMDb Scores')
plt.title('Correlation between Kernel PCA Component and IMDb Scores')
plt.legend()
plt.grid(True)
plt.show()

## **Part 7: Machine Learning model building and assessing.**

**Given the lack of strong Pearson correlation coefficients calculated between description features and IMDb scores throughout the NLP processing stage, we will at this stage implement a number of machine learning strategies such as Decision Trees and Random Forests in an attempt to maximise the accuracy of a machine learning model, using the description to predict IMDb scores.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import KernelPCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

df = pd.read_csv('final_filled_titles.csv')

# TF-IDF vectorization of movie descriptions
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(df['description'])

# Apply Kernel PCA to collapse TF-IDF vectors into one dimension
kernel_pca = KernelPCA(n_components=1, kernel='rbf')
X_kernel_pca = kernel_pca.fit_transform(X_tfidf.toarray())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_kernel_pca, df['imdb_score'], test_size=0.2, random_state=42)

# Create and train a Random Forest regressor
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = regressor.predict(X_test)

# Calculate RMSE, R-squared, MAE, and MSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
percentage_error = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f'Mean Absolute Error (MAE): {mae:.4f}')
print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'Percentage Error: {percentage_error:.2f}%')

# Creating scatter plot.
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2, label='Ideal Fit')
plt.xlabel('Actual IMDb Score')
plt.ylabel('Predicted IMDb Score')
plt.title('Actual vs. Predicted IMDb Scores for TF-IDF with PCA Random Forest Method')
plt.legend()
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, alpha=0.5, label='Actual IMDb Scores')
plt.scatter(X_test, y_pred, color='red', alpha=0.5, label='Predicted IMDb Scores')
plt.xlabel('Kernel PCA Component')
plt.ylabel('IMDb Score')
plt.title('Actual vs. Predicted IMDb Scores')
plt.legend()
plt.grid(True)
plt.show()

**As is expressed through the metrics displayed above, this machine learning model is not as accurate as we would have hoped. However, due to the subjective and wildly variable nature of IMDb ratings, given the many other factors that influence the IMDb ratings including the actual quality of the film, the plot and the actors, this model is relatively good at predicting the IMDb score of a movie/tv show through the calculation of the TF-IDF vector of the description of that film, undertaking principal component analysis to do dimension reduction to get the final PCA component value. The mean absolute error of 1.06 means that this model predicts the IMDb scores to an accuracy of +- 1.06 IMDb values.**

In [None]:
# Creating histogram.
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.histplot(y_test, color='blue', alpha=0.5, label='Actual IMDb Scores', kde=True)
sns.histplot(y_pred, color='red', alpha=0.5, label='Predicted IMDb Scores', kde=True)
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.title('Distribution of Actual vs. Predicted IMDb Scores for TF-IDF with PCA Method ')
plt.legend()
plt.grid(True)
plt.show()