### Workflow and Thought Process (SML with USML)

Step 1: Get the data into the right layout.
- Always think tabular first
- Rows are observations / samples
- Columns are variables / features

```
# Load the data into pandas using the correct file type.

df = pd.read_csv(...)  # csv
df = pd.read_excel(..., sheetname=...)  # excel
df = pd.read_parquet(...)  # parquet

# and 16 other formats including spss and sql.
# ## See here: https://pandas.pydata.org/docs/user_guide/io.html
```

```
# Perform data wrangling to get the correct layout.
# ## There is no shame in using GenAI (ChatGPT, Claude, Gemini)

df = df.apply(lambda x: ..., axis=1)
```

Step 2: Explore the data
- Interpret features and form inital hypotheses
- Observations are used to compute summary statistics for inspection
- "Visualize" the problem and solution

```
# Create a profile report of the dataframe
# ## Inspect for anomalies and build a sense of the feature space and
# ## form an initial set of hypotheses

df.profile_report(minimal=True)
```

## Step 3: Identify themes and latent (hidden) features
1. Use-case Development - Hypothesize the possibility of latent (hidden) features by checking whether there are interdependencies between explanatory features.

2. Engineer Features - Group features together under a hypothesized theme and think whether the interdependecy relations are simple or complex.

3. Train an unsupervised model
> - For simple themes, use clustering (create 1 new feature out of n original features).
>
> - For complex themes, use dimension reduction (create n new features out of n original features).
>
> - For extremely complex themes with deep layers of interdependency, use a deep learning model instead.

```
# Clustering
# Instantiate an experiment class
exp = ClusteringExperiment()

# Setup experiment
exp.setup(
    data=...,
    normalize=True,
)

# Train models
... = exp.create_model(...)
```

```
# Dimension Reduction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize, MaxAbsScaler, Normalizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline
from scipy.sparse import csr_matrix

# Perform pre-processing
# ## (if NLP) Call an instance of tfidf
tfidf = TfidfVectorizer()

# ## (if not NLP) Call a normalizer
nml = Normalizer() # or
max = MaxAbsScaler()

# Create a model instance
nmf = NMF(n_components=...) # or
pca = PCA()

# Create a pipeline
# ## Pipeline will execute the fit and transform in sequence
pipeline = make_pipeline(tfidf, nmf)  # or
pipeline = make_pipeline(tfidf, mas, nmf, nml)  # if we want to compute cosine similarity

# Execute the pipeline
# ## Train the pipeline
pipeline.fit(...)

# ## Apply the trained pipeline to transform existing features into new features
new_features = pipeline.transform(...)

# ## We can also do the fit-transform in one pass
# ## The model will still be trained
new_features = pipeline.fit_transform(...)

# Assign new features to each observation
df_with_new_features = pd.DataFrame(new_features, index=...)
```

4. Analyze the model result

```
# For clustering
# Check sample distribution
exp.plot_model(..., plot='distribution')

# Visual cluster connectivity
exp.plot_model(..., plot='tsne')

# Check cluster separation
exp.plot_model(..., plot='distance')
```

```
# For dimension reduction
# Extract the top n observations with the highest values for each new feature
# Inspect the observations and find a theme for this feature
df_with_new_features['...'].nlargest(20)
```

5. (Depending on use-case) Use model to group or recommend.

```
# For clustering
# Assign cluster membership to each observation
df_with_cluster = exp.assign_model(...)

# Visualize the relationship between clusters and original features

original_features = ['...', ..., '...']

for feature in original_features:
    sns.boxplot(data=..., x='...', y=feature)
```

```
# For dimension reduction
# Use cosine similarity or other methods as recommender
# Extract embeddings from sample
sample_embedding = [..., ..., ...]

# Compute cosine similarity score against all observations
similarity_score = df_with_new_features.dot(sample_embedding)

# Extract the closest n observations
similarity_score.nlargest(n)
```

Step 4: Identify what you want to predict and what you want to predict with:

**Predictive outcome**
- A value: Regression
- A label (i.e., a choice from a set of candidate choices): Classification

**Features**
- Features to ignore (lookahead-bias, i.e., values that you will not observe on prediction day. Example, I want to predict price_psf and there are price_psm, transaction_price, and nett_price in the features. These must be ignored.)
- Features that should be numerical
- Features that should be categorical
- For classification, is there data imbalance in the outcome feature? Use fix_imbalance if True.
- For classification, ensure data_split_stratify is True so that the train and test datasets have the same proportion of outcome labels
- If there are important features that have different proportion in feature lables, pass the feature names as a list to data_split_stratify, which will ensure that the train and test datasets have similar proportions of the feature labels as well.
- Consider the need for preprocessing (normalize, transformation, polynomial features).

```
# Instantiate an experiment class
exp = RegressionExperiment()  # or
exp = ClassificationExperiment()

# Setup the experiment
exp.setup(
    data=df,
    target=...,
    fix_imbalance=True,
    data_split_stratify=True,
    ignore_features=[...],
    ...,
    session_id = 137,
    log_experiment=True,
    experiment_name='...',
)

# See pycaret documentation: https://pycaret.gitbook.io/docs
```

Step 5: Use the data and fit into models that are known to be:
- Easy to understand: Linear/Logistic Regression, Decision Trees
- Good predictive performance: Ensembles
- Adjusts for overfitting: Ensembles, Penalized Regressions

```
# Fit the data across all classes of models
exp.compare_models()

# Choose the best models
best_model = exp.create_model(...)
lgb = exp.create_model('lgb')
xgb = exp.create_model('xgb')
```

Step 6: Analyze the performance of the models
- Feature importances: Does the important features make sense?
- Error distribution: What segment of the data tend to have larger errors?
- Learning curves: Will more samples help, or do we need more features?

```
# Validate the model performance on the test set
exp.predict_model(best_model)

# Inspect the model performance
exp.evaluate_model(best_model)

# Get a shap diagram to inform hypotheses
exp.interpret_model(best_model)
```

Step 7: Improve the solution
- More data and more relevant features
- Use stacking or blending ensembles
- Tune the model hyperparameters
- Insert more intermediate steps into the ML pipeline
- Better explanation

```
# Stack or blend multiple models
stack_ensemble = exp.stack_models(...)
blend_ensemble = exp.blend_models(...)

# Tune models (may or may not improve)
tuned_model = exp.tune_model(
    best_model,
    choose_better=True,
    search_library='optuna',
    n_iter=200,
    optimize='<metric>',
    early_stopping=True
)
```

Step 8: Maintain a proper pipeline management
- MLOps to manage data and model versions, performance, and other related artifacts

```
!mlflow ui
```

Step 9: Prepare for deployment
- Train the model on all data
- Save the trained model into a file or
- Use MLFlow model versioning

```
# Finalize the model by training it with all data
final_model = exp.finalize_model(tuned_model)

# Save the final model
# ## Use datetime.now() to ensure that we can trace every model version
model_filename = f'final_model_{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}'

# ## Save the model
exp.save_model(final_model, model_filename)
```

Step 10: Create an interface for ease of use
- API service
- Dashboard

```
from pycaret.... import load_model, predict_model

# Load latest model
latest_model = load_model(model_filename)

# Insert prediction code into interface code
# ## Convert input data into pandas dataframe
input_df = pd.DataFrame(...)

# ## Get predicted value
predicted_value = latest_model.predict_model(input_df)
```

Step 11: Terminate workflow

```
# To terminate MLFlow and release ngrok
# ## Remove all Python processes containing "mlflow"
!pkill -f mlflow

# ## Remove all ngrok tunnels
ngrok.kill()
```

# Setup

### Allow GPU for traditional machine learning
The following cells installs the lightgbm gpu version as well as cuml from rapids.ai.

These are required if we want to run traditional machine learning models with GPU.

Note that deep learning packages (e.g., pytorch, tensorflow) comes with native GPU access. There is no need to install anything else.

In [None]:
# LightGBM GPU can be activated with the following script
!mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

/bin/bash: /etc/OpenCL/vendors/nvidia.icd: Permission denied


In [None]:
# For some ML models, we require rapids ai's cuml
!pip install --extra-index-url=https://pypi.nvidia.com cuml-cu12==24.10.*

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com


### Other required installations

In [None]:
# Install the additional packages
# ## If we install the packages in a single row, it will perform dependency version checks
# ## else, the later packages will replace the dependencies of the earlier packages
!pip install ydata_profiling swifter
!pip install pycaret[full]
!pip install scikit-learn-intelex
!pip install fastapi[all] mlflow pyngrok



### Imports and data connectivity

In [None]:
# All import statements should be upfront so that it is easy to
# track what are required

# ## Import the following packages
import datetime
import swifter
import mlflow
import mlflow.data
from mlflow.data.pandas_dataset import PandasDataset
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from tqdm.notebook import tqdm, tnrange
from pathlib import Path
from pydantic import BaseModel
from ydata_profiling import ProfileReport
from pycaret.regression import RegressionExperiment
from pycaret.classification import ClassificationExperiment
from pycaret.clustering import ClusteringExperiment

In [None]:
# Connect to data folder
try:
    from google.colab import userdata
    from google.colab import drive
    drive.mount('/content/drive')

    data_dir = Path('/content/drive/MyDrive/pcml_data')
    mlrun_dir = Path('/content/drive/MyDrive/logs/mlruns')
    models_dir = Path('/content/drive/MyDrive/Colab Notebooks/models')

except (NotImplementedError, ModuleNotFoundError):
    data_dir = Path('data')
    mlrun_dir = Path('logs/mlruns')
    models_dir = Path('models')

else:
    print(f'Using Colab...')

finally:
    if not mlrun_dir.exists():
        mlrun_dir.mkdir(parents=True)
    mlflow.set_tracking_uri(mlrun_dir)

### Additional imports

In [None]:
# For this exercise, we require these additional packages
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from string import punctuation
from scipy.sparse import csr_matrix
from sklearn.preprocessing import Normalizer, MaxAbsScaler, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, PCA, NMF
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt
from pycaret.classification import ClassificationExperiment

# Punkt contains the dictionaries for stemming
nltk.download('punkt_tab')

# stopwords contains the dictionaries for stop words
nltk.download('stopwords')

# Wordnet contains the dictionaries for lemmatizing
nltk.download('wordnet')

# Setup data path
data_path = Path('data')

[nltk_data] Downloading package punkt_tab to /home/dannel/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/dannel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/dannel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Case Study - Netflix & Chill

In this exercise, you are given 9957 netflix titles with various features.

Your tasks:
1. Convert the features into a layout that ML models can use.
2. Convert the description into a set of features that ML models can use.
3. Perform a SML on the rating
4. Create a recommender system that recommends 10 movies to an user after receiving an input for 3 movies that they like.

In [None]:
# Load the netflix dataset
netflix = pd.read_csv(data_dir / '6USML/netflix_movies.csv')
print(netflix.head())

                    title         year certificate duration  \
0               Cobra Kai     (2018– )       TV-14   30 min   
1               The Crown     (2016– )       TV-MA   58 min   
2        Better Call Saul  (2015–2022)       TV-MA   46 min   
3           Devil in Ohio       (2022)       TV-MA  356 min   
4  Cyberpunk: Edgerunners     (2022– )       TV-MA   24 min   

                          genre  rating  \
0         Action, Comedy, Drama     8.5   
1     Biography, Drama, History     8.7   
2                  Crime, Drama     8.9   
3        Drama, Horror, Mystery     5.9   
4  Animation, Action, Adventure     8.6   

                                         description  \
0  Decades after their 1984 All Valley Karate Tou...   
1  Follows the political rivalries and romance of...   
2  The trials and tribulations of criminal lawyer...   
3  When a psychiatrist shelters a mysterious cult...   
4  A Street Kid trying to survive in a technology...   

                         

In [None]:
# Convert the year feature into just the starting year
netflix['year'] = netflix.year.str.extract(r'(\d{4})')
print(netflix['year'].head())

0    2018
1    2016
2    2015
3    2022
4    2022
Name: year, dtype: object


In [None]:
# Filter the rows for missing values in duration
netflix_duration_na = netflix[netflix.duration.isna()]
print(netflix_duration_na['duration'].head())
netflix_duration_not_na = netflix[netflix.duration.notna()]
print(netflix_duration_not_na['duration'].head())

19     NaN
67     NaN
77     NaN
168    NaN
199    NaN
Name: duration, dtype: object
0     30 min
1     58 min
2     46 min
3    356 min
4     24 min
Name: duration, dtype: object


In [None]:
import math
# Should we drop the column or fill in the missing values?
# ## We can fill in the missing values with the mean value

duration_without_mins = netflix_duration_not_na['duration'].copy().str.replace(' min', '').astype(int)
duration_mean = str(math.ceil(duration_without_mins.mean()))  + ' min'
print(duration_mean)
netflix.duration.fillna(duration_mean, inplace=True)
# ## We can drop the column
# netflix_no_duration = netflix.drop(columns=['duration'])
# print(netflix_no_duration.head())

# remove " min" from duration column
netflix['duration'] = netflix['duration'].str.replace(' min', '', regex=False).astype(int)
print(netflix.head())

74 min
                    title  year certificate  duration  \
0               Cobra Kai  2018       TV-14        30   
1               The Crown  2016       TV-MA        58   
2        Better Call Saul  2015       TV-MA        46   
3           Devil in Ohio  2022       TV-MA       356   
4  Cyberpunk: Edgerunners  2022       TV-MA        24   

                          genre  rating  \
0         Action, Comedy, Drama     8.5   
1     Biography, Drama, History     8.7   
2                  Crime, Drama     8.9   
3        Drama, Horror, Mystery     5.9   
4  Animation, Action, Adventure     8.6   

                                         description  \
0  Decades after their 1984 All Valley Karate Tou...   
1  Follows the political rivalries and romance of...   
2  The trials and tribulations of criminal lawyer...   
3  When a psychiatrist shelters a mysterious cult...   
4  A Street Kid trying to survive in a technology...   

                                               stars  

In [None]:
# # Split the genre into multiple columns
# split_genres_df = netflix_no_duration['genre'].str.split(', ', expand=True)

# Extract the genre into a separate dataframe using the method explode
exploded_df = netflix.assign(genre_type=netflix['genre'].str.split(', ')).explode('genre_type')

# Think of a way to indicate 1 for every genre that each title belongs to
genre_one_hot_df = pd.get_dummies(exploded_df['genre_type'])
genre_indicators_df = pd.concat([exploded_df[['title']], genre_one_hot_df], axis=1)

# map for title and its hash
mapping = pd.DataFrame({
    'title': netflix['title'],
    'hash': [hash(title) for title in netflix['title']]
})

# to use as index instead of title
genre_indicators_df['title'] = genre_indicators_df['title'].apply(hash)
# # Combine rows by title, taking the maximum value for each column
combined_df = genre_indicators_df.groupby('title', sort=False).max().astype(int)

# # Reset index (optional, to make Title a regular column again)
combined_df.reset_index(inplace=True)

hash_to_title = dict(zip(mapping['hash'], mapping['title']))
combined_df['title'] = combined_df['title'].map(hash_to_title)

# Don't concat it to the netflix df yet

In [None]:
# exploded_df['genre_type']
# netflix

# Concat combined_df with netflix df on the column title and drop the genre column
# netflix_with_genre = pd.merge(netflix, combined_df, on='title', how='left')

# # drop the genre column
# netflix_with_genre = netflix_with_genre.drop(columns=['genre'])

# # remove the text ' min' from the column duration
# netflix_with_genre['duration'] = netflix_with_genre['duration'].str.replace(' min', '', regex=False).astype(int)

# # convert the column rating to float
# netflix_with_genre['rating'] = netflix_with_genre['rating'].astype(float)

# # convert the column votes from the string format x,xxx to float
# netflix_with_genre['votes'] = netflix_with_genre['votes'].str.replace(',', '').astype(float)

# # convert the column year to int
# netflix_with_genre['year'] = netflix_with_genre['year'].astype(float)


# # save netflix_with_genre to csv
# netflix_with_genre.to_csv('netflix_with_genre.csv', index=False)

In [None]:
# Do the same for the stars
# Check the number of columns that it will generate

# clean
df_stars = netflix.copy();

df_stars['stars'] = df_stars['stars'].str.replace(r"\|\s*", ",", regex=True)
df_stars['stars'] = df_stars['stars'].str.replace("','", "", regex=False).str.replace("', '", ", ", regex=False).str.strip()

# Clean the 'stars' column
df_stars['stars'] = df_stars['stars'].str.replace(r",\s*,", ",", regex=True)  # Remove ', ,'
df_stars['stars'] = df_stars['stars'].str.replace(r"\[\s*'\s*|\s*'\s*\]", "", regex=True)  # Remove surrounding brackets and quotes
df_stars['stars'] = df_stars['stars'].str.replace(r"', '    Stars:", "", regex=True)  # Normalize 'Stars:' formatting
df_stars['stars'] = df_stars['stars'].str.replace(r", '    Stars:", "", regex=True)  # Normalize 'Stars:' formatting
df_stars['stars'] = df_stars['stars'].str.replace(r" \", '", "", regex=True)
df_stars['stars'] = df_stars['stars'].str.replace(r" '    Star:',", "", regex=True)
df_stars['stars'] = df_stars['stars'].str.replace(r"', '    Star:", "", regex=True)
df_stars['stars'] = df_stars['stars'].str.replace(r"[\[\]]", "", regex=True)  # Remove square brackets
df_stars['stars'] = df_stars['stars'].str.replace('"', "", regex=False)  # Remove double quotes

df_stars['stars'] = df_stars['stars'].str.strip()  # Remove leading/trailing spaces

# Extract the stars into a separate dataframe using the method explode
exploded_stars_df = df_stars.assign(star_name=df_stars['stars'].str.split(', ')).explode('star_name')

# Count the number of unique names
unique_name_count = exploded_stars_df['star_name'].nunique()

# Output the result
print(f"Number of unique names: {unique_name_count}")

Number of unique names: 23993


In [None]:
# Given that there are 27k+ stars, we should consider decomposition techniques
# and represent each firm with a few latent features

# Should we have 1 NMF for stars + genre or individual NMFs for stars and genre?
# Think about how latent features are generated, and the possibility of interactions between stars and genres

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd

# Perform NMF on your choice of stars and genre
netflix_nmf = df_stars.copy() # star list is cleaned already

# Combine 'stars' and 'genre' columns for textual analysis
netflix_nmf['combined_text'] = netflix_nmf['stars'] + ' ' + netflix_nmf['genre']

# Clean and prepare the text
netflix_nmf['combined_text'] = netflix_nmf['combined_text'].fillna('').str.lower()

# What is the scaler that you should use?
# scaler needed to convert data to be non-negative
# apply td-idf which already produces non negative
# td-idf transforms raw text  it transforms raw text into a numerical matrix that
# effectively represents the importance of terms within a document while reducing noise
# from overly common terms.

# Vectorize the combined text using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(netflix_nmf['combined_text'])

# How many components should we use?
# trial and error to see if topics make sense. dont want to oversimply or overcomplicate
# Topic 6: kang, yûki, jung, park, ji, min, jin, kim, lee, romance eg. this is good
# but dont want to link all korean sounding names to romance

n_components = 10  # Number of latent features/topics
nmf_model = NMF(n_components=n_components, random_state=42)
nmf_features = nmf_model.fit_transform(tfidf_matrix)

# Display top words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf_model.components_):
    top_features = [feature_names[i] for i in topic.argsort()[-10:]]
    print(f"nmf_feature_{topic_idx}: {', '.join(top_features)}")

# Create the pipeline and apply it to the dataframe

# Add NMF features to the dataframe
for i in range(n_components):
    netflix_nmf[f'nmf_feature_{i+1}'] = nmf_features[:, i]

# Create a pipeline with TF-IDF vectorizer and NMF (no idea what to do with this)
combined_text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('nmf', NMF(n_components=5, random_state=42))
])

nmf_feature_0: peter, navarro, manuel, álvaro, music, family, sport, history, biography, drama
nmf_feature_1: mike, paul, james, van, david, chris, john, michael, family, comedy
nmf_feature_2: anthony, scott, ben, chris, james, david, fi, sci, adventure, action
nmf_feature_3: peter, john, david, michael, music, sport, history, biography, short, documentary
nmf_feature_4: kana, kino, jun, greg, yûki, fantasy, adventure, short, family, animation
nmf_feature_5: brian, ben, clarkson, hammond, jeremy, james, richard, game, reality, tv
nmf_feature_6: kang, yûki, jung, park, ji, min, jin, kim, lee, romance
nmf_feature_7: jakob, eklund, antonia, jason, ella, john, action, tom, robert, thriller
nmf_feature_8: daniel, adam, alejandro, amit, anu, michael, robert, action, mystery, crime
nmf_feature_9: adam, mehmet, david, kim, lee, sci, fi, fantasy, mystery, horror


In [None]:
# Combine the NMF features with the original dataframe
# Drop genre and stars columns
netflix_genre_star_nmf_feature = netflix_nmf.drop(columns=['genre', 'stars','combined_text'])

In [None]:
# Observe that there's other features such as year and votes (and duration if you didn't drop it)
# Should we have combined all these and NMF them together?
# Consider the implications:
#   1. interactions between these features (interpretability)
#   2. how would you scale these features

In [None]:
# Lastly, we need to NMF the description
# Extract the descriptions and clean it
df_des = netflix_nmf.copy()
df_des['description'] = df_des['description'].str.lower()
df_des['description'] = df_des['description'].str.replace(r"\|\s*", ",", regex=True)

In [None]:
# Perform NMF on the descriptions
# What additional preprocessing do you need?

# Vectorize the combined text using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df_des['description'])

# What is the scaler that you should use?

# How many components should we use?

n_components = 10  # Number of latent features/topics
nmf_model = NMF(n_components=n_components, random_state=42)
nmf_features = nmf_model.fit_transform(tfidf_matrix)

# Display top words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf_model.components_):
    top_features = [feature_names[i] for i in topic.argsort()[-10:]]
    print(f"nmf_desc_feature_{topic_idx}: {', '.join(top_features)}")

# Create the pipeline and apply it to the dataframe

# Add NMF features to the dataframe
for i in range(n_components):
    df_des[f'nmf_desc_feature_{i+1}'] = nmf_features[:, i]

# Create a pipeline with TF-IDF vectorizer and NMF (no idea what to do with this)
desc_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('nmf', NMF(n_components=5, random_state=42))
])


nmf_desc_feature_0: victor, did, mystery, game, adaptation, kept, unknown, wraps, plot, add
nmf_desc_feature_1: years, girl, mother, father, murder, past, mysterious, woman, man, young
nmf_desc_feature_2: save, journey, ii, people, look, documentary, war, series, summary, world
nmf_desc_feature_3: work, time, follows, personal, changing, documentary, death, career, real, life
nmf_desc_feature_4: town, goes, james, summary, jeremy, takes, team, city, york, new
nmf_desc_feature_5: teenage, girl, students, student, group, summary, best, friends, high, school
nmf_desc_feature_6: finds, christmas, house, tries, mother, daughter, lives, father, home, family
nmf_desc_feature_7: netflix, stage, live, comic, takes, series, comedian, comedy, special, stand
nmf_desc_feature_8: home, friend, son, mother, years, girl, boy, summary, year, old
nmf_desc_feature_9: film, true, lives, based, series, tells, summary, follows, love, story


In [None]:
# Combine the NMF features with the original dataframe
# Drop the description column
netflix_nmf_feature_desc = df_des.drop(columns=['description'])

In [None]:
# Check what dtype is votes?
# Do we need to convert it to a numerical value?

votes_df = netflix_nmf_feature_desc.copy()

# Convert the 'votes' column to numeric by removing commas and changing type to float
votes_df['votes'] = votes_df['votes'].str.replace(',', '', regex=True).astype(float)

# Verify the conversion
print(votes_df['votes'].dtype)  # Should print float64

float64


In [None]:
# Check votes_df['votes'] for missing values
# Check for missing values in the 'votes' column
missing_votes = votes_df['votes'].isnull().sum()
print(f"Number of missing values in 'votes' column: {missing_votes}")

Number of missing values in 'votes' column: 1173


In [None]:
# We are going to predict ratings
# Can we just drop the rows with missing values?
# yes because we need that information for training

# Create a copy of the dataframe and name it rating_df
# Proceed with your choice of processing the data
rating_df = votes_df.dropna()

# Clean column names by replacing spaces with underscores
rating_df.columns = rating_df.columns.str.replace(' ', '_')

# If there are categorical variables, standardize their values
if 'certificate' in rating_df.columns:
    rating_df['certificate'] = rating_df['certificate'].str.replace(' ', '_')

# Convert year column to date time
rating_df['year'] = pd.to_datetime(rating_df['year'], format='%Y')

In [None]:
# Perform regression or classification on the netflix dataset?
# use regression as we want to predict
from pycaret.regression import *

exp = RegressionExperiment()

regression_data = rating_df.copy()

features = [
    'year',
    'duration',
    'certificate',
    'votes'
]  # Select relevant features
target = 'rating'

# Set up PyCaret regression environment
regression_setup = exp.setup(
    data=regression_data,
    target=target,
    train_size=0.8,
    normalize=True,  # Normalize numerical features automatically
    feature_selection=True,  # Enable feature selection
    ignore_features=['title'],  # Exclude irrelevant features
    numeric_features=[
        'votes',
        *[f'nmf_feature_{i}' for i in range(1, 11)],  # Numeric NMF features
        *[f'nmf_desc_feature_{i}' for i in range(1, 11)]  # Numeric NMF description features
    ],
    categorical_features=[
        'certificate',
        'genre',
        'stars',
        'combined_text',  # Combined text feature as categorical (or processed separately)
    ],
    date_features=['year'],  # Treat 'year' as a date feature
    use_gpu=True,  # Use GPU for acceleration if available
    session_id=123,  # Set a random seed for reproducibility
    experiment_name='rating_prediction',  # Name the experiment for tracking
    verbose=False  # Suppress detailed output during setup
)

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


In [None]:
# Perform model selection

# Compare models and select the best one
best_model = regression_setup.compare_models(exclude="rf")


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
knn,K Neighbors Regressor,0.8572,1.2917,1.1362,0.1259,0.1637,0.1485,0.433
lightgbm,Light Gradient Boosting Machine,0.8631,1.3113,1.1447,0.1126,0.1648,0.1495,1.083
gbr,Gradient Boosting Regressor,0.8652,1.3148,1.1462,0.1103,0.1648,0.1496,1.102
xgboost,Extreme Gradient Boosting,0.8651,1.3154,1.1465,0.1099,0.165,0.1497,1.158
ridge,Ridge Regression,0.8745,1.3192,1.1482,0.1073,0.1645,0.1502,0.427
lr,Linear Regression,0.8746,1.3194,1.1482,0.1072,0.1646,0.1502,0.421
lar,Least Angle Regression,0.8746,1.3194,1.1482,0.1072,0.1646,0.1502,0.381
br,Bayesian Ridge,0.8746,1.3194,1.1482,0.1072,0.1645,0.1502,0.447
et,Extra Trees Regressor,0.8656,1.3233,1.1499,0.1045,0.1655,0.15,0.69
omp,Orthogonal Matching Pursuit,0.8759,1.3237,1.1501,0.1043,0.1648,0.1504,0.386


[I] [22:40:16.739108] Unused keyword parameter: n_jobs during cuML estimator initialization
[I] [22:45:12.541589] Unused keyword parameter: n_jobs during cuML estimator initialization
[I] [22:45:12.542357] Unused keyword parameter: n_jobs during cuML estimator initialization


In [None]:
# Validate best model on test data
from pycaret.regression import predict_model

# Validate the best model on test data
validation_results = regression_setup.predict_model(best_model)
print(validation_results[['rating', 'prediction_label']])


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.8114,1.1701,1.0817,0.1385,0.1546,0.1381


      rating  prediction_label
5174     6.9          6.880000
9121     7.4          6.780000
6694     6.3          6.800000
230      8.4          6.840000
1447     7.0          6.800000
...      ...               ...
4410     6.2          6.800000
747      6.7          6.740000
4684     7.6          6.800000
7570     7.1          6.820001
234      7.1          6.800000

[1277 rows x 2 columns]


In [None]:
# # Tune model
tuned_model = regression_setup.tune_model(best_model)

# Validate the tuned model on test data
validation_results = regression_setup.predict_model(tuned_model, verbose=True)
print(validation_results[['rating', 'prediction_label']])


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.8646,1.2499,1.118,0.1424,0.1553,0.1419
1,0.9061,1.4154,1.1897,0.1108,0.171,0.1559
2,0.862,1.3038,1.1419,0.0875,0.1638,0.1504
3,0.874,1.3636,1.1677,0.1147,0.1722,0.1558
4,0.8575,1.3243,1.1508,0.1284,0.1687,0.1524
5,0.8159,1.2135,1.1016,0.1657,0.1591,0.1417
6,0.8676,1.2997,1.1401,0.1219,0.1621,0.1479
7,0.8543,1.3355,1.1557,0.1249,0.1708,0.154
8,0.8281,1.1721,1.0826,0.1101,0.1556,0.1418
9,0.8505,1.2714,1.1275,0.1315,0.1606,0.1445


[I] [22:45:13.774881] Unused keyword parameter: n_jobs during cuML estimator initialization
[I] [22:45:14.086851] Unused keyword parameter: n_jobs during cuML estimator initialization
Fitting 10 folds for each of 10 candidates, totalling 100 fits
[I] [22:45:14.092216] Unused keyword parameter: n_jobs during cuML estimator initialization
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003029 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6206
[LightGBM] [Info] Number of data points in the train set: 4593, number of used features: 39
[LightGBM] [Info] Start training from score 6.818855
[I] [22:45:14.478308] Unused keyword parameter: n_jobs during cuML estimator initialization
[I] [22:45:14.542170] Unused keyword parameter: n_jobs during cuML estimator initialization
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead o

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.8114,1.1701,1.0817,0.1385,0.1546,0.1381


      rating  prediction_label
5174     6.9          6.880000
9121     7.4          6.780000
6694     6.3          6.800000
230      8.4          6.840000
1447     7.0          6.800000
...      ...               ...
4410     6.2          6.800000
747      6.7          6.740000
4684     7.6          6.800000
7570     7.1          6.820001
234      7.1          6.800000

[1277 rows x 2 columns]


In [None]:
# # Finalize model

final_model = regression_setup.finalize_model(tuned_model)

# # Make predictions on the test set
predictions = regression_setup.predict_model(final_model, regression_data)
print(predictions[['rating', 'prediction_label']])

# # Save the model
saved_model = regression_setup.save_model(final_model, 'netflix_regression_model')



[I] [22:46:10.700688] Unused keyword parameter: n_jobs during cuML estimator initialization
[I] [22:46:10.701302] Unused keyword parameter: n_jobs during cuML estimator initialization


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.0636,0.0232,0.1523,0.9841,0.0198,0.0091


      rating  prediction_label
0        8.5          8.460000
1        8.7          8.780001
2        8.9          8.879999
3        5.9          5.880000
4        8.6          8.460001
...      ...               ...
9952     6.3          6.380000
9953     8.1          8.280001
9954     8.7          8.780001
9955     8.4          8.480000
9956     5.9          5.880000

[6381 rows x 2 columns]
Transformation Pipeline and Model Successfully Saved


In [None]:
# Will a classification model work better?
# Convert the ratings into classes

# Approach 1: Use 5 ratings

# Define bins and labels for the 5 classes
bins = [0, 2, 4, 6, 8, 10]
labels = ['Very Poor', 'Poor', 'Average', 'Good', 'Excellent']

# Convert the 'rating' column into a new categorical column 'rating_class'
rating_df['rating_class'] = pd.cut(rating_df['rating'], bins=bins, labels=labels, include_lowest=True)

# Display a sample of the updated dataset to verify the transformation
print(rating_df[['title', 'rating', 'rating_class']].head())


# Approach 2: Convert the ratings into 10 classes (round to the nearest integer)

# Fill missing values in the 'rating' column with the median of the column
# rating_df['rating'] = rating_df['rating'].fillna(rating_df['rating'].median())

# Convert the 'rating' column into 10 classes by rounding to the nearest integer
rating_df['rating_class_10'] = rating_df['rating'].round().astype(int)

# Display a sample of the updated dataset to verify the transformation
print(rating_df[['title', 'rating', 'rating_class_10']].head())



                    title  rating rating_class
0               Cobra Kai     8.5    Excellent
1               The Crown     8.7    Excellent
2        Better Call Saul     8.9    Excellent
3           Devil in Ohio     5.9      Average
4  Cyberpunk: Edgerunners     8.6    Excellent
                    title  rating  rating_class_10
0               Cobra Kai     8.5                8
1               The Crown     8.7                9
2        Better Call Saul     8.9                9
3           Devil in Ohio     5.9                6
4  Cyberpunk: Edgerunners     8.6                9


In [None]:
# Check for class imbalance

class_distribution_5 = rating_df['rating_class'].value_counts().sort_index()
print(class_distribution_5)

class_distribution_10 = rating_df['rating_class_10'].value_counts().sort_index()
print(class_distribution_10)

rating_class
Very Poor       3
Poor          162
Average      1332
Good         3962
Excellent     922
Name: count, dtype: int64
rating_class_10
2       15
3       58
4      216
5      538
6     1498
7     1934
8     1804
9      305
10      13
Name: count, dtype: int64


In [None]:
# Perform approach 1 classification on the netflix dataset
from pycaret.classification import *

# Select target and features
target_variable = 'rating_class'
five_cat_data = rating_df.copy()

exp = ClassificationExperiment()

# Step 2: Set up the classification environment
clf_setup = exp.setup(
    data=five_cat_data,
    target=target_variable,  # Target column
    session_id=42,  # For reproducibility
    ignore_features=['title', 'rating_class_10', 'rating'],  # Exclude irrelevant columns
    numeric_features=[
        'votes',
        *[f'nmf_feature_{i}' for i in range(1, 11)],  # Numeric NMF features
        *[f'nmf_desc_feature_{i}' for i in range(1, 11)]  # NMF description features
    ],
    categorical_features=['certificate', 'genre', 'stars', 'combined_text'],  # Treat these as categorical
    date_features=['year'],  # Treat 'year' as a date feature
    use_gpu=True,  # Use GPU for acceleration if available
    verbose=False,  # Suppress detailed output during setup
    experiment_name='rating_classification'  # Name the experiment for tracking
)

# Step 3: Compare models to find the best one
class_best_model = clf_setup.compare_models(exclude="rf")


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
dummy,Dummy Classifier,0.6209,0.1,0.6209,0.3855,0.4757,0.0,0.0,0.197
lr,Logistic Regression,0.6191,0.0,0.6191,0.4914,0.4866,0.0148,0.0437,3.305
knn,K Neighbors Classifier,0.5672,0.1182,0.5672,0.5332,0.5459,0.1492,0.1518,0.221
et,Extra Trees Classifier,0.5262,0.1305,0.5262,0.5879,0.4785,0.1166,0.1275,0.695
ada,Ada Boost Classifier,0.4786,0.0,0.4786,0.569,0.3953,0.0731,0.1281,0.955
nb,Naive Bayes,0.4774,0.1214,0.4774,0.5396,0.4714,0.1398,0.1558,0.196
dt,Decision Tree Classifier,0.4592,0.1082,0.4592,0.5511,0.4616,0.0936,0.1034,0.24
gbc,Gradient Boosting Classifier,0.4386,0.0,0.4386,0.5828,0.421,0.0763,0.0877,14.687
catboost,CatBoost Classifier,0.42,0.1253,0.42,0.5727,0.4083,0.0713,0.0828,17.681
lightgbm,Light Gradient Boosting Machine,0.384,0.1238,0.384,0.5952,0.3839,0.0691,0.0875,2.279


[I] [22:46:47.915294] Unused keyword parameter: n_jobs during cuML estimator initialization


In [None]:
# # Tune model
class_tuned_model = clf_setup.tune_model(class_best_model)

# Validate the tuned model on test data
validation_results = clf_setup.predict_model(class_tuned_model)
print(validation_results[['rating_class', 'prediction_label']])

# # Finalize the model
class_final_model = clf_setup.finalize_model(class_tuned_model)

# # Make predictions on the test set
predictions = clf_setup.predict_model(class_final_model, five_cat_data)
print(predictions[['rating_class', 'prediction_label']])

rating_class_classifier = clf_setup.save_model(final_model, 'rating_class_classifier')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6197,0.0,0.6197,0.384,0.4742,0.0,0.0
1,0.6197,0.0,0.6197,0.384,0.4742,0.0,0.0
2,0.6197,0.0,0.6197,0.384,0.4742,0.0,0.0
3,0.6197,0.0,0.6197,0.384,0.4742,0.0,0.0
4,0.6197,0.5,0.6197,0.384,0.4742,0.0,0.0
5,0.6197,0.5,0.6197,0.384,0.4742,0.0,0.0
6,0.6233,0.0,0.6233,0.3885,0.4787,0.0,0.0
7,0.6233,0.0,0.6233,0.3885,0.4787,0.0,0.0
8,0.6233,0.0,0.6233,0.3885,0.4787,0.0,0.0
9,0.6211,0.0,0.6211,0.3857,0.4759,0.0,0.0


Fitting 10 folds for each of 4 candidates, totalling 40 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Dummy Classifier,0.6209,0,0.6209,0.3855,0.4757,0.0,0.0


     rating_class prediction_label
9808         Good             Good
1499         Good             Good
4581         Good             Good
2293      Average             Good
3356         Good             Good
...           ...              ...
6611      Average             Good
335          Good             Good
5721      Average             Good
3412         Good             Good
559          Good             Good

[1915 rows x 2 columns]


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Dummy Classifier,0.6209,0,0.6209,0.3855,0.4757,0.0,0.0


     rating_class prediction_label
0       Excellent             Good
1       Excellent             Good
2       Excellent             Good
3         Average             Good
4       Excellent             Good
...           ...              ...
9952         Good             Good
9953    Excellent             Good
9954    Excellent             Good
9955    Excellent             Good
9956      Average             Good

[6381 rows x 2 columns]
Transformation Pipeline and Model Successfully Saved


In [None]:
# Perform approach 2 classification on the netflix dataset
from pycaret.classification import *

# Select target and features
target_variable = 'rating_class_10'
five_cat_data = rating_df.copy()

exp = ClassificationExperiment()

# Step 2: Set up the classification environment
clf_setup = exp.setup(
    data=five_cat_data,
    target=target_variable,  # Target column
    session_id=42,  # For reproducibility
    ignore_features=['title', 'rating_class', 'rating'],  # Exclude irrelevant columns
    numeric_features=[
        'votes',
        *[f'nmf_feature_{i}' for i in range(1, 11)],  # Numeric NMF features
        *[f'nmf_desc_feature_{i}' for i in range(1, 11)]  # NMF description features
    ],
    categorical_features=['certificate', 'genre', 'stars', 'combined_text'],  # Treat these as categorical
    date_features=['year'],  # Treat 'year' as a date feature
    use_gpu=True,  # Use GPU for acceleration if available
    verbose=False,  # Suppress detailed output during setup
    experiment_name='rating_classification_10'  # Name the experiment for tracking
)

# Step 3: Compare models to find the best one
class_10_best_model = clf_setup.compare_models(exclude="rf")


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1


[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.3594,0.5739,0.3594,0.5058,0.2595,0.0919,0.1667,18.544
et,Extra Trees Classifier,0.3565,0.6051,0.3565,0.4822,0.2521,0.0858,0.1622,0.676
gbc,Gradient Boosting Classifier,0.3536,0.0,0.3536,0.4505,0.2615,0.0886,0.1495,26.131
xgboost,Extreme Gradient Boosting,0.3504,0.5346,0.3504,0.4714,0.2815,0.0953,0.138,3.884
lda,Linear Discriminant Analysis,0.3502,0.0,0.3502,0.4613,0.2428,0.0789,0.1493,0.32
ada,Ada Boost Classifier,0.3497,0.0,0.3497,0.2893,0.233,0.0699,0.1408,0.958
lightgbm,Light Gradient Boosting Machine,0.3471,0.5203,0.3471,0.4344,0.2688,0.0882,0.1348,4.368
knn,K Neighbors Classifier,0.3318,0.545,0.3318,0.3278,0.3258,0.1105,0.1111,0.223
lr,Logistic Regression,0.3229,0.0,0.3229,0.2454,0.2162,0.0331,0.055,4.173
dt,Decision Tree Classifier,0.3229,0.482,0.3229,0.3598,0.2765,0.076,0.0957,0.248


[I] [23:03:05.562640] Unused keyword parameter: n_jobs during cuML estimator initialization


In [None]:
# # Tune model
class_10_tuned_model = clf_setup.tune_model(class_10_best_model)

# Validate the tuned model on test data
validation_results = clf_setup.predict_model(class_10_tuned_model)
print(validation_results[['rating_class_10', 'prediction_label']])

# # Finalize the model
class_10_final_model = clf_setup.finalize_model(class_10_tuned_model)

# # Make predictions on the test set
predictions = clf_setup.predict_model(class_10_final_model, five_cat_data)
print(predictions[['rating_class_10', 'prediction_label']])

rating_class_10_classifier = clf_setup.save_model(final_model, 'rating_class_10_classifier')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.3579,0.6074,0.3579,0.5493,0.249,0.0876,0.1749
1,0.3826,0.634,0.3826,0.6814,0.2877,0.1211,0.2205
2,0.3624,0.6064,0.3624,0.5266,0.2565,0.0949,0.1832
3,0.3669,0.61,0.3669,0.5937,0.2668,0.0995,0.184
4,0.349,0.6166,0.349,0.3486,0.2426,0.0758,0.1366
5,0.3468,0.6426,0.3468,0.4897,0.2396,0.0737,0.1408
6,0.3632,0.0,0.3632,0.6747,0.2647,0.0955,0.1827
7,0.3363,0.6029,0.3363,0.6038,0.2166,0.0563,0.131
8,0.3565,0.6412,0.3565,0.4264,0.2496,0.091,0.1671
9,0.3386,0.6446,0.3386,0.403,0.2498,0.0639,0.1032


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,CatBoost Classifier,0.3661,0.6489,0.3661,0.5355,0.2651,0.1004,0.1864


      rating_class_10  prediction_label
2032                7                 7
4977                8                 7
4019                7                 7
2961                6                 7
9862                8                 8
...               ...               ...
1906                7                 7
8711                7                 7
202                 7                 7
9303                8                 8
9470                7                 7

[1915 rows x 2 columns]


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,CatBoost Classifier,0.9992,1.0,0.9992,0.9992,0.9992,0.999,0.999


      rating_class_10  prediction_label
0                   8                 8
1                   9                 9
2                   9                 9
3                   6                 6
4                   9                 9
...               ...               ...
9952                6                 6
9953                8                 8
9954                9                 9
9955                8                 8
9956                6                 6

[6381 rows x 2 columns]
Transformation Pipeline and Model Successfully Saved


# Netflix Recommender System

In [26]:
# Create a recommender system
# Should we create another NMF on the processed netflix dataset?
# Think about the implications of doing so
# If not, how would we incorporate non-NMF features into the recommender system?

In [27]:
# Preprocessing - Data Cleaning
# End goal to recommend similar shows, regardless of ratings and vote
import pandas as pd
import math

netflix = pd.read_csv('netflix_recommender/netflix_movies.csv')

duration_without_mins = netflix[netflix.duration.notna()]['duration'].copy().str.replace(' min', '').astype(int)
duration_mean = str(math.ceil(duration_without_mins.mean()))  + ' min'
netflix.duration.fillna(duration_mean, inplace=True)

# # netflix = netflix.dropna()

# remove non digits from year but keep - hypen character eg.(2018-2022)
netflix['year'] = netflix['year'].str.replace(r'[^0-9-]', '', regex=True)

# take first digits of year as int
netflix['year'] = netflix['year'].str[:4]
netflix['year'] = pd.to_numeric(netflix['year'], errors='coerce').fillna(9999).astype(int)

# convert duration column to remove the text ' min'
netflix['duration'] = netflix['duration'].str.replace(' min', '', regex=False).astype(int)

# convert votes to int before calculating the median.
# Remove commas and convert to numeric
netflix['votes'] = netflix['votes'].str.replace(',', '', regex=False).astype(float)

# fill na in the votes column with median
netflix['votes'] = netflix['votes'].fillna(netflix['votes'].median())

# Now 'votes' contains numeric values without commas. Convert to int if necessary.
netflix['votes'] = netflix['votes'].astype(int)

# convert rating to float
netflix['rating'] = netflix['rating'].astype(float)

# fill empty rows in the rating column with median
netflix['rating'] = netflix['rating'].fillna(netflix['rating'].median())

# Clean the 'stars' column
netflix['stars'] = netflix['stars'].str.replace(r"\|\s*", ",", regex=True)
netflix['stars'] = netflix['stars'].str.replace("','", "", regex=False).str.replace("', '", ", ", regex=False).str.strip()
netflix['stars'] = netflix['stars'].str.replace(r",\s*,", ",", regex=True)  # Remove ', ,'
netflix['stars'] = netflix['stars'].str.replace(r"\[\s*'\s*|\s*'\s*\]", "", regex=True)  # Remove surrounding brackets and quotes
netflix['stars'] = netflix['stars'].str.replace(r"', '    Stars:", "", regex=True)  # Normalize 'Stars:' formatting
netflix['stars'] = netflix['stars'].str.replace(r", '    Stars:", "", regex=True)  # Normalize 'Stars:' formatting
netflix['stars'] = netflix['stars'].str.replace(r" \", '", "", regex=True)
netflix['stars'] = netflix['stars'].str.replace(r" '    Star:',", "", regex=True)
netflix['stars'] = netflix['stars'].str.replace(r"', '    Star:", "", regex=True)
netflix['stars'] = netflix['stars'].str.replace(r"[\[\]]", "", regex=True)  # Remove square brackets
netflix['stars'] = netflix['stars'].str.replace('"', "", regex=False)  # Remove double quotes
netflix['stars'] = netflix['stars'].str.strip()  # Remove leading/trailing spaces


# save to csv
netflix.to_csv('netflix_recommender/netflix_movies_processed.csv', index=False)

print(netflix.head())

                    title  year certificate  duration  \
0               Cobra Kai  2018       TV-14        30   
1               The Crown  2016       TV-MA        58   
2        Better Call Saul  2015       TV-MA        46   
3           Devil in Ohio  2022       TV-MA       356   
4  Cyberpunk: Edgerunners  2022       TV-MA        24   

                          genre  rating  \
0         Action, Comedy, Drama     8.5   
1     Biography, Drama, History     8.7   
2                  Crime, Drama     8.9   
3        Drama, Horror, Mystery     5.9   
4  Animation, Action, Adventure     8.6   

                                         description  \
0  Decades after their 1984 All Valley Karate Tou...   
1  Follows the political rivalries and romance of...   
2  The trials and tribulations of criminal lawyer...   
3  When a psychiatrist shelters a mysterious cult...   
4  A Street Kid trying to survive in a technology...   

                                               stars   votes 

In [28]:
# Using your choice, prepare the NMF features for recommender system

# Perform the step necessary for applying cosine similarity

# Apply the normalizer to the selected_nmf

In [29]:
# Convert encode genre to true false as column header

exploded_df = netflix.assign(genre_type=netflix['genre'].str.split(', ')).explode('genre_type')

# Think of a way to indicate 1 for every genre that each title belongs to
genre_one_hot_df = pd.get_dummies(exploded_df['genre_type'])
genre_indicators_df = pd.concat([exploded_df[['title']], genre_one_hot_df], axis=1)

# map for title and its hash
mapping = pd.DataFrame({
    'title': netflix['title'],
    'hash': [hash(title) for title in netflix['title']]
})

# to use as index instead of title
genre_indicators_df['title'] = genre_indicators_df['title'].apply(hash)
# # Combine rows by title, taking the maximum value for each column
combined_df = genre_indicators_df.groupby('title', sort=False).max().astype(int)

# # Reset index (optional, to make Title a regular column again)
combined_df.reset_index(inplace=True)

hash_to_title = dict(zip(mapping['hash'], mapping['title']))
combined_df['title'] = combined_df['title'].map(hash_to_title)

# combine this with the netflix df
netflix_recommender = netflix.merge(combined_df, on='title', how='left')

# drop the column genre
netflix_recommender.drop(columns=['genre'], inplace=True)

# convert column header to lowercase and replace spaces with hypen -
netflix_recommender.columns = netflix_recommender.columns.str.lower().str.replace(' ', '-')

# save to csv
# netflix_recommender.to_csv('netflix_recommender/netflix_recommender.csv', index=False)

In [30]:
# Check how the nmf_recommender looks like in a dataframe
print(netflix_recommender.head())

                    title  year certificate  duration  rating  \
0               Cobra Kai  2018       TV-14        30     8.5   
1               The Crown  2016       TV-MA        58     8.7   
2        Better Call Saul  2015       TV-MA        46     8.9   
3           Devil in Ohio  2022       TV-MA       356     5.9   
4  Cyberpunk: Edgerunners  2022       TV-MA        24     8.6   

                                         description  \
0  Decades after their 1984 All Valley Karate Tou...   
1  Follows the political rivalries and romance of...   
2  The trials and tribulations of criminal lawyer...   
3  When a psychiatrist shelters a mysterious cult...   
4  A Street Kid trying to survive in a technology...   

                                               stars   votes  action  \
0  Ralph Macchio, William Zabka, Courtney Henggel...  177031       1   
1  Claire Foy, Olivia Colman, Imelda Staunton, Ma...  199885       0   
2  Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Pa...  50

In [31]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

# Clasify into similar clusters

# Step 1: Preprocess Descriptions (Text Data)
netflix_recommender['description_clean'] = netflix_recommender['description'].str.lower().fillna('')

# Step 2: Extract Features
# 2a. TF-IDF for Descriptions
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(netflix_recommender['description_clean'])

# 2b. Genre-Based Feature Vector
genre_columns = [
    'action', 'adventure', 'animation', 'biography', 'comedy', 'crime',
    'documentary', 'drama', 'family', 'fantasy', 'history', 'horror',
    'music', 'mystery', 'news', 'reality-tv', 'romance', 'sci-fi', 'short',
    'sport', 'talk-show', 'thriller', 'war', 'western'
]
genre_matrix = netflix_recommender[genre_columns].values

# Step 3: Combine Features
# Normalize genre matrix to the same scale as TF-IDF
scaler = StandardScaler()
normalized_genre_matrix = scaler.fit_transform(genre_matrix)


# Combine TF-IDF and normalized genre features
combined_matrix = np.hstack((tfidf_matrix.toarray(), normalized_genre_matrix))

# Step 4: Perform Similarity Grouping
# Compute cosine similarity on the combined feature matrix
similarity_matrix = cosine_similarity(combined_matrix)

# Step 5: Clustering (Optional)
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
netflix_recommender['combined_cluster'] = kmeans.fit_predict(combined_matrix)

# save netflix_recommender to csv
# netflix_recommender.to_csv('netflix_recommender/netflix_recommender_clustered.csv', index=False)

data = netflix_recommender.copy()


In [32]:
# Randomly select a title from the netflix dataset
import random

random_index = random.randint(0, len(data) - 1)
random_title = data.iloc[random_index]['title']
print(f"Randomly Selected Title: {random_title}")

# Get the title's profile / embedding
print(data.iloc[random_index])

# and find the top 20 most similar title

similarity_scores = cosine_similarity([combined_matrix[random_index]], combined_matrix).flatten()
top_20_indices = similarity_scores.argsort()[::-1][1:21]  # Exclude the random title itself
similar_titles = data.iloc[top_20_indices][['title', 'description','combined_cluster']]

# Display Results
print("\nTop 20 Most Similar Titles:")
print(similar_titles)

# Calculate the cosine similarity between the random title and all other titles
top_20_similarities = [(data.iloc[i]['title'], similarity_scores[i]) for i in top_20_indices]
# Step 7: Print Top 20 Most Similar Titles with Cosine Similarity
print("Top 20 Most Similar Titles:")
for title, score in top_20_similarities:
    print(f"Title: {title} | Similarity: {score:.4f}")

Randomly Selected Title: Grand Army
title                                                       Grand Army
year                                                              2020
certificate                                                      TV-MA
duration                                                            49
rating                                                             7.2
description          Joey copes with a difficult setback in her cas...
stars                Silas Howard, Odessa A’zion, Odley Jean, Amir ...
votes                                                              130
action                                                               0
adventure                                                            0
animation                                                            0
biography                                                            0
comedy                                                               0
crime                                    

In [33]:
# Can you improve the recommendation through the use of other features?

# use genre and description as features

# For each title, get the rating and the votes and present the top 5
# highest rated titles


In [34]:
# Step 5: Recommend Movies
def recommend_movies(input_titles, top_n=10):
    # Find indices of input movies
    input_indices = data[data['title'].isin(input_titles)].index.tolist()
    if not input_indices or len(input_indices) < len(input_titles):
        missing_titles = set(input_titles) - set(data.iloc[input_indices]['title'])
        return f"Could not find the following titles in the dataset: {', '.join(missing_titles)}"

    # Compute similarity scores for all movies relative to the input movies
    aggregated_scores = np.zeros(similarity_matrix.shape[0])
    for idx in input_indices:
        aggregated_scores += similarity_matrix[idx]

    # Rank movies by similarity score, excluding the input movies
    aggregated_scores[input_indices] = -1  # Exclude input movies from recommendations

    # Create a DataFrame with scores, ratings, and votes
    recommendation_data = data.copy()
    recommendation_data['similarity_score'] = aggregated_scores

    # For each title, get the rating and the votes and present the top 5
    # highest rated titles
    # Sort by similarity score, then by rating, and then by votes
    recommendation_data = recommendation_data.sort_values(
        by=['similarity_score', 'rating', 'votes'], ascending=[False, False, False]
    )

    # Select the top N recommended movies
    recommended_movies = recommendation_data.head(top_n)[['title', 'description', 'combined_cluster', 'rating', 'votes']]

    return recommended_movies

# Select 3 random rows from the title column
random_movies = netflix_recommender.sample(n=3)
print(random_movies[['title', 'description', 'combined_cluster']])

# get the rows from the netflix_recommender
input_movies = random_movies['title'].tolist()

# loop through the list of input_movies
for index, i in random_movies.iterrows():
    print("\n === Input Movie === ")
    print(i[['title', 'description', 'combined_cluster']]) # Accessing series data by label
    print("\n=== Recommended movies === ")
    print(recommend_movies([i['title']], top_n=5))

# recommended_movies = recommend_movies(input_movies, top_n=10)
# print("\nRecommended Movies:")
# print(recommended_movies)

                          title  \
3232  Last Chance U: Basketball   
930                  White Girl   
2962                   Aelliseu   

                                            description  combined_cluster  
3232  Explore an honest and gritty look inside the w...                 3  
930   Summer, New York City. A college girl falls ha...                 3  
2962  The story of detective Park Jin Gyeom who come...                 3  

 === Input Movie === 
title                                       Last Chance U: Basketball
description         Explore an honest and gritty look inside the w...
combined_cluster                                                    3
Name: 3232, dtype: object

=== Recommended movies === 
                            title  \
2028                Last Chance U   
1947                        Cheer   
5153               The Short Game   
7133    Boca Juniors Confidential   
830   Formula 1: Drive to Survive   

                                            