#### Setup 

In [18]:
import pandas as pd
import numpy as np
import datetime
from sklearn.preprocessing import MinMaxScaler
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
df_merged = pd.read_pickle('data/df_movies_cleaned.pkl')
df_ratings = pd.read_pickle('data/df_ratings_cleaned.pkl')

## Feature Engineering

Feature engineering is an important step in the development of machine learning models, including recommender systems, because it involves extracting meaningful variables from raw data to improve model performance and accuracy. This process transforms complex and often unstructured information into structured, analytically useful formats, allowing models to uncover previously unknown patterns, relationships, and insights. In the context of developing a movie recommendation system, effective feature engineering ensures that the nuances of movie content, user preferences, and contextual factors are accurately captured and used. By carefully selecting, combining, and transforming data into features such as weighted scores, combined textual data, and sentiment analysis, developers can significantly improve the system's ability to provide personalized, relevant, and appealing movie recommendations. This not only improves user satisfaction and engagement, but it also strengthens the business case by increasing platform usage and retention.

#### Weighted Score 

In [22]:
C = df_merged['vote_average'].mean()

m = df_merged['vote_count'].quantile(0.90)

def weighted_score(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v / (v + m) * R) + (m / (m + v) * C)

df_merged['weighted_score'] = df_merged.apply(weighted_score, axis=1)

The weighted score combines a movie's average rating (vote_average) and the number of ratings (vote_count) it has received to provide a balanced metric that reflects both popularity and quality. This approach mitigates the bias towards movies with a high average rating but a low number of ratings, ensuring that the recommendations are not only high-quality but also broadly appreciated. For a movie recommender system, integrating the weighted score helps prioritize movies that have proven appeal, aligning recommendations with broader viewer satisfaction.

#### Textual Feature - Combined Text

In [23]:
df_merged['combined_text'] = df_merged.apply(lambda row: ' '.join([
    ' '.join(row['genre_extracted']), 
    ' '.join(row['actors']), 
    ' '.join(row['keywords_extracted']), 
    row['overview'], 
    ' '.join(row['production_company_extracted'])
]).lower(), axis=1)

The combined_text feature aggregates critical textual metadata from genres, actors, keywords, and movie descriptions into a single comprehensive descriptor for each movie. This aggregation captures the essence of a movie’s content, thematic elements, and appeal, which is crucial for content-based filtering. By synthesizing this information, the recommender system can identify and suggest movies with similar thematic and content attributes, enhancing personalization and user engagement.

#### Vectorizing Combined Text

In [28]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)

tfidf_matrix = tfidf_vectorizer.fit_transform(df_merged['combined_text'])

Vectorizing the combined_text using TF-IDF transforms qualitative textual information into quantitative vectors, facilitating the measurement of content similarity between movies. This numerical representation allows for sophisticated algorithms to compute similarities based on thematic elements, narrative structures, and genre affiliations. For our movie recommender system, this means being able to recommend movies that are contextually and thematically aligned with a user’s preferences, enhancing the discovery of relevant and appealing content.

#### Movie Age

In [24]:
current_year = datetime.datetime.now().year

df_merged['movie_age'] = current_year - pd.to_datetime(df_merged['release_date']).dt.year

Calculating the movie_age from the release date provides insight into the recency and potential cultural relevance of a movie. In the context of a movie recommender system, this allows for temporal filtering and trend analysis, enabling recommendations that cater to preferences for newer releases or classic films. Understanding movie age is essential for aligning recommendations with temporal viewing trends and user preferences for contemporary versus classic cinema.

#### Sentiment Analysis of Overview

In [25]:
def get_sentiment(text):
    try:
        return TextBlob(text).sentiment.polarity
    except:
        return None 

df_merged['sentiment_polarity'] = df_merged['overview'].apply(get_sentiment)

Performing sentiment analysis on movie descriptions yields a sentiment_polarity score, offering a nuanced view of the emotional tone or mood conveyed by the movie's narrative. This feature is particularly important for recommending movies that match a user’s emotional preferences or current mood, adding an additional layer of personalization. By integrating sentiment analysis, your recommender system can differentiate movies not just by genre or content but also by the emotional experience they offer, enhancing user satisfaction and engagement.