# TITLE

# Introduction

To fill

<div style="border-left: 6px solid rgba(69, 157, 185, 1);border-radius:5px; box-shadow: 3px 3px 3px rgba(221, 221, 221, 1);" >
    <p style="background-color: rgba(69, 157, 185, 0.1); font-weight:bold; padding: 8px 0 8px 15px;">Analysis</p>
    <div style="padding: 0 0 2px 10px;">
    
**What will be covered :**
- **Part 1 :** fill
- **Part 2 :** fill
- **Part 3 :** fill

</div></div>
<br/>

---

# Part 1: Movie Dataset

### Import libraries

In [2]:
import numpy as np
import pandas as pd

# LDA
# gensim is a popular library for topic modelling
# pip install gensim
# pip install nltk
# pip install kagglehub
# pip install spacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import kagglehub # To extract synopsis dataframes
import spacy


nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/serge/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /Users/serge/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Load data

In [5]:
movie_columnns =['Wikipedia movie ID','Freebase movie ID', 'Movie name', 'Movie release date', 'Movie box office revenue', 'Movie runtime', 'Movie languages', 'Movie countries', 'Movie genres']
movie = pd.read_csv('Data/movie.metadata.tsv', sep='\t', header=None, names=movie_columnns)
movie.head()

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


## Latent Dirichlet Allocation (LDA)
To group movies together based on something more precise than their genre, we will use Latent Dirichlet Allocation. LDA offers a way to cluster together movies that reflect similar themes, based on textual descriptions. The model is based on unsupervised learning, as the subjects are not known. However the movie summaries only give us too few informations to use LDA. That's why we will extract the synopsis of the movies through another dataset called "Movie Plot Synopses with Tags" (MPST). Those synopsis are around 10 times longer than the summaries, giving way more informations to the LDA model.

### Exploratory Data Analysis of Synopsis Dataframe

In [7]:
# Load data
synopsis = pd.read_csv('Data/mpst_full_data.csv')

# Check data shape
print("This synopsis dataframe is of size:", synopsis.shape)

# Display df
synopsis.head()

This synopsis dataframe is of size: (14828, 6)


Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source
0,tt0057603,I tre volti della paura,Note: this synopsis is for the orginal Italian...,"cult, horror, gothic, murder, atmospheric",train,imdb
1,tt1733125,Dungeons & Dragons: The Book of Vile Darkness,"Two thousand years ago, Nhagruul the Foul, a s...",violence,train,imdb
2,tt0033045,The Shop Around the Corner,"Matuschek's, a gift store in Budapest, is the ...",romantic,test,imdb
3,tt0113862,Mr. Holland's Opus,"Glenn Holland, not a morning person by anyone'...","inspiring, romantic, stupid, feel-good",train,imdb
4,tt0086250,Scarface,"In May 1980, a Cuban man named Tony Montana (A...","cruelty, murder, dramatic, cult, violence, atm...",val,imdb


### Comment
The dataframe offers extra movie tags and the full movie synopsis which would give the LDA more context to group movies together. On the other hand there a only 14'828 rows/movies which is consequently smaller than the summaries dataframe. Another important notice is that this dataframe uses imdb_id not wikipedia movie ids.

## Adding synopsis and tags to the initial movie df
To match the two dataset, we will retrieve the imdb ID of the movies in the initial dataset through the imdb library. It's safer than matching the titles and we will use the library later anyway (cf. part x).

In [11]:
from imdb import IMDb
ia = IMDb()
movie_id = '1375666'
movie = ia.get_movie(movie_id)
wikipedia_id = movie.get('wikidata id')

In [9]:
movies_synopsis = synopsis.merge(movie, left_on='title', right_on='Movie name', how='inner')
movies_synopsis.drop(columns=['title', 'split']) # Does not work
movies_synopsis.head()

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres
0,tt1733125,Dungeons & Dragons: The Book of Vile Darkness,"Two thousand years ago, Nhagruul the Foul, a s...",violence,train,imdb,30855958,/m/0gfjl1f,Dungeons & Dragons: The Book of Vile Darkness,2012-08-09,,90.0,{},"{""/m/07ssc"": ""United Kingdom""}","{""/m/01hmnh"": ""Fantasy""}"
1,tt0033045,The Shop Around the Corner,"Matuschek's, a gift store in Budapest, is the ...",romantic,test,imdb,76353,/m/0k4bt,The Shop Around the Corner,1940-01-12,,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3nyp"": ..."
2,tt0113862,Mr. Holland's Opus,"Glenn Holland, not a morning person by anyone'...","inspiring, romantic, stupid, feel-good",train,imdb,171076,/m/016z98,Mr. Holland's Opus,1995-12-29,106269971.0,130.0,"{""/m/02h40lc"": ""English Language"", ""/m/0my5"": ...","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3n84"": ""Inspirational Drama"", ""/m/0hqx..."
3,tt0086250,Scarface,"In May 1980, a Cuban man named Tony Montana (A...","cruelty, murder, dramatic, cult, violence, atm...",val,imdb,76331,/m/0k44g,Scarface,1932,,94.0,"{""/m/02bjrlw"": ""Italian Language"", ""/m/02h40lc...","{""/m/09c7w0"": ""United States of America""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/0gw5w78"": ""G..."
4,tt0086250,Scarface,"In May 1980, a Cuban man named Tony Montana (A...","cruelty, murder, dramatic, cult, violence, atm...",val,imdb,267848,/m/01nln3,Scarface,1983-12-01,65884703.0,170.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/01jfsb"": ""Th..."


### Comment

1. Movie loss


The size fo the dataframe has not significantly changed, only ... rows did not match thus our original movie df contained most of the movies present in the synopsis df.

2. Same title


We also observe that Scarface appears twice since there was a remake but for the rest for the analyis we will assume taht their is no significant change between their synopses

# Can we observe new movies' groups using LDA?

## Preprocessing
Removing stop words and names before using LDA is important such that those very frequent words are not used to predict themes + lowercase 

In [10]:
# This cell takes long to run you can use the movies_synopsis df instead

"""
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def remove_named_entities(text):
    doc = nlp(text)
    return ' '.join([token.text for token in doc if token.ent_type_ != "PERSON"])  # Exclude PERSON entities

stop_words = set(stopwords.words('english'))

# Sample movie summaries
sentences = movies_synopsis['plot_synopsis'].tolist()
print("Their are", len(sentences), "sentences")

# Remove names
cleaned_sentences= [remove_named_entities(doc) for doc in sentences]
print("After removing names their are", len(cleaned_sentences), "sentences")

# Remove stop words lowercase
processed_sentences = [
    [word for word in word_tokenize(sentence.lower()) if word.isalnum() and word not in stop_words]
    for sentence in cleaned_sentences
]
print("After total processing their are", len(processed_sentences), "sentences")

# stores processed_sentences in a df so that those steps do not have to be repeated
movies_synopsis['processed synopsis'] = processed_sentences
movies_synopsis.to_csv('movies_synopsis.csv', index=False)
"""


Their are 14734 sentences
After removing names their are 14734 sentences
After total processing their are 14734 sentences


In [23]:
# Create a dictionary and corpus for the LDA model
movies_synopsis = pd.read_csv('Data/movies_synopsis.csv')
processed_sentences = movies_synopsis['processed synopsis'].tolist()

In [21]:
dictionary = Dictionary(processed_sentences)
corpus = [dictionary.doc2bow(sentence) for sentence in processed_sentences]

# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, random_state=42, passes=10, iterations=50)

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

In [13]:
for topic_id, topic_words in lda_model.print_topics(num_words=15):
    print(f"Topic {topic_id}: {topic_words}")

Topic 0: 0.016*"police" + 0.008*"car" + 0.007*"man" + 0.007*"kill" + 0.007*"killed" + 0.007*"house" + 0.005*"finds" + 0.005*"dead" + 0.005*"gun" + 0.005*"one" + 0.005*"find" + 0.005*"body" + 0.005*"death" + 0.005*"murder" + 0.005*"shoots"
Topic 1: 0.005*"one" + 0.004*"war" + 0.004*"team" + 0.004*"men" + 0.003*"ship" + 0.003*"escape" + 0.003*"killed" + 0.003*"two" + 0.003*"attack" + 0.003*"world" + 0.003*"army" + 0.003*"earth" + 0.003*"group" + 0.003*"bomb" + 0.003*"time"
Topic 2: 0.022*"hamlet" + 0.018*"king" + 0.012*"macbeth" + 0.009*"claudius" + 0.006*"prince" + 0.006*"banquo" + 0.005*"act" + 0.005*"witches" + 0.005*"ghost" + 0.005*"queen" + 0.004*"castle" + 0.004*"musketeers" + 0.004*"father" + 0.004*"england" + 0.004*"love"
Topic 3: 0.007*"father" + 0.005*"love" + 0.005*"family" + 0.005*"one" + 0.005*"life" + 0.005*"new" + 0.005*"mother" + 0.004*"film" + 0.004*"home" + 0.004*"time" + 0.004*"also" + 0.004*"wife" + 0.003*"son" + 0.003*"later" + 0.003*"two"
Topic 4: 0.007*"back" + 0.0

This approcah seems promising! With very basic preprocessing can already interpret some themes:

*   Topic 0: Police + Kill -> Detective
*   Topic 1: War + team --> War
*   Topic 2: King + witches --> Medevial
*   Topic 3: Father + Love --> Family
*   Topic 4: Twon + Sherrif + Horse --> Western
*   Topic 5: Sea + treasure --> Pirate
*   Topic 6: Prince + Castle --> Fairy tale
*   Topic 7: Money + job --> Buisness
*   Topic 8: ?
*   Topic 9: ?


But we also observe that this still has room for improvement in example topics 9 still carries many names thus for further analysis the name removal step should be fine tuned. And also lemmatization should be applied to avoid reoccurences of declinaison of the same word (e.g killed and kill in topic 0)







passes: refers to the number of complete passes through the entire dataset. Increasing it gives the model more chances to learn the structure of your data.

iterations controls how many times the model iterates through each document per pass. Increasing iterations can help improve convergence for smaller datasets.