## <div style="color:brown;text-align:center"><u> Researcher: Akinsulure Akintunde </u></div>

### <div style="color:red">Problem Statement</div>

*The objective is to extract meaningful topics using two different topic modelling approaches: LDA and BERTopic. The task is to identify thematic structures within the movie synopses (using the synopsis column in the available data) and compare the topics generated by the traditional method (LDA) with those produced by the more recent, embedding-based method (BERTopic).*

---

In [1]:
### Importing neccessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Warnings suppresion
import warnings
warnings.filterwarnings('ignore')

---

In [2]:
### Loading the dataset into pandas 
data = pd.read_csv("data/movie_data.csv")

## Inspecting the dataframe
data.head()

Unnamed: 0,Movie,Genre,Runtime,Rating,Votes,Year,Synopsis,Actors,Certificate,Image
0,Gen V,"Action, Adventure, Comedy",,8.0,13679,2023–,"From the world of ""The Boys"" comes ""Gen V,"" wh...","Jaz Sinclair, Chance Perdomo, Lizze Broadway, ...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
1,Ahsoka,"Action, Adventure, Drama",,7.8,69947,2023–,"After the fall of the Galactic Empire, former ...","Rosario Dawson, David Tennant, Natasha Liu Bor...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
2,Loki,"Action, Adventure, Fantasy",53 min,8.2,359924,2021–,The mercurial villain Loki resumes his role as...,"Tom Hiddleston, Owen Wilson, Sophia Di Martino...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
3,The Wheel of Time,"Action, Adventure, Drama",60 min,7.1,125052,2021–,Set in a high fantasy world where magic exists...,"Rosamund Pike, Daniel Henney, Madeleine Madden...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
4,One Piece,"Action, Adventure, Comedy",60 min,8.4,109063,2023–,"In a seafaring world, a young pirate captain s...","Iñaki Godoy, Emily Rudd, Mackenyu, Vincent Regan",,https://m.media-amazon.com/images/S/sash/4Fyxw...


In [3]:
### Only the Synopsos column would be needed for this task, so it's going to be seperated from the rest

new_data = data[['Synopsis']]

# Inspecting new_data
new_data.head()

Unnamed: 0,Synopsis
0,"From the world of ""The Boys"" comes ""Gen V,"" wh..."
1,"After the fall of the Galactic Empire, former ..."
2,The mercurial villain Loki resumes his role as...
3,Set in a high fantasy world where magic exists...
4,"In a seafaring world, a young pirate captain s..."


Good, we'll now dive into data cleaning

---

### <div style="text-align:center;color:blue"><u> Exploratory Data Analysis and Data Visualization</u></div>

In [4]:
# Setting the max column width to display all text, rather than truncating them so we have a better view

pd.set_option("display.max_colwidth", None)

new_data.head(15)

Unnamed: 0,Synopsis
0,"From the world of ""The Boys"" comes ""Gen V,"" which explores the first generation of superheroes to know that their super powers are from Compound V. These heroes put their physical and moral boundaries to the test competing for the school's top ranking."
1,"After the fall of the Galactic Empire, former Jedi Knight Ahsoka Tano investigates an emerging threat to a vulnerable galaxy."
2,The mercurial villain Loki resumes his role as the God of Mischief in a new series that takes place after the events of “Avengers: Endgame.”
3,"Set in a high fantasy world where magic exists, but only some can access it, a woman named Moiraine crosses paths with five young men and women. This sparks a dangerous, world-spanning journey. Based on the book series by Robert Jordan."
4,"In a seafaring world, a young pirate captain sets out with his crew to attain the title of Pirate King, and to discover the mythical treasure known as 'One Piece.'"
5,"Nine noble families fight for control over the lands of Westeros, while an ancient enemy returns after being dormant for a millennia."
6,"During the French Revolution, vampire hunter prodigy Richter Belmont fights to uphold his family's legacy and prevent the rise of a ruthless, power-hungry vampire."
7,The year is 1717. Wealthy land-owner Stede Bonnet has a midlife crisis and decides to blow up his cushy life to become a pirate. It does not go well. Based on a true story.
8,"Follows the adventures of Monkey D. Luffy and his pirate crew in order to find the greatest treasure ever left by the legendary Pirate, Gold Roger. The famous mystery treasure named ""One Piece""."
9,A boy swallows a cursed talisman - the finger of a demon - and becomes cursed himself. He enters a shaman's school to be able to locate the demon's other body parts and thus exorcise himself.


The first step would be to convert all words to owercase, so the same words, having different case types would not be treated as sperate

In [5]:
new_data['Synopsis'] = new_data['Synopsis'].apply(lambda x : " ".join([i.lower() for i in x.split(" ")]))

In [6]:
new_data[['Synopsis']]

Unnamed: 0,Synopsis
0,"from the world of ""the boys"" comes ""gen v,"" which explores the first generation of superheroes to know that their super powers are from compound v. these heroes put their physical and moral boundaries to the test competing for the school's top ranking."
1,"after the fall of the galactic empire, former jedi knight ahsoka tano investigates an emerging threat to a vulnerable galaxy."
2,the mercurial villain loki resumes his role as the god of mischief in a new series that takes place after the events of “avengers: endgame.”
3,"set in a high fantasy world where magic exists, but only some can access it, a woman named moiraine crosses paths with five young men and women. this sparks a dangerous, world-spanning journey. based on the book series by robert jordan."
4,"in a seafaring world, a young pirate captain sets out with his crew to attain the title of pirate king, and to discover the mythical treasure known as 'one piece.'"
...,...
495,"liko, whose partner pokémon is sprigatito, and roy will encounter many characters during their journey, including a group called the rising volt tacklers."
496,"the further and darker adventures of batman with a new robin, a closer association with batgirl and the previous robin now as nightwing."
497,the submarine seaview is commissioned to investigate the mysteries of the seas. usually it finds more problems than answers...
498,"bee, an unemployed woman, is living a normal life until a grumpy companion named puppycat arrives. follow bee and and puppycat as they travel between reality and ""fishbowl space."""


lovely, next is the removal of special characters like `"`, `@`, `$`, `#` signs amongst many others

In [7]:
# regex would be used to do this
import re            #<--- regex

### removing these symbols from sentences
new_data['Synopsis'] = new_data['Synopsis'].apply(lambda x : re.sub(r'[^a-zA-Z\s]', '', x))

In [8]:
new_data[['Synopsis']].head(15)

Unnamed: 0,Synopsis
0,from the world of the boys comes gen v which explores the first generation of superheroes to know that their super powers are from compound v these heroes put their physical and moral boundaries to the test competing for the schools top ranking
1,after the fall of the galactic empire former jedi knight ahsoka tano investigates an emerging threat to a vulnerable galaxy
2,the mercurial villain loki resumes his role as the god of mischief in a new series that takes place after the events of avengers endgame
3,set in a high fantasy world where magic exists but only some can access it a woman named moiraine crosses paths with five young men and women this sparks a dangerous worldspanning journey based on the book series by robert jordan
4,in a seafaring world a young pirate captain sets out with his crew to attain the title of pirate king and to discover the mythical treasure known as one piece
5,nine noble families fight for control over the lands of westeros while an ancient enemy returns after being dormant for a millennia
6,during the french revolution vampire hunter prodigy richter belmont fights to uphold his familys legacy and prevent the rise of a ruthless powerhungry vampire
7,the year is wealthy landowner stede bonnet has a midlife crisis and decides to blow up his cushy life to become a pirate it does not go well based on a true story
8,follows the adventures of monkey d luffy and his pirate crew in order to find the greatest treasure ever left by the legendary pirate gold roger the famous mystery treasure named one piece
9,a boy swallows a cursed talisman the finger of a demon and becomes cursed himself he enters a shamans school to be able to locate the demons other body parts and thus exorcise himself


looks like that was successful!

---

There are no punctautions visible in the display above, but it would be wise to write a function for reoving therem just incase, as the do not carry a definite meaning

In [9]:
## Removing punctuation from the texts in all the columns
import string
new_data['Synopsis'] = new_data['Synopsis'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

In [10]:
new_data[['Synopsis']].head(15)

Unnamed: 0,Synopsis
0,from the world of the boys comes gen v which explores the first generation of superheroes to know that their super powers are from compound v these heroes put their physical and moral boundaries to the test competing for the schools top ranking
1,after the fall of the galactic empire former jedi knight ahsoka tano investigates an emerging threat to a vulnerable galaxy
2,the mercurial villain loki resumes his role as the god of mischief in a new series that takes place after the events of avengers endgame
3,set in a high fantasy world where magic exists but only some can access it a woman named moiraine crosses paths with five young men and women this sparks a dangerous worldspanning journey based on the book series by robert jordan
4,in a seafaring world a young pirate captain sets out with his crew to attain the title of pirate king and to discover the mythical treasure known as one piece
5,nine noble families fight for control over the lands of westeros while an ancient enemy returns after being dormant for a millennia
6,during the french revolution vampire hunter prodigy richter belmont fights to uphold his familys legacy and prevent the rise of a ruthless powerhungry vampire
7,the year is wealthy landowner stede bonnet has a midlife crisis and decides to blow up his cushy life to become a pirate it does not go well based on a true story
8,follows the adventures of monkey d luffy and his pirate crew in order to find the greatest treasure ever left by the legendary pirate gold roger the famous mystery treasure named one piece
9,a boy swallows a cursed talisman the finger of a demon and becomes cursed himself he enters a shamans school to be able to locate the demons other body parts and thus exorcise himself


no visible change, but it's better safe than sorry

---

**Stopwords:** Stopwords are words like `the` `that` `are` e.t.c, which have no active contibution to the meaning we're tring to make from this data

*Below is how these words can be reduced*

In [11]:
### Importing the needed library for this operation
from nltk.corpus import stopwords

## adding a few stoppwords to the already defined words
personalized_stopwords = stopwords.words('english')
    
personalized_stopwords.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', 'may', 'take',
                             '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'come'
                             'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 
                             'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also','the',])

# now, actually removing the stopwords
new_data['Synopsis'] = new_data['Synopsis'].apply(lambda x: " ".join([i for i in x.split(" ") if i not in personalized_stopwords]))

In [12]:
new_data[['Synopsis']]

Unnamed: 0,Synopsis
0,world boys comes gen v explores first generation superheroes super powers compound v heroes put physical moral boundaries test competing schools top ranking
1,fall galactic empire former jedi knight ahsoka tano investigates emerging threat vulnerable galaxy
2,mercurial villain loki resumes role god mischief new series takes place events avengers endgame
3,set high fantasy world magic exists access woman named moiraine crosses paths five young men women sparks dangerous worldspanning journey based book series robert jordan
4,seafaring world young pirate captain sets crew attain title pirate king discover mythical treasure known one piece
...,...
495,liko whose partner pokmon sprigatito roy encounter characters journey including group called rising volt tacklers
496,darker adventures batman new robin closer association batgirl previous robin nightwing
497,submarine seaview commissioned investigate mysteries seas usually finds problems answers
498,bee unemployed woman living normal life grumpy companion named puppycat arrives follow bee puppycat travel reality fishbowl space


from observation, there are some one letter words in here which don't make much sense, so it might be wise to drop

In [13]:
## lambda funcion to handle this task
new_data['Synopsis'] = new_data['Synopsis'].apply(lambda x: " ".join([i for i in x.split(" ") if len(i) > 1]))

## viewing the result
new_data[['Synopsis']].head(17)

Unnamed: 0,Synopsis
0,world boys comes gen explores first generation superheroes super powers compound heroes put physical moral boundaries test competing schools top ranking
1,fall galactic empire former jedi knight ahsoka tano investigates emerging threat vulnerable galaxy
2,mercurial villain loki resumes role god mischief new series takes place events avengers endgame
3,set high fantasy world magic exists access woman named moiraine crosses paths five young men women sparks dangerous worldspanning journey based book series robert jordan
4,seafaring world young pirate captain sets crew attain title pirate king discover mythical treasure known one piece
5,nine noble families fight control lands westeros ancient enemy returns dormant millennia
6,french revolution vampire hunter prodigy richter belmont fights uphold familys legacy prevent rise ruthless powerhungry vampire
7,year wealthy landowner stede bonnet midlife crisis decides blow cushy life become pirate well based true story
8,follows adventures monkey luffy pirate crew order find greatest treasure ever left legendary pirate gold roger famous mystery treasure named one piece
9,boy swallows cursed talisman finger demon becomes cursed enters shamans school able locate demons body parts thus exorcise


Data looking good, just a one step left

**Lemmatization:** Lemmatization involves the breaking down of words to thier simplest forms, removing tenses, adjectives, adverbs and the likes e.g(`running` getting reduced to `run`, and `better` reduced to `good`)

*Below is how these words would be Lemmatized using `nltk's` `WordNetLemmatizer()`*

In [14]:
### Importing the library
from nltk.stem import WordNetLemmatizer

## Instantiating the Lemmatizer
lemmatizer = WordNetLemmatizer()

# Performing lemmatization on the data
new_data['Synopsis'] = new_data['Synopsis'].apply(lambda x: " ".join([lemmatizer.lemmatize(i) for i in x.split(" ")]))

In [15]:
# Inspecting
new_data[['Synopsis']].head(17)

Unnamed: 0,Synopsis
0,world boy come gen explores first generation superheroes super power compound hero put physical moral boundary test competing school top ranking
1,fall galactic empire former jedi knight ahsoka tano investigates emerging threat vulnerable galaxy
2,mercurial villain loki resume role god mischief new series take place event avenger endgame
3,set high fantasy world magic exists access woman named moiraine cross path five young men woman spark dangerous worldspanning journey based book series robert jordan
4,seafaring world young pirate captain set crew attain title pirate king discover mythical treasure known one piece
5,nine noble family fight control land westeros ancient enemy return dormant millennium
6,french revolution vampire hunter prodigy richter belmont fight uphold family legacy prevent rise ruthless powerhungry vampire
7,year wealthy landowner stede bonnet midlife crisis decides blow cushy life become pirate well based true story
8,follows adventure monkey luffy pirate crew order find greatest treasure ever left legendary pirate gold roger famous mystery treasure named one piece
9,boy swallow cursed talisman finger demon becomes cursed enters shaman school able locate demon body part thus exorcise


this would be a wrap on cleaning this textual data😉