In [122]:
# import libraries 
import pandas as pd
import matplotlib.pyplot as plt
import os 
import math
import sys
from pathlib import Path
import seaborn as sns

# Add 'src' to the system path
sys.path.append(str(Path().resolve() / 'src'))
from src.data.process_data import *
from src.data.clean_data import *

IMPORTANT: these scripts/functions assume you have the following files in the data/raw directory:
- From the CMU dataset: 
    - movie.metadata.tsv
    - plot_summaries.txt
- From the TMDB dataset: 
    - TMDB_movie_dataset_v11.csv

AND have data/processed folder created

Note: download CMU dataset here: https://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz
and TMDB dataset here (Download button): https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies

In [123]:
# from raw files, creates clean datafiles
%run src/data/clean_data.py

sep ,
headers []
original df shape (1127777, 24)
after status (1102507, 24)
after release date (760743, 24)
after release year (760743, 25)
after duplicates (746388, 25)
after numeric columns (746387, 25)
after string to list (746387, 25)
after select columns (746387, 13)
sep 	
headers ['wikipedia_movie_id', 'freebase_ID', 'title', 'release_year', 'revenue', 'runtime', 'languages', 'countries', 'genres']
original df shape (81740, 9)
after status (81740, 9)
after release date (81740, 9)
after release year (44006, 9)
after duplicates (43915, 9)
after numeric columns (43915, 9)
after string to list (43915, 9)
after select columns (43915, 5)


In [124]:
# from clean data files, creates a dataframe with CMU + plots & TMDB movies 
df_combined = create_cmu_tmdb_dataset('data/processed/movies.csv','data/processed/plot_summaries.csv', 'data/processed/TMDB_clean.csv', 'inner')

In [125]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 430770 entries, 0 to 746386
Data columns (total 15 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   release_year          430770 non-null  int64 
 1   summary               364351 non-null  object
 2   release_date          430770 non-null  object
 3   budget                430770 non-null  int64 
 4   original_language     430770 non-null  object
 5   overview              364351 non-null  object
 6   genres                430770 non-null  object
 7   production_companies  430770 non-null  object
 8   production_countries  430770 non-null  object
 9   spoken_languages      430770 non-null  object
 10  keywords              430770 non-null  object
 11  title                 430767 non-null  object
 12  revenue               430770 non-null  int64 
 13  runtime               430770 non-null  int64 
 14  dvd_era               430770 non-null  object
dtypes: int64(4), object(11

In [126]:
df_combined.head()

Unnamed: 0,release_year,summary,release_date,budget,original_language,overview,genres,production_companies,production_countries,spoken_languages,keywords,title,revenue,runtime,dvd_era
0,1987,"A series of murders of rich young women throughout Arizona bear distinctive signatures of a serial killer. Clues lead Detective Charles Mendoza to visit Paul White, a sound expert installing hi-fi systems in wealthy people's homes. His special talent is to make a noise which echoes through the air cavities in his head and shows him where the sound of the speakers should come from and echo in the room. He is married to Joan, whom, ten years earlier, he had seduced away from Mike DeSantos, her then current boyfriend. Joan is questioned by Mendoza, but does not believe his insinuations that her husband is somehow involved in the murders. Various flashbacks show Joan's previous relationship to Mike and later explain how it came to be that he abandoned her. The couple met Paul and befriended him. At Mike's suggestion, he and Mike go on a deer hunting trip together. Paul shoots a deer and brutally mutilates it, demonstrating his sick fascination with killing. This is partly intended to scare Mike off, which it does. Mike catches Joan and Paul after they've made love, and Paul declares that he will take Mike's place. ""I am the one,"" Paul says. Mike puts his gun at the back of Paul's head but decides not to kill and abandons Joan. By now Joan has run into Mike DeSantos working at a gas station in a neighboring town. Mike tells her he got out of prison after suffering a major head injury; he thinks life is looking up. He makes her promise not to tell Paul that she has seen him. Joan soon discovers Paul has committed adultery. By puncturing Paul's tires she provides him with an alibi for the most recent killing. He begs her forgiveness as the police turn their suspicions away from him. At home, Joan looks into a crawl space in the house, and discovers preserved body parts of Paul's victims wrapped in paper and plastic. Joan confronts Paul, and Paul tries to explain his motivations for killing. He believes he has been ""chosen"" and is expressing the nothingness of the universe, whose heart is female and destructive like a black hole. He is putting women ""out of their misery,"" but he loves Joan. Joan's distrust of Mike over the next night and day agitates him into a fury. First, he tries to imprison her and then kill her and his daughter. He heavily arms himself and paints his face to look like a samurai warrior or an Indian brave. Joan and the little girl escape in different directions and soon Joan has to elude Paul in the abandoned quarry. It turns out Mike has been staying there, armed with a machine gun, certain that he will meet Paul again. He rescues Joan and takes away Paul's gun, leading him to the edge of the quarry. Paul makes the sound he uses in the emptiness of living rooms and savors its echo from the quarry. While incessantly pontificating about his philosophies of life and death, Paul reveals a lighter with which he has lit the fuse of his explosive vest. Mike opens fire on him with a machine gun and Joan dives into the lake in the quarry. Paul and Mike both die instantly, in a hail of destruction. Joan is reunited later with her daughter. She talks with Detective Mendoza about what the ten years with Paul could have meant, whose destructive and nihilistic nature she never understood. Based on the 1983 novel Mrs White by Margaret Tracy .",1987-06-19,0,en,"In a wealthy and isolated desert community, a sound expert is targeted as the prime suspect of a series of brutal murders of local suburban housewives who were attacked and mutilated in their homes. As he desperately tries to prove his innocence, his wife starts to uncover startling truths...","['Horror', 'Thriller']","[""Mrs. White's Productions""]",['United Kingdom'],['English'],"['based on novel or book', 'gas station', 'psychopath', 'insanity', 'detective', 'arizona', 'slasher', 'series of murders', 'desert', 'quarry', 'giallo']",White Of The EyeWhite of the Eye,0,221,pre
1,1983,"Eva, an upper class housewife, becomes frustrated and leaves her arrogant husband. She is drawn to the idea of becoming a call girl. With the aid of a prostitute named Yvonne, Eva learns the basics and then they both set out looking for Johns together. She meets a charming man who she falls in love with and comes to his house late at night for a romantic tryst. He turns out to be a gigolo. Consequently, they move into his penthouse, large enough for both of them to offer their services separately. Then slowly Eva enters the world of sado-masochism. She finds being a dominatrix extremely satisfying, and begins to take pleasure in controlling others and causing them intense pain. She discovers this in a scene in which a man is hiding under a table. Eva can see that his hands are sticking out from under the table and are clearly visible. Coldly, and with intense inner satisfaction, Eva proceeds to crush the man's hands by slowly walking over them with her stiletto-heeled boots. Chris, the gigolo begins getting jealous of her, wants to know what's going on upstairs and why she's making so much money. She tells him it's from hurting men, and the more she hurts them, the more money she gets. This upsets him greatly. She also becomes jealous of his boyfriend/client, a man who's been coming to him for many years. The scenes in the upstairs room intensify. One day he sneaks up and observers her in a scene dominating a man tied to a chair. He has a look on his face like ""see, this is what you truly are,"" and the look on her face says proudly ""yes, this is what I truly am."" He tries to whisk her away from all of it, buying her furs, talking about marriage. She tells him that she's been dreaming about hitting him, and in the dream he likes it. The setting is all there for a romantic ending, and yet, he panics, he takes all their money and invests it in a restaurant that she doesn't want to be part of. She tries to walk out on him, and he gets angry, throws her against the wall, hits her, pours alcohol on her, and lights her on fire. But the last scene shows her unscathed, happy with her friend the sex worker/madame, and they're getting thrown out of a bar that Chris owns. The movie thus extols female independence from men, though in its ending stops short of endorsing a female-dominant/male-submissive romance.",1983-05-11,0,de,"Eva, an upper-class housewife, frustratedly leaves her arrogant husband and decides to enter the call girl business. She lets Yvonne, a prostitute, teach her the basics and both set out for prey together, until Eva starts an affair with Chris, who turns out to be a call boy, as well. Consequently, she moves into his penthouse, large enough for both to offer their services separately.",['Drama'],['Dieter Geissler Filmproduktion'],['Germany'],['German'],"['jealousy', 'eroticism', 'gigolo', 'longing', 'dominatrix', 'sadomasochism', 'conflict', 'divorce', 'bdsm']",A Woman in FlamesA Woman in Flames,0,212,pre
2,2002,"Every hundred years, the evil Morgana returns to claim Fingall's talisman from the wizard Merlin, with which she intends to destroy the world. For the last fourteen hundred years she has failed... now she intends to conquer all. Young Ben Clark moves with his parents to a new town, where he befriends his elderly magician neighbor, Milner . Ben has a natural talent for magic and wants to learn all that he can from this old man. Ben carries the same scar as the original staff-bearer 1,400 years before. Both Morgana and Milner, who is revealed to be Merlin, see this as a sign that this time, the battle between good and evil will be stronger and harder than ever. Ben must make his own choice between good and evil as he is drawn into a battle and must draw on his own spirit and magic to decide which path to follow and hence, the fate of the world as we know it.",2002-04-12,0,en,"Every hundred years, the evil sorceress Morgana returns to claim Fingall's talisman from Merlin, with which she intends to destroy the world. For the last fourteen hundred years she has failed... now she intends to conquer all. Young Ben Clark moves with his parents to a new town, where he befriends his elderly magician neighbor, Milner. Ben has a natural talent for magic and wants to learn all that he can from this old man. Ben carries the same scar as the original staff-bearer 1,400 years before. Both Morgana and Milner, who is revealed to be Merlin, see this as a sign that this time, the battle between good and evil will be stronger and harder than ever. Ben must make his own choice between good and evil as he is drawn into a battle and must draw on his own spirit and magic to decide which path to follow and hence, the fate of the world as we know it.","['Adventure', 'Family', 'Fantasy']","['Peakviewing Productions', 'Peakviewing Transatlantic Plc']",['United Kingdom'],"['French', 'English']",['morgana'],The Sorcerer's ApprenticeThe Sorcerer's Apprentice,0,172,during
3,1997,"Adam, a San Francisco-based artist who works as a cab driver on the side, is having a hard time committing to his girlfriend, Nina. She wants to take their relationship to the next level, but he hasn't really gotten over his ex-girlfriend, Kate, who left him for another woman and is reluctant to move forward with Nina because he's still hanging on to the idea that one day Kate will come back to him. Feeling neglected, Nina breaks up with Adam and starts seeing Kevin, a womanizing bartender who is also Adam's best friend. Meanwhile Rebecca, the new girl in town, gets a job in Kevin's bar and begins an affair with Anne, the woman Kate left Adam for. Tired of her infidelities, Kate breaks up with Anne and returns to Adam. However she soon realises that she's not in love with him anymore and breaks up with him for good. Rebecca soon tires of Anne and breaks off their affair. One day she meets Adam, who is finally attempting to move on from Kate once and for all, and they go out on a date. Nina and Kevin's fling turns into something deeper when she finds out that she's pregnant, an event that strains Adam's friendship with Kevin and threatens to ruin his budding relationship with Rebecca before it has a chance to begin.",1997-04-04,0,en,"Best friends Adam and Kevin have a lot in common. Probably too much! In fact, Kevin is carrying on a torrid affair with Adam's girlfriend, Nina, on the side! And if this triangle wasn't crowded enough, the arrival of a seductive newcomer in town, Rebecca, promptly adds a whole new set of twists to an already tangled mix! Here's a hilarious look at how modern relationships have never been so complicated ... or so much fun!","['Comedy', 'Romance']","['Bandeira Entertainment', 'Miramax']",[],['English'],[],Little cityLittle City,0,183,pre
4,1989,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally negotiations are made for Henry to be named king of both England and France. He has a brief romantic interlude with Catherine while the French and English royal delegations negotiate the Treaty of Troyes. The Greek chorus informs the audience that an English-French union lasted as long as Henry V lived and was only lost under his successor Henry VI.,1989-10-05,9000000,en,Gritty adaption of William Shakespeare's play about the English King's bloody conquest of France.,"['War', 'Drama', 'History']","['BBC Film', 'Renaissance Films', 'Samuel Goldwyn Company']",['United Kingdom'],['English'],"['france', 'kingdom', 'theater play', 'based on true story', 'based on play or musical', 'sword fight', 'honor', 'battlefield', 'historical', 'combat', 'medieval', 'king of england', 'warrior', '15th century']",Henry VHenry V,20337800,274,pre
