# CMU Movies Summary Corpus

- Authors: Zaynab, Lylia, Ali, Christian, Yassin

---

## Tasks

1. **Select Project & Initial Analyses**:
   1. Agree on a project proposal with team members.
   2. Perform initial analyses to verify feasibility of the proposed project, including any additional data.
   3. Acquaint yourself with the provided data, preprocess it, and perform descriptive statistics.

2. **Pipeline & Data Description**:
   1. Create a pipeline for data handling and preprocessing, documented in the notebook.
   2. Describe the relevant aspects of the data, including:
      1. Handling the size of the data.
      2. Understanding the data (formats, distributions, missing values, correlations, etc.).
      3. Considering data enrichment, filtering, and transformation according to project needs.
   3. Develop a plan for methods to be used, with essential mathematical details.
   4. Outline a plan for analysis and communication, discussing alternative approaches considered.

3. **GitHub Repository & Deliverables**:
   1. Create a public GitHub repository named `ada-2023-project-<team>` under the `epfl-ada` GitHub organization. ✅
   2. Ensure the repository contains:
      1. **README.md** file with:
         1. **Title**: Project title.
         2. **Abstract**: 150-word description of the project idea, goals, and motivation.
         3. **Research Questions**: List of research questions to address.
         4. **Proposed Additional Datasets**: Description of additional datasets, expected management, and feasibility analysis.
         5. **Methods**: Methods to be used in the project.
         6. **Proposed Timeline**: Timeline for the project.
         7. **Organization within the Team**: Internal milestones leading to Milestone P3.
         8. **Questions for TAs (optional)**: Any questions for the teaching assistants.
      2. **Code for Initial Analyses**: Structured code for initial analyses and data handling pipelines.
      3. **Notebook** presenting initial results, including:
         1. Main results and descriptive analysis.
         2. External scripts/modules for implementing core logic, to be called from the notebook.

---


## Table of Contents
- [1. Zaynab's part](##Zaynab's-part)
- [2. Lylia's part](##Lylia's-part)
- [3. Ali's part](##Ali's-part)
- [4. Cristians's part](##Christian's-part)
- [5. Yassin's part](##Yassin's-part)

---

### Library importation

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

### data importation

In [2]:
DATA_PATH='./data/MovieSummaries/'

### movie metadata

In [3]:
movie_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'MovieName', 'ReleaseDate', 
    'BoxOfficeRevenue', 'Runtime', 'Languages', 'Countries', 'Genres'
]


movie_metadata = pd.read_csv(DATA_PATH+'movie.metadata.tsv', sep='\t', names=movie_columns)

movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


### character metadata

In [4]:
character_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'ReleaseDate', 'CharacterName',
    'ActorDOB', 'ActorGender', 'ActorHeight', 'ActorEthnicity', 
    'ActorName', 'ActorAgeAtRelease', 'FreebaseCharacterActorMapID',
    'FreebaseCharacterID', 'FreebaseActorID'
]

character_metadata = pd.read_csv(DATA_PATH+'character.metadata.tsv', sep='\t', names=character_columns)


character_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,ReleaseDate,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,/m/0g8ngmc,,/m/022g44
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,/m/0g8ngmj,,/m/0g8ngmm


### plot summaries

In [5]:
plot_columns = ['WikipediaMovieID', 'PlotSummary']

plot_summaries = pd.read_csv(DATA_PATH+'plot_summaries.txt', sep='\t', names=plot_columns)

plot_summaries


Unnamed: 0,WikipediaMovieID,PlotSummary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...
...,...,...
42298,34808485,"The story is about Reema , a young Muslim scho..."
42299,1096473,"In 1928 Hollywood, director Leo Andreyev look..."
42300,35102018,American Luthier focuses on Randy Parsons’ tra...
42301,8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


### name clusters

In [6]:
name_clusters_columns = ['FreebaseCharacterActorMapID', 'CharacterName']

name_clusters = pd.read_csv(DATA_PATH+'name.clusters.txt', sep='\t', names=name_clusters_columns)

name_clusters


Unnamed: 0,FreebaseCharacterActorMapID,CharacterName
0,Stuart Little,/m/0k3w9c
1,Stuart Little,/m/0k3wcx
2,Stuart Little,/m/0k3wbn
3,John Doe,/m/0jyg35
4,John Doe,/m/0k2_zn
...,...,...
2661,John Rolfe,/m/0k5_ql
2662,John Rolfe,/m/02vd6vs
2663,Elizabeth Swann,/m/0k1xvz
2664,Elizabeth Swann,/m/0k1x_d


### TV tropes clusters

In [7]:
tvtropes_columns = ['FreebaseCharacterActorMapID', 'CharacterType']

tvtropes_clusters = pd.read_csv(DATA_PATH+'tvtropes.clusters.txt', sep='\t', names=tvtropes_columns)

tvtropes_clusters


Unnamed: 0,FreebaseCharacterActorMapID,CharacterType
0,absent_minded_professor,"{""char"": ""Professor Philip Brainard"", ""movie"":..."
1,absent_minded_professor,"{""char"": ""Professor Keenbean"", ""movie"": ""Richi..."
2,absent_minded_professor,"{""char"": ""Dr. Reinhardt Lane"", ""movie"": ""The S..."
3,absent_minded_professor,"{""char"": ""Dr. Harold Medford"", ""movie"": ""Them!..."
4,absent_minded_professor,"{""char"": ""Daniel Jackson"", ""movie"": ""Stargate""..."
...,...,...
496,young_gun,"{""char"": ""Morgan Earp"", ""movie"": ""Tombstone"", ..."
497,young_gun,"{""char"": ""Colorado Ryan"", ""movie"": ""Rio Bravo""..."
498,young_gun,"{""char"": ""Tom Sawyer"", ""movie"": ""The League of..."
499,young_gun,"{""char"": ""William H. 'Billy the Kid' Bonney"", ..."


---

## Zaynab's part

We will mainly use the following dataset :
+ World Important dates

We could complete if needed with : 
+ Global terrorism
+ Press Freedom

## Let's see how it looks !

In [9]:
W_events = pd.read_csv('./data/zaynab/World Important Dates.csv', sep=',')

W_events

Unnamed: 0,Sl. No,Name of Incident,Date,Month,Year,Country,Type of Event,Place Name,Impact,Affected Population,Important Person/Group Responsible,Outcome
0,1,Indus Valley Civilization Flourishes,Unknown,Unknown,2600 BC,India,Civilization,Indus Valley,Development of one of the world's earliest urb...,Local inhabitants,Indus Valley people,Positive
1,2,Battle of the Ten Kings,Unknown,Unknown,1400 BC,India,Battle,Punjab,Rigvedic tribes consolidated their control ove...,Rigvedic tribes,Sudas,Positive
2,6,Establishment of the Delhi Sultanate,Unknown,Unknown,1206,India,Political,Delhi,Muslim rule established in parts of India,People of Delhi and surrounding regions,QutbUnknownudUnknowndin Aibak,Mixed
3,7,Battle of Panipat,21,April,1526,India,Battle,Panipat,Foundation of the Mughal Empire in India,Northern Indian kingdoms,Babur,Mixed
4,8,Establishment of British Raj,1,May,1858,India,Colonial,Whole India,Start of direct British governance in India,Indian subcontinent,British East India Company/Empire,Negative
...,...,...,...,...,...,...,...,...,...,...,...,...
1091,1147,First Mexican Empire Declared,28,September,1821,Mexico,Political,Mexico,Brief establishment of an empire soon transiti...,Mexicans,Agustín de Iturbide,Positive
1092,1148,U.S.UnknownMexican War,25,April,1846,Mexico,Military,Northern Mexico,Loss of vast territories to the United States,Mexicans,US,Negative
1093,1149,Reform Wars,Unknown,Unknown,1857,Mexico,Civil War,Mexico,Liberal vs. Conservative conflict leading to c...,Mexicans,Benito Juárez,Mixed
1094,1150,French Intervention in Mexico,Unknown,Unknown,1862,Mexico,Military Intervention,Mexico,Establishment and fall of the Second Mexican E...,Mexicans,Napoleon III,Negative


In [21]:
W_events.value_counts('Country')

Country
India      74
USA        72
China      60
UK         60
Japan      49
           ..
Malawi      1
Morocco     1
Mosul       1
Niger       1
Malacca     1
Name: count, Length: 113, dtype: int64

---
## Lylia's part

---
## Ali's part

---
## Christian's part

---
## Yassin's part