# CMU Movie Summary Corpus README

This README provides a comprehensive guide to the CMU Movie Summary Corpus. The corpus contains a collection of 42,306 movie plot summaries alongside extensive metadata at both the movie and character levels. Movie-level metadata include box office revenues, genre, and release date, while character-level metadata include gender and estimated age.

The dataset aids the research work presented in the following paper:

- David Bamman, Brendan O'Connor, and Noah Smith, "Learning Latent Personas of Film Characters," presented at the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013.

## License

The dataset is made available under the Creative Commons Attribution-ShareAlike License. For any queries or feedback, feel free to contact David Bamman at dbamman@cs.cmu.edu.

## Data Details

### 1. **Plot Summaries (29M)**
   - `plot_summaries.txt.gz`
   - Includes plot summaries of 42,306 movies extracted from the English Wikipedia dump dated November 2, 2012.
   - Each line contains the Wikipedia movie ID and the respective summary.

### 2. **CoreNLP Plot Summaries (628M)**
   - `corenlp_plot_summaries.tar.gz`
   - The above plot summaries processed with the Stanford CoreNLP pipeline for tagging, parsing, NER, and coref.
   - Each filename begins with the Wikipedia movie ID.

## Metadata

### 3. **Movie Metadata (3.4M)**
   - `movie.metadata.tsv.gz`
   - Metadata for 81,741 movies extracted from the Freebase dump dated November 4, 2012.
   - Columns include Wikipedia movie ID, Freebase movie ID, movie name, release date, box office revenue, runtime, languages, countries, and genres.

### 4. **Character Metadata (14M)**
   - `character.metadata.tsv.gz`
   - Metadata for 450,669 characters aligned to the movies above, also extracted from the Freebase dump dated November 4, 2012.
   - Columns include Wikipedia movie ID, Freebase movie ID, movie release date, character name, actor date of birth, gender, height, ethnicity, name, age at movie release, and related Freebase IDs.

## Test Data

- `tvtropes.clusters.txt`: Includes 72 character types from tvtropes.com, with 501 instances. IDs index into the Freebase character/actor map ID in character metadata.
- `name.clusters.txt`: Contains 970 unique character names used in at least two different movies, with 2,666 instances. IDs also index into the Freebase character/actor map ID in character metadata.

In [43]:
import pandas as pd
import matplotlib.pyplot as plt

In [44]:
# Loading plot summaries
plot_summaries_columns = ['Wikipedia_Movie_ID', 'Plot_Summary']
plot_summaries = pd.read_csv(
    'MovieSummaries/plot_summaries.txt', 
    sep='\t', 
    names=plot_summaries_columns
    )


In [45]:
plot_summaries.head()

Unnamed: 0,Wikipedia_Movie_ID,Plot_Summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [46]:
# Loading movie metadata
movie_metadata_columns = [
    'Wikipedia_Movie_ID', 'Freebase_Movie_ID', 'Movie_Name',
    'Movie_Release_Date', 'Movie_Box_Office_Revenue',
    'Movie_Runtime', 'Movie_Languages',
    'Movie_Countries', 'Movie_Genres'
]
movie_metadata = pd.read_csv(
    'MovieSummaries/movie.metadata.tsv', sep='\t', names=movie_metadata_columns)


In [47]:
movie_metadata.head()

Unnamed: 0,Wikipedia_Movie_ID,Freebase_Movie_ID,Movie_Name,Movie_Release_Date,Movie_Box_Office_Revenue,Movie_Runtime,Movie_Languages,Movie_Countries,Movie_Genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [66]:
# Loading character metadata
character_metadata_columns = [
    'Wikipedia_Movie_ID', 'Freebase_Movie_ID', 'Movie_Release_Date',
    'Character_Name', 'Actor_Date_of_Birth', 'Actor_Gender',
    'Actor_Height', 'Actor_Ethnicity', 'Actor_Name',
    'Actor_Age_at_Movie_Release', 'Freebase_Character_Actor_Map_ID',
    'Freebase_Character_ID', 'Freebase_Actor_ID'
]
character_metadata = pd.read_csv(
        'MovieSummaries/character.metadata.tsv', 
        sep='\t', 
        names=character_metadata_columns
    )


In [67]:
character_metadata.head()

Unnamed: 0,Wikipedia_Movie_ID,Freebase_Movie_ID,Movie_Release_Date,Character_Name,Actor_Date_of_Birth,Actor_Gender,Actor_Height,Actor_Ethnicity,Actor_Name,Actor_Age_at_Movie_Release,Freebase_Character_Actor_Map_ID,Freebase_Character_ID,Freebase_Actor_ID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.65,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg


---

In [None]:
movie_metadata['Year'] = movie_metadata['Movie_Release_Date'].str[:4]
# movie_metadata['Year'] = pd.to_datetime(movie_metadata['Year'], errors='coerce')