## Applied Data Analysis Project
Team: ToeStewBrr - Alexander Sternfeld, Marguerite Thery, Antoine Bonnet, Hugo Bordereaux

Dataset: CMU Movie Summary Corpus

### 1. Preprocessing data

In [1]:
import requests
import tarfile
import os
import pandas as pd
import seaborn as sns
import matplotlib as plt

Temporary configuration: Until we figure out how to extract the tar file directly from the url, please download the file and place it in the Data directory of this project.

We first extract all files from the MoviesSummaries tar file.

In [3]:
if not os.path.exists('Data/MovieSummaries'):
    my_tar = tarfile.open('Data/MovieSummaries.tar.gz') # Can't upload voluminous data to Github
    my_tar.extractall('./Data') # specify which folder to extract to
    my_tar.close()

We then explore the structure of each of the data files.

### 1. plot_summaries.txt [29 M]

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia.  Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

In [4]:
plot_path = 'Data/MovieSummaries/plot_summaries.txt'
plot_cols = ['Wikipedia ID', 'Summary']
plot_df = pd.read_csv(plot_path, sep='\t', header=None, names=plot_cols, index_col=0)
plot_df


Unnamed: 0_level_0,Summary
Wikipedia ID,Unnamed: 1_level_1
23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
31186339,The nation of Panem consists of a wealthy Capi...
20663735,Poovalli Induchoodan is sentenced for six yea...
2231378,"The Lemon Drop Kid , a New York City swindler,..."
595909,Seventh-day Adventist Church pastor Michael Ch...
...,...
34808485,"The story is about Reema , a young Muslim scho..."
1096473,"In 1928 Hollywood, director Leo Andreyev look..."
35102018,American Luthier focuses on Randy Parsons’ tra...
8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


### 2. movie.metadata.tsv.gz [3.4 M]


Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)


In [None]:
movie_path = 'Data/MovieSummaries/movie.metadata.tsv'
movie_cols = ['Wikipedia ID', 'Freebase ID', 'Name', 'Release date',
              'Box office revenue', 'Runtime', 'Languages', 'Countries', 'Genres']
movie_df = pd.read_csv(movie_path, sep='\t', header=None, names=movie_cols, index_col=0)
movie_df



### 3. character.metadata.tsv.gz [14 M]

Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie release date
4. Character name
5. Actor date of birth
6. Actor gender
7. Actor height (in meters)
8. Actor ethnicity (Freebase ID)
9. Actor name
10. Actor age at movie release
11. Freebase character/actor map ID
12. Freebase character ID
13. Freebase actor ID


In [None]:
char_path = 'Data/MovieSummaries/character.metadata.tsv'
char_cols = ['Wikipedia ID', 'Freebase ID', 'Release date', 'Character name', 'Date of birth',
             'Gender', 'Height', 'Ethnicity', 'Actor name', 'Actor age at release',
             'Freebase character/map ID', 'Freebase character ID', 'Freebase actor ID']
char_df = pd.read_csv(char_path, sep='\t', header=None, names=char_cols, index_col=0)
char_df

### 4. corenlp_plot_summaries.tar.gz [628 M, separate download]

The plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref). Each filename begins with the Wikipedia movie ID (which indexes into movie.metadata.tsv).


In [None]:
import gzip
import xml.etree.ElementTree as ET
#I uploaded the corenlp_plot_summaries from local
directory = './corenlp_plot_summaries'

#TODO: Need to extract each file and convert them to xml
#For loop to open every file in the directory
for filename in os.listdir(directory):
  f = os.path.join(directory, filename)
  if os.path.isfile(f):
    #open and store file as xml


In [None]:
#Use file I already extracted on my computer to run some tests
tree = ET.parse('3217.xml')
root = tree.getroot()

print(len(root.findall('.//*governor'))) #use parse or basic-dependencies to have more info
#print(root.findall('.//*governor').text())
for l in root.findall('.//*NER'):
  if len(l.text) > 1:
    print(l.text)
