Notebook containing initial analyses and data handling pipelines. We will grade the correctness, quality of code, and quality of textual descriptions.


## Play with DATA

[CMU Movie Summary Corpus](http://www.cs.cmu.edu/~ark/personas/)

`plot_summaries.txt` [29 M]

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia.  Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.


`corenlp_plot_summaries.tar` [628 M, separate download]

The plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref). Each filename begins with the Wikipedia movie ID (which indexes into movie.metadata.tsv).


### TEST DATA
`tvtropes.clusters.txt`

72 character types drawn from tvtropes.com, along with 501 instances of those types.  The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

`name.clusters.txt`


970 unique character names used in at least two different movies, along with 2,666 instances of those types.  The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv


In [2]:
# If you already downloaded CoreNLP data, you can avoid downloading by
# put it to data/corenlp_plot_summaries.tar
!./data/setup.sh

'.' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���


In [1]:
import pandas as pd
import numpy as np

In [2]:
# `movie.metadata.tsv` [3.4 M]

# Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

# 1. Wikipedia movie ID
# 2. Freebase movie ID
# 3. Movie name
# 4. Movie release date
# 5. Movie box office revenue
# 6. Movie runtime
# 7. Movie languages (Freebase ID:name tuples)
# 8. Movie countries (Freebase ID:name tuples)
# 9. Movie genres (Freebase ID:name tuples)

movie_metadata = pd.read_csv(
    "./data/MovieSummaries/movie.metadata.tsv",
    sep="\t",
    header=None,
    names=[
        "movie_id",
        "freebase_movie_id",
        "movie_name",
        "movie_release_date",
        "movie_box_office_revenue",
        "movie_runtime",
        "movie_languages",
        "movie_countries",
        "movie_genres",
    ],
    parse_dates=["movie_release_date"],
    date_parser=lambda x: pd.to_datetime(x, errors="coerce"),
)

movie_metadata.head()

Unnamed: 0,movie_id,freebase_movie_id,movie_name,movie_release_date,movie_box_office_revenue,movie_runtime,movie_languages,movie_countries,movie_genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988-01-01,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987-01-01,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983-01-01,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [4]:
# movie id is wikipedia page id
# https://en.wikipedia.org/?curid={movie_id}

# How to use query freebase id?
# https://edstem.org/eu/courses/134/discussion/3845

# https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX%20wikibase%3A%20%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0A%0ASELECT%20%20%3Fs%20%3FsLabel%20%3Fp%20%20%3Fo%20%3FoLabel%20WHERE%20%7B%0A%20%3Fs%20wdt%3AP646%20%22%2Fm%2F0181lj%22%20%0A%0A%20%20%20SERVICE%20wikibase%3Alabel%20%7B%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%0A%20%20%20%7D%0A%20%7D

In [3]:
# `character.metadata.tsv` [14 M]

# Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

# 1. Wikipedia movie ID
# 2. Freebase movie ID
# 3. Movie release date
# 4. Character name
# 5. Actor date of birth
# 6. Actor gender
# 7. Actor height (in meters)
# 8. Actor ethnicity (Freebase ID)
# 9. Actor name
# 10. Actor age at movie release
# 11. Freebase character/actor map ID
# 12. Freebase character ID
# 13. Freebase actor ID

character_metadata = pd.read_csv(
    "./data/MovieSummaries/character.metadata.tsv",
    sep="\t",
    header=None,
    names=[
        "movie_id",
        "freebase_movie_id",
        "movie_release_date",
        "character_name",
        "actor_birthdate",
        "actor_gender",
        "actor_height",
        "actor_ethnicity",
        "actor_name",
        "actor_age",
        "freebase_character_actor_map_id",
        "freebase_character_id",
        "freebase_actor_id",
    ],
    parse_dates=["movie_release_date", "actor_birthdate"],
    date_parser=lambda x: pd.to_datetime(x, errors="coerce", utc=True),
)
character_metadata['movie_release_date']= character_metadata['movie_release_date'].dt.date
character_metadata['actor_birthdate']= character_metadata['actor_birthdate'].dt.date
character_metadata.head()

Unnamed: 0,movie_id,freebase_movie_id,movie_release_date,character_name,actor_birthdate,actor_gender,actor_height,actor_ethnicity,actor_name,actor_age,freebase_character_actor_map_id,freebase_character_id,freebase_actor_id
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.65,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg


In [4]:
character_metadata.query('actor_age <= 0')[['movie_release_date', 'actor_birthdate','actor_age']]

Unnamed: 0,movie_release_date,actor_birthdate,actor_age
767,1934-05-02,1963-11-07,-29.0
2286,1918-04-14,1931-03-25,-12.0
3892,1965-01-01,1983-03-03,-18.0
6666,1924-01-01,1972-11-07,-48.0
7188,1955-08-07,1973-08-01,-17.0
...,...,...,...
446570,1999-10-03,NaT,-937.0
446581,1955-01-01,1967-05-31,-12.0
446583,1944-02-23,1947-05-28,-3.0
446816,1941-06-20,1957-04-19,-15.0


In [5]:
# play with the data - check the calculation of actor age

calculated_age = (character_metadata.movie_release_date - character_metadata.actor_birthdate).astype('timedelta64[Y]')
ages = character_metadata[['freebase_actor_id', 'actor_age', 'actor_birthdate', 'movie_release_date']]
ages['calculated_age'] = calculated_age
ages['diff'] = ages['actor_age'] - ages['calculated_age']

print("diff>1 :{}".format(ages[ages['diff'].apply(lambda x: not np.isnan(x) and np.abs(x) > 1)]))
ages[ages['diff'].apply(lambda x: not np.isnan(x) and x != 0)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


diff>1 :Empty DataFrame
Columns: [freebase_actor_id, actor_age, actor_birthdate, movie_release_date, calculated_age, diff]
Index: []


Unnamed: 0,freebase_actor_id,actor_age,actor_birthdate,movie_release_date,calculated_age,diff
34,/m/0bwh7d8,40.0,1947-01-01,1988-01-01,41.0,-1.0
164,/m/02w09gx,36.0,1949-01-01,1986-01-01,37.0,-1.0
767,/m/01wlly9,-29.0,1963-11-07,1934-05-02,-30.0,1.0
962,/m/07m9cm,44.0,1963-12-19,2008-12-18,45.0,-1.0
1179,/m/09vz5s,36.0,1937-01-01,1974-01-01,37.0,-1.0
...,...,...,...,...,...,...
447210,/m/02pb53,-9.0,1942-02-08,1932-08-09,-10.0,1.0
447504,/m/0f12r29,76.0,1933-01-01,2010-01-01,77.0,-1.0
449604,/m/0cm19f,56.0,1915-01-01,1972-01-01,57.0,-1.0
449664,/m/01g42,52.0,1913-11-02,1966-11-02,53.0,-1.0


Some ages has error 1. Some ages are negative...

In [None]:
# CoreNLP: https://stanfordnlp.github.io/CoreNLP/

def load_coreNLP_data(wiki_movie_id: int):
    """
    data/corenlp_plot_summaries/{wiki_movie_id}.xml.gz
    """
    from bs4 import BeautifulSoup
    import gzip
    
    xml = f'data/corenlp_plot_summaries/{wiki_movie_id}.xml.gz'
    with gzip.open(xml, 'rb') as f:
        soup = BeautifulSoup(f, 'xml')
    return soup

In [None]:
data = load_coreNLP_data(3217)
# data is like:
# <document>
#   <sentences>
#       <sentence>
#           ...
#       </sentence>
#   </sentences>
#   <coreference>
#      <coreference>
#         ...
#      </coreference>
#  </coreference>
# </document>
print(set(tag.name for tag in data.document.find_all(recursive=False)))
print(set(tag.name for tag in data.sentences.find_all(recursive=False)))
print(set(tag.name for tag in data.sentences.sentence.find_all(recursive=False)))
# print(data.sentence)

print(set(tag.name for tag in data.coreference.find_all(recursive=False)))
print(set(tag.name for tag in data.coreference.coreference.find_all(recursive=False)))
print(data.coreference.coreference.prettify())



{'coreference', 'sentences'}
{'sentence'}
{'parse', 'basic-dependencies', 'collapsed-dependencies', 'collapsed-ccprocessed-dependencies', 'tokens'}
{'coreference'}
{'mention'}
<coreference>
 <mention representative="true">
  <sentence>
   1
  </sentence>
  <start>
   23
  </start>
  <end>
   26
  </end>
  <head>
   24
  </head>
 </mention>
 <mention>
  <sentence>
   3
  </sentence>
  <start>
   18
  </start>
  <end>
   20
  </end>
  <head>
   18
  </head>
 </mention>
</coreference>

