In [2]:
import pandas as pd

from helpers.readers import read_dataframe

![cmu visualization](https://i.postimg.cc/NMqMjwRN/image-2023-11-13-214735878.png)

Add nlp to above pic

Add imdb visualization

Mapping visualization + MovieLens

<a id="Contents"></a> <br>
# Content
* [1 - Loading default dataframes](#default)
<br>
* [2 - Merged Dataframes](#merged)
<br>

<a class="anchor" id="default"></a>
## Loading default dataframes
[Back to Table of Contents](#Contents)

### CMU Metadata

In [2]:
cmu_movies = read_dataframe(name='cmu/movies', preprocess=True, usecols=[
    "Wikipedia movie ID", 
    "Freebase movie ID", 
    "Movie name", 
    "Movie release date", 
    "Movie box office revenue", 
    "Movie runtime", 
    "Movie languages", 
    "Movie countries", 
    "Movie genres",
])

cmu_movies.info()
cmu_movies.head(1)

Preprocess logs:
✅ Fixed Movie Languages inside Movie Countries
✅ Removed Deseret characters
✅ Movie release date splitted to three columns: Movie release Year, Movie release Month, Movie release Day
✅ Seperated freebase identifiers from Movie Languages, Movie Countries and Movie Genres
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81741 entries, 0 to 81740
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Wikipedia movie ID        81741 non-null  int32  
 1   Freebase movie ID         81741 non-null  string 
 2   Movie name                81741 non-null  string 
 3   Movie box office revenue  8401 non-null   float64
 4   Movie runtime             61291 non-null  float32
 5   Movie release Year        74839 non-null  Int16  
 6   Movie release Month       42667 non-null  Int8   
 7   Movie release Day         39373 non-null  Int8   
 8   Movie languages           81741 non-null  string

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie box office revenue,Movie runtime,Movie release Year,Movie release Month,Movie release Day,Movie languages,Movie countries,Movie genres
0,975900,/m/03vyhn,Ghosts of Mars,14010832.0,98.0,2001,8,24,English,United States of America,"Thriller,Science Fiction,Horror,Adventure,Supe..."


In [3]:
cmu_characters = read_dataframe(name='cmu/characters', preprocess=True, usecols=[
    "Wikipedia movie ID",
    "Freebase movie ID",
    "Movie release date",
    "Character name",
    "Actor DOB",
    "Actor gender",
    "Actor height",
    "Actor ethnicity",
    "Actor name",
    "Actor age at movie release",
    "Freebase character/actor map ID",
    "Freebase character ID",
    "Freebase actor ID",
])

cmu_characters.info()
cmu_characters.head(1)

Preprocess logs:
✅ Movie release date splitted to three columns: Movie release Year, Movie release Month, Movie release Day
✅ Actor DOB splitted to three columns: Actor DOB Year, Actor DOB Month, Actor DOB Day
✅ Dropped Freebase character/actor map ID and Freebase character ID
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450669 entries, 0 to 450668
Data columns (total 15 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Wikipedia movie ID               450669 non-null  int32  
 1   Freebase movie ID                450669 non-null  string 
 2   Character name                   192794 non-null  string 
 3   Actor gender                     405060 non-null  string 
 4   Actor height                     154824 non-null  float32
 5   Actor ethnicity                  106058 non-null  string 
 6   Actor name                       449441 non-null  string 
 7   Actor age at movie release       292556

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Character name,Actor gender,Actor height,Actor ethnicity,Actor name,Actor age at movie release,Freebase character/actor map ID,Movie release Year,Movie release Month,Movie release Day,Actor DOB Year,Actor DOB Month,Actor DOB Day
0,975900,/m/03vyhn,Akooshay,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,2001,8,24,1958,8,26


### IMDb (https://developer.imdb.com/non-commercial-datasets/)

In [4]:
imdb_people = read_dataframe(name='imdb/names')
imdb_people.info()
imdb_people.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12904751 entries, 0 to 12904750
Data columns (total 6 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   nconst             string
 1   primaryName        string
 2   birthYear          Int16 
 3   deathYear          Int16 
 4   primaryProfession  string
 5   knownForTitles     string
dtypes: Int16(2), string(4)
memory usage: 467.7 MB


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0053137,tt0050419,tt0072308,tt0031983"


In [5]:
imdb_info = read_dataframe(name='imdb/movies', preprocess=True)
imdb_info.info()
imdb_info.head(1)

Preprocess logs:
✅ Moved genres from runtimeMinutes to genres column
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10218119 entries, 0 to 10218118
Data columns (total 9 columns):
 #   Column          Dtype   
---  ------          -----   
 0   tconst          string  
 1   titleType       category
 2   primaryTitle    string  
 3   originalTitle   string  
 4   isAdult         Int16   
 5   startYear       Int16   
 6   endYear         Int16   
 7   runtimeMinutes  Int32   
 8   genres          string  
dtypes: Int16(3), Int32(1), category(1), string(4)
memory usage: 458.0 MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"


In [6]:
imdb_principals = read_dataframe(name='imdb/principals')
imdb_principals.info()
imdb_principals.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58535121 entries, 0 to 58535120
Data columns (total 6 columns):
 #   Column      Dtype   
---  ------      -----   
 0   tconst      string  
 1   ordering    int8    
 2   nconst      string  
 3   category    category
 4   job         string  
 5   characters  string  
dtypes: category(1), int8(1), string(4)
memory usage: 1.9 GB


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,,"[""Self""]"


In [7]:
imdb_ratings = read_dataframe(name='imdb/ratings')
imdb_ratings.info()
imdb_ratings.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1356511 entries, 0 to 1356510
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1356511 non-null  string 
 1   averageRating  1356511 non-null  float32
 2   numVotes       1356511 non-null  int32  
dtypes: float32(1), int32(1), string(1)
memory usage: 20.7 MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1997


### Unused IMDb (3/7):

In [8]:
imdb_akas = read_dataframe(name='imdb/akas')
imdb_akas.info()
imdb_akas.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37422067 entries, 0 to 37422066
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          string
 1   ordering         int16 
 2   title            string
 3   region           string
 4   language         string
 5   types            string
 6   attributes       string
 7   isOriginalTitle  Int8  
dtypes: Int8(1), int16(1), string(6)
memory usage: 1.8 GB


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,,imdbDisplay,,0


In [9]:
imdb_crew = read_dataframe(name='imdb/crew')
imdb_crew.info()
imdb_crew.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10218119 entries, 0 to 10218118
Data columns (total 3 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   tconst     string
 1   directors  string
 2   writers    string
dtypes: string(3)
memory usage: 233.9 MB


Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,


In [10]:
imdb_episode = read_dataframe(name='imdb/episode')
imdb_episode.info()
imdb_episode.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7788792 entries, 0 to 7788791
Data columns (total 4 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   tconst         string
 1   parentTconst   string
 2   seasonNumber   Int16 
 3   episodeNumber  Int32 
dtypes: Int16(1), Int32(1), string(2)
memory usage: 178.3 MB


Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0041951,tt0041038,1,9


### Mappings

In [11]:
mapping_w_i_f = read_dataframe(name='mapping_wikipedia_imdb_freebase')
mapping_w_i_f.info()
mapping_w_i_f.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76954 entries, 0 to 76953
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   wikipedia  72189 non-null  Int32 
 1   imdb       76954 non-null  string
 2   freebase   73947 non-null  string
dtypes: Int32(1), string(2)
memory usage: 1.5 MB


Unnamed: 0,wikipedia,imdb,freebase
0,975900,tt0228333,/m/03vyhn


In [12]:
mapping_w_i = read_dataframe(name='mapping_wikipedia_imdb')
mapping_w_i.info()
mapping_w_i.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72180 entries, 0 to 72179
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   wikipedia  72180 non-null  int32 
 1   imdb       72180 non-null  string
dtypes: int32(1), string(1)
memory usage: 846.0 KB


Unnamed: 0,wikipedia,imdb
0,975900,tt0228333


In [13]:
mapping_f_i = read_dataframe(name='mapping_freebase_imdb')
mapping_f_i.info()
mapping_f_i.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73894 entries, 0 to 73893
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   freebase  73894 non-null  string
 1   imdb      73894 non-null  string
dtypes: string(2)
memory usage: 1.1 MB


Unnamed: 0,freebase,imdb
0,/m/0kcn7,tt0058331


### MovieLens

In [14]:
movieLens_movies = read_dataframe(name='movieLens/movies', preprocess=True)
movieLens_movies.info()
movieLens_movies.head(1)

Preprocess logs:
✅ Aligned bad rows
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   adult                  45463 non-null  category
 1   belongs_to_collection  4491 non-null   string  
 2   budget                 45463 non-null  Int32   
 3   genres                 45463 non-null  string  
 4   homepage               7779 non-null   string  
 5   id                     45463 non-null  Int32   
 6   imdb_id                45446 non-null  string  
 7   original_language      45452 non-null  string  
 8   original_title         45463 non-null  string  
 9   overview               44512 non-null  string  
 10  popularity             45463 non-null  float32 
 11  poster_path            45080 non-null  string  
 12  production_companies   45463 non-null  string  
 13  production_countries   45463 non-null  string  
 14  re

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415


### CMU Summaries NLP

In [15]:
cmu_summaries = read_dataframe(name='cmu/summaries', usecols=[
    "Wikipedia movie ID", 
    "Plot Summary"
])
cmu_summaries.info()

cmu_nameclusters = read_dataframe(name='cmu/nameclusters', usecols=['Character name', 'Freebase character/actor map ID'])
cmu_nameclusters.info()

cmu_tvtropes = read_dataframe(name='cmu/tvtropes')
cmu_tvtropes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42303 entries, 0 to 42302
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Wikipedia movie ID  42303 non-null  int32 
 1   Plot Summary        42303 non-null  string
dtypes: int32(1), string(1)
memory usage: 495.9 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2666 entries, 0 to 2665
Data columns (total 2 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Character name                   2666 non-null   string
 1   Freebase character/actor map ID  2666 non-null   string
dtypes: string(2)
memory usage: 41.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 501 entries, 0 to 500
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Character type              

In [3]:
cmu_characters = read_dataframe('cmu/characters_2023')
cmu_characters.info()
cmu_characters.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229420 entries, 0 to 229419
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Wikipedia_movie_id  229420 non-null  int64 
 1   Character           229420 non-null  object
 2   AV                  229420 non-null  object
 3   PV                  229420 non-null  object
 4   Att                 229420 non-null  object
dtypes: int64(1), object(4)
memory usage: 8.8+ MB


Unnamed: 0,Wikipedia_movie_id,Character,AV,PV,Att
0,11784534,Ingrid Bergman,[],[],"[Ingrid, Bergman]"


<a class="anchor" id="merged"></a>
## Merged Dataframes
[Back to Table of Contents](#Contents)

### CMU Movies IMDb merge (+ MovieLens?)

### CMU Movies IMDb NLP characters  (+ MovieLens?)

In [17]:
### add merged dataframes that we will use to solve our questions