# Get info from a movie

In [1]:
%load_ext autoreload
%autoreload 2

In this notebook we want to retrieve movie information such as
- The name
- The year
- Duration
- Director and producer
- Casting (ordered by importance)

For the casting, we want the age, gender of the actress / actors

## Load packages

In [2]:
from bechdelai.data.imdb import (
    find_movie_from_kerword,
    get_movie_data
)
from bechdelai.data.display import show_movie_suggestions_get_id
from IPython.display import display, HTML, Markdown
import pandas as pd

## Import IMDB Datasets

More details here : https://www.imdb.com/interfaces/

In [3]:
%%time
# load big df in scrap file (~2min)
import pandas as pd

name_df = pd.read_csv(
    "../../data/imdb/name.basics.tsv.gz",
    sep="\t",
    usecols=["nconst", "primaryName", "birthYear", "deathYear", "primaryProfession"]
)
basics_df = pd.read_csv(
    "../../data/imdb/title.basics.tsv.gz",
    sep="\t",
    usecols=["tconst", "primaryTitle", "startYear", "runtimeMinutes"],
)
principals_df = pd.read_csv(
    "../../data/imdb/title.principals.tsv.gz",
    sep="\t",
    usecols=["tconst", "ordering", "nconst", "category", "characters"],
)

# crew_df = pd.read_csv("../../data/imdb/title.crew.tsv.gz", sep="\t")
# akas_df = pd.read_csv("../../data/imdb/title.akas.tsv.gz", sep="\t")
# ratings_df = pd.read_csv("../../data/imdb/title.ratings.tsv.gz", sep="\t")



Wall time: 1min 28s


Preprocess datasets

In [4]:
%%time
# Preprocess id into int to be quicker on filters
# TODO : optimize preprocess time
principals_df["tconst"] = principals_df["tconst"].str[2:].astype(int)
principals_df["nconst"] = principals_df["nconst"].str[2:].astype(int)
basics_df["tconst"] = basics_df["tconst"].str[2:].astype(int)
name_df["nconst"] = name_df["nconst"].str[2:].astype(int)

Wall time: 59.4 s


## Get list of suggestions from a query (e.g. "Batman")

### Get data from keyword

In [5]:
%%time
ans = find_movie_from_kerword(q="batman")

Wall time: 4.01 s


Show posters and select wanted movie

In [24]:
movie_id = show_movie_suggestions_get_id(ans, top=7, verbose=True)




Select wanted index: 1


ID of the movie: tt0096895


Get data using `get_movie_data()` function

In [25]:
%%time
movie_data = get_movie_data(movie_id, name_df, basics_df, principals_df)

Wall time: 3.02 s


### Show results

In [26]:
movie_data.keys()

dict_keys(['tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'url', 'director', 'producer', 'cast'])

In [27]:
pd.DataFrame([movie_data]).iloc[:, :5]

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,url
0,tt0096895,Batman,1989,126,https://www.imdb.com/name/tt0096895/


In [28]:
pd.DataFrame(movie_data["director"])

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,url,gender
0,nm0000318,Tim Burton,1958,\N,"producer,miscellaneous,director",https://www.imdb.com/name/nm0000318/,?


In [29]:
pd.DataFrame(movie_data["producer"])

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,url,gender
0,nm0005307,Jon Peters,1945,\N,"producer,actor,make_up_department",https://www.imdb.com/name/nm0005307/,M
1,nm0345542,Peter Guber,1942,\N,"producer,executive,miscellaneous",https://www.imdb.com/name/nm0345542/,?


In [30]:
pd.DataFrame(movie_data["cast"]).head(15)

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,url,gender,ordering
0,nm0000474,Michael Keaton,1951,\N,"actor,producer,soundtrack",https://www.imdb.com/name/nm0000474/,M,1
1,nm0000197,Jack Nicholson,1937,\N,"actor,soundtrack,producer",https://www.imdb.com/name/nm0000197/,M,2
2,nm0000107,Kim Basinger,1953,\N,"actress,soundtrack,producer",https://www.imdb.com/name/nm0000107/,F,3
3,nm0943237,Robert Wuhl,1951,\N,"actor,writer,producer",https://www.imdb.com/name/nm0943237/,M,4
4,nm0385757,Pat Hingle,1924,2009,"actor,producer,soundtrack",https://www.imdb.com/name/nm0385757/,M,5
5,nm0001850,Billy Dee Williams,1937,\N,"actor,soundtrack,writer",https://www.imdb.com/name/nm0001850/,M,6
6,nm0001284,Michael Gough,1916,2011,actor,https://www.imdb.com/name/nm0001284/,M,7
7,nm0001588,Jack Palance,1919,2006,"actor,soundtrack,director",https://www.imdb.com/name/nm0001588/,M,8
8,nm0355717,Jerry Hall,1956,\N,actress,https://www.imdb.com/name/nm0355717/,F,9
9,nm0910145,Tracey Walter,1947,\N,"actor,soundtrack",https://www.imdb.com/name/nm0910145/,M,10


### Raw result (as dictionnary)

In [31]:
movie_data

{'tconst': 'tt0096895',
 'primaryTitle': 'Batman',
 'startYear': '1989',
 'runtimeMinutes': '126',
 'url': 'https://www.imdb.com/name/tt0096895/',
 'director': [{'nconst': 'nm0000318',
   'primaryName': 'Tim Burton',
   'birthYear': '1958',
   'deathYear': '\\N',
   'primaryProfession': 'producer,miscellaneous,director',
   'url': 'https://www.imdb.com/name/nm0000318/',
   'gender': '?'}],
 'producer': [{'nconst': 'nm0005307',
   'primaryName': 'Jon Peters',
   'birthYear': '1945',
   'deathYear': '\\N',
   'primaryProfession': 'producer,actor,make_up_department',
   'url': 'https://www.imdb.com/name/nm0005307/',
   'gender': 'M'},
  {'nconst': 'nm0345542',
   'primaryName': 'Peter Guber',
   'birthYear': '1942',
   'deathYear': '\\N',
   'primaryProfession': 'producer,executive,miscellaneous',
   'url': 'https://www.imdb.com/name/nm0345542/',
   'gender': '?'}],
 'cast': [{'nconst': 'nm0000474',
   'primaryName': 'Michael Keaton',
   'birthYear': '1951',
   'deathYear': '\\N',
   'pri