# Get info from a movie

In [1]:
%load_ext autoreload
%autoreload 2

In this notebook we want to retrieve movie information such as
- The name
- The year
- Duration
- Director and producer
- Casting (ordered by importance)

For the casting, we want the age, gender of the actress / actors

## Load packages

In [2]:
from bechdelai.data.imdb import (
    find_movie_from_kerword,
    show_movie_suggestions_get_id,
    get_movie_data
)
from IPython.display import display, HTML, Markdown
import pandas as pd

## Import IMDB Datasets

More details here : https://www.imdb.com/interfaces/

In [3]:
%%time
# load big df in scrap file (~2min)
import pandas as pd

name_df = pd.read_csv(
    "../../data/imdb/name.basics.tsv.gz",
    sep="\t",
    usecols=["nconst", "primaryName", "birthYear", "deathYear", "primaryProfession"]
)
basics_df = pd.read_csv(
    "../../data/imdb/title.basics.tsv.gz",
    sep="\t",
    usecols=["tconst", "primaryTitle", "startYear", "runtimeMinutes"],
)
principals_df = pd.read_csv(
    "../../data/imdb/title.principals.tsv.gz",
    sep="\t",
    usecols=["tconst", "ordering", "nconst", "category", "characters"],
)

# crew_df = pd.read_csv("../../data/imdb/title.crew.tsv.gz", sep="\t")
# akas_df = pd.read_csv("../../data/imdb/title.akas.tsv.gz", sep="\t")
# ratings_df = pd.read_csv("../../data/imdb/title.ratings.tsv.gz", sep="\t")



Wall time: 1min 27s


Preprocess datasets

In [4]:
%%time
# Preprocess id into int to be quicker on filters
# TODO : optimize preprocess time
principals_df["tconst"] = principals_df["tconst"].str[2:].astype(int)
principals_df["nconst"] = principals_df["nconst"].str[2:].astype(int)
basics_df["tconst"] = basics_df["tconst"].str[2:].astype(int)
name_df["nconst"] = name_df["nconst"].str[2:].astype(int)

Wall time: 1min 2s


## Get list of suggestions from a query (e.g. "Batman")

### Get data from keyword

In [5]:
%%time
ans = find_movie_from_kerword(q="batman")

Wall time: 4.33 s


Show posters and select wanted movie

In [6]:
movie_id, movie_cast_url = show_movie_suggestions_get_id(ans, top=7, verbose=True)




Select wanted index: 0


ID of the movie: tt1877830
URL of the casting: https://www.imdb.com/title/tt1877830/fullcredits


Get data using `get_movie_data()` function

In [7]:
%%time
movie_data = get_movie_data(movie_id, movie_cast_url, name_df, basics_df, principals_df)

Wall time: 4.49 s


### Show results

In [8]:
movie_data.keys()

dict_keys(['tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'url', 'director', 'producer', 'cast'])

In [13]:
pd.DataFrame([movie_data]).iloc[:, :5]

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,url
0,tt1877830,The Batman,2022,176,https://www.imdb.com/name/tt1877830/


In [10]:
pd.DataFrame([movie_data["director"]])

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,url,gender
0,nm0716257,Matt Reeves,1966,\N,"producer,writer,director",https://www.imdb.com/name/nm0716257/,?


In [11]:
pd.DataFrame([movie_data["producer"]])

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,url,gender
0,nm1249995,Dylan Clark,\N,\N,"producer,camera_department,miscellaneous",https://www.imdb.com/name/nm1249995/,?


In [12]:
pd.DataFrame(movie_data["cast"]).head(15)

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,url,gender,ordering
0,nm1500155,Robert Pattinson,1986,\N,"actor,soundtrack,writer",https://www.imdb.com/name/nm1500155/,M,1
1,nm2368789,Zoë Kravitz,1988,\N,"actress,producer,soundtrack",https://www.imdb.com/name/nm2368789/,F,2
2,nm0942482,Jeffrey Wright,1965,\N,"actor,producer,soundtrack",https://www.imdb.com/name/nm0942482/,M,3
3,nm0268199,Colin Farrell,1976,\N,"actor,producer,soundtrack",https://www.imdb.com/name/nm0268199/,M,4
4,nm0200452,Paul Dano,1984,\N,"actor,soundtrack,producer",https://www.imdb.com/name/nm0200452/,M,5
5,nm0001806,John Turturro,1957,\N,"actor,writer,director",https://www.imdb.com/name/nm0001806/,M,6
6,nm0785227,Andy Serkis,1964,\N,"actor,producer,director",https://www.imdb.com/name/nm0785227/,M,7
7,nm0765597,Peter Sarsgaard,1971,\N,"actor,producer",https://www.imdb.com/name/nm0765597/,M,8
8,nm4422686,Barry Keoghan,1992,\N,actor,https://www.imdb.com/name/nm4422686/,M,9
9,nm11123883,Jayme Lawson,\N,\N,actress,https://www.imdb.com/name/nm11123883/,F,10


### Raw result (as dictionnary)

In [14]:
movie_data

{'tconst': 'tt1877830',
 'primaryTitle': 'The Batman',
 'startYear': '2022',
 'runtimeMinutes': '176',
 'url': 'https://www.imdb.com/name/tt1877830/',
 'director': {'nconst': 'nm0716257',
  'primaryName': 'Matt Reeves',
  'birthYear': '1966',
  'deathYear': '\\N',
  'primaryProfession': 'producer,writer,director',
  'url': 'https://www.imdb.com/name/nm0716257/',
  'gender': '?'},
 'producer': {'nconst': 'nm1249995',
  'primaryName': 'Dylan Clark',
  'birthYear': '\\N',
  'deathYear': '\\N',
  'primaryProfession': 'producer,camera_department,miscellaneous',
  'url': 'https://www.imdb.com/name/nm1249995/',
  'gender': '?'},
 'cast': [{'nconst': 'nm1500155',
   'primaryName': 'Robert Pattinson',
   'birthYear': '1986',
   'deathYear': '\\N',
   'primaryProfession': 'actor,soundtrack,writer',
   'url': 'https://www.imdb.com/name/nm1500155/',
   'gender': 'M',
   'ordering': 1},
  {'nconst': 'nm2368789',
   'primaryName': 'Zoë Kravitz',
   'birthYear': '1988',
   'deathYear': '\\N',
   'pri