# IMDB Dataset Preprocessing

The following notebook will provide steps to preprocess the IMDB dataset. This includes adding movies, plots, posters, and other filtering operations as shown in each of the sections.  

Uncomment below cell to install the following dependencies required for the notebook. 

In [1]:
# !pip install pyarrow -q
# !pip install fastparquet -q

### Datasets

We will obtain several datasets in this section, ones with: movies information, imdb and tmdb ids. 

In [2]:
import pandas as pd
import numpy as np

In [4]:
movies = pd.read_csv('../data/ml-latest-small/movies.csv')
movies.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [5]:
links = pd.read_csv('../data/ml-latest-small/links.csv')
links.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


In [6]:
ml_ttid = list(set(links.imdbId.apply(lambda x: "{:07d}".format(x))))

In [7]:
len(ml_ttid)

9742

In [8]:
df_meta = pd.read_parquet("s3://mlsl-imdb-data/raw-data/movies_llm.parquet")
df_meta.head(2)

Unnamed: 0,titleId,originalTitle,genres,plotLong,imdbRating,year,Actors,Directors,Producers,keyword,location
501,tt0000502,Bohemios,,,"{'numberOfVotes': 8.0, 'rating': 4.5}",1905.0,"[Antonio del Pozo, El Mochuelo]",[Ricardo de Baños],[],[based-on-zarzuela],[]
573,tt0000574,The Story of the Kelly Gang,"[Action, Adventure, Biography, Crime, Drama, H...",,"{'numberOfVotes': 623.0, 'rating': 6.1}",1906.0,"[Elizabeth Tait, John Tait, Norman Campbell, B...",[Charles Tait],"[W.A. Gibson, Millard Johnson, John Tait, Nevi...","[ned-kelly, australia, historic-figure, austra...","[Eltham, Victoria, Australia, Greensborough, V..."


In [9]:
df_meta.shape

(602895, 11)

In [10]:
ml_ttid = ['tt'+ttid for ttid in ml_ttid]

### Filter full movies dataset for only MovieLens movies

In [11]:
df_meta_small = df_meta[df_meta.titleId.isin(ml_ttid)]

In [12]:
df_meta_small.shape

(9086, 11)

### Retrieve IMDb plots

In [13]:
plots = pd.read_parquet('s3://mlsl-imdb-data/plotinfo/plotinfo.parquet')

In [14]:
plots.shape

(602895, 2)

In [15]:
plots.head(2)

Unnamed: 0,titleId,plot
501,tt0000502,
573,tt0000574,True story of notorious Australian outlaw Ned ...


In [16]:
len(set(df_meta_small.titleId).intersection(set(plots.titleId)))

9086

In [18]:
plot_dict = dict()
for idx, row in plots.iterrows():
    plot_dict[row['titleId']] = row['plot']
    

In [19]:
df_meta_small['plot'] = df_meta['titleId'].apply(lambda x: plot_dict[x] if x in plot_dict else None)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_meta_small['plot'] = df_meta['titleId'].apply(lambda x: plot_dict[x] if x in plot_dict else None)


In [20]:
df_meta_small.head(2)

Unnamed: 0,titleId,originalTitle,genres,plotLong,imdbRating,year,Actors,Directors,Producers,keyword,location,plot
4963,tt0004972,The Birth of a Nation,"[Drama, History, War]",The first half of the film begins before the C...,"{'numberOfVotes': 22971.0, 'rating': 6.3}",1915.0,"[Lillian Gish, Mae Marsh, Henry B. Walthall, M...",[D.W. Griffith],[],"[ku-klux-klan, civil-war, reconstruction-era, ...","[Calexico, California, USA, Forest Lawn Hollyw...",The Stoneman family finds its friendship with ...
6324,tt0006333,"20,000 Leagues Under the Sea","[Action, Adventure, Sci-Fi]",,"{'numberOfVotes': 1545.0, 'rating': 6.2}",1916.0,"[Allen Holubar, Dan Hanlon, Edna Pendleton, Cu...",[Stuart Paton],[],"[national-film-registry, public-domain, based-...","[New Providence Island, Bahamas, Universal Stu...",A French professor and his daughter accompany ...


In [21]:
df_meta_small.isna().sum()

titleId             0
originalTitle       0
genres              0
plotLong         3818
imdbRating          0
year                0
Actors              0
Directors           0
Producers           0
keyword             0
location            0
plot                5
dtype: int64

In [22]:
df_meta_small['rating'] = df_meta_small['imdbRating'].apply(lambda x: x['rating'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_meta_small['rating'] = df_meta_small['imdbRating'].apply(lambda x: x['rating'])


In [23]:
df_meta_small.head(2)

Unnamed: 0,titleId,originalTitle,genres,plotLong,imdbRating,year,Actors,Directors,Producers,keyword,location,plot,rating
4963,tt0004972,The Birth of a Nation,"[Drama, History, War]",The first half of the film begins before the C...,"{'numberOfVotes': 22971.0, 'rating': 6.3}",1915.0,"[Lillian Gish, Mae Marsh, Henry B. Walthall, M...",[D.W. Griffith],[],"[ku-klux-klan, civil-war, reconstruction-era, ...","[Calexico, California, USA, Forest Lawn Hollyw...",The Stoneman family finds its friendship with ...,6.3
6324,tt0006333,"20,000 Leagues Under the Sea","[Action, Adventure, Sci-Fi]",,"{'numberOfVotes': 1545.0, 'rating': 6.2}",1916.0,"[Allen Holubar, Dan Hanlon, Edna Pendleton, Cu...",[Stuart Paton],[],"[national-film-registry, public-domain, based-...","[New Providence Island, Bahamas, Universal Stu...",A French professor and his daughter accompany ...,6.2


In [24]:
df_meta_small.columns

Index(['titleId', 'originalTitle', 'genres', 'plotLong', 'imdbRating', 'year',
       'Actors', 'Directors', 'Producers', 'keyword', 'location', 'plot',
       'rating'],
      dtype='object')

In [25]:
df_meta_small['Actors']

4963       [Lillian Gish, Mae Marsh, Henry B. Walthall, M...
6324       [Allen Holubar, Dan Hanlon, Edna Pendleton, Cu...
6854       [Lillian Gish, Robert Harron, Mae Marsh, F.A. ...
7351       [Dorothy Cumming, Creighton Hale, Lionel Braha...
10024      [Mary Pickford, Milla Davenport, Percy Haswell...
                                 ...                        
6694945    [John David Washington, Adam Driver, Laura Har...
6787728    [Dana Carvey, Robert Smigel, Steve Carell, Ste...
6852490    [Fred Rogers, Joanne Rogers, John Rogers, Jim ...
6856582    [Trevor Jackson, Jason Mitchell, Lex Scott Dav...
7188202    [Brian Beacock, Ray Chase, Lucien Dodge, Carri...
Name: Actors, Length: 9086, dtype: object

In [26]:
df_meta_small[['titleId', 'originalTitle', 'genres', 'year',
       'Actors', 'Directors', 'Producers', 'keyword', 'location', 'plot',
       'rating']].to_parquet('s3://mlsl-imdb-data/imdb_ml_10k.parquet', index=False)

In [27]:
df = pd.read_parquet('s3://mlsl-imdb-data/imdb_ml_10k.parquet')

### Process location to remove duplicates in city/country names

In [41]:
def process_location(loc_list):
    new_places = []
    if loc_list is None: return [] 
    for loc in loc_list:
        places = loc.split(',')[-3:]
        new_places.extend(places)
    return [place.strip() for place in list(set(new_places))]

In [42]:
df['location_new'] = df['location'].apply(lambda x: process_location(x))

### Add posters information

In [48]:
posters = pd.read_parquet("s3://mlsl-imdb-data/poster.parquet")

In [49]:
posters.head(2)

Unnamed: 0,titleId,image,url
0,tt0000001,"{'height': 800.0, 'url': 'https://m.media-amaz...",https://m.media-amazon.com/images/M/MV5BZmUzOW...
1,tt0000002,"{'height': 533.0, 'url': 'https://m.media-amaz...",https://m.media-amazon.com/images/M/MV5BZDI4ZD...


In [50]:
posters = dict(zip(posters.titleId, posters.url))

In [51]:
df['poster_url'] = df['titleId'].apply(lambda x: posters[x] if x in posters else None)

In [52]:
df.head(2)

Unnamed: 0,titleId,originalTitle,genres,year,Actors,Directors,Producers,keyword,location,plot,rating,location_new,poster_url
0,tt0004972,The Birth of a Nation,"[Drama, History, War]",1915.0,"[Lillian Gish, Mae Marsh, Henry B. Walthall, M...",[D.W. Griffith],[],"[ku-klux-klan, civil-war, reconstruction-era, ...","[Calexico, California, USA, Forest Lawn Hollyw...",The Stoneman family finds its friendship with ...,6.3,"[Thousand Oaks, Fullerton, Los Angeles, Califo...",https://m.media-amazon.com/images/M/MV5BYTM4ZD...
1,tt0006333,"20,000 Leagues Under the Sea","[Action, Adventure, Sci-Fi]",1916.0,"[Allen Holubar, Dan Hanlon, Edna Pendleton, Cu...",[Stuart Paton],[],"[national-film-registry, public-domain, based-...","[New Providence Island, Bahamas, Universal Stu...",A French professor and his daughter accompany ...,6.2,"[Bahamas, California, Universal City, USA, New...",https://m.media-amazon.com/images/M/MV5BMTQ0OT...


In [54]:
df_meta_small = df[['titleId', 'originalTitle', 'genres', 'year',
       'Actors', 'Directors', 'Producers', 'keyword', 'location_new', 'plot',
       'rating', 'poster_url']]

In [55]:
df_meta_small.rename(columns={'location_new': 'location'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_meta_small.rename(columns={'location_new': 'location'}, inplace=True)


In [56]:
df_meta_small.head(2)

Unnamed: 0,titleId,originalTitle,genres,year,Actors,Directors,Producers,keyword,location,plot,rating,poster_url
0,tt0004972,The Birth of a Nation,"[Drama, History, War]",1915.0,"[Lillian Gish, Mae Marsh, Henry B. Walthall, M...",[D.W. Griffith],[],"[ku-klux-klan, civil-war, reconstruction-era, ...","[Thousand Oaks, Fullerton, Los Angeles, Califo...",The Stoneman family finds its friendship with ...,6.3,https://m.media-amazon.com/images/M/MV5BYTM4ZD...
1,tt0006333,"20,000 Leagues Under the Sea","[Action, Adventure, Sci-Fi]",1916.0,"[Allen Holubar, Dan Hanlon, Edna Pendleton, Cu...",[Stuart Paton],[],"[national-film-registry, public-domain, based-...","[Bahamas, California, Universal City, USA, New...",A French professor and his daughter accompany ...,6.2,https://m.media-amazon.com/images/M/MV5BMTQ0OT...


In [57]:
df_meta_small.to_parquet('s3://mlsl-imdb-data/imdb_ml_10k_posters.parquet', index=False)

Preprocessed data will be sent back to s3