<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

# Getting Data

In [67]:
import requests

The get_data() function retrieves movie data from The Movie Database API by sending a GET request with a specified URL and headers. The input is an IMDb ID, and the output is a response object containing the movie data. The function also supports retrieving data by movie ID using an optional flag parameter. If an error occurs, the function prints an error message.

In [68]:
# sometimes websites stop you from extracting the data for some reason. It can be due to some authentication errors.

def get_data(ex_id,flag=False):
    needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}

#     url = "https://api.themoviedb.org/3/movie/popular?api_key=81b49b634e27bb42bfe9fd2a462146c0&page=1"
    if flag:
        url = f"https://api.themoviedb.org/3/movie/{ex_id}?api_key=81b49b634e27bb42bfe9fd2a462146c0"
    else:
        url = f"https://api.themoviedb.org/3/find/{ex_id}?api_key=81b49b634e27bb42bfe9fd2a462146c0&language=en-US&external_source=imdb_id"
    response = requests.get(url, headers = needed_headers)
    if response.status_code !=200:
        print('Error',ex_id)
    return response

The get_json() function retrieves movie data from The Movie Database API using the get_data() function and an IMDb ID as input. The function then extracts the movie ID from the response object and sends a second GET request to the API using the get_data() function. The output of the function is a dictionary object containing the movie data, or an empty dictionary if an error occurs.

In [88]:
def get_json(imdb_id):
    response = get_data(imdb_id)
    try:
        api_id = response.json()['movie_results'][0]['id']
        response = get_data(api_id,True)
        data = response.json()
        if data['revenue']!=0:
            return data
        else:
            return {}
    except:
        return {}

In [91]:
# data = get_json('tt6359806')

In [4]:
import pandas as pd

In [5]:
# Uploading the csv file in order to get the ids  
df = pd.read_csv(r'titles.csv.zip')

In [6]:
# Store IDs
df['imdb_id'].dropna()

1        tt0075314
2        tt0068473
3        tt0071853
4        tt0061578
5        tt0063929
           ...    
5843    tt14216488
5845    tt13857480
5846    tt11803618
5847    tt14585902
5849    tt13711094
Name: imdb_id, Length: 5447, dtype: object

In [7]:
# Remove rows with missing IMDb IDs

df = df[~df['imdb_id'].isna()]

In [10]:
# Check that these are the IDs
df['imdb_id']

1        tt0075314
2        tt0068473
3        tt0071853
4        tt0061578
5        tt0063929
           ...    
5843    tt14216488
5845    tt13857480
5846    tt11803618
5847    tt14585902
5849    tt13711094
Name: imdb_id, Length: 5447, dtype: object

In [15]:
# Retrieve movie data from The Movie Database API for each IMDb ID in the
# DataFrame using a progress_apply() function. Store the resulting JSON data
# in a Series called json_str.
json_str = df['imdb_id'].progress_apply(lambda imdb_id:get_json(imdb_id))

  0%|          | 0/5447 [00:00<?, ?it/s]

In [32]:
# Convert the Series to a list of dictionaries
data = json_str.tolist()

In [33]:
# Normalize the data into a Pandas DataFrame
df_imdb = pd.json_normalize(data)

In [34]:
# Save the DataFrame to a CSV file
df_imdb.to_csv('df_imbd.csv', index=False)

In [35]:
# View dataset
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5447 entries, 0 to 5446
Data columns (total 29 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   adult                                3364 non-null   object 
 1   backdrop_path                        3136 non-null   object 
 2   belongs_to_collection                0 non-null      float64
 3   budget                               3364 non-null   float64
 4   genres                               3364 non-null   object 
 5   homepage                             3364 non-null   object 
 6   id                                   3364 non-null   float64
 7   imdb_id                              3364 non-null   object 
 8   original_language                    3364 non-null   object 
 9   original_title                       3364 non-null   object 
 10  overview                             3364 non-null   object 
 11  popularity                    

In [36]:
# View the dataframe
df_imdb.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,status,tagline,title,video,vote_average,vote_count,belongs_to_collection.id,belongs_to_collection.name,belongs_to_collection.poster_path,belongs_to_collection.backdrop_path
0,False,/orjyEE9ZcMefTsN8zT5ryQTdkIz.jpg,,1300000.0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,103.0,tt0075314,en,Taxi Driver,...,Released,"On every street in every city, there's a nobod...",Taxi Driver,False,8.2,10564.0,,,,
1,False,/jOaxen3pkKtD6vJm8sdHY6uVY3A.jpg,,2000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 12, 'name...",http://www.warnerbros.com/deliverance,10669.0,tt0068473,en,Deliverance,...,Released,What did happen on the Cahulawassee River?,Deliverance,False,7.317,1340.0,,,,
2,False,/nWkxOXpctN2NuToGzwCSdFdiht0.jpg,,400000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 35, '...",https://www.sonypictures.com/movies/montypytho...,762.0,tt0071853,en,Monty Python and the Holy Grail,...,Released,And now! At Last! Another film completely diff...,Monty Python and the Holy Grail,False,7.804,5076.0,,,,
3,False,/ccegwWa7I2vxxaoXrGB8Wl1g5nl.jpg,,5400000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,1654.0,tt0061578,en,The Dirty Dozen,...,Released,Train them! Excite them! Arm them!...Then turn...,The Dirty Dozen,False,7.611,1025.0,124492.0,The Dirty Dozen Collection,/3tNe3I17eUjQkHYz7XButEBlupJ.jpg,/z5Wj3wkksrRjyjhZ2flr00gSjTR.jpg
4,,,,,,,,,,,...,,,,,,,,,,


In [37]:
# Remove the columns that we won'tneed
df_imdb = df_imdb.drop(['adult','backdrop_path',
                        'belongs_to_collection', 
                        'video','belongs_to_collection.id',
                        'belongs_to_collection.name',
                        'poster_path',
                        'belongs_to_collection.poster_path',
                        'belongs_to_collection.backdrop_path'], axis=1)

In [38]:
# Check columns 
df_imdb.columns

Index(['budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [40]:
# Check dataframe
df_imdb.head()

Unnamed: 0,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,1300000.0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,103.0,tt0075314,en,Taxi Driver,A mentally unstable Vietnam War veteran works ...,49.645,"[{'id': 46059, 'logo_path': None, 'name': 'Ita...","[{'iso_3166_1': 'US', 'name': 'United States o...",1976-02-09,28570902.0,114.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"On every street in every city, there's a nobod...",Taxi Driver,8.2,10564.0
1,2000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 12, 'name...",http://www.warnerbros.com/deliverance,10669.0,tt0068473,en,Deliverance,Intent on seeing the Cahulawassee River before...,16.623,"[{'id': 174, 'logo_path': '/IuAlhI9eVC9Z8UQWOI...","[{'iso_3166_1': 'US', 'name': 'United States o...",1972-08-18,46122355.0,109.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,What did happen on the Cahulawassee River?,Deliverance,7.317,1340.0
2,400000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 35, '...",https://www.sonypictures.com/movies/montypytho...,762.0,tt0071853,en,Monty Python and the Holy Grail,"King Arthur, accompanied by his squire, recrui...",27.025,"[{'id': 416, 'logo_path': None, 'name': 'Pytho...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",1975-05-25,1940906.0,91.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,And now! At Last! Another film completely diff...,Monty Python and the Holy Grail,7.804,5076.0
3,5400000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,1654.0,tt0061578,en,The Dirty Dozen,12 American military prisoners in World War II...,21.143,"[{'id': 14159, 'logo_path': None, 'name': 'Sev...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",1967-06-15,45300000.0,150.0,"[{'english_name': 'Latin', 'iso_639_1': 'la', ...",Released,Train them! Excite them! Arm them!...Then turn...,The Dirty Dozen,7.611,1025.0
4,,,,,,,,,,,,,,,,,,,,


In [39]:
# save this final dataframe as csv
# work on json column and gte the necessary data

# New CODE

In [49]:
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas()

In [25]:
df = pd.read_csv('df_imbd.csv')

In [26]:
imp_cols = ['adult', 'budget','genres','imdb_id','original_language','original_title','overview',
'popularity','production_companies','production_countries','revenue','runtime','spoken_languages','status','video','vote_average','vote_average']

In [27]:
df = df[imp_cols]

In [28]:
df.isna().sum()

adult                   2083
budget                  2083
genres                  2083
imdb_id                 2083
original_language       2083
original_title          2083
overview                2101
popularity              2083
production_companies    2083
production_countries    2083
revenue                 2083
runtime                 2083
spoken_languages        2083
status                  2083
video                   2083
vote_average            2083
vote_average            2083
dtype: int64

In [30]:
df = df.dropna()

In [31]:
df

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,overview,popularity,production_companies,production_countries,revenue,runtime,spoken_languages,status,video,vote_average,vote_average.1
0,False,1300000.0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",tt0075314,en,Taxi Driver,A mentally unstable Vietnam War veteran works ...,49.645,"[{'id': 46059, 'logo_path': None, 'name': 'Ita...","[{'iso_3166_1': 'US', 'name': 'United States o...",28570902.0,114.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,False,8.200,8.200
1,False,2000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 12, 'name...",tt0068473,en,Deliverance,Intent on seeing the Cahulawassee River before...,16.623,"[{'id': 174, 'logo_path': '/IuAlhI9eVC9Z8UQWOI...","[{'iso_3166_1': 'US', 'name': 'United States o...",46122355.0,109.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,False,7.317,7.317
2,False,400000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 35, '...",tt0071853,en,Monty Python and the Holy Grail,"King Arthur, accompanied by his squire, recrui...",27.025,"[{'id': 416, 'logo_path': None, 'name': 'Pytho...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",1940906.0,91.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,False,7.804,7.804
3,False,5400000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",tt0061578,en,The Dirty Dozen,12 American military prisoners in World War II...,21.143,"[{'id': 14159, 'logo_path': None, 'name': 'Sev...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",45300000.0,150.0,"[{'english_name': 'Latin', 'iso_639_1': 'la', ...",Released,False,7.611,7.611
5,False,4000000.0,"[{'id': 35, 'name': 'Comedy'}]",tt0079470,en,Life of Brian,"Brian Cohen is an average young Jewish man, bu...",22.461,"[{'id': 20076, 'logo_path': '/i9qXGJIP9fGN22PP...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",20745728.0,94.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,False,7.755,7.755
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5441,False,0.0,"[{'id': 16, 'name': 'Animation'}]",tt14586752,en,Super Monsters: Once Upon a Rhyme,"From Goldilocks to Hansel and Gretel, the Supe...",12.941,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,0.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,False,6.200,6.200
5442,False,0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",tt14216488,ar,عروستي,The story follows a young man and woman who go...,2.448,[],"[{'iso_3166_1': 'EG', 'name': 'Egypt'}]",0.0,93.0,"[{'english_name': 'Arabic', 'iso_639_1': 'ar',...",Released,False,5.600,5.600
5443,False,0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",tt13857480,en,Fine Wine,After falling for a woman much younger than hi...,2.197,[],"[{'iso_3166_1': 'NG', 'name': 'Nigeria'}]",0.0,130.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,False,2.000,2.000
5444,False,0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",tt11803618,ta,கேர் ஆஃப் காதல்,A heartwarming film that explores the concept ...,3.549,"[{'id': 86205, 'logo_path': '/cuepc2zGdotXIuZj...","[{'iso_3166_1': 'IN', 'name': 'India'}]",0.0,134.0,"[{'english_name': 'Tamil', 'iso_639_1': 'ta', ...",Released,False,6.000,6.000


In [36]:
[d['name'] for d in eval(df['genres'][0])]

['Crime', 'Drama']

In [40]:
[d['name'] for d in eval(df['production_countries'][0])]

['United States of America']

In [41]:
eval(df['production_countries'][0])

[{'iso_3166_1': 'US', 'name': 'United States of America'}]

In [42]:
[d['name'] for d in eval(df['production_companies'][0])]

['Italo/Judeo Productions', 'Bill/Phillips', 'Columbia Pictures']

In [44]:
[d['name'] for d in eval(df['spoken_languages'][0])]

['English', 'Español']

In [45]:
eval(df['spoken_languages'][0])

[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'},
 {'english_name': 'Spanish', 'iso_639_1': 'es', 'name': 'Español'}]

In [46]:
def get_val(dictionary_list):
    val = [d['name'] for d in eval(dictionary_list)]
    return val

In [50]:
df['genres'] = df['genres'].progress_apply(lambda x: get_val(x))

  0%|          | 0/3346 [00:00<?, ?it/s]

In [52]:
df['production_countries'] = df['production_countries'].progress_apply(lambda x: get_val(x))
df['production_companies'] = df['production_companies'].progress_apply(lambda x: get_val(x))
df['spoken_languages'] = df['spoken_languages'].progress_apply(lambda x: get_val(x))

  0%|          | 0/3346 [00:00<?, ?it/s]

  0%|          | 0/3346 [00:00<?, ?it/s]

  0%|          | 0/3346 [00:00<?, ?it/s]

In [55]:
df[df['revenue']!=0]

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,overview,popularity,production_companies,production_countries,revenue,runtime,spoken_languages,status,video,vote_average,vote_average.1
0,False,1300000.0,"[Crime, Drama]",tt0075314,en,Taxi Driver,A mentally unstable Vietnam War veteran works ...,49.645,"[Italo/Judeo Productions, Bill/Phillips, Colum...",[United States of America],28570902.0,114.0,"[English, Español]",Released,False,8.200,8.200
1,False,2000000.0,"[Drama, Adventure, Thriller]",tt0068473,en,Deliverance,Intent on seeing the Cahulawassee River before...,16.623,"[Warner Bros. Pictures, Elmer Enterprises]",[United States of America],46122355.0,109.0,[English],Released,False,7.317,7.317
2,False,400000.0,"[Adventure, Comedy, Fantasy]",tt0071853,en,Monty Python and the Holy Grail,"King Arthur, accompanied by his squire, recrui...",27.025,"[Python (Monty) Pictures Limited, Michael Whit...",[United Kingdom],1940906.0,91.0,"[Français, Latin, English]",Released,False,7.804,7.804
3,False,5400000.0,"[Action, Adventure, War]",tt0061578,en,The Dirty Dozen,12 American military prisoners in World War II...,21.143,"[Seven Arts Pictures, MKH, Metro-Goldwyn-Mayer]","[United Kingdom, United States of America]",45300000.0,150.0,"[Latin, English, Deutsch, Français, Español]",Released,False,7.611,7.611
5,False,4000000.0,[Comedy],tt0079470,en,Life of Brian,"Brian Cohen is an average young Jewish man, bu...",22.461,"[HandMade Films, Python (Monty) Pictures Limited]",[United Kingdom],20745728.0,94.0,"[English, Latin]",Released,False,7.755,7.755
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5180,False,0.0,"[Animation, Adventure, Family, Comedy]",tt9038290,de,Latte Igel und der magische Wasserstein,When a greedy bear steals a magic stone to kee...,20.206,"[Dreamin' Dolphin Film, Eagle Eye Filmprodukti...","[Belgium, Germany]",4983582.0,89.0,[Deutsch],Released,False,6.700,6.700
5207,False,0.0,"[Crime, Drama]",tt13356884,ja,ヤクザと家族 The Family,"Taken in by the yakuza at a young age, Kenji s...",9.253,"[STAR SANDS, KADOKAWA]",[Japan],2168905.0,136.0,"[普通话, 日本語]",Released,False,7.200,7.200
5274,False,0.0,"[Action, Thriller]",tt11503178,te,Wild Dog,Wild Dog aka Vijay Varma is an NIA agent who’s...,4.645,[Matinee Entertainment],[India],32519.0,122.0,[తెలుగు],Released,False,5.067,5.067
5304,False,0.0,"[Comedy, Science Fiction, Adventure]",tt13043436,ar,الإنس والنمس,The story follows a poor government employee w...,2.554,[Al-Masa Media Production],[Egypt],3487821.0,120.0,[العربية],Released,False,4.100,4.100


In [57]:
dd = pd.read_csv(r'D:\OneDrive - NITT\Desktop\full_proj_v2\data.tsv',delimiter='\t')

  dd = pd.read_csv(r'D:\OneDrive - NITT\Desktop\full_proj_v2\data.tsv',delimiter='\t')


In [58]:
dd.shape

(35704646, 8)

In [78]:
imdb_id = list(set(dd['titleId']))

In [108]:
import random
random.seed(100)

In [110]:
small_imdb_id = random.sample(imdb_id,50000)

In [115]:
pd.DataFrame(small_imdb_id[:25000]).to_csv('imdb1.csv')

In [None]:
data_all = []
for id_ in tqdm(small_imdb_id[25000:]):
    json_data = get_json(id_)
    data_all.append(json_data)

  0%|          | 0/25000 [00:00<?, ?it/s]

In [None]:
import json

with open('data.json', 'w') as fp:
    json.dump(data_all, fp)