## 1. Importing Python Libraries

We shall start by importing the essential Python libraries

In [1]:
### IMPORTING LIBRARIES
import numpy as np
import pandas as pd
import requests
import re

## 2. Importing the Top Rated Movies Dataframe

Now, we shall create a variable for the API key and also import the movies dataframe that we had pulled in the first session.

In [2]:
api_key = "API Key"
movies = pd.read_csv('tmdb_movies_data.csv')
movies.head(10)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/5hNcsnMkwU2LknLoru73c76el3z.jpg,"[35, 18, 10749]",19404,hi,दिलवाले दुल्हनिया ले जायेंगे,"Raj is a rich, carefree, happy-go-lucky second...",24.222,/2CAL2433ZeIihfX1Hb2139CX0pW.jpg,1995-10-20,Dilwale Dulhania Le Jayenge,False,8.7,3253
1,False,/iNh3BivHyg5sQRPP1KOkzguEX0H.jpg,"[18, 80]",278,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,67.359,/q6y0Go1tsGEsmtFryDOJo3dEmqu.jpg,1994-09-23,The Shawshank Redemption,False,8.7,20172
2,False,/rSPw7tgCH9c6NqICZef4kZjFOQ5.jpg,"[18, 80]",238,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",62.603,/eEslKSwcqmiNS6va24Pbxf2UKmJ.jpg,1972-03-14,The Godfather,False,8.7,15112
3,False,/jtAI6OJIWLWiRItNSZoWjrsUtmi.jpg,[10749],724089,en,Gabriel's Inferno Part II,Professor Gabriel Emerson finally learns the t...,10.796,/x5o8cLZfEXMoZczTYWLrUo1P7UJ.jpg,2020-07-31,Gabriel's Inferno Part II,False,8.7,1334
4,False,/fQq1FWp1rC89xDrRMuyFJdFUdMd.jpg,"[10749, 35]",761053,en,Gabriel's Inferno Part III,The final part of the film adaption of the ero...,34.804,/qtX2Fg9MTmrbgN1UUvGoCsImTM8.jpg,2020-11-19,Gabriel's Inferno Part III,False,8.6,901
5,False,/loRmRzQXZeqG78TqZuyvSlEQfZb.jpg,"[18, 36, 10752]",424,en,Schindler's List,The true story of how businessman Oskar Schind...,35.794,/sF1U4EUQS8YHUYjNl3pMGNIQyr0.jpg,1993-11-30,Schindler's List,False,8.6,12066
6,False,/w2uGvCpMtvRqZg6waC1hvLyZoJa.jpg,[10749],696374,en,Gabriel's Inferno,An intriguing and sinful exploration of seduct...,14.372,/oyG9TL7FcRP4EZ9Vid6uKzwdndz.jpg,2020-05-29,Gabriel's Inferno,False,8.6,2155
7,False,/3ggZWEoa2aegF6AYyjyNRm8noM5.jpg,"[18, 80]",240,en,The Godfather: Part II,In the continuing saga of the Corleone crime f...,40.244,/sSuQTCZwqKrNBNIsksO9IAUoWP9.jpg,1974-12-20,The Godfather: Part II,False,8.6,9097
8,False,/1EAxNqdkVnp48a7NUuNBHGflowM.jpg,"[16, 28, 878]",283566,ja,シン・エヴァンゲリオン劇場版:||,"In the aftermath of the Fourth Impact, strande...",153.382,/jDwZavHo99JtGsCyRzp4epeeBHx.jpg,2021-03-08,Evangelion: 3.0+1.0 Thrice Upon a Time,False,8.6,383
9,False,/l5K9elugftlcyIHHm4nylvsn26X.jpg,[18],255709,ko,소원,After 8-year-old So-won narrowly survives a br...,8.039,/x9yjkm9gIz5qI5fJMUTfBnWiB2o.jpg,2013-10-02,Hope,False,8.6,236


## 3. Manipulating the Genre IDs

Now, let us consider the genre of the first movie.

In [3]:
genre = movies['genre_ids'][0]
genre

'[35, 18, 10749]'

We see that although it has a bunch of genre ids, it is given as a string rather than a list of integers.

In [4]:
len(genre)

15

Here, the length should have been 3 but it's returning a value of 15 further proving that it exists as a string value. Now, let us try to convert this string into a list of integers. For this, we first get rid of the two inverted commas at the beginning and at the end of the string. We then split the remaining string using commas and convert the remaining values as integers to get a list.

In [5]:
genre = genre[1:len(genre)-1] #removing the inverted commas
genre = list(genre.split(", ")) #splitting by commas to get integers
genre = [int(item) for item in genre] #converting to get a lost of integers
genre

[35, 18, 10749]

We can check to see if we indeed were able to convert the string to a list.

In [6]:
len(genre)

3

We've successfully obtained a list of genre ids. 

## 4. Extracting the Genre IDs for All the Top Rated Movies

Next, let us try to do this for all the movies using a loop. So for each movie in the list, we clean and split the genre ids.

Here, we face two problems:
1. Some movies have a blank space as genre id. To solve this, we shall replace the blank space with 0.
2. After converting the string to a list, we see that some movies have more than 2 genre ids. For this, we shall only take the top 2 genre ids. If a movie has less than 2 genre ids, then we shall extend the list by adding two 0 values and then consider the top 2 genre ids.

We shall then make a list containing the two genre ids in each column.

In [7]:
### GETTING GENRE IDS FOR ALL THE TOP RATED MOVIES
tmdb_genre_keys = []
for movie in range(0, len(movies)):
    genre_key = movies['genre_ids'][movie] 
    genre_key = genre_key[1:len(genre_key)-1] #removing the inverted commas
    genre_key = list(genre_key.split(', ')) #splitting by commas to get integers
    for index, item in enumerate(genre_key): 
        if(item == ''):
            genre_key[index] = 0 #replacing each blank genre id with 0 
    genre_key = [int(float(item)) for item in genre_key] #converting to get a list of integers
    gen_length = len(genre_key)
    if(gen_length < 2): 
        genre_key.extend([0, 0]) #extending the genre id list
        genre_key = genre_key[0:2] #extracting the top two genre ids
    elif(gen_length >= 2):
        genre_key = genre_key[0:2] #extracting the top two genre ids
    tmdb_genre_keys.append([genre_key[0], genre_key[1]]) 

We then convert this list into a pandas dataframe _tmdb_genre_keys_.

In [8]:
### CREATING A PANDAS DATAFRAME FOR GENRE IDS
tmdb_genre_keys = pd.DataFrame(tmdb_genre_keys, columns = ['genre_1', 'genre_2'])
tmdb_genre_keys

Unnamed: 0,genre_1,genre_2
0,35,18
1,18,80
2,18,80
3,10749,0
4,10749,35
...,...,...
9355,28,14
9356,28,12
9357,27,28
9358,28,12


## 5. Pulling the Genre ID Tags Using the API Key

Next, we shall use our API key to pull genre id tags from the TMDB website. For this, we go to https://developers.themoviedb.org and on the left column, we can see a section of urls that can help us get the information we need. We go to the _GENRES_ section then select the url for _'Get Movie List'_ and attach the API key as directed. We then extract information from the response object we get. We can see from the https://developers.themoviedb.org that the json file in the response object contains an array which has the information we need. Let us make a pandas dataframe _tmdb_genre_list_ from this array.

In [9]:
### EXTRACTING TAGS FOR MOVIE GENRE IDS
url = "https://api.themoviedb.org/3/genre/movie/list?api_key=" +  api_key + "&language=en-US"
response = requests.get(url)
tmdb_genre_list = pd.DataFrame(response.json())
tmdb_genre_list

Unnamed: 0,genres
0,"{'id': 28, 'name': 'Action'}"
1,"{'id': 12, 'name': 'Adventure'}"
2,"{'id': 16, 'name': 'Animation'}"
3,"{'id': 35, 'name': 'Comedy'}"
4,"{'id': 80, 'name': 'Crime'}"
5,"{'id': 99, 'name': 'Documentary'}"
6,"{'id': 18, 'name': 'Drama'}"
7,"{'id': 10751, 'name': 'Family'}"
8,"{'id': 14, 'name': 'Fantasy'}"
9,"{'id': 36, 'name': 'History'}"


## 6. Cleaning the Genre ID Tags Data

Now, let us clean this data so that we can get a dataframe which has genre ids in one column and the corresponding genre tag in the other column. For this, we shall do the following:

1. Convert each row value as a string.
2. Remove all punctuations.
3. Remove the term _'id'_ by splitting the string using that term and taking the second value.
4. Remove the term _'name'_ from the remaining string by splitting it using that term which will leave us with genre id as first value and genre name as the second value.
5. Convert the genre id into an integer value and append the id and name to a list.

We shall do this for all the genres in _tmdb_genre_list_ by using a for-loop. Lastly, since we had added 0 when either a movie had missing genre id or had less than 2 genres, we shall also add another row with genre id as 0 and genre name as _'None'_.

In [10]:
### CLEANING THE TAGS DATA
tmdb_genre_tags = []
for genre_index in range(0, len(tmdb_genre_list)):
    genre = tmdb_genre_list['genres'][genre_index]
    genre = str(genre) #1
    genre = re.sub(r'[^\w\s]+', '', genre) #2
    genre = list(genre.split('id '))[1] #3
    genre = list(genre.split(' name ')) #4
    tmdb_genre_tags.append([int(float(genre[0])), genre[1]]) #5
tmdb_genre_tags.append([0, 'None']) #adding the genre tag for 0

Let us convert this list into a pandas dataframe _tmdb_genre_tags_.

In [11]:
### CREATING A PANDAS DATAFRAME FOR GENRE ID TAGS 
tmdb_genre_tags = pd.DataFrame(tmdb_genre_tags, columns = ['genre_id', 'genre_names'])
tmdb_genre_tags

Unnamed: 0,genre_id,genre_names
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime
5,99,Documentary
6,18,Drama
7,10751,Family
8,14,Fantasy
9,36,History


## 7. Replacing the Genre IDs with their Respective Tags

First, we take the two columns of _tmdb_genre_ and store these values in two separate lists: _genre_keys_ and _genre_names_.

In [12]:
### STORING THE COLUMNS OF GENRE ID TAG DATAFRAME
genre_keys = list(tmdb_genre_tags['genre_id'])
genre_names = list(tmdb_genre_tags['genre_names'])

Then, we shall take the dataframe _tmdb_genre_keys_ and in each of its columns, we replace the genre id with the genre name using the lists _genre_keys_ and _genre_names_.

In [13]:
### REPLACING IDS WITH THEIR RESPECTIVE TAGS
tmdb_genre_keys['genre_1'] = tmdb_genre_keys['genre_1'].replace(genre_keys, genre_names)
tmdb_genre_keys['genre_2'] = tmdb_genre_keys['genre_2'].replace(genre_keys, genre_names)
tmdb_genre_keys

Unnamed: 0,genre_1,genre_2
0,Comedy,Drama
1,Drama,Crime
2,Drama,Crime
3,Romance,
4,Romance,Comedy
...,...,...
9355,Action,Fantasy
9356,Action,Adventure
9357,Horror,Action
9358,Action,Adventure


## 8. Saving the Genres Dataframe

Lastly, we store this dataframe in csv format.

In [14]:
tmdb_genre_keys.to_csv('tmdb_genres.csv', index = False)