# Data preprocessing for actors analysis

Since our analysis is focused on the career path of actors, we need a database of actors. Alongside their movies, we would like to get access to the popularity of an actor on the TMDB database. The 3 most famous movies of an actor are essential to the analysis, since we assume they are correlated to an actor's success.  
We get them access to the release dates of their most popular movies, their genras and many more. These information will help us in our analysis to understand the patterms of an actor's success.

In [2]:
import requests
import json
import pandas as pd

api_key = "YOUR_API_KEY"


This part of the code exctracts the `tmdb_actors_db.json` file from TMDB. It should only be run once since it is saved after in the folder `Data/tmdb_resources`

In [48]:
# Used to fetch data from TMDB API and save to a JSON file. Needed to be run only once.

url = 'https://api.themoviedb.org/3/person/popular'
headers = {
    "Authorization": f"Bearer {api_key}",
    "accept": "application/json"
}

def fetch_page(page):
    params = {'language': 'en-US', 'page': page}
    response = requests.get(url, params=params, headers=headers)
    
    if response.status_code == 200:
        data = response.json()
        print(f"Page {page} processed.")
        return data['results']
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return []
    
def fetch_all_pages():
    all_data = []
    for page in range(1, 500):
        page_results = fetch_page(page)
        all_data.extend(page_results)

    return all_data

def fetch_process():
    all_data = fetch_all_pages()

    # Save all data to a single file
    with open('../Data/tmdb_resources/tmdb_actors_db.json', 'w') as f:
        json.dump({'results': all_data}, f)

    print("Data fetching and appending complete.")

fetch_process()

In [19]:
json_file_path = "../Data/tmdb_resources/tmdb_actors_db.json"

# Read JSON data from the file
with open(json_file_path, 'r') as file:
    json_data = json.load(file)

# Convert to DataFrame
actors_df = pd.json_normalize(json_data['results'], sep='_')

# Extracting 'original_language' from 'known_for' and adding it to the DataFrame
actors_df['original_language'] = actors_df['known_for'].apply(lambda x: x[0]['original_language'] if x else None)
# Extracting 'genre_ids' from 'known_for' and adding it as a list to the DataFrame
actors_df['genre_ids'] = actors_df['known_for'].apply(lambda x: [genre['genre_ids'] for genre in x] if x else [])

actors_df = actors_df[actors_df['known_for_department'] == "Acting"]
ordered_columns = ["name", "gender", "popularity", "original_language", "genre_ids" , "known_for", "id"]
actors_df = actors_df[ordered_columns]
actors_df.to_csv('../Data/preprocessed_data/actors_db.csv', index=False)

print(f"There are {actors_df.shape[0]} actors in the dataset.")
display(actors_df)



There are 9582 actors in the dataset.


Unnamed: 0,name,gender,popularity,original_language,genre_ids,known_for,id
0,Sangeeth Shobhan,0,226.892,te,"[[35, 10749, 18], [35, 10751], [18, 35]]","[{'adult': False, 'backdrop_path': '/jBnnkkXRZ...",3234630
1,Gary Oldman,2,220.449,en,"[[18, 28, 80, 53], [28, 80, 18, 53], [18, 36]]","[{'adult': False, 'backdrop_path': '/nMKdUUepR...",64
2,Angeli Khang,1,199.449,tl,"[[18, 53], [18], [18, 10749]]","[{'adult': False, 'backdrop_path': '/27bkw4o1z...",3194176
3,Florence Pugh,1,176.589,en,"[[27, 18, 9648], [28, 12, 878], [18, 10749]]","[{'adult': False, 'backdrop_path': '/aAM3cQmYG...",1373737
4,Jason Statham,2,162.466,en,"[[80, 35], [28, 878, 27], [28, 80, 53]]","[{'adult': False, 'backdrop_path': '/ysKahAEPP...",976
...,...,...,...,...,...,...,...
9975,Alice Isaaz,1,14.133,fr,"[[18, 10749], [18, 53], [35]]","[{'adult': False, 'backdrop_path': '/vzcJQORoL...",1288047
9976,Peter Cullen,2,14.133,en,"[[28, 12, 878], [878, 28, 12, 53], [28, 12, 878]]","[{'adult': False, 'backdrop_path': '/2vFuG6bWG...",19540
9977,Mary Crosby,1,14.133,en,"[[28, 12, 37], [878, 28, 12, 35], [35, 18, 104...","[{'adult': False, 'backdrop_path': '/eCebbqmTs...",18465
9978,Daisuke Namikawa,2,14.131,ja,"[[16, 10759], [16, 10759, 10765], [10759, 16, ...","[{'adult': False, 'backdrop_path': '/zotzm1Iza...",110665


Each actor is presented with their 3 movies that made them popular. The information about these movies are available in the `known_for` feature. 
This type of data is kept in the column `known_for` for future purposes where we would want to get specific information about an actor's movie.
The data looks like this:

In [20]:
#Example:  Get information about the 3 most popular movies of the actor Jackie Chan
def get_popular_movies(actor_name):
    actor_movies = actors_df[actors_df['name'] == actor_name]['known_for'].values[0]
    popular_movies = sorted(actor_movies, key=lambda x: x['popularity'], reverse=True)
    return popular_movies

example_df = pd.json_normalize(get_popular_movies("Jackie Chan"))
ordered_columns = ["title", "popularity", "original_language", "genre_ids", "release_date", "vote_average", "vote_count", "id", "overview"]
example_df[ordered_columns]

Unnamed: 0,title,popularity,original_language,genre_ids,release_date,vote_average,vote_count,id,overview
0,Rush Hour 3,64.894,en,"[28, 35, 80]",2007-08-08,6.436,2948,5174,"After a botched assassination attempt, the mis..."
1,Rush Hour,57.09,en,"[28, 35, 80]",1998-09-18,7.019,4436,2109,When Hong Kong Inspector Lee is summoned to Lo...
2,Rush Hour 2,50.15,en,"[28, 35, 80]",2001-08-03,6.717,3729,5175,It's vacation time for Carter as he finds hims...
