## Read and Merge data 
<img src="../data/image/read_data.png" alt="My Local Image" width="750">

#### STEPS in this notebook
1. import open data in the form of excel and csv and merge them
2. use the titles from merged data, randomly sample 200 and retrieve information using API call in the form of json
3. import json data and merge with intially merged data
4. export to excel file

First, we import the excel data as a dataframe, it has 3725 datapoints. For consistency, we take the year out of Release Date and put it with the Title to be the id of each row.

In [1]:
import requests
import json
import pandas as pd

file_path = '../data/xlsx_data/movies_excel.xlsx'
excel_df = pd.read_excel(file_path, sheet_name='Sheet1')  # Specify the sheet name or index

print(len(excel_df))
excel_df['year'] = excel_df['Release Date'].dt.year.astype(int)
excel_df['Title_year'] = excel_df['Title'] + ' (' + excel_df['year'].astype(str) + ')'
excel_df.head()


3725


Unnamed: 0,Title,Release Date,Color/B&W,Genre,Language,Country,Rating,Lead Actor,Director Name,Lead Actor FB Likes,Cast FB Likes,Director FB Likes,Movie FB Likes,IMDb Score (1-10),Total Reviews,Duration (min),Gross Revenue,Budget,year,Title_year
0,Over the Hill to the Poorhouse,1920-09-15,Black and White,Crime,English,USA,Not Rated,Stephen Carr,Harry F. Millarde,2.0,4,0,0,4.8,1.0,110.0,3000000,100000,1920,Over the Hill to the Poorhouse (1920)
1,Metropolis,1927-01-26,Black and White,Drama,German,Germany,Not Rated,Brigitte Helm,Fritz Lang,136.0,203,756,12000,8.3,260.0,145.0,26435,6000000,1927,Metropolis (1927)
2,The Broadway Melody,1929-11-11,Black and White,Musical,English,USA,Passed,Anita Page,Harry Beaumont,77.0,109,4,167,6.3,36.0,100.0,2808000,379000,1929,The Broadway Melody (1929)
3,42nd Street,1933-08-29,Black and White,Comedy,English,USA,Unrated,Ginger Rogers,Lloyd Bacon,610.0,995,24,439,7.7,65.0,89.0,2300000,439000,1933,42nd Street (1933)
4,Top Hat,1935-04-15,Black and White,Comedy,English,USA,Approved,Ginger Rogers,Mark Sandrich,610.0,824,10,1000,7.8,66.0,81.0,3000000,609000,1935,Top Hat (1935)


Next, we import the csv file with 1000 data points and do some data manipulation similar to the one above.

In [2]:
csv_df = pd.read_csv("../data/csv_data/imdb_top_1000.csv")
print(len(csv_df))
csv_df.rename(columns={'Series_Title': 'Title'}, inplace=True)
csv_df['Title_year'] = csv_df['Title'] + ' (' + csv_df['Released_Year'].astype(str) + ')'
print(csv_df.columns)

1000
Index(['Poster_Link', 'Title', 'Released_Year', 'Certificate', 'Runtime',
       'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director', 'Star1',
       'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross', 'Title_year'],
      dtype='object')


Here we merge the data from the excel and the csv using the new column, **Title_year**

In [3]:
merged_df = pd.merge(excel_df, csv_df, on='Title_year', how='outer')
print(merged_df.columns)
print(len(merged_df))
merged_df['Combined_Title'] = merged_df['Title_x'].combine_first(merged_df['Title_y'])

Index(['Title_x', 'Release Date', 'Color/B&W', 'Genre_x', 'Language',
       'Country', 'Rating', 'Lead Actor', 'Director Name',
       'Lead Actor FB Likes', 'Cast FB Likes', 'Director FB Likes',
       'Movie FB Likes', 'IMDb Score (1-10)', 'Total Reviews',
       'Duration (min)', 'Gross Revenue', 'Budget', 'year', 'Title_year',
       'Poster_Link', 'Title_y', 'Released_Year', 'Certificate', 'Runtime',
       'Genre_y', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director', 'Star1',
       'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')
4353


Now we use 500 randomly sampled titles from the combined data to call the OMDB API to get more information about the movies in the form of a json file. We then save this json file in our data folder. **This code is only executed once as the information gathered has already been stored.**

In [4]:

# # Your API key and base URL
# api_key = '8501dc49'
# base_url = 'http://www.omdbapi.com/'

# # List of movie titles to search
# movie_titles = merged_df['Combined_Title'].sample(n=500, random_state=1).to_list()

# # Container to hold all results
# all_movies = []

# # Loop through each movie title
# for title in movie_titles:
    
#     # Query parameters
#     params = {
#         'apikey': api_key,
#         't': title
#     }

#     # Make the request
#     url = requests.Request('GET', base_url, params=params).prepare().url
#     response = requests.get(url)

#     # Check if the request was successful
#     if response.status_code == 200:
#         data = response.json()
#         if data['Response'] == 'True':
#             all_movies.append(data)
#         else: print(title + "not found")

#     else:
#         print(f"Error: Unable to retrieve data for {title}")

# # Save to a JSON file
# with open('../data/json_data/omdb_movies.json', 'w') as json_file:
#     json.dump(all_movies, json_file, indent=4)

# print(f"Successfully saved {len(all_movies)} records to omdb_movies.json")


Next, we import the saved json file and do the same data manipulation as that of the excel and csv so that we can finally merge all the data into one excel file.

In [5]:
with open('../data/json_data/omdb_movies.json', 'r') as file:
    data = json.load(file)
json_df = pd.json_normalize(data)


print(len(json_df))
json_df.rename(columns={'title': 'Title'}, inplace=True)
json_df['Title_year'] = json_df['Title'] + ' (' + json_df['Year'].astype(str) + ')'

238


In [6]:
merged_df = pd.merge(merged_df, json_df, on='Title_year', how='outer')
merged_df.to_excel('../data/output/merged.xlsx', index=False)