Exploratory Data Analysis of Movie Ratings

Project Overview:
Analyze a movie ratings dataset to uncover insights into viewer preferences and trends over time.

Steps:

Data Collection: Use a publicly available dataset, such as the IMDb or MovieLens datasets, containing movie titles, genres, ratings, and release years.

Data Cleaning: Handle missing or inconsistent data, such as incomplete ratings or incorrect genres.

Descriptive Statistics:

Calculate summary statistics for movie ratings (mean, median, mode).
Analyze the distribution of ratings to identify any skewness or outliers.
Genre Analysis:

Identify the most popular genres based on average ratings.
Determine which genres have the highest and lowest average ratings.
Trends Over Time:

Analyze trends in movie ratings over different years or decades.
Investigate if certain periods had higher overall ratings or more releases.
Visualizations:

Create visualizations such as histograms, bar charts, and line plots to present your findings.
Use heatmaps to show correlations between different genres and ratings.
Conclusions: Summarize key insights, such as which genres are consistently rated higher or how viewer preferences have evolved.

This project will help you practice data manipulation, statistical analysis, and data visualization without diving into machine learning.

In [24]:
import os
import requests
import json
from dotenv import load_dotenv
import pandas as pd
import re
from datetime import datetime
import time

In [25]:

# Load environment variables to access keys
load_dotenv('C:/Users/Earl/Desktop/API-Keys.env.txt')

# Get the API key from the environment
api_key = os.getenv('PAID_OMDB_API_KEY')
base_url = f'http://www.omdbapi.com/?apikey={api_key}&'
def save_dataframe_to_csv(df, filename):
    """
    Save a DataFrame to a CSV file.

    Parameters:
    df (pd.DataFrame): The DataFrame to save.
    filename (str): The path to the file where the DataFrame will be saved.
    """
    try:
        df.to_csv(filename, index=False)
        print(f"DataFrame successfully saved to {filename}")
    except Exception as e:
        print(f"An error occurred while saving the DataFrame to CSV: {e}")
        
def movie_retrieval(title,year):
    """
    retrieve movie info from IMDB.

    Parameters:
    title (str): The move title to search for
    year (int): The year of the movie title

    Return: 
    d.DataFrame([data]): The information of the movie searched as a pandas DataFrame
    """
    retries=5
    backoff_factor=0.5
    params = {
    't': title,  
    'type': 'movie',
    'y': year
    }
    for i in range(retries):
        try:
            response = requests.get(base_url, params=params)
            print(f"Fetching movie: {title} ({year}), Status code: {response.status_code}")  # Debugging statement
            

            if response.status_code == 200:
                try:
                    data = response.json()
                    if 'Title' in data:
                        title = data.get('Title', 'N/A')
                        runtime = data.get('Runtime', 'N/A')
                        country = data.get('Country', 'N/A')
                        print(f"Found movie with Title: {title}, Runtime: {runtime}, Country: {country}")  # Debugging statement
                        return pd.DataFrame([data])
                    else:
                        print("No movies found or the response doesn't contain the 'Title' key.")
                        return None
                except ValueError:
                    print("Response content is not valid JSON.")
                    return None
            elif response.status_code in [520, 524]:
                print(f"Error: {response.status_code} - A server issue occurred. Retrying...")
            else:
                print(f"Error: {response.status_code}")
                return None
        except Exception as e:
            print(f"An error occurred: {str(e)}")
        time.sleep(backoff_factor * (2 ** i))  # Exponential backoff
    return None

def movie_retrieval_by_title(title):
    """
    Retrieve movie info from IMDB by title.

    Parameters:
    title (str): The movie title to search for

    Return: 
    pd.DataFrame: The information of the movies searched as a pandas DataFrame
    """
    retries = 5
    backoff_factor = 0.5
    params = {
        's': title,
        'type': 'movie'
    }
    all_movies = []

    for i in range(retries):
        try:
            response = requests.get(base_url, params=params)
            print(f"Searching movies: {title}, Status code: {response.status_code}")  # Debugging statement

            if response.status_code == 200:
                try:
                    data = response.json()
                    if 'Search' in data:
                        all_movies.extend(data['Search'])
                        total_results = int(data.get('totalResults', len(data['Search'])))
                        current_results = len(data['Search'])

                        while current_results < total_results:
                            params['page'] = (current_results // 10) + 1
                            response = requests.get(base_url, params=params)
                            data = response.json()
                            if 'Search' in data:
                                all_movies.extend(data['Search'])
                                current_results += len(data['Search'])
                            else:
                                break

                        return pd.DataFrame(all_movies)
                    else:
                        print("No movies found or the response doesn't contain the 'Search' key.")
                        return None
                except ValueError:
                    print("Response content is not valid JSON.")
                    return None
            elif response.status_code in [520, 524]:
                print(f"Error: {response.status_code} - A server issue occurred. Retrying...")
            else:
                print(f"Error: {response.status_code}")
                return None
        except Exception as e:
            print(f"An error occurred: {str(e)}")
        time.sleep(backoff_factor * (2 ** i))  # Exponential backoff
    return None

def fetch_wikipedia_page_links(titles):
    """
    fetch wikipedia page links for the 

    Parameters:
    titles (str): The wikipedia page title to search for

    Returns: 
    response.json(): which contains information on the links associated with page
    """
    
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        'action': 'query',
        'format': 'json',
        'titles': titles, #'Lists of American films',
        'prop': 'links',
        'pllimit': 'max'
    }
    response = requests.get(url, params=params)
    
    if response.status_code==200:
        print(f"Respone {response.status_code}-request sucessful.\npage links for '{titles}' acquired")
        return response.json()
    else:
        print(f"response {response.status_code}-error has occured")
        return None

def fetch_wikipedia_page_content(titles):
    """
    fetch wikipedia page content from list of american films in 'year' page titles 

    Parameters:
    titles (str): The wikipedia page title to search for

    Returns: 
    page_content(str): The page content of the page searched
    year(str): The year associated with the list
    """
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        'action': 'query',
        'format': 'json',
        'titles': titles,
        'prop': 'revisions',
        'rvprop': 'content'
    }
    response = requests.get(url, params=params)
    data = response.json()

    year_match = re.search(r'\d{4}', titles)
    year = year_match.group(0) if year_match else None

    pages = data.get('query', {}).get('pages', {})

    for page_id, page_data in pages.items():
        if 'revisions' in page_data:
            page_content = next(iter(data['query']['pages'].values()))['revisions'][0]['*']
            print(f"Response {response.status_code} - Request successful.\nPage: {titles} fetched\n")
            return page_content,year
        else:
            print(f"Error: No revisions found for page: {titles}\n")
            return "", None
    return "", None


#Run on fetch_wikipedia_page_links() data
def filter_american_film_lists_links(data):
    """
    extracts links from wikipedia response.json from fetch_wikipedia_page_links() function

    Parameters:
    data (str): The wikipedia page content

    Returns: 
    links (list of str): list of the links attached to the page that fit the criteria
    """
    pages = data['query']['pages']
    links = []
    
    for page_id in pages:
        for link in pages[page_id].get('links', []):
            title = link['title']
            if title.startswith("List of American films of"):
                links.append(title)
    if links:
        print("List sucessfully created")
        return links
    else: 
        print("error list not created")
        return None

#run on fetch_wikipedia_page_content() data
def extract_titles(wikitext):
    
    pattern = r"''\[\[(.*?)\]\]"
    matches = re.findall(pattern, wikitext)
 
    matches_years_removed= [title.split(' (')[0] for title in matches]
    empty_strings_removed_matches = [title for title in matches_years_removed if title]
    duplicates_removed_matches=list(set(empty_strings_removed_matches)) 
    return duplicates_removed_matches


In [3]:

# Fetch links to pages of American films by year
film_links = filter_american_film_lists_links(fetch_wikipedia_page_links('Lists of American films'))
print(film_links) # debug to make sure list isnt empty



Respone 200-request sucessful.
page links for 'Lists of American films' acquired
List sucessfully created
['List of American films of 1900', 'List of American films of 1901', 'List of American films of 1902', 'List of American films of 1903', 'List of American films of 1904', 'List of American films of 1905', 'List of American films of 1906', 'List of American films of 1907', 'List of American films of 1908', 'List of American films of 1909', 'List of American films of 1910', 'List of American films of 1911', 'List of American films of 1912', 'List of American films of 1913', 'List of American films of 1914', 'List of American films of 1915', 'List of American films of 1916', 'List of American films of 1917', 'List of American films of 1918', 'List of American films of 1919', 'List of American films of 1920', 'List of American films of 1921', 'List of American films of 1922', 'List of American films of 1923', 'List of American films of 1924', 'List of American films of 1925', 'List of 

In [4]:
wikitext = []
for title in film_links:
    page_wikitext, year = fetch_wikipedia_page_content(title)
    if page_wikitext:
        wikitext.append((page_wikitext, year))

#debugging 
if not wikitext:
    print("error wikitext variable list is empty")

Response 200 - Request successful.
Page: List of American films of 1900 fetched

Response 200 - Request successful.
Page: List of American films of 1901 fetched

Response 200 - Request successful.
Page: List of American films of 1902 fetched

Response 200 - Request successful.
Page: List of American films of 1903 fetched

Response 200 - Request successful.
Page: List of American films of 1904 fetched

Response 200 - Request successful.
Page: List of American films of 1905 fetched

Response 200 - Request successful.
Page: List of American films of 1906 fetched

Response 200 - Request successful.
Page: List of American films of 1907 fetched

Response 200 - Request successful.
Page: List of American films of 1908 fetched

Response 200 - Request successful.
Page: List of American films of 1909 fetched

Response 200 - Request successful.
Page: List of American films of 1910 fetched

Response 200 - Request successful.
Page: List of American films of 1911 fetched

Response 200 - Request succe

In [5]:
movie_list_for_imdb_search=[]

for row in wikitext:
    year=int(row[1])
    titles_list=extract_titles(row[0])
    titles_with_year = [(title, year) for title in titles_list]
    movie_list_for_imdb_search.extend(titles_with_year)

#debugging 
if not movie_list_for_imdb_search:
    print("error movie_list_for_imdb_search variable list is empty")

In [6]:
current_year=datetime.now().year
movie_list_for_imdb_search_df=pd.DataFrame(movie_list_for_imdb_search, columns=["Title","Year"])
movie_list_for_imdb_search_df=movie_list_for_imdb_search_df[   (movie_list_for_imdb_search_df['Year'] <= current_year) & 
    (movie_list_for_imdb_search_df['Year'] > 1890)]
movie_list_for_imdb_search_df.sort_values(by='Year',inplace=True)
movie_list_for_imdb_search_df['Title'] = movie_list_for_imdb_search_df['Title'].str.strip()


In [7]:
movie_list_for_imdb_search_df

Unnamed: 0,Title,Year
0,Caught,1900
18,Watermelon Contest,1900
17,1900 in film|1900,1900
16,The Enchanted Drawing,1900
15,Clowns Spinning Hats,1900
...,...,...
35703,Don't Tell Mom the Babysitter's Dead,2024
35702,Am I OK?,2024
35701,Babygirl,2024
35708,Lights Out,2024


In [19]:
def find_imdb_movies_from_list(movie_list_df,just_title=False):
    movie_df=pd.DataFrame()
    unfound_movies_list = []
    for index, row in movie_list_df.iterrows():
        title = row['Title']  # Replace 'title' with the actual column name
        year = row['Year']    # Replace 'year' with the actual column name
        if just_title==False:
            if "|" in title:
                title1=title.split("|")[0]
                title2=title.split("|")[1]
                movie_info1=movie_retrieval(title1,year)
                movie_info2=movie_retrieval(title2,year)
                if movie_info1 is not None:
                    movie_df=pd.concat([movie_df,movie_info1],ignore_index=True)
                elif movie_info2 is not None:
                    movie_df=pd.concat([movie_df,movie_info2],ignore_index=True)
                else:
                    print(f"Movie {title} ({year}) was not found")
                    unfound_movies_list.append({'Title': title, 'Year': year})

            else:
                movie_info=movie_retrieval(title,year)
                if movie_info is not None:
                    movie_df=pd.concat([movie_df,movie_info],ignore_index=True)
                else:
                    print(f"Movie {title} ({year}) was not found")              
                    unfound_movies_list.append({'Title': title, 'Year': year})
        else: 
            if "|" in title:
                title1=title.split("|")[0]
                title2=title.split("|")[1]
                movie_info1=movie_retrieval_by_title(title1)
                movie_info2=movie_retrieval_by_title(title2)
                if movie_info1 is not None:
                    movie_df=pd.concat([movie_df,movie_info1],ignore_index=True)
                elif movie_info2 is not None:
                    movie_df=pd.concat([movie_df,movie_info2],ignore_index=True)
                else:
                    print(f"Movie {title} was not found")
                    unfound_movies_list.append({'Title': title, 'Year': year})

            else:
                movie_info=movie_retrieval_by_title(title)
                if movie_info is not None:
                    movie_df=pd.concat([movie_df,movie_info],ignore_index=True)
                else:
                    print(f"Movie {title} was not found")              
                    unfound_movies_list.append({'Title': title, 'Year': year})
    unfound_movies=pd.DataFrame(unfound_movies_list)
    return movie_df,unfound_movies


In [26]:
movie_df,unfound_movies=find_imdb_movies_from_list(movie_df)
#unfound movies should be rare, but imdb api is very specific with searchs it hardly does close matches, sometimes it just times out, you can try running it
# the function again and set just_titles to true and it will try to find the movies by just the titles, this will give you page results of all close matches
#you can then run it through the function again

Fetching movie: The Girls in the Overalls (1904), Status code: 200
Found movie with Title: The Girls in the Overalls, Runtime: N/A, Country: USA
Fetching movie: A Trip to the Giant's Causeway (1900), Status code: 200
Found movie with Title: A Trip to the Giant's Causeway, Runtime: N/A, Country: United Kingdom
Fetching movie: Over the Garden Wall (1950), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 94 min, Country: United Kingdom
Fetching movie: Over the Garden Wall (1934), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 68 min, Country: United Kingdom
Fetching movie: Over the Garden Wall (1910), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 16 min, Country: United States
Fetching movie: Over the Garden Wall (1919), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 50 min, Country: United States
Fetching movie: Over the Garden Wall (1914), Status code: 200
Found movie with Title: Over the 

In [12]:
#save_dataframe_to_csv(movie_df,"american_movie_dataset.csv") 
#if this is the first run of the code, it is recommended to uncomment this to create the initial csv copy

DataFrame successfully saved to american_movie_dataset.csv


In [27]:
movie_df.to_csv("american_movie_dataset.csv", mode='a', index=False, header=False)
#run this to append to exisiting file alternatively, you can run this for the first time and the file is created anyways if it doesnt exist

In [20]:
title_of_movie_df,unfound_movies2=find_imdb_movies_from_list(unfound_movies,just_title=True) 
#here you can get your initial unfound movies by title, and then run it again 



Searching movies: Buffalo Bill's Wild West Parad, Status code: 200
No movies found or the response doesn't contain the 'Search' key.
Movie Buffalo Bill's Wild West Parad was not found
Searching movies: The Girls in the Overalls, Status code: 200
Searching movies: A Trip to the Giant's Causeway, Status code: 200
Searching movies: The Tragical Tale of a Belated Letter, Status code: 200
No movies found or the response doesn't contain the 'Search' key.
Movie The Tragical Tale of a Belated Letter was not found
Searching movies: Scene in Canada -- Logging at Bear Creek, Status code: 200
No movies found or the response doesn't contain the 'Search' key.
Movie Scene in Canada -- Logging at Bear Creek was not found
Searching movies: The Ascent of Mont Blanc, Status code: 200
No movies found or the response doesn't contain the 'Search' key.
Movie The Ascent of Mont Blanc was not found
Searching movies: A Substantial Ghost, Status code: 200
No movies found or the response doesn't contain the 'Sear

In [None]:
movie_df,unfound_movies3=find_imdb_movies_from_list(title_of_movie_df)
#second run, but unfound movies or other issues could still arise.

Fetching movie: The Girls in the Overalls (1904), Status code: 200
Found movie with Title: The Girls in the Overalls, Runtime: N/A, Country: USA
Fetching movie: A Trip to the Giant's Causeway (1900), Status code: 200
Found movie with Title: A Trip to the Giant's Causeway, Runtime: N/A, Country: United Kingdom
Fetching movie: Over the Garden Wall (1950), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 94 min, Country: United Kingdom
Fetching movie: Over the Garden Wall (1934), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 68 min, Country: United Kingdom
Fetching movie: Over the Garden Wall (1910), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 16 min, Country: United States
Fetching movie: Over the Garden Wall (1919), Status code: 200
Found movie with Title: Over the Garden Wall, Runtime: 50 min, Country: United States
Fetching movie: Over the Garden Wall (1914), Status code: 200
Found movie with Title: Over the 