The code uses web scraping to retrieve information about the top movies from the Metacritic website, spanning pages from 1 to 42. It employs the BeautifulSoup library to parse HTML content, extracting movie titles and their respective positions on the page. The collected data, including movie positions and titles, is stored in a list and limited to the first 1000 entries.

In [1]:
import requests
from bs4 import BeautifulSoup

def retrieve_top_movies_data(begin_page, conclude_page):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    all_films = []

    for current_page in range(begin_page, conclude_page + 1):
        url = f"https://www.metacritic.com/browse/movie/?releaseYearMin=1910&releaseYearMax=2023&page={current_page}"
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            film_titles = soup.find_all("h3", class_="c-finderProductCard_titleHeading")

            for idx, title_heading in enumerate(film_titles, start=(current_page - 1)):
                title_span = title_heading.find("span", class_="c-finderProductCard_title")
                film_title = title_span.text.strip() if title_span else title_heading.text.strip()
                all_films.append({"Position": idx, "Title": film_title})
        else:
            print(f"Error: Unable to retrieve the page {current_page} (Status code: {response.status_code})")

        if len(all_films) >= 1010:
            break

    return all_films[:1010]

if __name__ == "__main__":
    starting_page_number = 1
    concluding_page_number = 42
    films_data = retrieve_top_movies_data(starting_page_number, concluding_page_number)
    print(films_data)



In [None]:
for movie in films_data:
    print(movie['Title'])

1. Dekalog (1988)
2. Tokyo Story
3. Casablanca
4. Boyhood
5. Three Colors: Red
6. Rear Window
7. Lawrence of Arabia (re-release)
8. The Godfather
9. The Conformist
10. Citizen Kane
11. The Leopard (re-release)
12. Vertigo
13. Notorious
14. Fanny and Alexander (re-release)
15. Singin' in the Rain
16. Playtime
17. Army of Shadows
18. City Lights
19. Moonlight
20. Intolerance
21. The Rules of the Game
22. Pinocchio
23. Touch of Evil
24. Seven Samurai
25. The Wild Bunch
26. Au hasard Balthazar
27. The Lady Vanishes
28. Pépé le Moko (re-release)
29. The Treasure of the Sierra Madre
30. Pan's Labyrinth
31. Some Like It Hot
32. North by Northwest
33. Hoop Dreams
34. Rashomon
35. The Passion of Joan of Arc
36. All About Eve
37. Metropolis (re-release)
38. Jules and Jim
39. My Left Foot
40. The Night of the Hunter
41. Ran
42. The Third Man
43. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
44. Quo Vadis, Aida?
45. Psycho
46. Rififi (re-release)
47. Gone with the Wind
48. 4

The code iterates through each film entry in the films_data list and modifies the 'Title' key by removing the first word before a space if present. This is done to extract more concise titles, updating the 'Title' key with the modified version. The final list, films_data, reflects the movies with adjusted titles.

In [None]:


for film in films_data:
    
    first_space_index = film['Title'].find(" ")

    if first_space_index != -1:
        
        modified_title = film['Title'][first_space_index + 1:]
    else:
        
        modified_title = film['Title']

    film['Title'] = modified_title

print(films_data)



[{'Position': 1, 'Title': 'Dekalog (1988)'}, {'Position': 2, 'Title': 'Tokyo Story'}, {'Position': 3, 'Title': 'Casablanca'}, {'Position': 4, 'Title': 'Boyhood'}, {'Position': 5, 'Title': 'Three Colors: Red'}, {'Position': 6, 'Title': 'Rear Window'}, {'Position': 7, 'Title': 'Lawrence of Arabia (re-release)'}, {'Position': 8, 'Title': 'The Godfather'}, {'Position': 9, 'Title': 'The Conformist'}, {'Position': 10, 'Title': 'Citizen Kane'}, {'Position': 11, 'Title': 'The Leopard (re-release)'}, {'Position': 12, 'Title': 'Vertigo'}, {'Position': 13, 'Title': 'Notorious'}, {'Position': 14, 'Title': 'Fanny and Alexander (re-release)'}, {'Position': 15, 'Title': "Singin' in the Rain"}, {'Position': 16, 'Title': 'Playtime'}, {'Position': 17, 'Title': 'Army of Shadows'}, {'Position': 18, 'Title': 'City Lights'}, {'Position': 19, 'Title': 'Moonlight'}, {'Position': 20, 'Title': 'Intolerance'}, {'Position': 21, 'Title': 'The Rules of the Game'}, {'Position': 22, 'Title': 'Pinocchio'}, {'Position'

The code defines a function, fetch_metacritic_info, to scrape Metacritic for director and cast information of a given movie title. The generate_movie_dict function iterates through a list of movie titles, formats them for URL usage, and builds a dictionary (movie_dict) containing director and main cast details obtained from Metacritic. The resulting dictionary provides a structured representation of movie information with modified titles, directors, and main cast members.

In [None]:

import requests
import re
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

def fetch_metacritic_info(movie_title):
    base_url = "https://www.metacritic.com/movie/"
    credits_url = f"{base_url}{movie_title.lower().replace(' ', '-')}/credits/"

    view_all_page = requests.get(credits_url, headers=headers)

    if view_all_page.url != credits_url and '/credits/' not in view_all_page.url:
        credits_url = view_all_page.url + 'credits/'
        view_all_page = requests.get(credits_url, headers=headers)

    
    view_all_soup = BeautifulSoup(view_all_page.content, "html.parser")

    credit_tables_list = view_all_soup.find_all('div', class_="c-productCredits g-outer-spacing-bottom-xlarge")

    directed_by_tags = []

    for credit_table in credit_tables_list:
        if "Directed By" in credit_table.get_text().strip():
            director_names = [director.get_text().strip() for director in credit_table.find_all('a')]
            directed_by_tags.extend(director_names)

    cast_list = []

    for credit_table in credit_tables_list:
        if "Cast" in credit_table.get_text().strip():
            cast_names = [cast.get_text().strip() for cast in credit_table.find_all('a')]
            cast_list.extend(cast_names)

    movie_info_dict = {
        'Director': directed_by_tags,
        'Cast': cast_list
    }

    return movie_info_dict

def generate_movie_dict(movie_titles):
    movie_dict = {}

    for title_info in movie_titles:

        title = title_info["Title"]
        formatted_title = title.lower().replace(" ", "-").replace("'", "").replace(":", "").replace("?", "").replace("&", "").replace(".", "")
        movie_info = fetch_metacritic_info(formatted_title)

        formatted_movie_info = {
            'Director': movie_info['Director'],
            'Main Cast': movie_info['Cast']
        }

        movie_dict[title] = formatted_movie_info

    return movie_dict

movie_dict = generate_movie_dict(films_data)


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

The code uses the CSV module to write movie information from the movie_dict dictionary to a CSV file named "Movies_info.csv". It creates a CSV writer with specified field names, writes the header, and iterates through the movie dictionary to write each movie's title, directors, and main cast to the CSV file. Finally, it prints a confirmation message indicating the successful writing of movie data to the specified file.

In [None]:
import csv


data_path = "Movies_info.csv"
field_names = ["Movie Title", "Director", "Main Cast"]


with open(data_path, "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=field_names)

    writer.writeheader()

    for movie_title, movie_info in movie_dict.items():
        director_str = ", ".join(movie_info["Director"])
        cast_str = ", ".join(movie_info["Main Cast"])

        writer.writerow(
            {
                "Movie Title": movie_title,
                "Director": director_str,
                "Main Cast": cast_str,
            }
        )


print(f"Movie data successfully written to: {data_path}")


Movie data successfully written to: Movies_info.csv


The code defines a function, find_movie_info, that prompts the user to input a movie title, searches for a matching title in the movie_dict dictionary, and prints the corresponding director and main cast information if the movie is found. If the movie is not found, it prints a message indicating that the movie is not in the dataset.

In [None]:
def find_movie_info():
    user_input = input("Enter the name of a movie: ").strip().lower()

    located_movie = None
    for film_title, film_info in movie_dict.items():
        if user_input in film_title.lower():
            located_movie = film_title
            break

    if located_movie:
        print(f"\nThe director of {located_movie} is: {', '.join(movie_dict[located_movie]['Director'])}")
        print(f"The cast of {located_movie} includes: {', '.join(movie_dict[located_movie]['Main Cast'])}")
    else:
        print("Movie not found.")

The code defines three functions: get_director_movies, which returns a list of movies directed by a given director; get_collaborations, which returns a dictionary counting collaborations between the director and performers; and explore_people, which prompts the user to enter a director's name, searches for the director in the dataset, and prints the movies directed by the director and their collaborations with performers. If the director is not found, it prints a message indicating that the director is not in the dataset.

In [None]:
def get_director_movies(director_name, movie_dict):
    directed_movies = []
    for movie_title, movie_info in movie_dict.items():
        if director_name in movie_info['Director']:
            directed_movies.append(movie_title)
    return directed_movies

def get_collaborations(director_name, movie_dict):
    collaborations_count = {}
    for movie_title, movie_info in movie_dict.items():
        if director_name in movie_info['Director']:
            for performer in movie_info['Main Cast']:
                if performer not in collaborations_count:
                    collaborations_count[performer] = 1
                else:
                    collaborations_count[performer] += 1
    return collaborations_count

def explore_people():
    director_query = input("Enter the name of a director (Please use proper upper case for the first character in the name and lowercase for the rest): ").strip()

    located_director = False
    for movie_title, movie_info in movie_dict.items():
        if director_query in movie_info['Director']:
            located_director = True
            director_movies = get_director_movies(director_query, movie_dict)
            collaborations = get_collaborations(director_query, movie_dict)

            print(f"\n{director_query} has directed in:")
            for movie in director_movies:
                print(f"- {movie}")

            print("\nThe director has worked together with these people:")
            for actor, count in collaborations.items():
                print(f"- {actor}: {count} collaborations")

    if not located_director:
        print("Director not found.")

The code defines a function, calculate_cosine_similarity, that calculates the cosine similarity between the cast lists of two directors using TfidfVectorizer and cosine_similarity from scikit-learn. The cosine_comparison function prompts the user to input the names of two directors and then calculates and prints the cosine similarity between them based on the cast of movies they directed. If one or both directors are not found or are identical, it prints a message and returns a similarity score of 0.0.

Finally, the code presents a menu to the user, prompting them to choose between checking information about a movie ('movie'), exploring details about people involved in movies ('people'), or comparing directors based on cosine similarity of their cast ('comparison'). It then processes the user's choice and executes the corresponding function or displays an error message for an invalid choice.

In [None]:

print("What do you want to check on Metacritic? (Please choose 'movie', 'people', or 'comparison')")
choice = input("Your choice: ").lower()

if choice=='movie':
    find_movie_info()

elif choice=='people':
    explore_people()

elif choice=='comparison':
    cosine_comparison()

else:
    print("Invalid choice. Please enter 'movie', 'people', or 'comparison'.")


What do you want to check on Metacritic? (Please choose 'movie', 'people', or 'comparison')
  (0, 30418)	0.061175452323935856
  (0, 26946)	0.061175452323935856
  (0, 8597)	0.061175452323935856
  (0, 23069)	0.061175452323935856
  (0, 15517)	0.061175452323935856
  (0, 564)	0.05186504551741835
  (0, 15720)	0.061175452323935856
  (0, 2187)	0.026548039119074014
  (0, 2560)	0.061175452323935856
  (0, 18857)	0.05773925818950446
  (0, 12047)	0.05773925818950446
  (0, 5227)	0.05530123965184974
  (0, 26423)	0.061175452323935856
  (0, 24638)	0.061175452323935856
  (0, 16209)	0.05773925818950446
  (0, 26472)	0.05773925818950446
  (0, 4440)	0.061175452323935856
  (0, 26471)	0.061175452323935856
  (0, 21941)	0.061175452323935856
  (0, 21272)	0.061175452323935856
  (0, 12430)	0.04209643536644841
  (0, 23315)	0.061175452323935856
  (0, 26213)	0.05773925818950446
  (0, 24971)	0.061175452323935856
  (0, 7634)	0.061175452323935856
  :	:
  (1002, 7103)	0.06824736509995778
  (1002, 24681)	0.085491525422615