# IMDb Data Analysis with PySpark

This notebook analyzes the IMDb dataset using PySpark for big data processing.

## 1. Data Loading

Load all IMDb datasets from the official IMDb data source.

In [1]:
from pyspark.sql import SparkSession, DataFrame
from requests import get
import tempfile
import os

LOAD_DATA_FROM_REMOTE = True

# Initialize Spark session
spark = SparkSession.builder \
    .appName("IMDb Data Processing") \
    .getOrCreate()

# Dictionary to store DataFrames
dataframes: dict[str, DataFrame] = {}


if not LOAD_DATA_FROM_REMOTE:
    # Load data from local files
    sources = {
        "name_basics": "data/name.basics.tsv.gz",
        "title_akas": "data/title.akas.tsv.gz",
        "title_basics": "data/title.basics.tsv.gz",
        "title_crew": "data/title.crew.tsv.gz",
        "title_episode": "data/title.episode.tsv.gz",
        "title_principals": "data/title.principals.tsv.gz",
        "title_ratings": "data/title.ratings.tsv.gz",
    }

    # For each source, load the data into a DataFrame
    for name, source in sources.items():
        # Spark can read gzip files directly
        df = spark.read.csv(
            source, sep="\t", header=True, nullValue="\\N", inferSchema=True
        )
        dataframes[name] = df
        print(f"Loaded {name}: {df.count()} rows")
else:
    # Load data from remote files
    sources_url = {
        "name_basics": "https://datasets.imdbws.com/name.basics.tsv.gz",
        "title_akas": "https://datasets.imdbws.com/title.akas.tsv.gz",
        "title_basics": "https://datasets.imdbws.com/title.basics.tsv.gz",
        "title_crew": "https://datasets.imdbws.com/title.crew.tsv.gz",
        "title_episode": "https://datasets.imdbws.com/title.episode.tsv.gz",
        "title_principals": "https://datasets.imdbws.com/title.principals.tsv.gz",
        "title_ratings": "https://datasets.imdbws.com/title.ratings.tsv.gz",
    }

    # For each source, load the data into a DataFrame
    for name, source in sources_url.items():
        # Download the file and save to a temporary location
        result = get(source)
        
        # Create a temporary file to store the downloaded content
        with tempfile.NamedTemporaryFile(delete=False, suffix=".tsv.gz") as tmp_file:
            tmp_file.write(result.content)
            tmp_path = tmp_file.name
        
        # Spark can read gzip files directly from the file path
        df = spark.read.csv(
            tmp_path, sep="\t", header=True, nullValue="\\N", inferSchema=True
        )
        dataframes[name] = df
        print(f"Loaded {name}: {df.count()} rows")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/13 22:15:07 WARN Utils: Your hostname, MacBook-Pro-de-Ethan.local, resolves to a loopback address: 127.0.0.1; using 192.168.1.225 instead (on interface en0)
25/12/13 22:15:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/13 22:15:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/13 22:15:08 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
                                                                                

Loaded name_basics: 14942131 rows


                                                                                

Loaded title_akas: 54383161 rows


                                                                                

Loaded title_basics: 12140521 rows


                                                                                

Loaded title_crew: 12140521 rows


                                                                                

Loaded title_episode: 9362369 rows


                                                                                

Loaded title_principals: 96523069 rows


[Stage 31:>                                                         (0 + 1) / 1]

Loaded title_ratings: 1608088 rows


                                                                                

**Explanation:** We use the `count()` action on the `name_basics` DataFrame to get the total number of rows. Each row represents a unique person in the IMDb dataset, identified by their `nconst` (name constant) ID.

In [2]:
# Preview each dataframe
for name, df in dataframes.items():
    print(f"=== {name} ===")
    df.show(5, truncate=False)

=== name_basics ===
+---------+---------------+---------+---------+---------------------------------+---------------------------------------+
|nconst   |primaryName    |birthYear|deathYear|primaryProfession                |knownForTitles                         |
+---------+---------------+---------+---------+---------------------------------+---------------------------------------+
|nm0000001|Fred Astaire   |1899     |1987     |actor,miscellaneous,producer     |tt0072308,tt0050419,tt0027125,tt0025164|
|nm0000002|Lauren Bacall  |1924     |2014     |actress,miscellaneous,soundtrack |tt0037382,tt0075213,tt0038355,tt0117057|
|nm0000003|Brigitte Bardot|1934     |NULL     |actress,music_department,producer|tt0057345,tt0049189,tt0056404,tt0054452|
|nm0000004|John Belushi   |1949     |1982     |actor,writer,music_department    |tt0072562,tt0077975,tt0080455,tt0078723|
|nm0000005|Ingmar Bergman |1918     |2007     |writer,director,actor            |tt0050986,tt0069467,tt0050976,tt0083922|
+---

**Explanation:** We initialize a Spark session and load 7 IMDb datasets either from local files or remote URLs. Each dataset is read as a TSV (tab-separated values) file with gzip compression. The `nullValue="\\N"` parameter handles IMDb's convention for missing values. All DataFrames are stored in a dictionary for easy access.

In [3]:
# Print schema for each dataframe
for name, df in dataframes.items():
    print(f"=== {name} schema ===")
    df.printSchema()

=== name_basics schema ===
root
 |-- nconst: string (nullable = true)
 |-- primaryName: string (nullable = true)
 |-- birthYear: integer (nullable = true)
 |-- deathYear: integer (nullable = true)
 |-- primaryProfession: string (nullable = true)
 |-- knownForTitles: string (nullable = true)

=== title_akas schema ===
root
 |-- titleId: string (nullable = true)
 |-- ordering: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- region: string (nullable = true)
 |-- language: string (nullable = true)
 |-- types: string (nullable = true)
 |-- attributes: string (nullable = true)
 |-- isOriginalTitle: integer (nullable = true)

=== title_basics schema ===
root
 |-- tconst: string (nullable = true)
 |-- titleType: string (nullable = true)
 |-- primaryTitle: string (nullable = true)
 |-- originalTitle: string (nullable = true)
 |-- isAdult: integer (nullable = true)
 |-- startYear: integer (nullable = true)
 |-- endYear: integer (nullable = true)
 |-- runtimeMinutes: string (n

---
## 2. How many total people in the dataset?

In [4]:
# Question 2: How many total people in the dataset?
total_people = dataframes["name_basics"].count()
print(f"Total number of people in the dataset: {total_people}")

[Stage 42:>                                                         (0 + 1) / 1]

Total number of people in the dataset: 14942131


                                                                                

**Explanation:** We use the `count()` action on the `name_basics` DataFrame to get the total number of rows. Each row represents a unique person in the IMDb dataset, identified by their `nconst` (name constant) ID.

---
## 3. What is the earliest year of birth?

In [5]:
# Question 3: What is the earliest year of birth?
from pyspark.sql.functions import min as spark_min, col

earliest_birth = dataframes["name_basics"].select(spark_min("birthYear")).collect()[0][0]
print(f"Earliest year of birth: {earliest_birth}")

# Show who was born in that year
dataframes["name_basics"].filter(col("birthYear") == earliest_birth).show(truncate=False)

                                                                                

Earliest year of birth: 4


[Stage 48:>                                                         (0 + 1) / 1]

+---------+------------------+---------+---------+-----------------+---------------------------------------+
|nconst   |primaryName       |birthYear|deathYear|primaryProfession|knownForTitles                         |
+---------+------------------+---------+---------+-----------------+---------------------------------------+
|nm0784172|Lucio Anneo Seneca|4        |65       |writer           |tt0043802,tt0218822,tt0049203,tt0972562|
+---------+------------------+---------+---------+-----------------+---------------------------------------+



                                                                                

**Explanation:** We use Spark's `min()` aggregation function on the `birthYear` column to find the smallest value. We then filter the DataFrame to display the person(s) born in that year. The result shows Lucio Anneo Seneca, the famous Roman Stoic philosopher, born in year 4 AD.

---
## 4. How many years ago was this person born?

In [6]:
# Question 4: How many years ago was this person born?
from datetime import datetime

current_year = datetime.now().year
years_ago = current_year - earliest_birth
print(f"The person with the earliest birth year was born {years_ago} years ago (in {earliest_birth})")

The person with the earliest birth year was born 2021 years ago (in 4)


**Explanation:** We calculate the difference between the current year (obtained using Python's `datetime` module) and the earliest birth year found in the previous question. This gives us the number of years since that person was born.

---
## 5. Using only the data in the dataset, determine if this date of birth is correct.

In [7]:
# Question 5: Using only the data in the data set, determine if this date of birth is correct.

# Get the person with the earliest birth year
earliest_person = dataframes["name_basics"].filter(col("birthYear") == earliest_birth).first()
person_id = earliest_person["nconst"]
person_name = earliest_person["primaryName"]
death_year = earliest_person["deathYear"]

print(f"Person: {person_name} (ID: {person_id})")
print(f"Birth Year: {earliest_birth}")
print(f"Death Year: {death_year}")

if death_year:
    age_at_death = death_year - earliest_birth
    print(f"Age at death: {age_at_death} years")
    
# Check what titles this person is known for
known_titles = earliest_person["knownForTitles"]
print(f"\nKnown for titles: {known_titles}")

if known_titles:
    title_ids = known_titles.split(",")
    print("\nChecking the years of their known works:")
    for title_id in title_ids:
        title_info = dataframes["title_basics"].filter(col("tconst") == title_id).first()
        if title_info:
            print(f"  {title_id}: {title_info['primaryTitle']} ({title_info['startYear']})")

Person: Lucio Anneo Seneca (ID: nm0784172)
Birth Year: 4
Death Year: 65
Age at death: 61 years

Known for titles: tt0043802,tt0218822,tt0049203,tt0972562

Checking the years of their known works:
  tt0043802: The Affairs of Messalina (1951)
  tt0218822: Such Is Life (2000)
  tt0049203: Fedra, the Devil's Daughter (1956)
  tt0972562: Medea 2 (2006)


**Explanation:** To verify the birth date, we cross-reference multiple data points:
1. We retrieve the person's death year and calculate their age at death (61 years - plausible)
2. We look up the titles they are "known for" by joining with `title_basics`

The key insight: Seneca's birth year (4 AD) and death year (65 AD) are historically accurate. However, the films listed (1951-2006) are **adaptations of his literary works**, not films he personally worked on. IMDb links writers to adaptations of their work, which explains why a 2000-year-old philosopher appears in the database with modern films.

---
## 6. Explain the reasoning for the answer in a code comment or new markdown cell.

In [8]:
# Question 6: Verify the birth date using data cross-referencing

# Get the person with the earliest birth year
person = dataframes["name_basics"].filter(col("birthYear") == earliest_birth).first()
person_id = person["nconst"]

print("=" * 60)
print(f"VERIFICATION ANALYSIS FOR: {person['primaryName']}")
print("=" * 60)

# 1. Check lifespan plausibility
print("\n1. LIFESPAN ANALYSIS:")
print(f"   Birth Year: {person['birthYear']}")
print(f"   Death Year: {person['deathYear']}")
if person['deathYear']:
    age_at_death = person['deathYear'] - person['birthYear']
    print(f"   Age at Death: {age_at_death} years")
    if age_at_death < 120:
        print("   -> Lifespan is PLAUSIBLE (under 120 years)")
    else:
        print("   -> Lifespan is IMPLAUSIBLE (over 120 years)")

# 2. Check profession
print(f"\n2. PROFESSION:")
print(f"   Listed as: {person['primaryProfession']}")
print("   -> 'writer' is consistent with Seneca being a philosopher/playwright")

# 3. Analyze known titles
print("\n3. KNOWN WORKS ANALYSIS:")
known_titles = person['knownForTitles']
if known_titles:
    title_ids = known_titles.split(",")
    for title_id in title_ids:
        title = dataframes["title_basics"].filter(col("tconst") == title_id).first()
        # Check crew to see the person's role
        principals = dataframes["title_principals"].filter(
            (col("tconst") == title_id) & (col("nconst") == person_id)
        ).first()
        role = principals["category"] if principals else "unknown"
        if title:
            print(f"   - {title['primaryTitle']} ({title['startYear']}) - Role: {role}")

# 4. Conclusion
print("\n" + "=" * 60)
print("CONCLUSION:")
print("=" * 60)
print("""
The birth year (4 AD) is HISTORICALLY CORRECT for Lucio Anneo Seneca,
the famous Roman Stoic philosopher and playwright.

Key evidence:
- Age at death (61 years) matches historical records (4 AD - 65 AD)
- Profession 'writer' aligns with his role as philosopher/playwright
- The modern films (1951-2006) are ADAPTATIONS of his classical works
- IMDb credits original authors for adaptations of their writings

Therefore, the birth date IS CORRECT - this is genuinely Seneca the Younger,
and the modern films are based on his ancient plays like 'Medea' and 'Phaedra'.
""")

VERIFICATION ANALYSIS FOR: Lucio Anneo Seneca

1. LIFESPAN ANALYSIS:
   Birth Year: 4
   Death Year: 65
   Age at Death: 61 years
   -> Lifespan is PLAUSIBLE (under 120 years)

2. PROFESSION:
   Listed as: writer
   -> 'writer' is consistent with Seneca being a philosopher/playwright

3. KNOWN WORKS ANALYSIS:


                                                                                

   - The Affairs of Messalina (1951) - Role: unknown


                                                                                

   - Such Is Life (2000) - Role: writer
   - Fedra, the Devil's Daughter (1956) - Role: writer


[Stage 62:>                                                         (0 + 1) / 1]

   - Medea 2 (2006) - Role: writer

CONCLUSION:

The birth year (4 AD) is HISTORICALLY CORRECT for Lucio Anneo Seneca,
the famous Roman Stoic philosopher and playwright.

Key evidence:
- Age at death (61 years) matches historical records (4 AD - 65 AD)
- Profession 'writer' aligns with his role as philosopher/playwright
- The modern films (1951-2006) are ADAPTATIONS of his classical works
- IMDb credits original authors for adaptations of their writings

Therefore, the birth date IS CORRECT - this is genuinely Seneca the Younger,
and the modern films are based on his ancient plays like 'Medea' and 'Phaedra'.



                                                                                

**Explanation:** This comprehensive verification uses multiple data sources:
1. **Lifespan check**: Age at death (61 years) is biologically plausible
2. **Profession check**: "writer" matches Seneca's historical role as playwright
3. **Role analysis**: Using `title_principals` to see his credited role in each film

The conclusion is that the birth date is **historically accurate** - Seneca was a real person, and the modern films are adaptations of his classical works (Medea, Phaedra, etc.).

---
## 7. What is the most recent date of birth?

In [9]:
# Question 7: What is the most recent date of birth?
from pyspark.sql.functions import max as spark_max, col

recent_birth = dataframes["name_basics"].select(spark_max("birthYear")).collect()[0][0]
print(f"Most recent year of birth: {recent_birth}")

# Show who was born in that year
dataframes["name_basics"].filter(col("birthYear") == recent_birth).show(truncate=False)

                                                                                

Most recent year of birth: 2025


[Stage 66:>                                                         (0 + 1) / 1]

+----------+-----------------+---------+---------+---------------------+------------------------------------------+
|nconst    |primaryName      |birthYear|deathYear|primaryProfession    |knownForTitles                            |
+----------+-----------------+---------+---------+---------------------+------------------------------------------+
|nm16784939|Kyrah Ivy Jackson|2025     |NULL     |actress              |NULL                                      |
|nm5642311 |Chase Ramsey     |2025     |NULL     |actor,director,writer|tt17505010,tt14715170,tt4236770,tt17062324|
+----------+-----------------+---------+---------+---------------------+------------------------------------------+



                                                                                

**Explanation:** Similar to finding the earliest birth year, we use Spark's `max()` aggregation function on the `birthYear` column. The result shows people born in 2025, which includes child actors or recently added entries to the database.

---
## 8. What percentage of the people do not have a listed date of birth?

In [10]:
# Question 8: What percentage of the people do not have a listed date of birth?
from pyspark.sql.functions import col

total_people = dataframes["name_basics"].count()
people_without_birth = dataframes["name_basics"].filter(col("birthYear").isNull()).count()

percentage_without_birth = (people_without_birth / total_people) * 100
print(f"Total people: {total_people}")
print(f"People without birth year: {people_without_birth}")
print(f"Percentage without birth year: {percentage_without_birth:.2f}%")

[Stage 70:>                                                         (0 + 1) / 1]

Total people: 14942131
People without birth year: 14281170
Percentage without birth year: 95.58%


                                                                                

**Explanation:** We filter the `name_basics` DataFrame to count rows where `birthYear` is NULL, then divide by the total count. The high percentage (~95%) indicates that most people in IMDb don't have their birth year recorded - this is common for minor cast/crew members.

---
## 9. What is the length of the longest "short" after 1900?

In [11]:
# Question 9: What is the length of the longest "short" after 1900?
from pyspark.sql.functions import max as spark_max, col

longest_short = dataframes["title_basics"].filter(
    (col("titleType") == "short") & 
    (col("startYear") > 1900) &
    (col("runtimeMinutes").isNotNull())
).select(spark_max(col("runtimeMinutes").cast("int"))).collect()[0][0]

print(f"Length of the longest 'short' after 1900: {longest_short} minutes")

# Show the title(s) with that runtime
dataframes["title_basics"].filter(
    (col("titleType") == "short") & 
    (col("startYear") > 1900) &
    (col("runtimeMinutes") == longest_short)
).show(truncate=False)

                                                                                

Length of the longest 'short' after 1900: 1311 minutes


[Stage 76:>                                                         (0 + 1) / 1]

+----------+---------+-------------+-------------+-------+---------+-------+--------------+-----------+
|tconst    |titleType|primaryTitle |originalTitle|isAdult|startYear|endYear|runtimeMinutes|genres     |
+----------+---------+-------------+-------------+-------+---------+-------+--------------+-----------+
|tt35509411|short    |Our First Day|Our First Day|0      |2025     |NULL   |1311          |Drama,Short|
+----------+---------+-------------+-------------+-------+---------+-------+--------------+-----------+



                                                                                

**Explanation:** We filter `title_basics` for entries where `titleType` equals "short" and `startYear` is greater than 1900. We cast `runtimeMinutes` to integer (since it's stored as string) and find the maximum value. The result seems unusually long for a "short" - this could be a data entry error in IMDb.

---
## 10. What is the length of the shortest "movie" after 1900?

In [12]:
# Question 10: What is the length of the shortest "movie" after 1900?
from pyspark.sql.functions import min as spark_min, col

shortest_movie = dataframes["title_basics"].filter(
    (col("titleType") == "movie") & 
    (col("startYear") > 1900) &
    (col("runtimeMinutes").isNotNull()) &
    (col("runtimeMinutes").cast("int") > 0)
).select(spark_min(col("runtimeMinutes").cast("int"))).collect()[0][0]

print(f"Length of the shortest 'movie' after 1900: {shortest_movie} minutes")

# Show the title(s) with that runtime
dataframes["title_basics"].filter(
    (col("titleType") == "movie") & 
    (col("startYear") > 1900) &
    (col("runtimeMinutes") == shortest_movie)
).show(truncate=False)

                                                                                

Length of the shortest 'movie' after 1900: 1 minutes


[Stage 80:>                                                         (0 + 1) / 1]

+----------+---------+-------------------------+-------------------------+-------+---------+-------+--------------+----------------------+
|tconst    |titleType|primaryTitle             |originalTitle            |isAdult|startYear|endYear|runtimeMinutes|genres                |
+----------+---------+-------------------------+-------------------------+-------+---------+-------+--------------+----------------------+
|tt0025166 |movie    |George White's Scandals  |George White's Scandals  |0      |1934     |NULL   |1             |Comedy,Musical,Romance|
|tt0469119 |movie    |Love Trap                |Love Trap                |0      |2005     |NULL   |1             |Drama                 |
|tt0810779 |movie    |Bound by Blood           |Bound by Blood           |0      |2007     |NULL   |1             |Action                |
|tt0848384 |movie    |Nikkatsu on Parade       |Nikkatsu on Parade       |0      |1930     |NULL   |1             |Documentary           |
|tt12893768|movie    |If I 

                                                                                

**Explanation:** We filter for `titleType` equals "movie", `startYear` greater than 1900, and `runtimeMinutes` greater than 0 (to exclude invalid entries). Using `min()` on the cast integer value gives us the shortest runtime. Multiple movies are listed at 1 minute, which likely represents incomplete data or placeholder entries.

---
## 11. List all of the genres represented.

In [13]:
# Question 11: List of all of the genres represented
from pyspark.sql.functions import explode, split, col

# Genres are comma-separated, so we need to split and explode them
all_genres = dataframes["title_basics"].filter(col("genres").isNotNull()) \
    .select(explode(split(col("genres"), ",")).alias("genre")) \
    .distinct() \
    .orderBy("genre") \
    .collect()

genres_list = [row["genre"] for row in all_genres]
print(f"Total unique genres: {len(genres_list)}")
print("\nAll genres:")
for genre in genres_list:
    print(f"  - {genre}")

[Stage 81:>                                                         (0 + 1) / 1]

Total unique genres: 28

All genres:
  - Action
  - Adult
  - Adventure
  - Animation
  - Biography
  - Comedy
  - Crime
  - Documentary
  - Drama
  - Family
  - Fantasy
  - Film-Noir
  - Game-Show
  - History
  - Horror
  - Music
  - Musical
  - Mystery
  - News
  - Reality-TV
  - Romance
  - Sci-Fi
  - Short
  - Sport
  - Talk-Show
  - Thriller
  - War
  - Western


                                                                                

**Explanation:** The `genres` column contains comma-separated values (e.g., "Comedy,Drama,Romance"). We use:
1. `split()` to break the string into an array
2. `explode()` to create one row per genre
3. `distinct()` to remove duplicates
4. `orderBy()` to sort alphabetically

This gives us all unique genres in the IMDb dataset.

---
## 12. What is the highest rated comedy "movie" in the dataset?

Note: If there is a tie, the tie shall be broken by the movie with the most votes.

In [14]:
# Question 12: What is the highest rated comedy "movie" in the dataset?
# Tie-breaker: movie with the most votes
from pyspark.sql.functions import col

# Join title_basics with title_ratings, filter for comedy movies
comedy_movies = dataframes["title_basics"].filter(
    (col("titleType") == "movie") & 
    (col("genres").contains("Comedy"))
).join(
    dataframes["title_ratings"], 
    dataframes["title_basics"]["tconst"] == dataframes["title_ratings"]["tconst"]
).select(
    dataframes["title_basics"]["tconst"],
    "primaryTitle",
    "startYear",
    "genres",
    "averageRating",
    "numVotes"
).orderBy(
    col("averageRating").desc(), 
    col("numVotes").desc()
)

print("Highest rated comedy movie (tie-broken by most votes):")
comedy_movies.show(1, truncate=False)

# Store the best comedy movie for the next question
best_comedy = comedy_movies.first()
best_comedy_id = best_comedy["tconst"]
print(f"\nMovie ID: {best_comedy_id}")

Highest rated comedy movie (tie-broken by most votes):


                                                                                

+---------+------------+---------+------+-------------+--------+
|tconst   |primaryTitle|startYear|genres|averageRating|numVotes|
+---------+------------+---------+------+-------------+--------+
|tt8458418|O La La     |2018     |Comedy|10.0         |6       |
+---------+------------+---------+------+-------------+--------+
only showing top 1 row


[Stage 92:>                                                         (0 + 1) / 1]


Movie ID: tt8458418


                                                                                

**Explanation:** We perform a JOIN between `title_basics` and `title_ratings` on the `tconst` column. We filter for:
- `titleType` = "movie"
- `genres` contains "Comedy" (using `contains()` for substring matching)

We then sort by `averageRating` descending, with `numVotes` as the tie-breaker. Note: The top result may have few votes, which is statistically unreliable. A more robust analysis might require a minimum vote threshold.

---
## 13. Who was the director of the movie?

In [15]:
# Question 13: Who was the director of the movie?
from pyspark.sql.functions import col

# Get the director(s) from title_crew
crew_info = dataframes["title_crew"].filter(col("tconst") == best_comedy_id).first()
director_ids = crew_info["directors"]

print(f"Director ID(s): {director_ids}")

if director_ids:
    # Directors can be comma-separated if there are multiple
    director_id_list = director_ids.split(",")
    
    print("\nDirector(s):")
    for dir_id in director_id_list:
        director_info = dataframes["name_basics"].filter(col("nconst") == dir_id).first()
        if director_info:
            print(f"  - {director_info['primaryName']} ({dir_id})")

                                                                                

Director ID(s): nm7709412

Director(s):


[Stage 94:>                                                         (0 + 1) / 1]

  - Sripad Pai (nm7709412)


                                                                                

**Explanation:** We look up the movie's `tconst` in the `title_crew` DataFrame to get the `directors` field (which contains `nconst` IDs). Directors can be comma-separated if there are multiple. We then join with `name_basics` to convert the director ID(s) to human-readable names.

---
## 14. List, if any, the alternate titles for the movie.

In [16]:
# Question 14: List, if any, the alternate titles for the movie.
from pyspark.sql.functions import col

# Get alternate titles from title_akas
alternate_titles = dataframes["title_akas"].filter(
    (col("titleId") == best_comedy_id) & 
    (col("isOriginalTitle") == 0)
).select("title", "region", "language", "types", "attributes")

alt_count = alternate_titles.count()
print(f"Movie: {best_comedy['primaryTitle']} ({best_comedy_id})")
print(f"Number of alternate titles: {alt_count}\n")

if alt_count > 0:
    print("Alternate titles:")
    alternate_titles.show(alt_count, truncate=False)
else:
    print("No alternate titles found.")

                                                                                

Movie: O La La (tt8458418)
Number of alternate titles: 1

Alternate titles:


[Stage 98:>                                                         (0 + 1) / 1]

+-------+------+--------+-----------+----------+
|title  |region|language|types      |attributes|
+-------+------+--------+-----------+----------+
|O La La|IN    |en      |imdbDisplay|NULL      |
+-------+------+--------+-----------+----------+



                                                                                

**Explanation:** The `title_akas` (Also Known As) table contains alternate titles for movies in different regions/languages. We filter by the movie's `titleId` and exclude the original title (`isOriginalTitle` = 0). This shows localized titles, translated titles, or alternative marketing titles used in different countries.