# Explainer Notebook for Final Project
### Group 72 
#### 02806 Social Data Analysis and Visualization

s193602

s

## 1. Motivation

Our data is a curated selection from the IMDb Datasets, which offer publicly available metadata on film and TV shows listed in the Internet Movie Database (IMDb). Specifically, we worked with structured information on titles, genres, release years, and user ratings to explore how the film industry has evolved over time with a focus on genre distribution. We chose this dataset as all members of our group enjoy film and TV, though we are split on which genres are most enjoyable to watch -- a fact which led to a discussion on the evolution of genre. This particular dataset comes from the most widely used and recognized film/TV database, and contains the exact information on genre needed to perform this analysis. 

Our goal was to create an fun and interactive experience that highlights how film genres have shifted according to time and geography. We wanted users to be able to explore not just how many movies were made in a given year, but what kinds of stories were being told — and how that has changed across decades, and potentially due to historical events or general public sentiments. Through clean visualizations and intuitive layouts, we aimed to make complex temporal and categorical data feel accessible, insightful, and interesting. 



## 2. Basic stats. Let's understand the dataset better
Write about your choices in data cleaning and preprocessing

Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.



IMDb provides datasets with movies and a variety of metadata. The data is stored in different files, depending on the content. We combined three of the datasets:

**title.basics.tsv.gz (main metadata)**

**title.ratings.tsv.gz (user ratings)**

**title.akas.tsv.gz (alternative names/regions)**  

### Combined Dataset Overview

After merging the three IMDb datasets, we obtained a structured dataframe with **714,015 rows**. The table below summarizes the basic statistics for each column:

As IMDb doesn't provide explicit production country fields in title.basics, we extracted the region field from the akas dataset as a proxy, aggregating all regions associated with each title into a |-separated string.

| Statistic       | tconst    | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | averageRating | numVotes | production_country |
|----------------|-----------|-----------|---------------|----------------|---------|------------|----------|------------------|--------|----------------|-----------|---------------------|
| **count**       | 714,015   | 714,015   | 714,013       | 714,013        | 714,015 | 608,940    | 0        | 450,303          | 637,026| 329,613        | 329,613   | 699,897             |
| **unique**      | 714,015   | 1         | 613,714       | 627,750        | 2       | 139        | 0        | 510              | 1,494  | 91             | 19,061    | 118,177             |
| **top**         | tt0000009 | movie     | Broken        | Broken         | 0       | 2022       | NaN      | 90               | Drama  | 6.2            | 9         | US                  |
| **freq**        | 1         | 714,015   | 63            | 62             | 704,832 | 20,755     | NaN      | 29,021           | 131,785| 11,004         | 7,518     | 136,485             |

We are only interested in the following columns:

'tconst', 'primaryTitle', 'originalTitle', 'startYear',
    'runtimeMinutes', 'genres', 'averageRating', 'numVotes', 'production_country'

So we filter using that, and change the names. This gives us the final dataset with the following columns.


The dataset contains the following variables:  
- **title_id**: Unique identifier for each title.  
- **title**: Display title of the movie.  
- **original_title**: Original title of the movie.  
- **release_year**: Year the movie was released.  
- **runtime_minutes**: Duration of the movie in minutes.  
- **genre**: Genre(s) of the movie.  
- **imdb_rating**: IMDb user rating for the movie.  
- **vote_count**: Number of votes received for the IMDb rating.  
- **production_country**: Countries where the movie was produced.  


The dataset has quite a lot of missing values. This means, that not all movies can be used for every plot. Depending on the plot the data is filtered to not contain any missing values.

### Final data

| Column | Data Type | Non-Missing | Missing | Unique Values | Example |
|--------|------------|--------------|---------|----------------|---------|
| title_id | object | 714015 | 0 | 714015 | tt0000009 |
| title | object | 714013 | 2 | 613714 | Miss Jerry |
| original_title | object | 714013 | 2 | 627750 | Miss Jerry |
| release_year | Int64 | 608940 | 105075 | 139 | 1894 |
| runtime_minutes | Int64 | 450303 | 263712 | 510 | 45 |
| genre | object | 637026 | 76989 | 1494 | Romance |
| imdb_rating | float64 | 329613 | 384402 | 91 | 5.4 |
| vote_count | Int64 | 329613 | 384402 | 19061 | 223 |
| production_country | object | 699897 | 14118 | 118177 | AU|DE|HU|US |

## 3. Data Analysis

The data in the IMDb dataset was fairly clean to begin with. That made the data handling a lot easier. For many of the plots, the only datahandling needed, was simply to count or average by year or genre or country.

Each movie can have multiple genres, and the dataset contained a total of 28 unique genres across all movies:
- ['Action', 'Adult', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'Game-Show', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western']

We decided to focus on the 10 most popular genres of the dataset for plots. So the dataset was further filtered to only contain movies which contain one of the following focus genres

##### Focus genres
- ['Drama', 'Comedy', 'Documentary', 'Romance', 'Action', 'Crime', 'Thriller', 'Horror', 'Adventure', 'Mystery']


### Plot 1: Area chart

1. Selected relevant columns: `title`, `release_year`, and `genre`, and dropped rows with missing values.
2. Split multiple genres into separate rows using `.explode()` after splitting the genre strings.
3. Trimmed whitespace from genre names.
4. Filtered to include only genres in `FOCUS_GENRES`.
5. Limited data to movies released between **1930 and 2023**.
6. Counted the number of unique movie titles per year to avoid duplicate counts across multiple genres.
7. Computed yearly counts of movies per genre using `.groupby()` and `.unstack()`.
8. Reindexed genre columns to match `FOCUS_GENRES` and filled missing values with 0.
9. Calculated the percentage share of each genre per year among the `FOCUS_GENRES`.

### Plot 2: IMDb rating and runtime barchart

1. Loaded the cleaned movie dataset from a `.parquet` file.
2. Defined a function to extract the first matching genre from the `genre` field based on a predefined `MOVIE_GENRES` list.
3. Created a new column `primary_genre` with the first valid genre per movie.
4. Filtered out rows with missing or non-matching `primary_genre` values.
5. Converted `imdb_rating` and `runtime_minutes` columns to numeric, coercing invalid entries to `NaN`.
6. Grouped data by `primary_genre` and computed:
   - Average IMDb rating
   - Average runtime (in minutes)
   - Movie count per genre
7. Dropped any remaining rows with missing values from the summary table.

### Plot 3: Map plot
1. Loaded the movie dataset from a `.parquet` file.
2. Split the `production_country` field (pipe-separated string) into lists and exploded into separate rows.
3. Converted 2-letter country codes to 3-letter ISO Alpha-3 codes using the `pycountry` package.
4. Filtered out rows where country conversion failed (could not be mapped to a valid ISO Alpha-3 code).
5. Mapped 3-letter country codes to full country names for improved hover text.
6. Split the `genre` field by comma and exploded the list to associate each movie-country row with individual genres.
7. Stripped whitespace from genre names to ensure clean and consistent values.

### Plot 4: Polar plot

For the polar plot, we needed exact date releases for the movies. However, the IMDb dataset only contains release year. To acquire the precise release date, the TMDB database API was used. Using a personal API key, precise dates could be requested for each movie ID. But the API limits meant it took a lot of time, estimated 30 hours for 100k movies. Using parallelized requests, we gathered 50k exact movie titles for the polar plot. We sorted the data before, to only get movies after 1980s. This would allow the conclusions of the plot, to fit better with modern traditions and holidays etc. 

1. Loaded movie dataset and filtered rows with successful date fetch status.
2. Converted `release_date_full` to datetime and removed invalid dates (e.g., January 1st). This was removed because an unusual amount of movies seemed to have the date as 1. january, we think this is an error from the TMDB dataset api.
3. Extracted calendar components:
   - ISO week (`release_week`)
   - Calendar month (`release_month`)
   - Day of year (`release_dayofyear`)
4. Split and exploded the `genre` column to allow per-genre analysis.
5. Filtered to include only genres in `MOVIE_GENRES`.
6. **Weekly Data**:
   - Counted movies per genre per ISO week.
   - Merged with a complete week-genre template to fill missing combinations.
   - Calculated angular positions for polar coordinates (`theta_degrees`).
   - Generated total weekly movie counts and formatted week labels for hover text.
7. **Monthly Data**:
   - Counted movies per genre per month.
   - Merged with a month-genre template and computed `theta_degrees`.
   - Created hover labels using month abbreviations.
8. **Daily Data**:
   - Counted movies per genre per day of year (excluding day 1).
   - Used a full day-of-year template (days 2–365) for consistency.
   - Mapped day numbers to readable labels using a base year for formatting.
   - Calculated angular positions and joined all relevant info for hover interaction.


## 4. Genre

The genre of our data story is **magazine style**, as it tells a coherent, linear narrative. We felt this format fit well with the temporal nature of our theme and allowed users to scroll through our analysis in a simple, intuitive way.

### Visual Narrative

From the *Visual Narrative* categories laid out by Segal and Heer, we apply several key techniques.

#### Visual Structuring

Naturally, our visual story begins with an **establishing shot** that gives users an overview of the dataset and the genres at play. We also use a **consistent visual platform** by retaining genre-specific colors throughout the analysis. Finally, due to the temporal nature of our data, we incorporate a **timebar** to show how genre popularity evolves through the years.

#### Highlighting

We leave the **close-ups** to the user — our interactive graphs allow them to focus on specific genres through hover and click. We also employ **feature distinction** through our consistent genre color scheme. **Motion** is used through hover animations, which help users engage with the plot and effectively "zoom in" on a given year.

#### Transition Guidance

The **familiar objects** throughout our data story are the different genres and their associated colors. We apply **continuity editing** and **object continuity** to maintain a coherent visual flow, reinforced by the consistent visual styling. While we don’t use elaborate animated transitions, our interactive plots include zoom and selection animations that support a smooth user experience.

### Narrative Structure

#### Ordering

Our project follows a **linear flow**, in line with the magazine-style layout. We also include elements of a **user-directed path** through interactive components.

#### Interactivity

Our plots include **hover highlighting**, **filtering/selection**, and **navigation controls** — most notably in Figure 2, which allows the user to select a genre and view its global distribution. While the plots themselves don’t provide **explicit instruction**, the figure captions do, and the interactions are easily discoverable via **tacit tutorial**. Additionally, the plots display **stimulating default views** — such as total movie production and genre share — that invite exploration.

#### Messaging

Our visualizations are supported by clear **captions/headlines** that explain their content. Relevant plots (e.g., the polar plot) include **annotations**. Since we follow a magazine structure, each plot is accompanied by explanatory text that builds a **narrative** around the data. This structure also includes **introductory text** to frame the project, and a **summary** to wrap it up.


## 5. Visualization
Explain the visualizations you've chosen.
Why are they right for the story you want to tell?

To support our narrative of how movie genres have evolved over time, we chose three complementary and interactive visualizations that each highlight a different dimension of the data.

#### Annual Genre Distribution (Stacked Area Chart)

The first visualization shows the relative share of each genre per year, alongside a bar chart indicating the total number of movies produced for that year. This view gives users an immediate sense of how certain genres have grown or shrunk in popularity, as well as how the film industry has expanded overall. It provides historical context and serves as a strong visual foundation for the rest of the article.

#### Global Genre Specialization (Map Plot)

Next, we include an interactive map that shows genre specialization across countries. By allowing the user to select a genre and see how prevalent it is in different parts of the world, this visualization highlights cultural and regional preferences. It supports a more global, comparative perspective.

#### Weekly Release Patterns (Polar Chart)

Finally, our article features a circular bar chart (polar plot) that visualizes weekly movie release patterns across all genres. This format reveals seasonality — for example, certain genres may be released more often in specific parts of the year. The spiral shape was chosen both for visual interest and its effectiveness in showing cyclical data without overwhelming the viewer.

Together, these visualizations offer a temporal, geographic, and seasonal lens on the same dataset. Their interactivity makes them engaging and allows users to explore their own questions while following a coherent story. Because they are interactive, the plots are also highly information-dense, allowing users to reveal detailed insights on demand without overwhelming the layout visually.



## 6. Discussion. Think critically about your creation
What went well?,
What is still missing? What could be improved?, Why?

## 7. Contributions

You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
It is not OK simply to write "All group members contributed equally".