# Movie Plot Summaries Filtering

This script processes a dataset of movie plot summaries, selecting only those with a minimum length of 1000 characters. The last 400 characters of each qualifying summary are retained and saved in a clean TSV file for further analysis.

## Details

- **Input**: A text file of raw movie plot summaries with their IDs.
- **Output**: A filtered TSV file containing movie IDs and truncated summaries.
- **Goal**: Preprocess plot summaries to simplify subsequent data analysis.

In [1]:
import pandas as pd

file_path = '../../Data/MovieSummaries/plot_summaries.txt'
output_file_path = '../../src/data/filtered_plot_summaries.tsv'
data = []

with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        try:
            movie_id, summary = line.split('\t', 1)  
            summary = summary.strip()  
            
            if len(summary) >= 1000:
                last_400_characters = summary[-400:].strip()
                data.append({'Movie_ID': movie_id, 'Summary': last_400_characters})
                
        except ValueError:
            print(f"Wrong format on line : {line}")

df = pd.DataFrame(data)
df.to_csv(output_file_path, sep='\t', index=False)

print(f"TSV file create with success : {output_file_path}")

TSV file create with success : ../../src/data/filtered_plot_summaries.tsv


# Sentiment Analysis on Movie Plot Summaries

This script combines movie metadata with filtered plot summaries to analyze sentiment and assign a score based on the emotional tone of each summary.

## Details

- **Input 1**: Movie metadata file containing information like title, release date, and genres.
- **Input 2**: Filtered plot summaries (TSV file).
- **Output**: A final TSV file with merged data and a sentiment score for each movie.

## Sentiment Scoring

- Sentiment is analyzed using `TextBlob`, with scores assigned as follows:
  - **5**: Very happy ending.
  - **4**: Happy ending.
  - **3**: Neutral ending.
  - **2**: Sad ending.
  - **1**: Very sad ending.

In [2]:
import pandas as pd
from textblob import TextBlob

metadata_path = '../../Data/MovieSummaries/movie.metadata.tsv'
movie_data = pd.read_csv(metadata_path, sep='\t', header=None, dtype={0: str})  # Charger l'ID comme chaîne
movie_data.columns = ['Movie_ID', 'Other_Column', 'Title', 'Release_Date', 'Revenue', 'Runtime', 'Languages', 'Country', 'Genres']
summaries_path = '../../src/data/filtered_plot_summaries.tsv'
summaries_data = pd.read_csv(summaries_path, sep='\t', dtype={'Movie_ID': str})

def analyze_sentiment(summary):
    analysis = TextBlob(summary)
    polarity = analysis.sentiment.polarity
    if polarity > 0.5:
        return 5  # Very happy ending
    elif 0.13 < polarity <= 0.5:
        return 4  # Happy ending
    elif -0.13 <= polarity <= 0.13:
        return 3  # Neutral ending
    elif -0.5 < polarity < -0.13:
        return 2  # Sad ending
    else:
        return 1  # Very sad ending

merged_data = pd.merge(movie_data, summaries_data, on='Movie_ID', how='inner')

merged_data['Score'] = merged_data['Summary'].apply(analyze_sentiment)

output_file_path = '../../src/data/movies_dataset_final.tsv'
merged_data.to_csv(output_file_path, sep='\t', index=False)


# Fetching and Enriching Movie Data from TMDB API

This script enriches our existing dataset of movies with additional information from the TMDB API. 

## Details

- **Input**: A TSV file containing basic movie metadata, including Wikipedia IDs and titles.
- **Output**: A serialized `pickle` file storing the enriched data for future use.
- **APIs Used**:
  - **Search Movie**: Retrieves movie details using titles.
  - **Movie Details**: Fetches additional details like genres and release dates.
  - **Movie Credits**: Collects cast and crew information.

## Workflow

1. **Load Existing Dataset**: Reads the movie dataset containing titles and IDs.
2. **Check Existing Data**: Loads pre-existing TMDB data from a `pickle` file to avoid redundant API calls.
3. **Fetch Missing Data**:
   - Uses the TMDB API to search for movies by title and retrieve their IDs.
   - Fetches detailed information and credits for each movie using multithreading for efficiency.
4. **Save Results**: Stores the enriched data in a `pickle` file for further processing.


In [9]:
import requests
from tqdm import tqdm
import os
import pickle
from concurrent.futures import ThreadPoolExecutor, as_completed
import pandas as pd

# API Key and Base URL setup
API_KEY = '9923aaa2a3b2777bfdeba7f76c97d212'
BASE_SEARCH_URL = 'https://api.themoviedb.org/3/search/movie'
BASE_MOVIE_URL = 'https://api.themoviedb.org/3/movie'

# Load the existing movie dataset
file_path = '../../src/data/movies_dataset_final.tsv'
movies_df = pd.read_csv(file_path, sep='\t')

# Define function to fetch movie data from TMDB API
def get_movie_data_from_tmdb(wikipedia_id, title):
    params = {
        'api_key': API_KEY,
        'query': title,
        'language': 'en-US'
    }
    response = requests.get(BASE_SEARCH_URL, params=params)
    if response.status_code == 200:
        data = response.json()
        if data['results']:
            movie_data = data['results'][0]
            overview = movie_data.get('overview', '')
            tmdb_id = movie_data.get('id', None)
            return wikipedia_id, {"overview": overview, "tmdb_id": tmdb_id}
    return wikipedia_id, {}

# Define functions to get specific movie details and credits
def get_movie_details(wikipedia_id, tmdb_id):
    response = requests.get(f"{BASE_MOVIE_URL}/{tmdb_id}", params={'api_key': API_KEY, 'language': 'en-US'})
    return (wikipedia_id, "details", response.json()) if response.status_code == 200 else (wikipedia_id, "details", {})

def get_movie_credits(wikipedia_id, tmdb_id):
    response = requests.get(f"{BASE_MOVIE_URL}/{tmdb_id}/credits", params={'api_key': API_KEY})
    return (wikipedia_id, "credits", response.json()) if response.status_code == 200 else (wikipedia_id, "credits", {})

# Load existing TMDB data if available
DATA_FOLDER = '../../src/data'
if os.path.exists(f'{DATA_FOLDER}/movie_data_from_tmdb.pkl'):
    with open(f'{DATA_FOLDER}/movie_data_from_tmdb.pkl', 'rb') as file:
        movie_data_from_tmdb = pickle.load(file)
else:
    movie_data_from_tmdb = {}

# Fetch missing TMDB data
movies_to_process = [
    (wiki_id, title) for wiki_id, title in zip(movies_df['Other_Column'], movies_df['Title'])
    if wiki_id not in movie_data_from_tmdb
]
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(get_movie_data_from_tmdb, movie_id, title): movie_id for movie_id, title in movies_to_process}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Fetching TMDB IDs"):
        wikipedia_id, movie_info = future.result()
        if movie_info:
            movie_data_from_tmdb[wikipedia_id] = movie_info


Fetching TMDB IDs: 100%|██████████| 431/431 [00:02<00:00, 178.22it/s]


# Enriching Movie Dataset with TMDB Data

This script processes and enriches a movie dataset with additional details fetched from the TMDB API, such as directors, revenue, and vote averages.

## Workflow

1. **Save and Reload Intermediate Data**:
   - Save basic TMDB data (`tmdb_id`) to avoid redundant API requests.
   - Fetch additional details (e.g., credits, revenue, budget) using TMDB API in a multithreaded process.

2. **Add New Fields to the Dataset**:
   - **Director**: Extracts the director's name from the crew data.
   - **Collection**: Identifies the collection to which a movie belongs.
   - **Vote Average**: Retrieves the average user rating from TMDB.
   - **Revenue and Budget**: Adds financial details (if available).
   - **Production Companies**: Extracts production company details.

3. **Update and Save Dataset**:
   - Maps the enriched data fields to the movie dataset.
   - Drops redundant or outdated columns.
   - Saves the final dataset back to a TSV file for further analysis.

In [4]:
# Save basic data to avoid re-fetching
with open(f'{DATA_FOLDER}/movie_data_from_tmdb_only_id.pkl', 'wb') as file:
    pickle.dump(movie_data_from_tmdb, file)

# Fetch additional details and credits
movies_to_process = [(wiki_id, info['tmdb_id']) for wiki_id, info in movie_data_from_tmdb.items() if info.get('tmdb_id') and 'details' not in info]
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = []
    for wiki_id, tmdb_id in movies_to_process:
        futures.append(executor.submit(get_movie_details, wiki_id, tmdb_id))
        futures.append(executor.submit(get_movie_credits, wiki_id, tmdb_id))
    
    for future in tqdm(as_completed(futures), total=len(futures), desc="Fetching Details and Credits"):
        wikipedia_id, data_type, data = future.result()
        if wikipedia_id in movie_data_from_tmdb:
            movie_data_from_tmdb[wikipedia_id][data_type] = data

# Save the enriched TMDB data
with open(f'{DATA_FOLDER}/movie_data_from_tmdb.pkl', 'wb') as file:
    pickle.dump(movie_data_from_tmdb, file)

# Define helper functions for new data fields
def find_director(movie_data):
    crew = movie_data.get('credits', {}).get('crew', [])
    for person in crew:
        if person.get('job') == 'Director':
            return person.get('name', pd.NA)
    return pd.NA

# add collection to the dataset
def find_collection(movie_data_from_tmdb_single):
    try:
        return movie_data_from_tmdb_single['details']['belongs_to_collection']['id']
    except Exception:
        return pd.NA

def find_vote_average(movie_data):
    return movie_data.get('details', {}).get('vote_average', pd.NA)

def find_revenue(movie_data):
    return movie_data.get('details', {}).get('revenue', pd.NA)
    
def find_budget(movie_data):
    return movie_data.get('details', {}).get('budget', pd.NA)

def find_productions(movie_data):
    return movie_data.get('details', {}).get('production_companies', pd.NA)

# Map new data to the movies_df dataset
movies_df['director'] = movies_df['Other_Column'].map(lambda x: find_director(movie_data_from_tmdb.get(x, {})))
movies_df['vote_average'] = movies_df['Other_Column'].map(lambda x: find_vote_average(movie_data_from_tmdb.get(x, {})))
movies_df['revenue'] = movies_df['Other_Column'].map(lambda x: find_revenue(movie_data_from_tmdb.get(x, {})))
movies_df['collection'] = movies_df['Other_Column'].map(lambda x: find_collection(movie_data_from_tmdb.get(x, {})))
movies_df = movies_df.drop(columns=['Revenue', 'Movie_ID_y'], errors='ignore')
movies_df['Budget'] = movies_df['Other_Column'].map(lambda x: find_budget(movie_data_from_tmdb.get(x, {})))
movies_df['Production'] = movies_df['Other_Column'].map(lambda x: find_productions(movie_data_from_tmdb.get(x, {})))

# Save the updated dataset back to the original file
movies_df.to_csv(file_path, sep='\t', index=False)
print(f"Updated dataset saved back to '{file_path}' with new columns: director, vote_average, and revenue")



Fetching Details and Credits: 100%|██████████| 42982/42982 [04:35<00:00, 156.06it/s]


Updated dataset saved back to '../../src/data/movies_dataset_final.tsv' with new columns: director, vote_average, and revenue


# Cleaning and Formatting Movie Country Data

This script processes the "Country" column in a movie dataset to extract and clean country names, ensuring consistent and usable data for analysis.

## Workflow

1. **Load Dataset**: Reads a TSV file containing movie metadata, including a "Country" column.
2. **Extract Country Names**: Parses and cleans country names from structured text data.
3. **Handle Missing Data**: Drops rows with invalid or missing country information.
4. **Save Cleaned Dataset**: Exports the updated dataset to the original TSV file.

In [5]:
import pandas as pd

# Load the dataset
movies_df = pd.read_csv('../../src/data/movies_dataset_final.tsv', sep='\t')

# Helper function to clean up and extract country names
def extract_countries(country_data):
    if pd.isna(country_data) or '{' not in country_data:
        return None
    countries = []
    items = country_data.split(", ")
    for item in items:
        if ':' in item:
            # Clean up each country name
            country_name = item.split(":")[-1].strip().replace("\"", "").replace("}", "").replace("{", "")
            countries.append(country_name)
    # Join all countries with a comma and space
    return ", ".join(countries) if countries else None

# Apply the function to clean and reformat the 'Country' column
movies_df['Country'] = movies_df['Country'].apply(extract_countries)

# Drop rows with missing or invalid country data in 'Country'
movies_df = movies_df.dropna(subset=['Country']).copy()

# Save the updated dataset with the cleaned 'Country' column back to the original file
movies_df.to_csv('../../src/data/movies_dataset_final.tsv', sep='\t', index=False)
print("Dataset updated with cleaned 'Country' column and saved as 'src/data/movies_dataset_final.tsv'")


Dataset updated with cleaned 'Country' column and saved as 'src/data/movies_dataset_final.tsv'


# Cleaning and Formatting Movie Language Data

This script processes the "Languages" column in a movie dataset to extract and clean language names, ensuring standardized and accurate data for analysis.

## Workflow

1. **Load Dataset**: Reads a TSV file containing movie metadata, including a "Languages" column.
2. **Extract Language Names**: Parses and cleans language data, removing unnecessary text and filtering out entries longer than 40 characters.
3. **Handle Missing Data**: Drops rows with invalid or missing language information.
4. **Save Cleaned Dataset**: Exports the updated dataset with the cleaned "Languages" column to the original TSV file.


In [6]:
import pandas as pd

# Load the dataset
movies_df = pd.read_csv('../../src/data/movies_dataset_final.tsv', sep='\t')

# Helper function to clean up and extract language names
def extract_languages(language_data):
    if pd.isna(language_data) or '{' not in language_data:
        return None
    languages = []
    items = language_data.split(", ")
    for item in items:
        if ':' in item:
            # Clean each language name and remove "Language"/"language"
            language_name = item.split(":")[-1].strip().replace("\"", "").replace("}", "").replace("Language", "").replace("language", "").strip()
            if len(language_name) <= 40:  # Filter out languages longer than 40 characters
                languages.append(language_name)
    # Join all languages with a comma and space
    return ", ".join(languages) if languages else None

# Apply the function to clean and reformat the 'Languages' column
movies_df['Languages'] = movies_df['Languages'].apply(extract_languages)

# Drop rows with missing or invalid language data in 'Languages'
movies_df = movies_df.dropna(subset=['Languages']).copy()

# Save the updated dataset with the cleaned 'Languages' column back to the original file
movies_df.to_csv('../../src/data/movies_dataset_final.tsv', sep='\t', index=False)
print("Dataset updated with cleaned 'Languages' column and saved as 'movies_dataset_finals.tsv'")


Dataset updated with cleaned 'Languages' column and saved as 'movies_dataset_finals.tsv'


# Cleaning and Formatting Movie Genre Data

This script processes the "Genres" column in a movie dataset to extract and clean genre names, ensuring a standardized and usable format for analysis.

## Workflow

1. **Load Dataset**: Reads a TSV file containing movie metadata, including a "Genres" column.
2. **Extract Genre Names**:
   - Parses and cleans genre data.
   - Removes unwanted words like "Movie", "Film", and their variations.
3. **Handle Missing Data**: Drops rows with invalid or missing genre information.
4. **Save Cleaned Dataset**: Exports the updated dataset with the cleaned "Genres" column to the original TSV file.


In [7]:
import pandas as pd
import re

# Load the dataset
movies_df = pd.read_csv('../../src/data/movies_dataset_final.tsv', sep='\t')

# Helper function to clean up and extract genre names
def extract_genres(genre_data):
    if pd.isna(genre_data) or '{' not in genre_data:
        return None
    genres = []
    items = genre_data.split(", ")
    for item in items:
        if ':' in item:
            # Clean the genre name and remove unwanted words
            genre_name = item.split(":")[-1].strip().replace('"', '').replace('}', '')
            # Remove words like "Movie", "Movies", "Film", etc.
            genre_name = re.sub(r'\b(Movie|Movies|Film|Films|movie|movies|film|films)\b', '', genre_name).strip()
            genres.append(genre_name)
    # Join all genres with a comma and space
    return ", ".join(genres) if genres else None

# Apply the function to clean and reformat the 'Genres' column
movies_df['Genres'] = movies_df['Genres'].apply(extract_genres)

# Drop rows with missing or invalid genre data in 'Genres'
movies_df = movies_df.dropna(subset=['Genres']).copy()

# Save the updated dataset with the cleaned 'Genres' column back to the original file
movies_df.to_csv('../../src/data/movies_dataset_final.tsv', sep='\t', index=False)
print("Dataset updated with cleaned 'Genres' column and saved as 'src/data/movies_dataset_final.tsv'")


Dataset updated with cleaned 'Genres' column and saved as 'src/data/movies_dataset_final.tsv'


# Extracting and Cleaning Release Years

This script processes the "Release_Date" column in a movie dataset to extract and clean release years, ensuring a consistent format for analysis.

## Workflow

1. **Load Dataset**: Reads a TSV file containing movie metadata, including a "Release_Date" column.
2. **Extract Release Years**:
   - Uses regex to identify and extract 4-digit year patterns from various date formats.
   - Removes rows with invalid or missing year data.
3. **Convert to Integer**: Converts the "Release_Date" column to an integer type for easier analysis.
4. **Save Cleaned Dataset**: Exports the updated dataset with cleaned release years to the original TSV file.


In [8]:
import pandas as pd
import re

# Load the dataset
movies_df = pd.read_csv('../../src/data/movies_dataset_final.tsv', sep='\t')

# Define a function to extract the 4-digit year from various date formats
def extract_year(date_str):
    # Ensure the date is a string
    date_str = str(date_str)
    
    # Use regex to find a 4-digit year pattern
    match = re.search(r'\b(\d{4})\b', date_str)
    
    if match:
        return match.group(1)  # Return the matched 4-digit year as a string
    else:
        return None  # Return None if no 4-digit year is found

# Apply the function to the 'Release_Date' column to extract only the year
movies_df['Release_Date'] = movies_df['Release_Date'].apply(extract_year)

# Drop rows with no valid year
movies_df = movies_df.dropna(subset=['Release_Date'])

# Convert 'Release_Date' to an integer type for further analysis
movies_df['Release_Date'] = movies_df['Release_Date'].astype(int)

# Save the cleaned dataset back to the original file
movies_df.to_csv('../../src/data/movies_dataset_final.tsv', sep='\t', index=False)
print("Dataset updated with cleaned 'Release_Date' years in '../../src/data/movies_dataset_final.tsv'")


Dataset updated with cleaned 'Release_Date' years in '../../src/data/movies_dataset_final.tsv'
