# Step-by-Step Detailed Overview of the Data Processing and Sentiment Analysis Workflow

This workflow consists of two main parts:
1. **Filtering and Saving Summaries with Minimum Character Requirement**
2. **Analyzing Sentiment and Merging with Movie Metadata**

## Part 1: Filtering and Saving Summaries with Minimum Character Requirement

### Step 1.1: Import Necessary Libraries
We start by importing the `pandas` library, which will be used for data manipulation, and Python's built-in file handling functionality.

### Step 1.2: Define File Paths
We specify the path to the `plot_summaries.txt` file (containing movie summaries) and define an output path where the filtered summaries will be saved in TSV format.

### Step 1.3: Filter Summaries with Sufficient Length
We open the `plot_summaries.txt` file and iterate through each line, which contains a movie's ID and its summary, separated by a tab. For each line:
- We try to split the line into `movie_id` and `summary`.
- We check if the summary has at least 1000 characters.
- If it does, we extract the last 400 characters and store them in a list `data`, paired with the `movie_id`.

If a line doesn't conform to the expected format, it's ignored, and a message is printed.

### Step 1.4: Save Filtered Data to a TSV File
Once all lines are processed, the filtered summaries are saved to a new TSV file (`filtered_plot_summaries_last_400_characters.tsv`) using `pandas.DataFrame`.

## Part 2: Analyzing Sentiment and Merging with Movie Metadata

### Step 2.1: Load Movie Metadata and Summary Files
We import the `textblob` library for sentiment analysis, and `pandas` for data processing.

### Step 2.2: Load Movie Metadata and Filtered Summaries
We load:
- The `movie.metadata.tsv` file, which contains information such as `Movie_ID`, `Title`, `Release_Date`, etc., while setting the `Movie_ID` column as a string for consistency.
- The `filtered_plot_summaries_last_400_characters.tsv` file created in Part 1, containing filtered summaries.

### Step 2.3: Define the Sentiment Analysis Function
Using `TextBlob`, we define a function to analyze sentiment polarity of each summary:
- A high positive polarity (> 0.5) corresponds to a "Very Happy Ending" (Score = 5).
- A moderate positive polarity (between 0.13 and 0.5) is a "Happy Ending" (Score = 4).
- A neutral polarity (between -0.13 and 0.13) is classified as "Neutral" (Score = 3).
- A moderate negative polarity (between -0.5 and -0.13) is labeled as a "Sad Ending" (Score = 2).
- A highly negative polarity (< -0.5) represents a "Very Sad Ending" (Score = 1).

### Step 2.4: Merge Metadata and Summary Data on Movie ID
We use an inner join to merge `movie_data` and `summaries_data` on the `Movie_ID` column, resulting in a combined dataset of only the movies that have both metadata and summaries.

### Step 2.5: Apply Sentiment Analysis
The `analyze_sentiment` function is applied to the `Summary` column of the merged dataset to generate a `Score` column, representing the sentiment score for each movie.

### Step 2.6: Save the Final Dataset
Finally, the combined dataset with sentiment scores is saved to a TSV file, `movies_dataset_w_scores.tsv`, for further analysis.


In [1]:
import pandas as pd

file_path = 'plot_summaries.txt'
output_file_path = 'filtered_plot_summaries_last_400_characters.tsv'
data = []

with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        try:
            movie_id, summary = line.split('\t', 1)  
            summary = summary.strip()  
            
            if len(summary) >= 1000:
                last_400_characters = summary[-400:].strip()
                data.append({'Movie_ID': movie_id, 'Summary': last_400_characters})
                
        except ValueError:
            print(f"Ligne ignorée (mauvais format) : {line}")

df = pd.DataFrame(data)

df.to_csv(output_file_path, sep='\t', index=False)

print(f"Fichier TSV créé avec succès : {output_file_path}")


Fichier TSV créé avec succès : filtered_plot_summaries_last_400_characters.tsv


In [2]:
import pandas as pd
from textblob import TextBlob

metadata_path = 'movie.metadata.tsv'
movie_data = pd.read_csv(metadata_path, sep='\t', header=None, dtype={0: str})  # Charger l'ID comme chaîne
movie_data.columns = ['Movie_ID', 'Other_Column', 'Title', 'Release_Date', 'Revenue', 'Runtime', 'Languages', 'Country', 'Genres']

summaries_path = 'filtered_plot_summaries_last_300_characters.tsv'
summaries_data = pd.read_csv(summaries_path, sep='\t', dtype={'Movie_ID': str})

print("Premières lignes de movie_data :")
print(movie_data.head())

print("Premières lignes de summaries_data :")
print(summaries_data.head())

def analyze_sentiment(summary):
    analysis = TextBlob(summary)
    polarity = analysis.sentiment.polarity
    if polarity > 0.5:
        return 5  # Very happy ending
    elif 0.13 < polarity <= 0.5:
        return 4  # Happy ending
    elif -0.13 <= polarity <= 0.13:
        return 3  # Neutral ending
    elif -0.5 < polarity < -0.13:
        return 2  # Sad ending
    else:
        return 1  # Very sad ending

merged_data = pd.merge(movie_data, summaries_data, on='Movie_ID', how='inner')

print(f"Nombre de films avec correspondance : {len(merged_data)}")

merged_data['Score'] = merged_data['Summary'].apply(analyze_sentiment)

output_file_path = 'movies_dataset_w_scores.tsv'
merged_data.to_csv(output_file_path, sep='\t', index=False)

print(f"Dataset avec scores sauvegardé dans {output_file_path}")


Premières lignes de movie_data :
   Movie_ID Other_Column                                              Title  \
0    975900    /m/03vyhn                                     Ghosts of Mars   
1   3196793    /m/08yl5d  Getting Away with Murder: The JonBenét Ramsey ...   
2  28463795   /m/0crgdbh                                        Brun bitter   
3   9363483   /m/0285_cd                                   White Of The Eye   
4    261236    /m/01mrr1                                  A Woman in Flames   

  Release_Date     Revenue  Runtime                           Languages  \
0   2001-08-24  14010832.0     98.0  {"/m/02h40lc": "English Language"}   
1   2000-02-16         NaN     95.0  {"/m/02h40lc": "English Language"}   
2         1988         NaN     83.0  {"/m/05f_3": "Norwegian Language"}   
3         1987         NaN    110.0  {"/m/02h40lc": "English Language"}   
4         1983         NaN    106.0   {"/m/04306rv": "German Language"}   

                                     Coun