# Notebook 1: Data Preprocessing and Preparation

## Objective
This notebook focuses on preprocessing the HowTo100M dataset to prepare it for downstream tasks such as text generation and multimodal learning. Specifically, we will:
1. Parse the provided dataset files (`HowTo100M_v1.csv`, `caption.json`, and `task_ids.csv`).
2. Align video captions with their corresponding task descriptions.
3. Perform data cleaning, such as removing stop words and redundant entries.
4. Save the preprocessed data in a structured format (e.g., CSV or JSON) for use in subsequent notebooks.

## Dataset Files
1. **HowTo100M_v1.csv**: Contains metadata about videos, including YouTube video IDs, task IDs, and categories.
2. **caption.json**: Stores video captions with their corresponding timestamps.
3. **task_ids.csv**: Maps task IDs to task descriptions sourced from WikiHow.

## Steps in this Notebook
1. Load and explore the CSV files (`HowTo100M_v1.csv` and `task_ids.csv`).
2. Parse and preprocess the JSON file (`caption.json`).
3. Align captions with task descriptions using task IDs.
4. Save the final processed dataset for use in future notebooks.

## Step 1: Load HowTo100M_v1.csv
In this step, we will:
1. Load the `HowTo100M_v1.csv` file into a Pandas DataFrame.
2. Display the first few rows to verify its structure and understand the columns.

In [1]:
import pandas as pd

# Load HowTo100M_v1.csv into a Pandas DataFrame
howto100m_df = pd.read_csv("Dataset/HowTo100M_v1.csv")

# Display the first 5 rows to verify the structure
howto100m_df.head()

Unnamed: 0,video_id,category_1,category_2,rank,task_id
0,nVbIUDjzWY4,Cars & Other Vehicles,Motorcycles,27,52907
1,CTPAZ2euJ2Q,Cars & Other Vehicles,Motorcycles,35,109057
2,rwmt7Cbuvfs,Cars & Other Vehicles,Motorcycles,99,52907
3,HnTLh99gcxY,Cars & Other Vehicles,Motorcycles,35,52907
4,EyP3HVhg1u0,Cars & Other Vehicles,Motorcycles,95,52906


## Step 2: Load task_ids.csv
In this step, we will:
1. Load the `task_ids.csv` file into a Pandas DataFrame.
2. Display the first few rows to verify its structure and understand the mapping between task IDs and task descriptions.
## Fixing task_ids.csv with Tab Delimiter
The `task_ids.csv` file uses a tab (`\t`) as the delimiter. In this step, we will:
1. Load the file using the `delimiter="\t"` parameter in `pd.read_csv()`.
2. Verify the structure to ensure the data is correctly loaded.

In [5]:
# Load task_ids.csv with tab delimiter
task_ids_df = pd.read_csv("Dataset/task_ids.csv", delimiter="\t")

# Display the first 5 rows to verify the structure
task_ids_df.head()

Unnamed: 0,0,Make a Mexican Bean Toast
0,1,Make a Cinnamon Toast Sandwich
1,2,Make Avocado on Toast
2,3,Make Baked Peach French Toast
3,4,Make Buttered Toast
4,5,Make Banana and Coconut Toast


## Step 3: Merge Video Metadata with Task Descriptions
In this step, we will:
1. Merge `howto100m_df` (video metadata) with `task_ids_df` (task descriptions) using the `task_id` column.
2. Verify the resulting DataFrame to ensure the merge was successful.

In [8]:
# Merge the video metadata with task descriptions on the 'task_id' column
merged_df = pd.merge(howto100m_df, task_ids_df, left_on="task_id", right_on="0", how="inner")

# Rename columns for clarity
merged_df.rename(columns={"0": "task_id", "Make a Mexican Bean Toast": "task_description"}, inplace=True)

# Display the first 5 rows of the merged DataFrame
merged_df.head(500)

Unnamed: 0,video_id,category_1,category_2,rank,task_id,task_id.1,task_description
0,nVbIUDjzWY4,Cars & Other Vehicles,Motorcycles,27,52907,52907,Paint a Motorcycle
1,rwmt7Cbuvfs,Cars & Other Vehicles,Motorcycles,99,52907,52907,Paint a Motorcycle
2,HnTLh99gcxY,Cars & Other Vehicles,Motorcycles,35,52907,52907,Paint a Motorcycle
3,RAidUDTPZ-k,Cars & Other Vehicles,Motorcycles,10,52907,52907,Paint a Motorcycle
4,tYQoPHwNkho,Cars & Other Vehicles,Motorcycles,18,52907,52907,Paint a Motorcycle
...,...,...,...,...,...,...,...
495,cSHD9tI2vWM,Cars & Other Vehicles,Bicycles,61,58362,58362,Repair an Exhaust Pipe with a Tin Can
496,i65eNj0pjrY,Hobbies and Crafts,Crafts,164,58362,58362,Repair an Exhaust Pipe with a Tin Can
497,0AB-dU--HmA,Cars & Other Vehicles,Cars,24,58362,58362,Repair an Exhaust Pipe with a Tin Can
498,pwxIqeyoB_M,Cars & Other Vehicles,Cars,28,58362,58362,Repair an Exhaust Pipe with a Tin Can


## Step 4: Load and Align Captions with Merged Data
In this step, we will:
1. Load the `caption.json` file, which contains captions for each video.
2. Extract the captions for the `video_id` present in the merged DataFrame.
3. Align the captions with the video metadata and task descriptions.

In [9]:
import json

# Load the caption.json file
with open("Dataset/caption.json", "r") as file:
    captions = json.load(file)

# Filter captions for video_ids present in the merged DataFrame
merged_df["captions"] = merged_df["video_id"].map(captions)

# Display a sample of the merged DataFrame with captions
merged_df[["video_id", "task_description", "captions"]].head()

Unnamed: 0,video_id,task_description,captions
0,nVbIUDjzWY4,Paint a Motorcycle,"{'start': [13.64, 15.86, 20.6, 23.96, 26.36, 2..."
1,rwmt7Cbuvfs,Paint a Motorcycle,"{'start': [1.8, 6.32, 7.32, 10.86, 13.28, 15.6..."
2,HnTLh99gcxY,Paint a Motorcycle,"{'start': [0.03, 2.37, 4.29, 6.69, 8.42, 8.67,..."
3,RAidUDTPZ-k,Paint a Motorcycle,"{'start': [0.06, 1.38, 3.03, 5.13, 7.44, 8.73,..."
4,tYQoPHwNkho,Paint a Motorcycle,"{'start': [0.0, 6.93, 8.94, 11.07, 12.71, 15.2..."


## Step 5: Clean and Save the Preprocessed Dataset
In this step, we will:
1. Remove any unnecessary columns from the merged DataFrame.
2. Save the cleaned dataset as a new CSV or JSON file for use in subsequent notebooks.

In [10]:
# Select only the relevant columns for the final dataset
final_df = merged_df[["video_id", "category_1", "category_2", "task_description", "captions"]]

# Save the preprocessed data to a CSV file
final_df.to_csv("preprocessed_dataset.csv", index=False)

# Save the preprocessed data to a JSON file (optional)
final_df.to_json("preprocessed_dataset.json", orient="records", lines=True)

# Display a confirmation message
print("Preprocessed dataset saved successfully!")

Preprocessed dataset saved successfully!
