# 01 - Data Exploration

Welcome to the Video Similarity Learning project! In this notebook, we'll explore the dataset and understand the structure of video data.

## Learning Objectives

By the end of this notebook, you will:
- Understand the structure of video data
- Load and visualize video frames
- Explore the dataset metadata
- Understand the similarity learning problem
- **Complete 5 hands-on exercises** that require critical thinking

## Prerequisites

Make sure you have:
1. Run the data download script: `python scripts/download_data.py`
2. Installed all required packages: `pip install -r requirements.txt`

## Important Note

This notebook contains **interactive exercises** throughout. Each exercise builds on the previous concepts and requires you to think critically about the data and write code from scratch. Simply copying from ChatGPT won't help you understand the underlying concepts!

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path

# Add the project root to the path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from tqdm import tqdm

# Import our utilities
from utils.video_utils import load_video, get_video_info, visualize_frames
from utils.data_utils import VideoDataset, create_sample_dataset

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 1. Dataset Overview

Let's start by exploring the structure of our dataset.

In [None]:
# Define data paths
data_dir = project_root / "data" / "videos"
metadata_file = data_dir / "sample_metadata.csv"
pairs_file = data_dir / "similarity_pairs.csv"

print(f"Data directory: {data_dir}")
print(f"Metadata file: {metadata_file}")
print(f"Pairs file: {pairs_file}")

# Check if files exist
print(f"\nData directory exists: {data_dir.exists()}")
print(f"Metadata file exists: {metadata_file.exists()}")
print(f"Pairs file exists: {pairs_file.exists()}")

In [None]:
# Load metadata
if metadata_file.exists():
    metadata = pd.read_csv(metadata_file)
    print("Dataset Metadata:")
    print(f"Number of videos: {len(metadata)}")
    print(f"Columns: {list(metadata.columns)}")
    print("\nFirst few rows:")
    display(metadata.head())
    
    print("\nDataset statistics:")
    display(metadata.describe())
else:
    print("Metadata file not found. Please run the data download script first.")

## 🎯 EXERCISE 1: Data Quality Check

**Task**: Write code to identify potential data quality issues in the metadata.

**Requirements**:
1. Check for missing values in each column
2. Identify any duplicate video filenames
3. Check if all video files mentioned in metadata actually exist
4. Find any videos with unusual properties (very short/long duration, extreme file sizes)

**Your code here**:

In [None]:
# TODO: Write your data quality check code
# Hint: Use pandas functions like isnull(), duplicated(), and file existence checks

# 1. Check for missing values
# Your code here...

# 2. Check for duplicate filenames
# Your code here...

# 3. Check if video files exist
# Your code here...

# 4. Find unusual videos
# Your code here...

## 2. Video Data Exploration

Now let's explore the actual video files and understand their properties.

In [None]:
# List video files
video_files = list(data_dir.glob("*.mp4"))
print(f"Found {len(video_files)} video files")

if video_files:
    print("\nFirst 5 video files:")
    for video_file in video_files[:5]:
        print(f"  - {video_file.name}")

In [None]:
# Analyze video properties
video_info_list = []

for video_file in tqdm(video_files[:10], desc="Analyzing videos"):  # Analyze first 10 videos
    try:
        info = get_video_info(str(video_file))
        info['filename'] = video_file.name
        video_info_list.append(info)
    except Exception as e:
        print(f"Error analyzing {video_file.name}: {e}")

if video_info_list:
    video_info_df = pd.DataFrame(video_info_list)
    print("\nVideo properties:")
    display(video_info_df.describe())
    
    # Plot video durations
    plt.figure(figsize=(10, 6))
    plt.hist(video_info_df['duration'], bins=20, alpha=0.7, edgecolor='black')
    plt.xlabel('Duration (seconds)')
    plt.ylabel('Number of videos')
    plt.title('Distribution of Video Durations')
    plt.grid(True, alpha=0.3)
    plt.show()

## 🎯 EXERCISE 2: Video Property Analysis

**Task**: Analyze the relationship between video properties and create insightful visualizations.

**Requirements**:
1. Create a scatter plot showing the relationship between video duration and file size
2. Group videos by label and create box plots showing duration distribution for each label
3. Calculate the correlation coefficient between duration and file size
4. Identify the video with the highest and lowest frame rate

**Your code here**:

In [None]:
# TODO: Write your video property analysis code

# 1. Scatter plot: duration vs file size
# Your code here...

# 2. Box plots by label
# Your code here...

# 3. Correlation coefficient
# Your code here...

# 4. Frame rate analysis
# Your code here...

## 3. Video Frame Visualization

Let's load and visualize frames from some videos to understand their content.

In [None]:
# Load and visualize a sample video
if video_files:
    sample_video = str(video_files[0])
    print(f"Loading video: {os.path.basename(sample_video)}")
    
    # Load video frames
    frames = load_video(sample_video, max_frames=30)
    print(f"Loaded {len(frames)} frames with shape: {frames.shape}")
    
    # Visualize frames
    visualize_frames(frames, num_frames=8)

## 🎯 EXERCISE 3: Frame Analysis

**Task**: Analyze the visual content of video frames and identify patterns.

**Requirements**:
1. Load frames from 3 different videos (different labels)
2. Calculate the average brightness of each frame
3. Create a plot showing brightness variation over time for each video
4. Identify which video has the most consistent brightness
5. Calculate the standard deviation of brightness for each video

**Your code here**:

In [None]:
# TODO: Write your frame analysis code

# 1. Load frames from 3 different videos
# Your code here...

# 2. Calculate average brightness for each frame
# Your code here...

# 3. Plot brightness over time
# Your code here...

# 4. Identify most consistent video
# Your code here...

# 5. Calculate brightness standard deviation
# Your code here...

## 4. Similarity Pairs Analysis

Let's explore the similarity pairs to understand how the similarity learning problem is structured.

In [None]:
# Load similarity pairs
if pairs_file.exists():
    pairs = pd.read_csv(pairs_file)
    print("Similarity Pairs:")
    print(f"Number of pairs: {len(pairs)}")
    print(f"Columns: {list(pairs.columns)}")
    print("\nFirst few pairs:")
    display(pairs.head())
    
    # Analyze similarity distribution
    print("\nSimilarity distribution:")
    similarity_counts = pairs['similarity'].value_counts()
    print(similarity_counts)
    
    # Plot similarity distribution
    plt.figure(figsize=(8, 6))
    similarity_counts.plot(kind='bar', color=['red', 'green'])
    plt.xlabel('Similarity')
    plt.ylabel('Number of pairs')
    plt.title('Distribution of Similarity Pairs')
    plt.xticks([0, 1], ['Different (0)', 'Similar (1)'])
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("Pairs file not found. Please run the data download script first.")

## 🎯 EXERCISE 4: Similarity Pair Investigation

**Task**: Deep dive into the similarity pairs to understand the dataset structure.

**Requirements**:
1. Find the most common video pairs (which videos appear together most often)
2. Create a histogram showing how many times each video appears in pairs
3. Check if there are any videos that only appear in similar pairs or only in different pairs
4. Calculate the percentage of similar vs different pairs for each label combination
5. Identify any potential bias in the dataset

**Your code here**:

In [None]:
# TODO: Write your similarity pair investigation code

# 1. Find most common video pairs
# Your code here...

# 2. Histogram of video appearances
# Your code here...

# 3. Check for videos with only one type of pair
# Your code here...

# 4. Similarity percentage by label combination
# Your code here...

# 5. Identify dataset bias
# Your code here...

## 5. Label Distribution Analysis

Let's analyze the distribution of labels in our dataset.

In [None]:
# Analyze label distribution
if metadata_file.exists():
    label_counts = metadata['label'].value_counts().sort_index()
    
    print("Label distribution:")
    print(label_counts)
    
    # Plot label distribution
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    label_counts.plot(kind='bar')
    plt.xlabel('Label')
    plt.ylabel('Number of videos')
    plt.title('Distribution of Video Labels')
    plt.xticks(rotation=45)
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.pie(label_counts.values, labels=label_counts.index, autopct='%1.1f%%')
    plt.title('Label Distribution (Pie Chart)')
    
    plt.tight_layout()
    plt.show()

## 🎯 EXERCISE 5: Advanced Label Analysis

**Task**: Perform advanced analysis of label patterns and relationships.

**Requirements**:
1. Create a heatmap showing the similarity matrix between different labels
2. Calculate the average video duration for each label
3. Find the label with the highest variance in video duration
4. Create a visualization showing the relationship between label and video properties
5. Suggest potential improvements to the dataset based on your analysis

**Your code here**:

In [None]:
# TODO: Write your advanced label analysis code

# 1. Create similarity matrix heatmap
# Your code here...

# 2. Average duration by label
# Your code here...

# 3. Duration variance by label
# Your code here...

# 4. Label vs properties visualization
# Your code here...

# 5. Dataset improvement suggestions
# Your code here...

## 6. Dataset Statistics Summary

Let's create a comprehensive summary of our dataset.

In [None]:
# Create dataset summary
print("=== DATASET SUMMARY ===")
print(f"Total videos: {len(metadata) if metadata_file.exists() else 'N/A'}")
print(f"Total similarity pairs: {len(pairs) if pairs_file.exists() else 'N/A'}")
print(f"Number of unique labels: {metadata['label'].nunique() if metadata_file.exists() else 'N/A'}")

if video_info_list:
    avg_duration = np.mean([info['duration'] for info in video_info_list])
    avg_fps = np.mean([info['fps'] for info in video_info_list])
    avg_resolution = f"{int(np.mean([info['width'] for info in video_info_list]))}x{int(np.mean([info['height'] for info in video_info_list]))}"
    
    print(f"Average video duration: {avg_duration:.2f} seconds")
    print(f"Average FPS: {avg_fps:.2f}")
    print(f"Average resolution: {avg_resolution}")

print("\n=== DATASET STRUCTURE ===")
print("data/")
print("├── videos/")
print("│   ├── sample_video_0000.mp4")
print("│   ├── sample_video_0001.mp4")
print("│   ├── ...")
print("│   ├── sample_metadata.csv")
print("│   └── similarity_pairs.csv")
print("└── models/")
print("    └── (pre-trained model placeholders)")

## 🎯 FINAL EXERCISE: Dataset Insights Report

**Task**: Write a comprehensive analysis report based on your findings.

**Requirements**:
1. Summarize the key characteristics of the dataset
2. Identify potential challenges for video similarity learning
3. Suggest preprocessing steps that might be helpful
4. Propose a strategy for handling class imbalance (if any)
5. List 3 potential model architectures that might work well for this data

**Your report here** (write in markdown):

In [None]:
# TODO: Write your dataset insights report
report = """
## Dataset Insights Report

### Key Characteristics:
[Your analysis here]

### Potential Challenges:
[Your analysis here]

### Suggested Preprocessing:
[Your suggestions here]

### Class Imbalance Strategy:
[Your strategy here]

### Recommended Model Architectures:
[Your recommendations here]
"""

print(report)

## Summary

In this notebook, we've explored:

✅ **Dataset Structure**: Understanding how videos and metadata are organized
✅ **Video Properties**: Analyzing duration, resolution, and frame rates
✅ **Frame Visualization**: Seeing what the actual video content looks like
✅ **Similarity Pairs**: Understanding how similarity learning is structured
✅ **Label Distribution**: Analyzing the distribution of video categories
✅ **5 Interactive Exercises**: Hands-on analysis requiring critical thinking

### Key Takeaways:

1. **Video Similarity Learning** is about determining whether two videos are similar or different
2. **Frame Extraction** is crucial - we extract multiple frames from each video to capture temporal information
3. **Data Organization** matters - we need both individual videos and similarity pairs for training
4. **Visual Patterns** - our synthetic dataset has different visual patterns that should be learnable
5. **Critical Analysis** - understanding data quality and patterns is essential for model success

### Next Steps:

In the next notebook, we'll learn about **Feature Extraction** - how to convert video frames into numerical features that our models can use.

---

**Questions to think about:**
- What makes two videos similar in our dataset?
- How might we improve the dataset for better learning?
- What challenges do you see in video similarity detection?
- How would you handle videos of different lengths?
- What preprocessing steps would be most important for this task?