# 🎬 IMDb Data Preprocessing
**Project:** Director Performance Tracker  
**Objective:** Prepare IMDb datasets for SQL analysis: unzip, convert, clean, and preprocess.
---
**Author:** Anna Reyes Trave 
**Date:** May 2025  
**Tools:** Python, pandas, gzip, shutil

## 1️ Set Up Environment
Import necessary libraries and set folder paths.

## ✅ What we're doing:
- **Unzipping files:** We'll work with `.tsv.gz` files (compressed tab-separated values). These need to be **unzipped** before further processing.
- **Converting to CSV:** We'll later convert them to `.csv` files, which are easier to handle in most data analysis workflows.
- **Cleaning/preprocessing:** We’ll also prepare folders to store cleaned versions of the data.

## 📂 Folder structure:
- `data/`: Main data folder containing the **raw IMDb datasets**.
- `data/unzipped/`: Where we will place the **unzipped** `.tsv` files.
- `data/csv/`: Where we will save the **CSV versions** of the datasets.
- `data/cleaned/`: Where cleaned and preprocessed data will go.

## 🛠️ Libraries used:
- **os:** For interacting with the operating system (creating folders, handling paths).
- **gzip:** For working with `.gz` compressed files.
- **shutil:** To help copy and extract files.
- **pandas:** The main library for data analysis and manipulation.

> 🔍 **Note:**  
We are **not importing NumPy yet** because, at this stage, we are only handling file operations and basic data cleaning, which Pandas fully covers. NumPy is installed in the environment and will be used later if needed for numerical analysis.

## 💾 Git tracking note:
We have set up the .gitignore file to exclude the entire data/ folder from tracking. This ensures we don’t accidentally commit large dataset files (e.g., the IMDb raw data), which can cause push errors or exceed GitHub’s file size limits.

However, to keep the folder structure visible in the repository (since Git does not track empty folders by default), we have added small README.md files inside each subfolder (unzipped/, csv/, and cleaned/). These files:

    💬 Document the purpose of each folder.

    📁 Force Git to track the empty folders, ensuring the structure is preserved even when the data files are excluded.

This keeps the repo clean, documented, and fully structured—making it easier for others (and ourselves!) to understand the workflow.

In [1]:
# Import libraries
import os
import gzip
import shutil
import pandas as pd
# Define folder paths
RAW_DATA_FOLDER = 'data'
UNZIPPED_FOLDER = 'data/unzipped'
CSV_FOLDER = 'data/csv'
CLEANED_FOLDER = 'data/cleaned'
# Create output folders if they don't exist
os.makedirs(UNZIPPED_FOLDER, exist_ok=True)
os.makedirs(CSV_FOLDER, exist_ok=True)
os.makedirs(CLEANED_FOLDER, exist_ok=True)
print("Environment set up completed.")

Environment set up completed.
