# Data Preparation for Text Summarization (Colab Version)

This notebook prepares the data for fine-tuning LLMs for text summarization. It includes mounting Google Drive, installing required packages, and preparing the dataset.

## 1. Mount Google Drive and Setup Environment

First, we'll mount Google Drive and install required packages.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set up project directory
import os

# Change this to your Google Drive project path
PROJECT_DIR = "/content/drive/MyDrive/text_summarization_project"
os.makedirs(PROJECT_DIR, exist_ok=True)

# Create necessary subdirectories
for dir_name in ['data/raw', 'data/processed', 'models', 'src']:
    os.makedirs(os.path.join(PROJECT_DIR, dir_name), exist_ok=True)

## 2. Install Required Packages

Install all the necessary packages for the project.

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate bitsandbytes peft trl
!pip install -q torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
!pip install -q evaluate rouge-score numpy pandas matplotlib seaborn

## 3. Import Required Libraries and Set Up Configuration

Import necessary libraries and set up the project configuration.

In [None]:
import os
import json
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

# Configure paths
DATA_DIR = os.path.join(PROJECT_DIR, "data")
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw")
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed")

# Create directories if they don't exist
os.makedirs(RAW_DATA_DIR, exist_ok=True)
os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)

# Configuration parameters
RANDOM_SEED = 42
TRAIN_SIZE = 0.8
VAL_SIZE = 0.1
TEST_SIZE = 0.1