## IMDB Movie Review Dataset - Sentiment Analysis

Loads and processes the IMDB Large Movie Review Dataset for sentiment analysis. The dataset contains 50,000 movie reviews split evenly into 25k train and 25k test sets with balanced sentiment labels (positive/negative).

### Dataset Structure
- **Training set**: 25,000 reviews (12,500 positive, 12,500 negative)
- **Test set**: 25,000 reviews (12,500 positive, 12,500 negative)
- **Directory structure**: `train/` and `test/` folders, each containing `pos/` and `neg/` subdirectories
- **File format**: Individual text files named `[id]_[rating].txt`

### Data Loading Function
The `load_imdb_data()` function:
- **Input**: Path to data directory (train or test)
- **Output**: Tuple of (reviews, labels) where reviews are text strings and labels are integers (1=positive, 0=negative)

In [2]:
import os
import pandas as pd

def load_imdb_data(data_dir):
   
    reviews = []
    labels = []
    
    # Load positive reviews
    pos_dir = os.path.join(data_dir, 'pos')
    if os.path.exists(pos_dir):
        for filename in os.listdir(pos_dir):
            if filename.endswith('.txt'):
                with open(os.path.join(pos_dir, filename), 'r', encoding='utf-8') as f:
                    reviews.append(f.read())
                    labels.append(1)  # Positive label
    
    # Load negative reviews
    neg_dir = os.path.join(data_dir, 'neg')
    if os.path.exists(neg_dir):
        for filename in os.listdir(neg_dir):
            if filename.endswith('.txt'):
                with open(os.path.join(neg_dir, filename), 'r', encoding='utf-8') as f:
                    reviews.append(f.read())
                    labels.append(0)  # Negative label
    
    return reviews, labels

### Load IMDB Dataset
Loads the dataset for training and testing.

In [5]:
# Dataset path - adjust this path according to your dataset location
dataset_path = "imdb_dataset"

train_reviews, train_labels = load_imdb_data(os.path.join(dataset_path, 'train'))

test_reviews, test_labels = load_imdb_data(os.path.join(dataset_path, 'test'))

train_df = pd.DataFrame({
    'review': train_reviews,
    'sentiment': train_labels
})

test_df = pd.DataFrame({
    'review': test_reviews,
    'sentiment': test_labels
})

print(f"\nTraining set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nTraining set sentiment distribution:")
print(train_df['sentiment'].value_counts())
print(f"\nTest set sentiment distribution:")
print(test_df['sentiment'].value_counts())


Training set shape: (25000, 2)
Test set shape: (25000, 2)

Training set sentiment distribution:
sentiment
1    12500
0    12500
Name: count, dtype: int64

Test set sentiment distribution:
sentiment
1    12500
0    12500
Name: count, dtype: int64
