# Data Preprocessing for Sentiment Analysis
**Author**: Lakshya Khetan  
**Project**: Twitter Sentiment Analysis for Indian Elections

This notebook demonstrates text preprocessing using our modular preprocessing system.

## Setup and Configuration

In [None]:
import sys
sys.path.append('../src')

from data.preprocessor import TextPreprocessor
from utils.config import ConfigManager
from utils.logger import setup_logger
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Load configuration
config_manager = ConfigManager('../config/config.yaml')
config = config_manager.get_config()

# Setup logging
logger = setup_logger('preprocessing')

## Load Collected Data

Load the Twitter data collected in the previous step.

In [None]:
# Load the collected tweets
data_file = '../data/collected_tweets.csv'
try:
    tweets_df = pd.read_csv(data_file)
    print(f"✅ Loaded {len(tweets_df)} tweets from {data_file}")
    print(f"Columns: {list(tweets_df.columns)}")
except FileNotFoundError:
    print("❌ Data file not found. Please run the data collection notebook first.")
    # Create sample data for demonstration
    tweets_df = pd.DataFrame({
        'id': range(1, 6),
        'text': [
            "Great work by Modi government! #BJP2024 https://example.com",
            "RT @user: Not happy with current policies... 😞",
            "Congress has better vision for India's future #Congress2024",
            "Election results will be interesting! #Democracy #India",
            "@politician Your policies are affecting common people badly!!!"
        ],
        'created_at': pd.date_range('2023-01-01', periods=5),
        'user': [f'user_{i}' for i in range(1, 6)]
    })
    print("📝 Using sample data for demonstration")

tweets_df.head()

## Initialize Text Preprocessor

Create an instance of our text preprocessing system.

In [None]:
# Initialize the text preprocessor
preprocessor = TextPreprocessor(config)

print("Text Preprocessor initialized successfully!")
print(f"Configuration settings:")
print(f"  - Remove URLs: {config['preprocessing']['text_cleaning']['remove_urls']}")
print(f"  - Remove mentions: {config['preprocessing']['text_cleaning']['remove_mentions']}")
print(f"  - Convert to lowercase: {config['preprocessing']['text_cleaning']['convert_lowercase']}")
print(f"  - Remove stopwords: {config['preprocessing']['text_cleaning']['remove_stopwords']}")

## Text Cleaning Demonstration

Let's see how our preprocessing works on individual tweets.

In [None]:
# Demonstrate text cleaning on sample tweets
print("Text Cleaning Examples:")
print("=" * 50)

for i, row in tweets_df.head(3).iterrows():
    original_text = row['text']
    clean_text = preprocessor.clean_text(original_text)
    
    print(f"\n{i+1}. Original:")
    print(f"   {original_text}")
    print(f"   Cleaned:")
    print(f"   {clean_text}")
    print("-" * 30)

## Batch Preprocessing

Process the entire dataset using our batch preprocessing function.

In [None]:
# Preprocess the entire dataset
print("Processing entire dataset...")
processed_df = preprocessor.preprocess_dataframe(
    tweets_df, 
    text_column='text'
)

print(f"✅ Processed {len(processed_df)} tweets")
print(f"New columns: {list(processed_df.columns)}")

# Display results
processed_df[['text', 'clean_text']].head()

## Tokenization and Sequence Creation

Convert cleaned text to numerical sequences for machine learning models.

In [None]:
# Fit tokenizer on cleaned text
print("Fitting tokenizer...")
clean_texts = processed_df['clean_text'].tolist()
preprocessor.fit_tokenizer(clean_texts)

# Get vocabulary information
vocab_size = preprocessor.get_vocabulary_size()
word_index = preprocessor.get_word_index()

print(f"✅ Tokenizer fitted successfully")
print(f"Vocabulary size: {vocab_size}")
print(f"Sample word indices: {dict(list(word_index.items())[:10])}")

In [None]:
# Create sequences from text
sequences = preprocessor.create_sequences_from_dataframe(
    processed_df,
    text_column='clean_text'
)

print(f"Created sequences with shape: {sequences.shape}")
print(f"Sequence length: {sequences.shape[1]}")
print(f"\nSample sequences:")
for i in range(min(3, len(sequences))):
    print(f"{i+1}. {sequences[i][:10]}... (showing first 10 tokens)")

## Data Analysis and Visualization

Analyze the preprocessed data to understand patterns.

In [None]:
# Analyze text lengths
text_lengths = processed_df['clean_text'].str.len()
word_counts = processed_df['clean_text'].str.split().str.len()

print("Text Statistics:")
print(f"Average character length: {text_lengths.mean():.1f}")
print(f"Average word count: {word_counts.mean():.1f}")
print(f"Max character length: {text_lengths.max()}")
print(f"Max word count: {word_counts.max()}")

In [None]:
# Visualize text length distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Character length distribution
axes[0].hist(text_lengths, bins=20, alpha=0.7, color='blue')
axes[0].set_title('Character Length Distribution')
axes[0].set_xlabel('Characters')
axes[0].set_ylabel('Frequency')

# Word count distribution
axes[1].hist(word_counts, bins=20, alpha=0.7, color='green')
axes[1].set_title('Word Count Distribution')
axes[1].set_xlabel('Words')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## Save Preprocessed Data

Save the preprocessed data and fitted tokenizer for model training.

In [None]:
# Save preprocessed dataframe
processed_file = '../data/preprocessed_tweets.csv'
processed_df.to_csv(processed_file, index=False)
print(f"✅ Preprocessed data saved to {processed_file}")

# Save sequences as numpy array
sequences_file = '../data/tweet_sequences.npy'
np.save(sequences_file, sequences)
print(f"✅ Sequences saved to {sequences_file}")

# Save tokenizer
tokenizer_file = '../models/tokenizer.pickle'
import os
os.makedirs('../models', exist_ok=True)
success = preprocessor.save_tokenizer(tokenizer_file)
if success:
    print(f"✅ Tokenizer saved to {tokenizer_file}")
else:
    print(f"❌ Failed to save tokenizer")

## Summary

### What we accomplished:

1. ✅ **Loaded raw Twitter data** from the collection phase
2. ✅ **Applied text cleaning** (removed URLs, mentions, special characters)
3. ✅ **Tokenized text** and created vocabulary
4. ✅ **Generated numerical sequences** for model input
5. ✅ **Analyzed text statistics** and distributions
6. ✅ **Saved preprocessed data** for model training

### Next Steps:

The preprocessed data is now ready for model training. The next notebook will demonstrate:

1. **Model Architecture** - Building LSTM sentiment analysis models
2. **Training Process** - Training models on the preprocessed data
3. **Model Evaluation** - Assessing model performance

Navigate to `04_model_training.ipynb` to continue the workflow.