# Data Collection for Indian Election Sentiment Analysis
**Author**: Lakshya Khetan  
**Project**: Twitter Sentiment Analysis for Indian Elections

This notebook demonstrates how to collect Twitter data using our modular data collection system.

## Setup and Configuration

First, let's import the necessary modules and load our configuration.

In [None]:
import sys
import os
sys.path.append('../src')

from data.collector import TwitterDataCollector
from utils.config import ConfigManager
from utils.logger import setup_logger
import pandas as pd

In [None]:
# Load configuration
config_manager = ConfigManager('../config/config.yaml')
config = config_manager.get_config()

# Setup logging
logger = setup_logger('data_collection')

## Initialize Data Collector

Create an instance of our Twitter data collector with the loaded configuration.

In [None]:
# Initialize the Twitter data collector
collector = TwitterDataCollector(config)

print("Twitter Data Collector initialized successfully!")
print(f"Configuration loaded: {len(config)} sections")

## Collect Twitter Data

Now let's collect some tweets related to Indian elections.

In [None]:
# Define search keywords for Indian elections
keywords = [
    'modi', 'bjp', 'congress', 'election2024', 
    'india election', 'indian politics', 'lokSabha'
]

# Collect tweets
print("Starting data collection...")
tweets_df = collector.search_tweets(
    keywords=keywords,
    count=100,  # Collect 100 tweets
    lang='en'   # English tweets only
)

print(f"Collected {len(tweets_df)} tweets")
tweets_df.head()

## Data Analysis

Let's analyze the collected data to understand what we have.

In [None]:
# Basic statistics
print("Dataset Statistics:")
print(f"Total tweets: {len(tweets_df)}")
print(f"Columns: {list(tweets_df.columns)}")
print(f"Date range: {tweets_df['created_at'].min()} to {tweets_df['created_at'].max()}")

In [None]:
# Sample tweets
print("Sample tweets:")
for i, row in tweets_df.head(5).iterrows():
    print(f"\n{i+1}. {row['text'][:100]}...")
    print(f"   Created: {row['created_at']}")
    print(f"   User: {row['user']}")

## Data Filtering

Apply filters to clean and refine our dataset.

In [None]:
# Filter by language (if needed)
collector.filter_by_language('en')

# Get updated statistics
stats = collector.get_data_stats()
print("Filtered Dataset Statistics:")
for key, value in stats.items():
    print(f"{key}: {value}")

## Save Collected Data

Save the collected tweets for further processing.

In [None]:
# Save to CSV
output_file = '../data/collected_tweets.csv'
success = collector.save_data(output_file, format='csv')

if success:
    print(f"✅ Data successfully saved to {output_file}")
    print(f"File size: {os.path.getsize(output_file)} bytes")
else:
    print("❌ Failed to save data")

## Next Steps

The collected data is now ready for preprocessing. The next notebook will demonstrate:

1. **Text Preprocessing** - Cleaning and preparing the tweet text
2. **Tokenization** - Converting text to numerical sequences
3. **Data Validation** - Ensuring data quality

Navigate to `03_data_preprocessing.ipynb` to continue the workflow.