### Setting Up the Environment

Before we begin our data collection and analysis, we need to install the necessary Python libraries. This cell installs the following packages:

- `pandas`: A powerful data manipulation and analysis library
- `ntscraper`: A tool for scraping tweets from Twitter (X)
- `datetime`: A module for working with dates and times
- `ftfy`: A library for fixing text encoding issues

Run this cell to install these dependencies:

In [None]:
!pip install pandas ntscraper datetime ftfy

### Importing Required Libraries

In this section, we import the necessary libraries for our Political Sentiment Analyzer project:

In [None]:
from ntscraper import Nitter
from datetime import datetime, timedelta
import re
import pandas as pd
from ftfy import fix_text

### Defining the `correct_encoding` Function

The `correct_encoding` function ensures that the text from tweets is properly encoded, fixing any issues that may arise from encoding errors.

In [None]:
def correct_encoding(text):
    return fix_text(text)

### Defining the `format_date` Function

The `format_date` function cleans and formats the date string from the tweet data into a standardized format.

In [None]:
def format_date(date_str):
    cleaned_date_str = date_str.replace('Â', '').replace('·', '').strip()
    date_obj = datetime.strptime(cleaned_date_str, '%b %d, %Y %I:%M %p UTC')
    return date_obj.strftime('%Y-%m-%d %I:%M %p UTC')

### Defining the `extract_tweet_id` Function

The `extract_tweet_id` function extracts the unique tweet ID from the tweet's URL.

In [None]:
def extract_tweet_id(url):
    match = re.search(r'/status/(\d+)', url)
    return match.group(1) if match else None

### Defining the `get_info` Function

The `get_info` function extracts and processes essential information from each tweet. This function returns a dictionary containing the tweet's ID, formatted date, and corrected text.

In [None]:
def get_info(tweet):
    return {
        'id': extract_tweet_id(tweet['link']),
        'date': format_date(tweet['date']),
        'text': correct_encoding(tweet['text'])
    }

### Defining the `get_tweets` Function

The following function `get_tweets` is designed to scrape tweets based on specific search terms within a defined date range. It uses the `Nitter` module from the `ntscraper` library to fetch tweets and then processes them accordingly.

In [None]:
def get_tweets(terms, start_date_str, end_date_str=None):
    end_date = datetime.strptime(end_date_str, '%Y-%m-%d') if end_date_str else datetime.now()
    curr_date = datetime.strptime(start_date_str, '%Y-%m-%d')

    while curr_date <= end_date:
        next_date = curr_date + timedelta(days=1)

        curr_date_str = curr_date.strftime('%Y-%m-%d')
        next_date_str = next_date.strftime('%Y-%m-%d')

        scraper = Nitter(log_level=1, skip_instance_check=False)

        for term in terms:
            try:
                response = scraper.get_tweets(term, since=curr_date_str, until=next_date_str, near='India', language='en')

                new_tweets = []

                for tweet in response.get('tweets', []):
                    new_tweets.append(get_info(tweet))

                if new_tweets:
                    yield new_tweets

            except Exception as e:
                print(e)

        print(f'Tweets for {curr_date} collected!')

        curr_date = next_date

### Defining the Search Terms

Below is a list of search terms that will be used to collect tweets related to the India budget. These terms encompass various aspects of the budget, including economic policies, sectors, and public reactions.

In [None]:
terms = [
    'Budget',
    'India Budget',
    'Union Budget',
    'Indian Economy',
    'Finance Minister Budget',
    'Economic Survey',
    'Tax Reforms',
    'Income Tax',
    'GST',
    'Fiscal Deficit',
    'Subsidies',
    'Infrastructure Spending',
    'Public Expenditure',
    'Social Welfare Budget',
    'Agriculture Budget',
    'Healthcare Budget',
    'Modi Government Budget',
    'FM Nirmala Sitharaman Budget',
    'Indian Parliament Budget',
    'Budget Reactions',
    'Opposition Response Budget',
    'Middle Class Budget',
    'Corporate Tax Budget',
    'Defense Budget',
    'Railway Budget',
    'Education Budget',
    'Automobile Budget',
    'Real Estate Budget',
    'Startups Budget',
    'MSME Budget',
    'Banking Sector Budget',
    'Energy Sector Budget',
    'Technology Budget',
    'Digital India Budget',
    'Green Energy Budget',
    'Rural Development Budget',
    'Budget Session',
    'Budget Day',
    'Budget Announcement',
    'Pre-Budget Survey',
    'Post-Budget Analysis'
]

### Defining the `generate_date_ranges` Function

The `generate_date_ranges` function creates a list of date ranges for each year within a specified period. These ranges can be used to systematically collect tweets from specific years.

In [None]:
def generate_date_ranges(start_year, end_year):
    date_ranges = []

    for year in range(start_year, end_year + 1):
        start_date = f'{year}-01-01'
        end_date = f'{year + 1}-01-01'
        date_ranges.append((start_date, end_date))

    return date_ranges

### Collecting and Storing Tweets

The following code collects tweets for each year within the specified range and saves them to CSV files.

In [None]:
start_year = 2021
end_year = datetime.now().year

date_ranges = generate_date_ranges(start_year, end_year)

for start_date_str, end_date_str in date_ranges:
        if end_date_str > datetime.now().strftime('%Y-%m-%d'):
            end_date_str = None

        year = datetime.strptime(start_date_str, '%Y-%m-%d').year
        file_path = f'../../data/budget_{year}.csv'

        for tweets in get_tweets(terms, start_date_str, end_date_str):
            df = pd.DataFrame(tweets)
            df.to_csv(file_path, mode='a', index=False, header=False, encoding='utf-8')

        print(f'Tweets for year {year} collected!')

### Cleaning and Saving Tweet Data

The following code reads tweet data from CSV files, removes duplicates, and performs data cleaning and reformatting.

In [None]:
for year in range(start_year, end_year + 1):
    file_path = f'../../data/budget_{year}.csv'
    
    df = pd.read_csv(file_path, header=None, names=['tweet_id', 'datetime', 'text'])

    df.drop_duplicates(subset='tweet_id', keep='first', inplace=True)
    
    df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %I:%M %p %Z')
    
    df.sort_values(by='datetime', inplace=True)
    
    df = df[['datetime', 'text']]
    
    df.to_csv(file_path, index=False)
            
    print(f'Cleaned data for year {year}, saved successfully!')