# Data Collection

#### **Overview of Objectives**
- Extract top trending hashtags from GetDayTrends for the United States.
- Fetch tweets associated with each hashtag using the Twitter API.
- Apply preprocessing to filter English tweets and remove duplicates.
- Store results in structured CSV files for individual hashtags and a consolidated dataset.

#### Requirements
    1. Update username, password and email id in the [config.ini file](../config.ini) before running the code.
#### Expectations
    2. It will generate some csv files in the [data](./data/) folder. Each hash tag will have its own csv file at the end of this stage.

#### **Step-by-Step Methodology**

1. **Scraping Trending Hashtags**
   - **Tools Used**:
     - `BeautifulSoup` for HTML parsing.
     - `urllib` for fetching the webpage content.

   - **Process**:
     - Top hashtags were extracted from two sections of the GetDayTrends webpage:
       - Most Tweeted Hashtags.
       - Longest-Trending Hashtags.
     - HTML tags and CSS classes were used to locate and parse hashtags.

   - **Output**:
     - A combined, de-duplicated list of hashtags ready for further processing.

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request

def extract_tags(webpage):
    '''
    Extracts the top hashtags from a webpage
    The code is specific to the structure of the webpage
    '''
    soup = BeautifulSoup(webpage, 'html.parser')

    # Find the section containing the top hashtags
    hashtags_section = soup.find('table', class_='ranking')

    all_hashtags = []
    # Extract and print each hashtag
    if hashtags_section:
        hashtags = hashtags_section.find_all('a')
        for hashtag in hashtags:
            all_hashtags.append(hashtag.text)
    return all_hashtags


def get_tags_from_url(url):
    '''
    Extracts the top hashtags from a webpage
    '''
    # Send a GET request to the URL
    response = Request(url, headers={'User-Agent': 'insomnia/2023.5.8'})
    webpage = urlopen(response).read()
    return extract_tags(webpage)

In [2]:

# URL of the target page
url_most_tweeted = 'https://getdaytrends.com/united-states/top/tweeted/day/'
url_longest_trending = 'https://getdaytrends.com/united-states/top/longest/day/'


most_tweeted_hashtags = get_tags_from_url(url_most_tweeted)
longest_trending_hashtags = get_tags_from_url(url_longest_trending)

(most_tweeted_hashtags, longest_trending_hashtags)

# combine the two lists and remove duplicates
all_hashtags = list(set(most_tweeted_hashtags + longest_trending_hashtags))
all_hashtags


['#AskAiah',
 '#Survivor47',
 '#PlayStationWrapUp2024',
 '#AEWDynamite',
 '#ThursdayMotivation',
 '#RHOSLC',
 '#CUTOSHI',
 '#thursdayvibes',
 '#GSWvsHOU',
 '#liftoff',
 '#PMSLive',
 '#BBMAs',
 '#ThursdayThoughts',
 '#playstationwrapup',
 '#TheGameAwards',
 '#WhyIChime',
 '#TNFonPrime']

2. **Fetching Tweets**
   - **Tools Used**:
     - `twikit` library for Twitter API interactions.
     - `random`, `time` for implementing delays to respect API rate limits.

   - **Process**:
     - Authenticated using credentials stored in a configuration file.
        You need to place your username, email and password in the [config.ini file](../config.ini)
     - For each hashtag:
       - Top 50 tweets were fetched.
       - Tweets were stored in CSV files, one per hashtag.
       - A random delay between 20–40 seconds was employed to avoid rate-limit violations.

   - **Output**:
     - CSV files containing tweets, hashtags, and language metadata for each hashtag.

In [None]:
## Getting tweets for each hashtag

from twikit import Client
import time
from configparser import ConfigParser
from random import randint
import os
import pandas as pd

COOKIES_FILE = './cookies.json'

async def get_top_tweets_for_trend(trend, max_count = 50):
    '''
    Get the top tweets for a given trend
    It will get the first 20 tweets and then keep getting more until max_count is reached
    The function will wait for a random time between 5 and 10 seconds before getting the next set of tweets
    to avoid getting blocked by Twitter
    '''
    print(f'Getting tweets for {trend}')
    all_tweets = []
    # first set of tweets
    tweets = await client.search_tweet(trend, 'Top', count = 20)
    if (not tweets):
        print(f'No tweets found for {trend}')
        return all_tweets
    all_tweets.extend(tweets)

    while len(all_tweets) < max_count:
        # get the next set of tweets
        wait_time = randint(5, 20)
        print(f'Got {len(all_tweets)} tweets for {trend}. Getting more after {wait_time} seconds...')
        time.sleep(wait_time)
        next_set = await tweets.next()
        if (not next_set):
            break
        all_tweets.extend(next_set)

    print(f'Got {len(all_tweets)} tweets for {trend}')
    return all_tweets

# Read the config file for credentials
config = ConfigParser()
config.read('../config.ini')
username = config['X']['username']
password = config['X']['password']
email = config['X']['email']

# authenticate with X.com
# if 'cookies.json' is not present, then try to login, else use the cookies
client = Client('en-US')
if os.path.exists(COOKIES_FILE):
    client.load_cookies(COOKIES_FILE)
else:
    await client.login(
        auth_info_1 = username,
        auth_info_2 = email,
        password = password)
    client.save_cookies(COOKIES_FILE)


# Get the top 50 tweets for each hashtag
# and save the tweets in a CSV file
for index in range(0, len(all_hashtags)):
    wait_time = randint(20, 40)
    trend = all_hashtags[index]
    print(f'{index}. Waiting for {wait_time} seconds before fetching tweets for {trend}...')
    time.sleep(wait_time)

    all_tweets = await get_top_tweets_for_trend(trend, 50)
    twitter_texts= [tweet.text for tweet in all_tweets]
    twitter_hashtags = [tweet.hashtags for tweet in all_tweets]
    twitter_langs = [tweet.lang for tweet in all_tweets]
    temp_df = pd.DataFrame({'text': twitter_texts, 'all_hashtags': twitter_hashtags, 'lang': twitter_langs})
    # remove duplicates
    temp_df = temp_df.drop_duplicates(subset='text')
    temp_df['hashtag'] = trend
    print(f'Got {len(temp_df)} unique tweets for {trend} from {len(all_tweets)}')

    temp_df.to_csv(f'./data/{index}_{trend}.csv', index=False)

0. Waiting for 25 seconds before fetching tweets for #AskAiah...
Getting tweets for #AskAiah
Got 19 tweets for #AskAiah. Getting more after 15 seconds...
Got 39 tweets for #AskAiah. Getting more after 6 seconds...
Got 59 tweets for #AskAiah
Got 41 unique tweets for #AskAiah from 59
1. Waiting for 39 seconds before fetching tweets for #Survivor47...
Getting tweets for #Survivor47
Got 19 tweets for #Survivor47. Getting more after 15 seconds...
Got 39 tweets for #Survivor47. Getting more after 6 seconds...
Got 59 tweets for #Survivor47
Got 39 unique tweets for #Survivor47 from 59
2. Waiting for 35 seconds before fetching tweets for #PlayStationWrapUp2024...
Getting tweets for #PlayStationWrapUp2024
Got 19 tweets for #PlayStationWrapUp2024. Getting more after 17 seconds...
Got 39 tweets for #PlayStationWrapUp2024. Getting more after 13 seconds...
Got 59 tweets for #PlayStationWrapUp2024
Got 45 unique tweets for #PlayStationWrapUp2024 from 59
3. Waiting for 20 seconds before fetching tweets

#### **Highlights and Key Learnings**
1. **Efficient Web Scraping**:
   - Leveraged BeautifulSoup for precise extraction.
   - Focused on reusable code to scrape multiple sections.

2. **Respecting API Guidelines**:
   - Random delays between API calls minimized the risk of being blocked.
   - Ensured compliance with Twitter's rate limits.

## Next Steps
[Data Preprocessing](./02_DataPreprocessing.ipynb)