# Phase II: Data Curation, Exploratory Analysis and Plotting (5\%)

Each **project group** will submit a single **jupyter notebook** which contains:

1. (1\%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.
2. (2\%) Obtains, cleans, and merges all data sources involved in the project.
3. (2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.

## Gathering Twitter Data
### Sources
- https://builtin.com/articles/selenium-web-scraping
- https://www.selenium.dev/documentation/webdriver/elements/finders/
- https://www.selenium.dev/documentation/webdriver/actions_api/wheel/
- https://selenium-python.readthedocs.io/locating-elements.html

*Note: Twitter pages first load JavaScript (that then fetches the data) so BeautifulSoup won't work here*

In [1]:
# Selenium: Python library used for automating web browser for web scraping
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import pandas as pd
import time
from datetime import datetime

In [2]:
# scrapes given url and gathers profile and tweet data
def extractTwitterData(url, scroll_amount, team):
    """ Scrapes given Twitter/X profile page to gather profile and tweet data
    Requires ChromeDriver installation [update executable path once installed]
    
    Args:
        url(str): URL of Twitter/X profile (ex: 'https://x.com/ManCity')
        scroll_amount(int): Number of times the page is scrolled to load tweets
        team: team username for finding page elements (ex: 'ManCity')

    Returns:
        profile_data(dict): Dictionary of profile information
        all_tweets_data(dict): List of dictionaries of individual tweet data
    """
    cService = webdriver.ChromeService(executable_path='/Users/KinseyBellerose/Downloads/chromedriver-mac-arm64/chromedriver')
    driver = webdriver.Chrome(service=cService)
    driver.get(url)
    time.sleep(5) # let the page load

    profile_data = extractProfileData(driver, team)
    all_tweets_data = []

    # extract data from tweets as the page is scrolled
    # increase scroll_amount to load more tweets
    for i in range(scroll_amount):
        tweets = driver.find_elements(By.CSS_SELECTOR, 'article[data-testid="tweet"]')
        for tweet in tweets:
            tweet_data = extractTweetData(tweet)
            all_tweets_data.append(tweet_data)
        # scroll down on page
        driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
        time.sleep(5)
    driver.quit() # closes browsers and stops processes
    return profile_data, all_tweets_data


def extractProfileData(driver, team):
    """ Extract profile information from Twitter/X profile
    Args: 
        driver: Selenium web driver open on profile page
        team(str): Team name for finding page elements
    Returns:
        profile_data(dict): Dictionary containing profile data
    """
    profile_data = {
        'display_name': driver.find_element(By.CSS_SELECTOR, '[data-testid="UserName"]').text,
        'join_date': driver.find_element(By.CSS_SELECTOR, '[data-testid="UserJoinDate"]').text,
        'following_count': driver.find_element(By.CSS_SELECTOR, f'a[href="/{team}/following"]').text,
        'follower_count': driver.find_element(By.CSS_SELECTOR, f'[data-testid="primaryColumn"] [href="/{team}/verified_followers"]').text,
        'post_count': driver.find_element(By.XPATH, '//div[@data-testid="primaryColumn"]//div[contains(text(), "posts")]').text,
        'scraped_at': datetime.now(),
        'profile_url': driver.current_url,
        'team': team
    }
    return profile_data

def extractTweetData(tweet):
    """ Extract individual tweet information from Twitter/X profile
    Args:
        tweet: Page content for single tweet
    Returns:
        tweet_data(dict): Dictionary containing tweet data
    """
    tweet_stats = extract_tweet_stats(tweet)
    author = tweet.find_element(By.CSS_SELECTOR, '[data-testid="User-Name"]').text
    # name, username, and other chars were combined in this element
    author_parts = author.split('\n')
    try:
        tweet_text = tweet.find_element(By.CSS_SELECTOR, '[data-testid="tweetText"]').text,
    except Exception as e:
        tweet_text = 'Error retrieving tweet'

    tweet_data = {
                'text': tweet_text,
                'author_name': author_parts[0] if len(author_parts) > 0 else "",
                'author_username': author_parts[1] if len(author_parts) > 1 else "",
                'timestamp': tweet.find_element(By.CSS_SELECTOR, 'time').get_attribute('datetime'),
                'relative_time': tweet.find_element(By.CSS_SELECTOR, 'time').text,
                'tweet_url': tweet.find_element(By.CSS_SELECTOR, 'a[href*="/status/"]').get_attribute('href'),
                'replies': tweet_stats['replies'],
                'reposts': tweet_stats['reposts'],
                'likes': tweet_stats['likes'],
                'views': tweet_stats['views'],
            }
    return tweet_data

def extract_tweet_stats(tweet):
    """ Retrieves reply, repost, like, and view counts from specific tweet web element
    Args:
        tweet: Page content for a tweet
    Returns:
        Dictionary containing tweet statistics with replies, reposts, likes, and view counts uncleaned
    """
    stat_section = tweet.find_element(By.CSS_SELECTOR, '[role="group"]')
    # stat_section returns as "38 91 1.3K 33K"
    stat_text = stat_section.text.split('\n') # turn into list
    return {
        'replies': stat_text[0] if len(stat_text) > 0 else "0",
        'reposts': stat_text[1] if len(stat_text) > 1 else "0", 
        'likes': stat_text[2] if len(stat_text) > 2 else "0",
        'views': stat_text[3] if len(stat_text) > 3 else "0"
    }

In [3]:
def create_profiles_df(all_profile_data):
    """ Converts profile dictionaries into DataFrame
    Args:
        all_profile_data(dict): List of profile dictionaries for each team
    Returns:
        profiles_df(DataFrame): DataFrame with a row for each team profile
    """
    profiles_series = []
    for profile in all_profile_data:
        profiles_series.append(pd.Series(profile))
    profiles_df = pd.DataFrame(profiles_series)
    return profiles_df

In [4]:
def create_tweet_df(tweets_data):
    """ Converts a team's tweet dictionaries into a DataFrame
    Args:
        tweets_Data(dict): List of tweet dictionaries for a team
    Returns:
        tweets_df(DataFrame): DataFrame with a row for each individual tweet
    """
    tweet_series = []
    for tweet in tweets_data:
        tweet_series.append(pd.Series(tweet))
    tweets_df = pd.DataFrame(tweet_series)
    return tweets_df

In [5]:
# extracts profile + tweet data from each team
def get_teams_twitter_data(teams):
    """ Scrapes Twitter/X for multiple teams and returns profile and tweet data
    Args:
        teams: Collection of team usernames (Ex: 'ManCity')
    Returns:
        profiles_data(dict): List of dictionaries for each profile
        tweets_data(dict): Dictionary linking team usernames to all their tweet data
    """
    profiles_data = []
    tweets_data = {}
    for team in teams:
        url = f"https://x.com/{team}"
        profile_data, team_tweets_data = extractTwitterData(url, 10, team)
        profiles_data.append(profile_data)
        tweets_data[team] = team_tweets_data

    return profiles_data, tweets_data

In [6]:
# turns extracted data into one profile df and a tweet df for each team
def get_team_dfs(profiles_data, tweets_data):  
    """ Converts scraped data into a profile DataFrame and tweet DataFrame
    Args:
        profiles_data(dict): Dictionaries for each team profile
        tweets_data(dict): Dictionary pairing team usernames to tweet dictionaries
    returns:
        combined_tweets_df(DataFrame): DataFrame containing all tweet data for all teams
        profiles_df(DataFrame): DataFrame where each row represents data for one team
    """
    profiles_df = create_profiles_df(profiles_data)
    tweets_list = []
    for team, tweets in tweets_data.items():
        team_tweets_df = create_tweet_df(tweets)
        # add team_name column
        team_tweets_df['team_name'] = team
        tweets_list.append(team_tweets_df)
    combined_tweets_df = pd.concat(tweets_list, ignore_index=True)
    return combined_tweets_df, profiles_df

In [7]:
# ~ 4 minutes
team_usernames = {'ManCity','ManUtd', 'ChelseaFC', 'Arsenal', 'LFC'}
profiles_data, tweets_data = get_teams_twitter_data(team_usernames)

In [8]:
# Note this is only extracting first ~40 tweets from each team 
tweets_dfs, profiles_df = get_team_dfs(profiles_data, tweets_data)
tweets_dfs.to_csv('team_tweets.csv', index=False)
profiles_df.to_csv('team_profiles.csv', index=False)
tweets_dfs

Unnamed: 0,text,author_name,author_username,timestamp,relative_time,tweet_url,replies,reposts,likes,views,team_name
0,(Stunning from Divine Mukasa against Villarrea...,Manchester City,@ManCity,2025-10-22T18:00:19.000Z,2h,https://x.com/ManCity/status/1981058046244901138,34,89,904,40K,ManCity
1,"(An unstoppable header! \n@BernardoCSilva\n ,)",Manchester City,@ManCity,2025-10-22T17:20:00.000Z,2h,https://x.com/ManCity/status/1981047900701389088,31,83,885,32K,ManCity
2,"(Another game, another goal! \n@ErlingHaaland\...",Manchester City,@ManCity,2025-10-22T17:10:00.000Z,2h,https://x.com/ManCity/status/1981045384651714923,43,138,1.3K,35K,ManCity
3,"(Loud and proud as always! ,)",Manchester City,@ManCity,2025-10-21T21:46:32.000Z,22h,https://x.com/ManCity/status/1980752587453350349,70,793,7.4K,280K,ManCity
4,"(A good night's work ,)",Manchester City,@ManCity,2025-10-21T21:23:02.000Z,22h,https://x.com/ManCity/status/1980746673673388196,55,499,7.9K,92K,ManCity
...,...,...,...,...,...,...,...,...,...,...,...
198,(26' - Goal for Frankfurt. Rasmus Kristensen.\...,Liverpool FC,@LFC,2025-10-22T19:25:59.000Z,46m,https://x.com/LFC/status/1981079605327826954,954,585,2.5K,307K,LFC
199,(19’ - We’re forced into an early change as Br...,Liverpool FC,@LFC,2025-10-22T19:19:09.000Z,53m,https://x.com/LFC/status/1981077887840104585,155,132,1.3K,166K,LFC
200,"(Come on you Reds ,)",Liverpool FC,@LFC,2025-10-22T19:06:09.000Z,1h,https://x.com/LFC/status/1981074616237117515,79,343,3.4K,127K,LFC
201,(1’ - Up and running at Deutsche Bank Park \n\...,Liverpool FC,@LFC,2025-10-22T19:00:22.000Z,1h,https://x.com/LFC/status/1981073161195307216,51,141,1.1K,102K,LFC


In [9]:
profiles_df.head()

Unnamed: 0,display_name,join_date,following_count,follower_count,post_count,scraped_at,profile_url,team
0,Manchester City\n@ManCity,Joined April 2008,46.6K Following,18M Followers,174K posts,2025-10-22 16:08:21.045752,https://x.com/ManCity,ManCity
1,Manchester United\n@ManUtd,Joined April 2012,135 Following,38.8M Followers,97K posts,2025-10-22 16:09:20.570527,https://x.com/ManUtd,ManUtd
2,Arsenal\n@Arsenal,Joined April 2009,78.3K Following,22.6M Followers,118.1K posts,2025-10-22 16:10:19.195047,https://x.com/Arsenal,Arsenal
3,Chelsea FC\n@ChelseaFC,Joined March 2009,265 Following,26.3M Followers,126.2K posts,2025-10-22 16:11:18.863206,https://x.com/ChelseaFC,ChelseaFC
4,Liverpool FC\n@LFC,Joined January 2009,335.3K Following,24.9M Followers,130.7K posts,2025-10-22 16:12:19.424287,https://x.com/LFC,LFC
