# Phase II: Data Curation, Exploratory Analysis and Plotting (5\%)

Each **project group** will submit a single **jupyter notebook** which contains:

1. (1\%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.
2. (2\%) Obtains, cleans, and merges all data sources involved in the project.
3. (2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.

In [293]:
import requests
import pandas as pd
from datetime import datetime
import time

In [294]:
season_start = '2025-08-16'
season_end = '2026-05-25'

teams = ['Arsenal FC', 'Manchester City FC', 'Manchester United FC', 'Chelsea FC', 'Liverpool FC']
API_KEY = "001fa633-fc31-4e53-a402-9dfcf5c25c70"

In [295]:
def get_article_data(article, team):
    """ Format specific article json response
    Args:
        article(dictionary): Raw article data from Guardian API
        team(str): Premier League team name for this article
    Returns: 
        dictionary: dictionary containing article information for this team; includes team, title, description, headline, 
                    author, publication, wordcount, published_at, url, section, content_type
    """
    fields = article.get('fields', {})
    article_info = {
        'team': team,
        'title': article.get('webTitle', ''),
        'description': fields.get('trailText', ''),
        'headline': fields.get('headline', ''),
        'author': fields.get('byline', ''),
        'publication': fields.get('publication', ''),
        'wordcount': fields.get('wordcount', ''),
        'published_at': article.get('webPublicationDate', ''),
        'url': article.get('webUrl', ''),
        'section': article.get('sectionName', ''),
        'content_type': article.get('type', ''),
    }
    return article_info

In [296]:
def get_guardian_articles(team, from_date, to_date, api_key, page_size):
    """
    Retrieves articles from The Guardian for a specific team between a specific date range
        Args:
            team(str): Premier League team to search for such as 'Arsenal' or 'Manchester City'
            from_date(str): start date in YYYY-MM-DD format
            to_date(str): end date in YYYY-MM-DD format
            api_key(str): Guardian API key
            page_size(int): Number of articles to retrieve

        Returns: 
            Dataframe: Dataframe containing article information for this team.
                       Includes team, title, description, headline, author, publication, wordcount, 
                       published_at, url, section, content_type
    """
    url = "https://content.guardianapis.com/search"
    params = {
        'q': f'"{team}" Premier League',
        'from-date': from_date,
        'to-date': to_date,
        'api-key': api_key,
        'page-size': page_size,
        'show-fields': 'headline,trailText,byline,publication,wordcount',
        'show-tags': 'keyword',
    }
    response = requests.get(url, params=params)
    data = response.json()
    articles_data = []
    
    if 'results' in data['response']:
        for article in data['response']['results']:
            article_info = get_article_data(article, team)
            articles_data.append(article_info)

    start = datetime.strptime(from_date, '%Y-%m-%d').strftime('%b %d, %Y')
    end = datetime.strptime(to_date, '%Y-%m-%d').strftime('%b %d, %Y')
            
    print(f"Retrieved {len(articles_data)} articles for {team} from {start} to {end}")
    return pd.DataFrame(articles_data)

In [297]:
def get_all_teams_guardian_news(api_key, season_start, season_end, page_size):
    """Retrieve news data for five Premier League teams for a single season
    Args:
        api_key(str): Guardian API key
        season_start(str): start date in YYYY-MM-DD format
        season_end(str): end date in YYYY-MM-DD format
        page_size(int): Number of articles to retrieve per team
    Returns: 
        Dataframe with all article data for all teams
    
    """
    teams = ['Arsenal FC', 'Manchester City FC', 'Manchester United FC', 'Chelsea FC', 'Liverpool FC']
    all_articles = []
    
    for team in teams:
        team_df = get_guardian_articles(
            team=team,
            from_date=season_start,
            to_date=season_end, 
            api_key=api_key,
            page_size=page_size
        )
        all_articles.append(team_df)
    
        return pd.concat(all_articles, ignore_index=True)

In [298]:
# Data for 2025-26 season
all_news_df = get_all_teams_guardian_news(
    api_key=API_KEY,
    season_start=season_start,
    season_end=season_end,
    page_size=100
)

Retrieved 100 articles for Arsenal FC from Aug 16, 2025 to May 25, 2026


In [299]:
def clean_guardian_articles(df):
    """cleans the articles collected from gaurdian
    Args:
        df: uncleaned dataframs of articles from gaurdian API
    Returns: 
        df: datafram with all cleaned relevant article data for all teams
    """
    df['article_datetime'] = pd.to_datetime(df['published_at'], errors='coerce')
    df['article_date'] = df['article_datetime'].dt.date

    df = df.dropna(subset=['article_datetime'])
    
    if 'url' in df.columns:
        df = df.drop_duplicates(subset='url')
    elif 'id' in df.columns:
        df = df.drop_duplicates(subset='id')

    df = df.sort_values('article_datetime')

    keep_cols = [c for c in [
        'team', 'article_datetime', 'article_date',
        'title', 'description',
        'url', 'content_type'
    ] if c in df.columns]

    cleaned = df[keep_cols].copy()
    cleaned = cleaned.reset_index(drop=True)

    return cleaned

In [300]:
clean_articles_df = clean_guardian_articles(all_news_df)
clean_articles_df.head()


Unnamed: 0,team,article_datetime,article_date,title,description,url,content_type
0,Arsenal FC,2025-08-16 14:02:16+00:00,2025-08-16,Aston Villa v Newcastle: Premier League – as i...,<strong>Minute-by-minute report:</strong> Newc...,https://www.theguardian.com/football/live/2025...,liveblog
1,Arsenal FC,2025-08-17 18:01:27+00:00,2025-08-17,Manchester United 0-1 Arsenal: Premier League ...,Riccardo Calafiori punished an error from Alta...,https://www.theguardian.com/football/live/2025...,liveblog
2,Arsenal FC,2025-08-18 07:00:53+00:00,2025-08-18,Premier League: 10 talking points from the wee...,"Kyle Walker has the World Cup in his sights, N...",https://www.theguardian.com/football/2025/aug/...,article
3,Arsenal FC,2025-08-22 01:00:18+00:00,2025-08-22,Premier League may help overcome Australia’s ‘...,Coverage on free-to-air commercial TV for the ...,https://www.theguardian.com/football/2025/aug/...,article
4,Arsenal FC,2025-08-22 15:55:05+00:00,2025-08-22,Premier League team news: predicted lineups fo...,Manchester City host Tottenham on Saturday whi...,https://www.theguardian.com/football/2025/aug/...,article


In [301]:
def load_premier_league_matches(api_key):
    """fetches all games for selected teams for the current season including dates/times
    Args:
        api_key(str): Football-Data.org API token.

    Returns:
        df: cleaned matches with important cols like start and end dates/times
    """

    url = "https://api.football-data.org/v4/competitions/PL/matches"
    headers = {"X-Auth-Token": api_key}
    response = requests.get(url, headers=headers)
    data = response.json()


    matches = []
    for m in data['matches']:
        matches.append({
            "match_id": m["id"],
            "matchday": m["matchday"],
            "utcDate": m["utcDate"],
            "homeTeam": m["homeTeam"]["name"],
            "awayTeam": m["awayTeam"]["name"],
            "status": m["status"],
            "home_goals": m["score"]["fullTime"]["home"],
            "away_goals": m["score"]["fullTime"]["away"]
        })
    matches_df = pd.DataFrame(matches)
    matches_df["match_datetime"] = (pd.to_datetime(matches_df["utcDate"], utc=True).dt.tz_convert(None))
    matches_df['match_date'] = matches_df['match_datetime'].dt.date


    teams_of_interest = [
        "Arsenal FC",
        "Chelsea FC",
        "Liverpool FC",
        "Manchester City FC",
        "Manchester United FC"
    ]
    team_matches = matches_df[
        matches_df['homeTeam'].isin(teams_of_interest) |
        matches_df['awayTeam'].isin(teams_of_interest)
    ].copy()


    rows = []
    for _, row in team_matches.iterrows():
        
        # Home team
        if row["homeTeam"] in teams_of_interest:
            rows.append({
                "team": row["homeTeam"],
                "opponent": row["awayTeam"],
                "match_datetime": row["match_datetime"],
                "match_date": row["match_date"],
            })
            
        # Away team
        if row["awayTeam"] in teams_of_interest:
            rows.append({
                "team": row["awayTeam"],
                "opponent": row["homeTeam"],
                "match_datetime": row["match_datetime"],
                "match_date": row["match_date"],
            })
            
    team_matches_df = pd.DataFrame(rows)
    team_matches_df = team_matches_df.sort_values(["team", "match_datetime"]).reset_index(drop=True)
    
    team_matches_df["game_number"] = (
        team_matches_df.groupby("team").cumcount() + 1
    )
    
    return team_matches_df

In [302]:
api_key = "96015ed5367c48e7a815d82b865eb24a"

team_matches_df = load_premier_league_matches(api_key)
team_matches_df.head()

Unnamed: 0,team,opponent,match_datetime,match_date,game_number
0,Arsenal FC,Manchester United FC,2025-08-17 15:30:00,2025-08-17,1
1,Arsenal FC,Leeds United FC,2025-08-23 16:30:00,2025-08-23,2
2,Arsenal FC,Liverpool FC,2025-08-31 15:30:00,2025-08-31,3
3,Arsenal FC,Nottingham Forest FC,2025-09-13 11:30:00,2025-09-13,4
4,Arsenal FC,Manchester City FC,2025-09-21 15:30:00,2025-09-21,5


In [303]:
def add_match_windows(team_matches_df):
    """gives the start/end windows for each game that articles will be sorted into
    Args:
        team_matches_df(df): all the matches for each team of interest

    Returns:
        df: cleaned matches now with added start/end windows for article classification
    """
    df = team_matches_df.copy()
    df = df.sort_values(["team", "match_datetime"]).reset_index(drop=True)

    df["next_match_datetime"] = df.groupby("team")["match_datetime"].shift(-1)
    season_end = pd.to_datetime("2025-06-01", utc=True).tz_convert(None)
    df["next_match_datetime"] = df["next_match_datetime"].fillna(season_end)

    # start window = current game, end window = start of next game
    df["window_start"] = df["match_datetime"]
    df["window_end"]   = df["next_match_datetime"]

    df["window_start"] = pd.to_datetime(df["window_start"], utc=True).dt.tz_convert(None)
    df["window_end"]   = pd.to_datetime(df["window_end"],   utc=True).dt.tz_convert(None)

    return df


In [304]:

match_windows_df = add_match_windows(team_matches_df)
match_windows_df.head()

Unnamed: 0,team,opponent,match_datetime,match_date,game_number,next_match_datetime,window_start,window_end
0,Arsenal FC,Manchester United FC,2025-08-17 15:30:00,2025-08-17,1,2025-08-23 16:30:00,2025-08-17 15:30:00,2025-08-23 16:30:00
1,Arsenal FC,Leeds United FC,2025-08-23 16:30:00,2025-08-23,2,2025-08-31 15:30:00,2025-08-23 16:30:00,2025-08-31 15:30:00
2,Arsenal FC,Liverpool FC,2025-08-31 15:30:00,2025-08-31,3,2025-09-13 11:30:00,2025-08-31 15:30:00,2025-09-13 11:30:00
3,Arsenal FC,Nottingham Forest FC,2025-09-13 11:30:00,2025-09-13,4,2025-09-21 15:30:00,2025-09-13 11:30:00,2025-09-21 15:30:00
4,Arsenal FC,Manchester City FC,2025-09-21 15:30:00,2025-09-21,5,2025-09-28 15:30:00,2025-09-21 15:30:00,2025-09-28 15:30:00


In [305]:
def assign_articles_to_games(clean_articles_df, match_windows_df):
    """assigns articles to the games they correspond to (after game just played, before next game)
    Args:
        clean_articles_df(df): cleaned set of articles from garudian API
        match_windows_df(df): dataset of games with window timelines to sort articles
    Returns:
        final_result(df): cleaned dataset of all articles assigned to their corresponding game
    """
    articles = clean_articles_df.copy()
    windows = match_windows_df.copy()
    
    articles["article_datetime"] = (
        pd.to_datetime(articles["article_datetime"], utc=True)
        .dt.tz_convert(None)
    )


    articles["article_datetime"] = pd.to_datetime(articles["article_datetime"])

    results = []
    for team in windows["team"].unique():
        team_articles = articles[articles["team"] == team]
        team_windows = windows[windows["team"] == team]

        for _, w in team_windows.iterrows():
            mask = (
                (team_articles["article_datetime"] > w["window_start"]) &
                (team_articles["article_datetime"] <= w["window_end"])
            )

            articles_in_window = team_articles[mask].copy()
            articles_in_window["game_number"] = w["game_number"]
            articles_in_window["match_date"] = w["match_datetime"]

            results.append(articles_in_window)

    final_result = pd.concat(results).reset_index(drop=True)
    return final_result

In [306]:
articles_with_games_df = assign_articles_to_games(clean_articles_df, match_windows_df)
articles_with_games_df.head(10)

Unnamed: 0,team,article_datetime,article_date,title,description,url,content_type,game_number,match_date
0,Arsenal FC,2025-08-17 18:01:27,2025-08-17,Manchester United 0-1 Arsenal: Premier League ...,Riccardo Calafiori punished an error from Alta...,https://www.theguardian.com/football/live/2025...,liveblog,1,2025-08-17 15:30:00
1,Arsenal FC,2025-08-18 07:00:53,2025-08-18,Premier League: 10 talking points from the wee...,"Kyle Walker has the World Cup in his sights, N...",https://www.theguardian.com/football/2025/aug/...,article,1,2025-08-17 15:30:00
2,Arsenal FC,2025-08-22 01:00:18,2025-08-22,Premier League may help overcome Australia’s ‘...,Coverage on free-to-air commercial TV for the ...,https://www.theguardian.com/football/2025/aug/...,article,1,2025-08-17 15:30:00
3,Arsenal FC,2025-08-22 15:55:05,2025-08-22,Premier League team news: predicted lineups fo...,Manchester City host Tottenham on Saturday whi...,https://www.theguardian.com/football/2025/aug/...,article,1,2025-08-17 15:30:00
4,Arsenal FC,2025-08-22 21:45:03,2025-08-22,West Ham 1-5 Chelsea: Premier League – live re...,<strong>Minute-by-minute report:</strong> Chel...,https://www.theguardian.com/football/live/2025...,liveblog,1,2025-08-17 15:30:00
5,Arsenal FC,2025-08-23 19:20:42,2025-08-23,Arsenal 5-0 Leeds: Premier League – as it happ...,<strong>Minute-by-minute report:</strong> Vikt...,https://www.theguardian.com/football/live/2025...,liveblog,2,2025-08-23 16:30:00
6,Arsenal FC,2025-08-25 07:00:06,2025-08-25,Premier League: 10 talking points from the wee...,Richarlison and Martín Zubimendi are changing ...,https://www.theguardian.com/football/2025/aug/...,article,2,2025-08-23 16:30:00
7,Arsenal FC,2025-08-25 22:05:50,2025-08-25,Newcastle 2-3 Liverpool: Premier League – as i...,Ten-man Newcastle fought back from two goals d...,https://www.theguardian.com/football/live/2025...,liveblog,2,2025-08-23 16:30:00
8,Arsenal FC,2025-08-28 23:01:04,2025-08-28,Premier League: 10 things to look out for this...,"How Ruben Amorim could stop the rot, Brighton ...",https://www.theguardian.com/football/blog/2025...,article,2,2025-08-23 16:30:00
9,Arsenal FC,2025-08-29 16:47:41,2025-08-29,Premier League team news: predicted lineups fo...,Liverpool and Arsenal clash at Anfield on Sund...,https://www.theguardian.com/football/2025/aug/...,article,2,2025-08-23 16:30:00
