# Phase II: Data Curation, Exploratory Analysis and Plotting (5\%)

Each **project group** will submit a single **jupyter notebook** which contains:

1. (1\%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.
2. (2\%) Obtains, cleans, and merges all data sources involved in the project.
3. (2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.

In [56]:
import requests
import pandas as pd
from datetime import datetime
import time

In [66]:
season_start = '2024-08-16'
season_end = '2025-05-25'

teams = ['Arsenal', 'Manchester City', 'Manchester United', 'Chelsea', 'Liverpool']
API_KEY = "001fa633-fc31-4e53-a402-9dfcf5c25c70"

In [68]:
def get_article_data(article, team):
    """ Format specific article json response
    Args:
        article(dictionary): Raw article data from Guardian API
        team(str): Premier League team name for this article
    Returns: 
        dictionary: dictionary containing article information for this team; includes team, title, description, headline, 
                    author, publication, wordcount, published_at, url, section, content_type
    """
    fields = article.get('fields', {})
    article_info = {
        'team': team,
        'title': article.get('webTitle', ''),
        'description': fields.get('trailText', ''),
        'headline': fields.get('headline', ''),
        'author': fields.get('byline', ''),
        'publication': fields.get('publication', ''),
        'wordcount': fields.get('wordcount', ''),
        'published_at': article.get('webPublicationDate', ''),
        'url': article.get('webUrl', ''),
        'section': article.get('sectionName', ''),
        'content_type': article.get('type', ''),
    }
    return article_info

In [75]:
def get_guardian_articles(team, from_date, to_date, api_key, page_size):
    """
    Retrieves articles from The Guardian for a specific team between a specific date range
        Args:
            team(str): Premier League team to search for such as 'Arsenal' or 'Manchester City'
            from_date(str): start date in YYYY-MM-DD format
            to_date(str): end date in YYYY-MM-DD format
            api_key(str): Guardian API key
            page_size(int): Number of articles to retrieve

        Returns: 
            Dataframe: Dataframe containing article information for this team.
                       Includes team, title, description, headline, author, publication, wordcount, 
                       published_at, url, section, content_type
    """
    url = "https://content.guardianapis.com/search"
    params = {
        'q': f'"{team}" Premier League',
        'from-date': from_date,
        'to-date': to_date,
        'api-key': api_key,
        'page-size': page_size,
        'show-fields': 'headline,trailText,byline,publication,wordcount',
        'show-tags': 'keyword',
    }
    response = requests.get(url, params=params)
    data = response.json()
    articles_data = []
    
    if 'results' in data['response']:
        for article in data['response']['results']:
            article_info = get_article_data(article, team)
            articles_data.append(article_info)

    start = datetime.strptime(from_date, '%Y-%m-%d').strftime('%b %d, %Y')
    end = datetime.strptime(to_date, '%Y-%m-%d').strftime('%b %d, %Y')
            
    print(f"Retrieved {len(articles_data)} articles for {team} from {start} to {end}")
    return pd.DataFrame(articles_data)

In [76]:
def get_all_teams_guardian_news(api_key, season_start, season_end, page_size):
    """Retrieve news data for five Premier League teams for a single season
    Args:
        api_key(str): Guardian API key
        season_start(str): start date in YYYY-MM-DD format
        season_end(str): end date in YYYY-MM-DD format
        page_size(int): Number of articles to retrieve per team
    Returns: 
        Dataframe with all article data for all teams
    
    """
    teams = ['Arsenal', 'Manchester City', 'Manchester United', 'Chelsea', 'Liverpool']
    all_articles = []
    
    for team in teams:
        team_df = get_guardian_articles(
            team=team,
            from_date=season_start,
            to_date=season_end, 
            api_key=api_key,
            page_size=page_size
        )
        all_articles.append(team_df)
    
        return pd.concat(all_articles, ignore_index=True)

In [77]:
# Data for 2024 season
all_news_df = get_all_teams_guardian_news(
    api_key=API_KEY,
    season_start=season_start,
    season_end=season_end,
    page_size=100
)

Retrieved 100 articles for Arsenal from Aug 16, 2024 to May 25, 2025
Retrieved 100 articles for Manchester City from Aug 16, 2024 to May 25, 2025
Retrieved 100 articles for Manchester United from Aug 16, 2024 to May 25, 2025
Retrieved 100 articles for Chelsea from Aug 16, 2024 to May 25, 2025
Retrieved 100 articles for Liverpool from Aug 16, 2024 to May 25, 2025


In [78]:
all_news_df

Unnamed: 0,team,title,description,headline,author,publication,wordcount,published_at,url,section,content_type
0,Arsenal,Arsenal 1-0 Newcastle: Premier League – as it happened,Declan Rice’s screamer means one point separates Newcastle in third and seventh placed Forest,Arsenal 1-0 Newcastle: Premier League – as it happened,Rob Smyth,theguardian.com,3651,2025-05-18T17:48:34Z,https://www.theguardian.com/football/live/2025/may/18/arsenal-v-newcastle-premier-league-live,Football,liveblog
1,Arsenal,Ipswich 0-4 Arsenal: Premier League – live reaction,Arsenal delayed Liverpool’s title party and pushed Ipswich to the brink of relegation with a rout at Portman Road,Ipswich 0-4 Arsenal: Premier League – live reaction,Niall McVeigh,theguardian.com,2958,2025-04-20T15:23:39Z,https://www.theguardian.com/football/live/2025/apr/20/ipswich-v-arsenal-premier-league-live,Football,liveblog
2,Arsenal,Liverpool 2-2 Arsenal: Premier League – as it happened,"Arsenal came from two down to earn a 2-2 draw at Anfield, Mikel Merino scoring before being sent off",Liverpool 2-2 Arsenal: Premier League – as it happened,Dominic Booth,theguardian.com,4427,2025-05-11T17:56:58Z,https://www.theguardian.com/football/live/2025/may/11/liverpool-v-arsenal-premier-league-live,Football,liveblog
3,Arsenal,Arsenal v Brentford: Premier League – as it happened,<strong>Minute-by-minute report: </strong>Yoane Wissa cancelled out Thomas Partey’s second-half opener to earn Brentford a share of the spoils at the Emirates Stadium,Arsenal v Brentford: Premier League – as it happened,Barry Glendenning,theguardian.com,3546,2025-04-12T19:09:45Z,https://www.theguardian.com/football/live/2025/apr/12/arsenal-brentford-premier-league-live,Football,liveblog
4,Arsenal,Arsenal 1-2 Bournemouth: Premier League – as it happened,<strong>Minute-by-minute report:</strong> Andoni Iraola’s side came from behind to boost their European hopes with victory at the Emirates. Scott Murray was watching,Arsenal 1-2 Bournemouth: Premier League – as it happened,Scott Murray,theguardian.com,4331,2025-05-03T19:22:14Z,https://www.theguardian.com/football/live/2025/may/03/arsenal-v-bournemouth-premier-league-live,Football,liveblog
5,Arsenal,Premier League guaranteed five teams in Champions League after Arsenal win,The Premier League will be guaranteed at least five teams in the 2025-26 Champions League after Arsenal’s stunning 3-0 win over Real Madrid,Premier League guaranteed five teams in Champions League after Arsenal win,Reuters and Guardian sport,The Guardian,203,2025-04-09T07:43:48Z,https://www.theguardian.com/football/2025/apr/09/premier-league-guaranteed-five-teams-in-champions-league-after-arsenal-win,Football,article
6,Arsenal,Everton 1-1 Arsenal: Premier League – as it happened,Arsenal lost further ground in the Premier League title race after dropping points against Everton in a 1-1 draw at Goodison Park.,Everton 1-1 Arsenal: Premier League – as it happened,Simon Burnton,theguardian.com,2740,2025-04-05T13:52:54Z,https://www.theguardian.com/football/live/2025/apr/05/everton-v-arsenal-premier-league-live,Football,liveblog
7,Arsenal,Arsenal 2-2 Crystal Palace: Premier League – as it happened,Jean-Phillipe Mateta came off to score a stunning goal as Crystal Palace twice came from behind at the Emirates,Arsenal 2-2 Crystal Palace: Premier League – as it happened,Rob Smyth,theguardian.com,3372,2025-04-23T21:37:54Z,https://www.theguardian.com/football/live/2025/apr/23/arsenal-v-crystal-palace-premier-league-live-liverpool,Football,liveblog
8,Arsenal,Arsenal 1-0 Chelsea: Premier League – as it happened,Arsenal secured their first league win in over a month thanks to a first-half Mikel Merino header,Arsenal 1-0 Chelsea: Premier League – as it happened,Daniel Harris,theguardian.com,4251,2025-03-16T16:33:14Z,https://www.theguardian.com/football/live/2025/mar/16/arsenal-v-chelsea-premier-league-updates-live,Football,liveblog
9,Arsenal,Manchester United 1-1 Arsenal: Premier League – as it happened,Bruno Fernandes’ free-kick was levelled by Declan Rice but Arsenal failed to keep their title hopes breathing,Manchester United 1-1 Arsenal: Premier League – as it happened,John Brewin,theguardian.com,4346,2025-03-09T19:11:32Z,https://www.theguardian.com/football/live/2025/mar/09/manchester-united-v-arsenal-premier-league-live,Football,liveblog
