# Phase III: First ML Proof of Concept (5\%)

### Team Names:
- Will
- Kinsey Bellerose
- Zaid
- Ved

Two datasets were collected, one for individual Premier League game performance and one for news articles (from The Guardian API) about the specific team. The goal was to see if the number of articles written about a team prior to a game affected the team's performace. 

In [1]:
import requests
import pandas as pd
from datetime import datetime
import time

## News Article Scraping

Using The Guardian API, we were able to search for news articles in a specific date range that were about the specific Premier League teams.

In [2]:
season_start = '2024-08-16'
season_end = '2025-05-25'
teams = ['Arsenal', 'Manchester City', 'Manchester United', 'Chelsea', 'Liverpool']
API_KEY = "001fa633-fc31-4e53-a402-9dfcf5c25c70"

In [3]:
def get_article_data(article, team):
    """ Format specific article json response
    Args:
        article(dictionary): Raw article data from Guardian API
        team(str): Premier League team name for this article
    Returns: 
        dictionary: dictionary containing article information for this team; includes team, title, description, headline, 
                    author, wordcount, published_at, url, section, content_type
    """
    fields = article.get('fields', {})
    article_info = {
        'team': team,
        'title': article.get('webTitle', ''),
        'description': fields.get('trailText', ''),
        'headline': fields.get('headline', ''),
        'author': fields.get('byline', ''),
        'wordcount': fields.get('wordcount', ''),
        'published_at': article.get('webPublicationDate', ''),
        'url': article.get('webUrl', ''),
        'section': article.get('sectionName', ''),
        'content_type': article.get('type', ''),
    }
    return article_info

In [4]:
def get_guardian_articles(team, from_date, to_date, api_key, page_size):
    """
    Retrieves articles from The Guardian for a specific team between a specific date range
        Args:
            team(str): Premier League team to search for such as 'Arsenal' or 'Manchester City'
            from_date(str): start date in YYYY-MM-DD format
            to_date(str): end date in YYYY-MM-DD format
            api_key(str): Guardian API key
            page_size(int): Number of articles to retrieve

        Returns: 
            Dataframe: Dataframe containing article information for this team.
                       Includes team, title, description, headline, author, wordcount, 
                       published_at, url, section, content_type
    """
    url = "https://content.guardianapis.com/search"
    params = {
        'q': f'"{team}" Premier League',
        'from-date': from_date,
        'to-date': to_date,
        'api-key': api_key,
        'page-size': page_size,
        'show-fields': 'headline,trailText,byline,wordcount',
        'show-tags': 'keyword',
    }
    response = requests.get(url, params=params)
    data = response.json()
    articles_data = []
    
    if 'results' in data['response']:
        for article in data['response']['results']:
            article_info = get_article_data(article, team)
            articles_data.append(article_info)

    start = datetime.strptime(from_date, '%Y-%m-%d').strftime('%b %d, %Y')
    end = datetime.strptime(to_date, '%Y-%m-%d').strftime('%b %d, %Y')
    return pd.DataFrame(articles_data)

In [5]:
def get_all_teams_guardian_news(api_key, season_start, season_end, page_size):
    """Retrieve news data for five Premier League teams for a single season
    Args:
        api_key(str): Guardian API key
        season_start(str): start date in YYYY-MM-DD format
        season_end(str): end date in YYYY-MM-DD format
        page_size(int): Number of articles to retrieve per team
    Returns: 
        Dataframe with all article data for all teams
    
    """
    teams = ['Arsenal', 'Manchester City', 'Manchester United', 'Chelsea', 'Liverpool']
    all_articles = []
    
    for team in teams:
        team_df = get_guardian_articles(
            team=team,
            from_date=season_start,
            to_date=season_end, 
            api_key=api_key,
            page_size=page_size
        )
        all_articles.append(team_df)
    
    return pd.concat(all_articles, ignore_index=True)

In [6]:
# Data for 2024 season
all_news_df = get_all_teams_guardian_news(
    api_key=API_KEY,
    season_start=season_start,
    season_end=season_end,
    page_size=100
)
all_news_df.head()

Unnamed: 0,team,title,description,headline,author,wordcount,published_at,url,section,content_type
0,Arsenal,Arsenal 1-0 Newcastle: Premier League – as it ...,Declan Rice’s screamer means one point separat...,Arsenal 1-0 Newcastle: Premier League – as it ...,Rob Smyth,3651,2025-05-18T17:48:34Z,https://www.theguardian.com/football/live/2025...,Football,liveblog
1,Arsenal,Ipswich 0-4 Arsenal: Premier League – live rea...,Arsenal delayed Liverpool’s title party and pu...,Ipswich 0-4 Arsenal: Premier League – live rea...,Niall McVeigh,2958,2025-04-20T15:23:39Z,https://www.theguardian.com/football/live/2025...,Football,liveblog
2,Arsenal,Liverpool 2-2 Arsenal: Premier League – as it ...,Arsenal came from two down to earn a 2-2 draw ...,Liverpool 2-2 Arsenal: Premier League – as it ...,Dominic Booth,4427,2025-05-11T17:56:58Z,https://www.theguardian.com/football/live/2025...,Football,liveblog
3,Arsenal,Arsenal v Brentford: Premier League – as it ha...,<strong>Minute-by-minute report: </strong>Yoan...,Arsenal v Brentford: Premier League – as it ha...,Barry Glendenning,3546,2025-04-12T19:09:45Z,https://www.theguardian.com/football/live/2025...,Football,liveblog
4,Arsenal,Arsenal 1-2 Bournemouth: Premier League – as i...,<strong>Minute-by-minute report:</strong> Ando...,Arsenal 1-2 Bournemouth: Premier League – as i...,Scott Murray,4331,2025-05-03T19:22:14Z,https://www.theguardian.com/football/live/2025...,Football,liveblog


## Individual Game Scraping

# Phase III: First ML Proof of Concept (5\%)

### When: November 23

Each **project group** will submit a single **jupyter notebook** which contains:

1. (2%) The implementation (using NumPy) of your first ML model as a function call to the cleaned data
2. (3%) A discussion of the preliminary results:
   - This may include checking of assumptions, generated plots/tables, measures of fit, or other attributes of the analysis
   - It does not have to be fully correct, but as a proof of concept must demonstrate that the group is close to completing the analysis
   - You **must** discuss some of the potential ethical considerations (or explain why there aren't any) for your project

### Attempt 1 - Linear Regression

In [7]:
# Function to fit a linear regression model and check assumptions with diagnostic plots

In [8]:
# Show function with 3 different features

### Attempt 2 - Polynomial regression
#### Creating the Design Matrix

In [10]:
from sklearn.preprocessing import PolynomialFeatures


#### Cross Validation

In [None]:
# add bias column
# line of best fit
# linreg predict to find MSE and R^2 values