This notebook uses the GNews API to collect articles mentioning the athletes named in the `athletes` dataset. The goal is to collect news articles related to athlete endorsements and any other relevant media coverage. 

The following steps were taken:
1. Load athlete data
2. Fetch articles: a function is defined to query the GNews API for articles mentioning each athlete. 
3. Process athletes: athletes are processed in chunks to handle rate limits and manage large datasets. Fetched articles are stored with details like titles, dates, and descriptions.

In [2]:
import time
import requests
import pandas as pd

# get athlete names
athletes = pd.read_csv('../../data/usernames/accounts_final.csv')

# set api key and base url
api_key = '4eff157235e983a5f72db09ead7ecf00'
base_url = 'https://gnews.io/api/v4/search'

In [67]:
def fetch_articles(query, api_key, retries=3, delay=5):
    # build the URL for the API request (athlete name as query)
    url = f"https://gnews.io/api/v4/search?q={query}&from=2020-01-01&to=2021-12-31&lang=en&max=10&apikey={api_key}"
    for attempt in range(retries):
        try:
            response = requests.get(url)        # send request
            response.raise_for_status()         # raise exception if failed
            data = response.json()              # convert to JSON
            return data.get('articles', [])     # return articles
        except (requests.RequestException, ValueError) as e:
            print(f"Error fetching articles for {query}: {e}")
            if attempt < retries - 1:
                time.sleep(delay)               # wait before retrying
    # return empty list if all retries fail
    return []

In [78]:
def process_athletes_in_chunks(athletes_df, api_key, chunk_size=50, max_athletes=1000, output_file='articles.csv'):
    if max_athletes is None:
        max_athletes = len(athletes_df)
    
    all_articles = []

    # process athletes in chunks
    for start in range(0, min(max_athletes, len(athletes_df)), chunk_size):
        end = min(start + chunk_size, len(athletes_df))
        chunk = athletes_df.iloc[start:end]
        print(f"Processing chunk from index {start} to {end}...")
        
        # get and process articles for each athlete in chunk
        for athlete in chunk['name']:
            print(f"Fetching articles for {athlete}...")
            articles = fetch_articles(athlete, api_key)
            # append articles to list
            for article in articles:
                all_articles.append({
                    'athlete': athlete,
                    'title': article['title'],
                    'description': article['description'],
                    'url': article['url'],
                    'publish_date': article['publishedAt'],})
    
    # convert articles to DataFrame and save to CSV
    articles_df = pd.DataFrame(all_articles)
    articles_df.to_csv(output_file, index=False)
    print(f"Articles saved to {output_file}")

In [79]:
# process athletes
process_athletes_in_chunks(athletes, api_key)

Processing chunk from index 0 to 50...
Fetching articles for Cristiano Ronaldo...
Fetching articles for LeBron James...
Fetching articles for Andrés Iniesta...
Fetching articles for Ronaldinho...
Fetching articles for Karim Benzema...
Fetching articles for Kevin Durant...
Fetching articles for Gerard Piqué...
Fetching articles for Luis Suárez...
Fetching articles for Stephen Curry...
Fetching articles for Sergio Agüero...
Fetching articles for Shaquille O'Neal...
Error fetching articles for Shaquille O'Neal: 400 Client Error: Bad Request for url: https://gnews.io/api/v4/search?q=Shaquille%20O'Neal&from=2020-01-01&to=2021-12-31&lang=en&max=10&apikey=4eff157235e983a5f72db09ead7ecf00
Error fetching articles for Shaquille O'Neal: 400 Client Error: Bad Request for url: https://gnews.io/api/v4/search?q=Shaquille%20O'Neal&from=2020-01-01&to=2021-12-31&lang=en&max=10&apikey=4eff157235e983a5f72db09ead7ecf00
Error fetching articles for Shaquille O'Neal: 400 Client Error: Bad Request for url: htt

In [4]:
# view articles
articles = pd.read_csv('../../data/articles/articles.csv')
articles

Unnamed: 0,athlete,title,description,url,publish_date
0,Cristiano Ronaldo,Al Taawoun Vs Al Nassr Preview & Team News- Ex...,Al Nassr will face Al Taawoun in the Saudi Sup...,https://www.essentiallysports.com/soccer-footb...,2024-08-13T15:15:15Z
1,Cristiano Ronaldo,Cristiano Ronaldo has already made his feeling...,Former Manchester United hero Cristiano Ronald...,https://www.manchestereveningnews.co.uk/sport/...,2024-08-13T10:34:15Z
2,Cristiano Ronaldo,We Will Continue Together: Pepe Reacts to Cris...,Portugal defender Pepe recently announced his ...,https://www.deccanchronicle.com/sports/we-will...,2024-08-12T11:42:00Z
3,Cristiano Ronaldo,'I Should've Beaten Messi & Ronaldo to the Bal...,Franck Ribery has a theory as to why Cristiano...,https://www.givemesport.com/i-shouldve-beaten-...,2024-08-11T20:20:43Z
4,Cristiano Ronaldo,Comparing Lionel Messi and Cristiano Ronaldo's...,"One has more goals, one has more assists.",https://www.givemesport.com/lionel-messi-vs-cr...,2024-08-11T19:00:31Z
...,...,...,...,...,...
8061,Brayden Schenn,Brandon Saad scores in OT to help the Blues be...,ST. LOUIS (AP) — Brandon Saad scored 2:09 into...,https://www.sootoday.com/national-sports/brand...,2024-04-02T04:16:24Z
8062,Brayden Schenn,Brandon Saad scores in OT to help the Blues be...,ST. LOUIS (AP) — Brandon Saad scored 2:09 into...,https://www.guelphtoday.com/national-sports/br...,2024-04-02T04:16:24Z
8063,Brayden Schenn,Revisiting the Brayden Schenn Trade,When the Blues acquired Brayden Schenn from th...,https://thehockeywriters.com/flyers-brayden-sc...,2024-03-26T14:44:14Z
8064,Brayden Schenn,Blues Captain Brayden Schenn's Regression This...,With the Blues in a precarious situation this ...,https://thehockeywriters.com/blues-captain-bra...,2024-03-01T19:20:00Z
