# Corpus collection POC (proof-of-concept)

This section contains the corpus collection process.     
Currently, we are looking at extracting the news articles that are published between 2022-01-01 and 2022-02-15, within topics of Politics, Film, Technology, Business and Science, using [The Guardian OpenPlatform](https://open-platform.theguardian.com).    


## Instruction

- Run the code chunks below to obtain corpus.tsv file that contains the article category and the article content. 
- If no file is created after running the code, it is very likely that the key has reached its limits, please register for a developer key [here](https://open-platform.theguardian.com/access/), and replace the **MY_API_KEY** variable with your key.

In [1]:
# imports 
import json
import requests
from datetime import date, timedelta
import time
import pandas as pd

# Guardian api
# apply for your key here: https://open-platform.theguardian.com/access/
MY_API_KEY = '1956fa1c-fa95-4b0b-a5a8-0436216ab868'

# I followed this example:
# code modified from https://gist.github.com/dannguyen/c9cb220093ee4c12b840

API_ENDPOINT = 'http://content.guardianapis.com/search'
my_params = {
    'from-date': "",
    'to-date': "",
    'order-by': "newest",
    'show-fields': 'all',
    'page-size': 200,
    'api-key': MY_API_KEY
}

topics = ['Politics', 'Film', 'Technology', 'Business', 'Science']


start_date = date(2022, 1, 1)
end_date = date(2022,2, 15)
all_days = range((end_date - start_date).days + 1)
all_results = []
for day_count in all_days:
    print('day ', day_count)
    dt = start_date + timedelta(days=day_count)
    datestr = dt.strftime('%Y-%m-%d')
    my_params['from-date'] = datestr
    my_params['to-date'] = datestr
    page_count = 1
    num_pages = 1
    while page_count <= num_pages:
        print("...page", page_count)
        my_params['page'] = page_count
        resp = requests.get(API_ENDPOINT, my_params)
        data = resp.json()
        # [topic, text] list
        all_results.extend([[i["sectionName"], i['fields']['bodyText']] for i in data['response']['results']])
        # if there is more than one page, also look at other pages too
        page_count += 1
        num_pages = data['response']['pages']
    time.sleep(2)

day  0
...page 1
day  1
...page 1
day  2
...page 1
day  3
...page 1
...page 2
day  4
...page 1
...page 2
day  5
...page 1
...page 2
day  6
...page 1
...page 2
day  7
...page 1
day  8
...page 1
day  9
...page 1
...page 2
day  10
...page 1
...page 2
day  11
...page 1
...page 2
day  12
...page 1
...page 2
day  13
...page 1
...page 2
day  14
...page 1
day  15
...page 1
day  16
...page 1
...page 2
day  17
...page 1
...page 2
day  18
...page 1
...page 2
day  19
...page 1
...page 2
day  20
...page 1
...page 2
day  21
...page 1
day  22
...page 1
...page 2
day  23
...page 1
...page 2
day  24
...page 1
...page 2
day  25
...page 1
...page 2
day  26
...page 1
...page 2
day  27
...page 1
...page 2
day  28
...page 1
day  29
...page 1
day  30
...page 1
...page 2
day  31
...page 1
...page 2
day  32
...page 1
...page 2
day  33
...page 1
...page 2
day  34
...page 1
...page 2
day  35
...page 1
day  36
...page 1
day  37
...page 1
...page 2
day  38
...page 1
...page 2
day  39
...page 1
...page 2
day  40
..

In [2]:
# filter the results, and only keep the ones that is in our topics
final_results = []
for topic, text in all_results:
    if topic in topics:
        final_results.append([topic, text])

# conver to pd dataframe, to sort them by topics (optional)
final_df = pd.DataFrame(final_results, columns = ['Topic', 'Text'])
final_df = final_df.sort_values('Topic').reset_index(drop=True)
final_df

Unnamed: 0,Topic,Text
0,Business,The Bank of England is facing fierce criticism...
1,Business,Treasury officials have quietly introduced a n...
2,Business,The UK’s cost of living crisis escalated in De...
3,Business,The global surge in demand for energy could sp...
4,Business,"Bullying, sexual harassment and racism are com..."
...,...,...
1549,Technology,The sharing of some of the most insidious imag...
1550,Technology,The billionaire entrepreneur Elon Musk’s brain...
1551,Technology,Fossil fuel companies and firms that work clos...
1552,Technology,Uber’s food delivery service Uber Eats has tur...


In [3]:
# print number of documents in each category
final_df['Topic'].value_counts()

Politics      509
Business      500
Film          326
Technology    135
Science        84
Name: Topic, dtype: int64

In [4]:
# save our data in tsv file
final_df.to_csv('corpus.tsv', sep="\t", index = False)