# Corpus collection code & Corpus + explanation

We have extracted the Film articles that are published between 2021-01-01 and 2021-12-31 from [The Guardian OpenPlatform](https://open-platform.theguardian.com).

## Instruction
- Run the code chunks below to obtain the corpus.tsv file that contains 500 Film articles which are less than 1000 words from the Guardian API.
- If no file is created after running the code, it is very likely that the key has reached its limits, please register for a developer key [here](https://open-platform.theguardian.com/access/), and replace the **MY_API_KEY** variable with your key.
- Stop and restart: The current request progress and data is saved, therefore it can be interrupted and restart with out losing anything.

## 1. Corpus collection code
This code chunk shows the main process of retrieving Film articles using Python Requests is to make HTTP requests to the Guardian API for every date between the start and end date (2021-01-01 to 2021-12-31).    
It can be interrupted and restart without losing the current request progress and the data. The current progress is save in the *progress_tracker.txt*, and the raw fetched data is saved in *progress_file.txt*.

In [1]:
import json
import requests
from datetime import date, timedelta
import time
import pandas as pd
from nltk.tokenize import word_tokenize
import pickle
import os

# Guardian api
# apply for your key here: https://open-platform.theguardian.com/access/
MY_API_KEY = '1956fa1c-fa95-4b0b-a5a8-0436216ab868'

# I followed this example:
# code modified from https://gist.github.com/dannguyen/c9cb220093ee4c12b840

API_ENDPOINT = 'http://content.guardianapis.com/search'
my_params = {
    'from-date': "",
    'to-date': "",
    'order-by': "newest",
    'show-fields': 'all',
    'section':'film',
    'page-size': 200,
    'api-key': MY_API_KEY
}


# check existance of tracker file, create if necessary
if os.path.exists("progress_tracker.txt") == False:
    f = open("progress_tracker.txt", "w")
    f.close()
    max_date = 0
    start_date = date(2021,1,1)
else:
    f = open("progress_tracker.txt", "r")
    read_line = f.readlines()
    if len(read_line) > 1:
        print('Continued from interrupt')
        max_date = max([int(i.strip('/n')) for i in read_line])+1
        start_date = date(2021,1,1) + timedelta(days=max_date)
    else:
        max_date = 0
        start_date = date(2021,1,1)
    f.close()
    

end_date = date(2021,12,31)
all_days = range((end_date - start_date).days + 1)
all_results = []
f = open("progress_tracker.txt", "a")
for day_count in all_days:
    dt = start_date + timedelta(days=day_count)
    if dt <= end_date:
        print(dt)
        f.write(str(day_count+max_date)+'\n')
        datestr = dt.strftime('%Y-%m-%d')
        if dt == end_date:
            f.truncate(0)
        my_params['from-date'] = datestr
        my_params['to-date'] = datestr
        page_count = 1
        num_pages = 1
        while page_count <= num_pages:
            my_params['page'] = page_count
            resp = requests.get(API_ENDPOINT, my_params)
            data = resp.json()
            # [topic, text] list
            all_results.extend([[i["sectionName"], i['fields']['bodyText']] for i in data['response']['results']])
            # save progress
            pd.DataFrame([[i["sectionName"], i['fields']['bodyText']] for i in data['response']['results']], 
                         columns = None, index = None).to_csv('progress_file.tsv', sep="\t", 
                                                              mode='a', index = False, header = 0)
            # if there is more than one page, also look at other pages too
            page_count += 1
            num_pages = data['response']['pages']
        time.sleep(1)
       

2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05
2021-01-06
2021-01-07
2021-01-08
2021-01-09
2021-01-10
2021-01-11
2021-01-12
2021-01-13
2021-01-14
2021-01-15
2021-01-16
2021-01-17
2021-01-18
2021-01-19
2021-01-20
2021-01-21
2021-01-22
2021-01-23
2021-01-24
2021-01-25
2021-01-26
2021-01-27
2021-01-28
2021-01-29
2021-01-30
2021-01-31
2021-02-01
2021-02-02
2021-02-03
2021-02-04
2021-02-05
2021-02-06
2021-02-07
2021-02-08
2021-02-09
2021-02-10
2021-02-11
2021-02-12
2021-02-13
2021-02-14
2021-02-15
2021-02-16
2021-02-17
2021-02-18
2021-02-19
2021-02-20
2021-02-21
2021-02-22
2021-02-23
2021-02-24
2021-02-25
2021-02-26
2021-02-27
2021-02-28
2021-03-01
2021-03-02
2021-03-03
2021-03-04
2021-03-05
2021-03-06
2021-03-07
2021-03-08
2021-03-09
2021-03-10
2021-03-11
2021-03-12
2021-03-13
2021-03-14
2021-03-15
2021-03-16
2021-03-17
2021-03-18
2021-03-19
2021-03-20
2021-03-21
2021-03-22
2021-03-23
2021-03-24
2021-03-25
2021-03-26
2021-03-27
2021-03-28
2021-03-29
2021-03-30
2021-03-31
2021-04-01

In the code chunk below, there is some data pre-processing and statistics, we tokenize the texts using NLTK.word_tokenize, and only keep articles with less than 1000 words.
Additionally, we store the articles and the token count(Length) in a pd dataframe, and remove duplicate rows if necessary, and only keep the first 500 articles.

In [2]:
# filter the results, and only keep the articles with less than 1000 words
# convert pd df to list, for processing(tokenize the text)
all_results = pd.read_csv('progress_file.tsv', sep = '\t', header=None).values.tolist()
final_results = []
for topic, text in all_results:
    tokenized = word_tokenize(text)
    if len(tokenized) < 1000: #  including articles with less than 1000 words only
        final_results.append([text, len(tokenized)])

# conver to pd dataframe, to sort them by topics (optional)
final_df = pd.DataFrame(final_results, columns = ['Text', 'Length'])
# drop duplicates, if necessary
final_df = final_df.drop_duplicates()
# only keep the first 500 files 
final_df = final_df.head(500)
final_df

Unnamed: 0,Text,Length
0,"In print, DC Comics was the trailblazer for su...",774
1,"Desperately Seeking Susan Out with the old, in...",701
2,It’s the “last day in paradise” for Las Vegas ...,286
3,"Joan Micklin Silver, the American film-maker b...",585
4,Sylvie (Tessa Thompson) has been taught by her...,302
...,...,...
495,A sequel to the successful spin-off film from ...,186
496,The rescheduled 2021 edition of the Cannes fil...,403
497,"In Morocco, sex outside marriage is punishable...",395
498,Were there an Oscar for best on-screen drunk (...,555


Assertions to check for duplicates and document count.

In [3]:
# assertion to check there is no duplicates in text
assert final_df['Text'].nunique() == len(final_df)
print('There are no duplicates!') 
# assertion to check there is 500 articles
assert len(final_df) == 500
print('We have 500 articles!') 

There are no duplicates!
We have 500 articles!


Finally, we save our data to *corpus.tsv* file.

In [4]:
# save our final data in tsv file
final_df.to_csv('corpus.tsv', sep="\t", index = False)

## 2. Corpus + explanation code
Please see corpus_readme.md for detailed explanation.

We calculate text statistics, including the total/average file length and the total/average number of distinct tokens.

In [5]:
final_result = final_df.values.tolist()
total_text = set()
total_length = 0
for text, word_count in final_result:
    tokenized = word_tokenize(text)
    total_text.update(set(tokenized))
    total_length += len(tokenized)

In [6]:
print('The total text length:', total_length)
print('The total distinct number of tokens:', len(total_text))
print('Average article length in words:', round(total_length/len(final_df), 2))
print('Average article length in distinct tokens:', round(len(total_text)/len(final_df), 2))

The total text length: 269519
The total distinct number of tokens: 28051
Average article length in words: 539.04
Average article length in distinct tokens: 56.1
