Aside from the New York Times corpus, we also want to try building a corpus from a variety of online news sources, to debias our work from how the Times does their reporting. We chose to work with [NewsAPI](newsapi.org), which allowed us to perform limited querying for free.

In [17]:
import numpy as np
import pandas as pd
import requests
import json
import os
import dotenv
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks
from newspaper import Article
import re
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

In [18]:
dotenv.load_dotenv()
token = os.getenv('token')

In [19]:
myanmardict = {'q':'Rohingya&', 'from':'2017-08-24&', 'language':'en', 'pageSize':100, 'apiKey': token}

In [20]:
myanmar = requests.get('http://newsapi.org/v2/everything?', params=myanmardict)
myanmar

<Response [200]>

In [21]:
myanmarjson = json.loads(myanmar.text)

Many of the stories pulled by this request are from different branches of the same news outlet (e.g. Reuters, Reuters UK, Reuters India, etc). In an effort to deduplicate our data, we removed rows with exactly matching titles.

In [22]:
myanmardf = pd.json_normalize(myanmarjson, record_path = ['articles'])
myanmardf = myanmardf.loc[~myanmardf.title.duplicated(),]

In [24]:
myanmar_urls = []
for i in range(0, myanmardf.shape[0]):
    myanmar_urls.append(myanmardf.iloc[i]['url'])

In [25]:
len(myanmar_urls)

94

Using the [newspaper3k](https://newspaper.readthedocs.io/en/latest/) package, we extract the full text of each article with an `extractArticles` method. Within the method, we attempt once more to deduplicate data, checking to make sure that the text isn't already stored in the articles dict. Additionally, when building this method, we noticed that some of the URLs returned by the query were missing a / or other punctuation, and hence would raise an exception. To overcome this, we instruct the method to continue past any broken links.

In [26]:
def extractArticles(lst, dic):
    for j in lst:
        try:
            art = Article(j)
            art.download()
            art.parse()
            text = re.sub('\s+',' ',art.text)
            if text not in dic.values():
                dic[j] = text
        except:
            continue

In [27]:
myanmar_articles = {}
extractArticles(myanmar_urls, myanmar_articles)

In [28]:
myanmar_texts = pd.DataFrame.from_dict(myanmar_articles, orient='index', columns=['text'])

After creating a dataframe of our article texts, we use regular expressions to remove punctuation and then cast all words to lower case, for use later.

In [29]:
myanmar_texts['text']= myanmar_texts['text'].map(lambda x: re.sub('[,\.!?]', '', x))
myanmar_texts['text'] = myanmar_texts['text'].map(lambda x: x.lower())

Once again, we store the variable for later use in another notebook.

In [31]:
%store myanmar_texts

Stored 'myanmar_texts' (DataFrame)


Now we import the `nyt_urls_df` dataframe from the `NYT.ipynb` notebook, and perform the same steps as above.

In [32]:
%store -r nyt_urls_df

In [33]:
nyt_urls_df.rename({0:'url'}, axis=1, inplace=True)

In [34]:
nyt_urls = []
for i in range(0, nyt_urls_df.shape[0]):
    nyt_urls.append(nyt_urls_df.iloc[i]['url'])

In [35]:
len(nyt_urls)

1059

In [36]:
nyt_articles = {}
extractArticles(nyt_urls, nyt_articles)

In [37]:
nyt_texts = pd.DataFrame.from_dict(nyt_articles, orient='index', columns=['text'])

In [38]:
nyt_texts['text']= nyt_texts['text'].map(lambda x: re.sub('[,\.!?]', '', x))
nyt_texts['text'] = nyt_texts['text'].map(lambda x: x.lower())

In [40]:
%store nyt_texts

Stored 'nyt_texts' (DataFrame)
