This file contains code for loading data (from Quotebank) and storing the filtered data as json-files that can be loaded in the other notebook where all the analysis is made.

The approach presented here has the following structure: <br>


*   From Quotebank, per year, only load data with a domain containing 'uk' and where the speaker is not 'None'. Save this as X.json.gz to drive.
*   Load the X.json.gz and filter it again based on our 12 chosen newspapers. Add the newspapers to a column. Save as X-filtered.json.gz.


The reason for not having these two steps as one is to be able to change things, e.g. change or add more newspapers without having to re-load from quotebank. 

Drawbacks and TODO!

*   Only loading the 'uk' domains potentially miss UK-quotes. E.g. the count for BBC quotes seem suspiciously low. We are missing the bbc.com/news/uk for example (.co.uk seems to be used when in UK, .com/news/uk when outside? Both are present in the dataset). 
*   At the moment none of the 'speaker = None' quotes are used. When focusing on topics of quotes, this might still contain interesting information.
*   It is possible to load one year into one dataframe, but each yearly-dataframe become very large. Will look into another approach. 


In [None]:
import bz2
import json
import pandas as pd
import numpy as np

!pip install tld
from tld import get_tld

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
PATH_TO_DATA = '/content/drive/MyDrive/ADAMilestone2/Data/'

Filter the total data set with a UK domain and None speakers. The filtered data is then stored. 

In [None]:
YEAR = 2019
path_to_file = PATH_TO_DATA + f'Quotebank/quotes-{YEAR}.json.bz2' 
path_to_out = PATH_TO_DATA + f'quotes-{YEAR}-UK.json.bz2'

def get_domain(url):
    res = get_tld(url, as_object=True)
    return res.tld
    
with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
        for index, instance in enumerate(s_file):
            instance = json.loads(instance) 
            if instance['speaker'] != "None":
                urls = instance['urls'] 
                domains = []
                for url in urls:
                    tld = get_domain(url)
                    domains.append(tld)
                if any("uk" in domain for domain in domains):
                    #instance['domains'] = domains
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

Read the stored data, create a dataframe, visualize a few samples

In [None]:
chosen_news = ['.thesun.', '.dailymail.', '.metro.', 
               '.thetimes.', '.standard.', '.mirror.', 
               '.telegraph.', '.bbc.', '.independent.',
               '.theguardian.', '.sky.', 'itv.']

In [None]:
YEAR = 2019
path_to_file = PATH_TO_DATA + f'quotes-{YEAR}-UK.json.bz2'
path_to_out = PATH_TO_DATA + f'quotes-{YEAR}-UK-filtered.json.bz2'

def get_domain(url):
    res = get_tld(url, as_object=True)
    return res.tld
    
with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
        for index, instance in enumerate(s_file):
            instance = json.loads(instance) 
            urls = instance['urls'] 
            newspapers = []
            for i in range(len(chosen_news)):
                if any(chosen_news[i] in url for url in urls):
                    newspapers.append(chosen_news[i].split('.')[1])
            if len(newspapers) > 0:
                instance['newspapers'] = newspapers
                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

In [None]:
path_to_out

'/content/drive/MyDrive/ADAMilestone2/Data/quotes-2019-UK-filtered.json.bz2'

In [63]:
df = pd.read_json(path_to_out, lines=True, compression='bz2')
print(len(df))
df.sample(5)

517880


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,newspapers
283170,2019-01-04-028730,"I knew nothing about airlines, which I think m...",Herb Kelleher,[Q764247],2019-01-04 02:53:09,2,"[[Herb Kelleher, 0.7122], [None, 0.2878]]",[http://news.bbc.co.uk/news/world-us-canada-46...,E,[bbc]
384851,2019-12-04-045289,I'm excited by actors that I don't recognise. ...,Nick Cave,"[Q192668, Q24218]",2019-12-04 10:24:51,1,"[[Nick Cave, 0.3889], [None, 0.3814], [Lydia W...",[https://www.standard.co.uk/go/london/theatre/...,E,[standard]
334313,2019-04-29-056728,None of that matters because everyone is in su...,CHRIS Evans,"[Q17490263, Q178348, Q21538587, Q23418578, Q27...",2019-04-29 08:31:04,1,"[[CHRIS Evans, 0.8749], [None, 0.1068], [Paul ...",[https://www.thesun.co.uk/tvandshowbiz/8961022...,E,[thesun]
47281,2019-03-24-027729,It's nice to have my own company sometimes. I ...,Dani Dyer,[Q22004146],2019-03-24 00:01:05,4,"[[Dani Dyer, 0.6431], [None, 0.3569]]",[https://www.thesun.co.uk/fabulous/8675813/dan...,E,"[thesun, mirror]"
242660,2019-02-05-083676,The combination of our state-of-the-art F-35s ...,Gavin Williamson,[Q262409],2019-02-05 17:12:51,5,"[[Gavin Williamson, 0.7616], [None, 0.1501], [...",[https://www.pressandjournal.co.uk/news/uk/167...,E,[standard]
