# ADA Project:  Define the political orientation of NYTimes newspaper
## Notebook 1: Loading and selecting the data

In this first notebook, we implemented all the steps that allow us to create the final dataset that we will use for our project, based on the given Quotebank. As these steps are very long to run (a few hours..), we created a separate preliminary notebook, that we only need to run once. 

In the following code, we read the full quotebank files for 2015-2020. As it is a very large dataset, we select only the quotes coming from the New York Times newspaper. We then stock them in json files, and use these reduced datasets for our actual project (see `project_pt2_analyses.ipynb`).


We also perfomed a part of the quotations cleaning in this notebook, as it also necessitates a very long running time. This preprocessing consists in tokenizing and lemmatizing the quotes. The quotes are chopped into a collection of individual words (i.e. tokens), and each word is cutted down to its base form (lemmatization). For example: laugh, laughs, laughing, laughed would all be reduced to laugh. This reduces the complexity of analysis by reducing the number of unique words. Both techniques are built into the spaCy package, which was used in the `text_tokenizer` function.
The tokenized version of the quotations is stocked in another compressed json file and will be added in our dataset as an additional column in the second notebook (see `project_pt2_analyses.ipynb`). 

### Part 1: Selecting New York Times Quotes
In this part, the complete data set is loaded and only the dataset corresponding to nytimes quotations are selected and saved into 5 reduced size json files (2015-2020).The json files will be loaded in the second notebook and put together in one dataframe.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import bz2
import json
import pandas as pd
import os
import spacy 

In [None]:
# Function used to read and select only the nytimes quotations based on the urls
def keep_NYT (PATH_IN, PATH_OUT):
  """
  Opens the data json files and select only the New York Times quotes based on 
  the domain name
  Selected lines are written in a new json file
  """
  with bz2.open(PATH_IN, 'rb') as file_in:
    with bz2.open(PATH_OUT, 'wb') as file_out:
        for instance in file_in:
          instance = json.loads(instance) # loading a sample
          if any(".nytimes.com" in url for url in instance['urls']):
            file_out.write((json.dumps(instance)+'\n').encode('utf-8'))

In [None]:
# Function used to create a dataframe
def to_DataFrame (PATH):
  """
  Loads the  data from json files and store them in a dataframe
  """
  with bz2.open(PATH,'rb') as file:      
      data = file.readlines()
      data = list(map(json.loads, data)) 
  return (pd.DataFrame(data))

In [None]:
# Loading all the data files and creating our new files
DIR_PATH_IN = '/content/drive/Shareddrives/ADA/Quotebank'
DIR_PATH_OUT = '/content/drive/Shareddrives/ADA/NewYorkTimes'
files = os.listdir(DIR_PATH_IN)
pd_NYT = pd.DataFrame()


for f in files:
  # Load the data and keep only the quotes from NYT
  PATH_IN = DIR_PATH_IN + '/'+f
  PATH_OUT = DIR_PATH_OUT + '/nyt-'+f
  keep_NYT(PATH_IN, PATH_OUT)
  
  # Stock in pandas DataFrame
  pd_NYT = pd.concat([pd_NYT, to_DataFrame(PATH_OUT)], ignore_index=True)

We can now visualize the created dataframe to check if everything is correct.

In [None]:
pd_NYT

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-02-18-004289,an appetite for power.,,[],2020-02-18 14:44:45,3,"[[None, 0.3665], [Robin Niblett, 0.3339], [Jos...","[https://hypervocal.com/items/3249757, https:/...",E
1,2020-01-09-006199,Andrew Yang's Lies About Supporting Medicare f...,Andrew Yang,"[Q11118258, Q28723576]",2020-01-09 01:21:54,2,"[[Andrew Yang, 0.7197], [None, 0.2804]]",[https://www.nytimes.com/2020/01/08/opinion/me...,E
2,2020-01-22-017789,eager to erase the image of congressional Repu...,Eric Cantor,[Q497271],2020-01-22 21:20:52,2,"[[Eric Cantor, 0.5013], [None, 0.3045], [Kevin...",[http://mobile.nytimes.com/2020/01/22/us/polit...,E
3,2020-01-31-022641,Given the partisan nature of this impeachment ...,Lisa Murkowski,[Q22360],2020-01-31 00:00:00,24,"[[Lisa Murkowski, 0.6433], [None, 0.224], [Joh...",[http://feeds.foxnews.com/~r/foxnews/politics/...,E
4,2020-01-23-024008,"He got on top of me, and he raped me.",Annabella Sciorra,[Q231395],2020-01-23 00:00:00,75,"[[Annabella Sciorra, 0.5251], [Harvey Weinstei...",[https://www.rawstory.com/2020/01/sopranos-act...,E
...,...,...,...,...,...,...,...,...,...
858362,2015-12-14-009516,be the Healthiest Individual Ever Elected to t...,Donald Trump,"[Q22686, Q27947481]",2015-12-14 21:29:14,258,"[[Donald Trump, 0.4715], [None, 0.1979], [Haro...",[http://time.com/4148215/donald-trump-health-p...,E
858363,2015-11-10-015262,Change is inevitable -- it's the progress that...,Andy Stern,[Q4761352],2015-11-10 14:35:05,1,"[[Andy Stern, 0.8624], [None, 0.1376]]",[http://mobile.nytimes.com/blogs/bits/2015/11/...,E
858364,2015-10-15-044368,"I just don't fit in,",,[],2015-10-15 12:00:21,7,"[[None, 0.4883], [Renee Unterman, 0.2619], [Ra...",[http://edmontonjournal.com/news/politics/1015...,E
858365,2015-09-15-104423,"Think to Win: The Strategic Dimension of Tennis,",Allen Fox,[Q1561999],2015-09-15 13:22:33,2,"[[Allen Fox, 0.779], [None, 0.2174], [Rafael N...",[http://dcourier.com/main.asp?SectionID=2&SubS...,E


### Part 2: Tokenization and lemmatization
In this part, the tokenized version of the quotations is created and stored with the corresponding quoteID in a dataframe, This dataframe is then saved into a compressed json file that will be loaded in the second notebook. 

In [None]:
# Tokenizer and lemmatizer function
nlp = spacy.load('en_core_web_sm')

def text_tokenizer(text):
    """
    Takes in text, tokenizes words, then lemmatizes tokens.
    """
    return ' '.join([w.lemma_ for w in nlp(text)])


In [None]:
pd_NYT

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-02-18-004289,an appetite for power.,,[],2020-02-18 14:44:45,3,"[[None, 0.3665], [Robin Niblett, 0.3339], [Jos...","[https://hypervocal.com/items/3249757, https:/...",E
1,2020-01-09-006199,Andrew Yang's Lies About Supporting Medicare f...,Andrew Yang,"[Q11118258, Q28723576]",2020-01-09 01:21:54,2,"[[Andrew Yang, 0.7197], [None, 0.2804]]",[https://www.nytimes.com/2020/01/08/opinion/me...,E
2,2020-01-22-017789,eager to erase the image of congressional Repu...,Eric Cantor,[Q497271],2020-01-22 21:20:52,2,"[[Eric Cantor, 0.5013], [None, 0.3045], [Kevin...",[http://mobile.nytimes.com/2020/01/22/us/polit...,E
3,2020-01-31-022641,Given the partisan nature of this impeachment ...,Lisa Murkowski,[Q22360],2020-01-31 00:00:00,24,"[[Lisa Murkowski, 0.6433], [None, 0.224], [Joh...",[http://feeds.foxnews.com/~r/foxnews/politics/...,E
4,2020-01-23-024008,"He got on top of me, and he raped me.",Annabella Sciorra,[Q231395],2020-01-23 00:00:00,75,"[[Annabella Sciorra, 0.5251], [Harvey Weinstei...",[https://www.rawstory.com/2020/01/sopranos-act...,E
...,...,...,...,...,...,...,...,...,...
858362,2015-12-14-009516,be the Healthiest Individual Ever Elected to t...,Donald Trump,"[Q22686, Q27947481]",2015-12-14 21:29:14,258,"[[Donald Trump, 0.4715], [None, 0.1979], [Haro...",[http://time.com/4148215/donald-trump-health-p...,E
858363,2015-11-10-015262,Change is inevitable -- it's the progress that...,Andy Stern,[Q4761352],2015-11-10 14:35:05,1,"[[Andy Stern, 0.8624], [None, 0.1376]]",[http://mobile.nytimes.com/blogs/bits/2015/11/...,E
858364,2015-10-15-044368,"I just don't fit in,",,[],2015-10-15 12:00:21,7,"[[None, 0.4883], [Renee Unterman, 0.2619], [Ra...",[http://edmontonjournal.com/news/politics/1015...,E
858365,2015-09-15-104423,"Think to Win: The Strategic Dimension of Tennis,",Allen Fox,[Q1561999],2015-09-15 13:22:33,2,"[[Allen Fox, 0.779], [None, 0.2174], [Rafael N...",[http://dcourier.com/main.asp?SectionID=2&SubS...,E


In [None]:
# Create a reduced dataframe with only the quotations and the quoteID to identify the quotations
tokenized_quotations = pd_NYT[['quoteID', 'quotation']]
tokenized_quotations.columns=['quoteID', 'tokenized_quotations']

# Create the lemmatized version of the quotations
tokenized_quotations['tokenized_quotations'] = pd.DataFrame(pd_NYT['quotation'].apply( lambda quotation: text_tokenizer(quotation)))

# Save the lemmatized column in a json
PATH_TOKEN = '/content/drive/Shareddrives/ADA/tokenizer/tokenizer_column_data_NYT.json.bz2'
save_tokenizer = tokenized_quotations.to_json(PATH_TOKEN, orient='records', lines=True, compression='bz2')