
<h1 align=center>
ADA Project: Define the political orientation of newspapers
<br>
Notebook 1: Loading and selecting the data
</h1>

---

In this first notebook, we implement all the steps that allow us to create the final datasets that we will use for our project, based on the given one. As these steps are very long to run (a few hours), we created this separated preliminary notebook, that we only need to run once.

In the following code, we read the full quotebank files for the period 2015-2020. As it is a very large dataset, we only select the quotes coming from three newspapers: the *New York Times*, *CNN* and *FOX News*. We then stock them in json files, and use these reduced datasets for our actual project (see `project_pt2_analyses.ipynb`).

We also perfom a part of the quotations cleaning in this notebook, as it also requires a very long running time. This preprocessing consists in tokenizing and lemmatizing the quotes. The quotes are chopped into a collection of individual words (i.e. tokens), and each word is cutted down to its base form (lemmatization). For example: laugh, laughs, laughing, laughed would all be reduced to laugh. This reduces the complexity of analysis by reducing the number of unique words. Both techniques are built into the spaCy package, which is used in the `add_col_tokens` function from the `src` directory of the repository.

## Selecting newspapers quotes

In this part, the full data set is loaded and only the dataset corresponding to newspapers quotations are selected and saved into 5 reduced size json files (2015-2020) for each newspaper. The json files will be loaded in the second notebook and put together in one dataframe.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
pip install empath



In [None]:
# Add root to path
import sys
sys.path.append('/content/drive/Shareddrives/ADA')

In [None]:
# Import libraries
import bz2
import json
import os

from tqdm import tqdm

# Import paths
from src.constants import TOKENS_COL
from src.df_factory import (add_col_tokens_from_bz2, create_df_from_bz2,
                            create_df_from_bz2_dir, save_df_bz2)
from src.paths import (CNN_DIR, DATA_DIR, FOX_DIR, NYT_DIR, QUOTEBANK_DIR,
                       SELECTED_DIR, TOKENS_DIR)
from src.text_processing import add_col_tokens

# Define domains
CNN_DOMAIN = '.cnn'
FOX_DOMAIN = '.fox'
NYT_DOMAIN = '.nytimes.com'

# Define dictionary newspaper/domain
NEWSPAPER_DOMAIN = {
    'CNN': CNN_DOMAIN,
    'FOX': FOX_DOMAIN,
    'NYT': NYT_DOMAIN,
}

# Print paths
print('CNN:', CNN_DIR)
print('FOX:', FOX_DIR)
print('NYT:', NYT_DIR)
print('QUOTEBANK:', QUOTEBANK_DIR)
print('SELECTED:', SELECTED_DIR)

CNN: /content/drive/Shareddrives/ADA/data/CNN
FOX: /content/drive/Shareddrives/ADA/data/FOX
NYT: /content/drive/Shareddrives/ADA/data/NYT
QUOTEBANK: /content/drive/Shareddrives/ADA/Quotebank
SELECTED: /content/drive/Shareddrives/ADA/data/selected


In [None]:
def save_newspapers(filename_in: str, filename_out: str):
    """
    Opens a json file and selects only the quotes from the wanted newspapers
    based on the domain name. Selected lines are written in a new json file.
    """
    with bz2.open(filename_in, 'rb') as file_in:
        with bz2.open(filename_out, 'wb') as file_out:
            for instance in file_in:
                instance = json.loads(instance)     # loading a sample
                if (any(NYT_DOMAIN in url for url in instance['urls']) or
                    any(FOX_DOMAIN in url for url in instance['urls']) or
                    any(CNN_DOMAIN in url for url in instance['urls'])
                ):
                    file_out.write((json.dumps(instance)+'\n').encode('utf-8'))

In [None]:
def save_newspaper(filename_in: str, filename_out: str, domain: str):
    """
    Opens a json file and selects only the quotes from the wanted newspaper
    based on the domain name. Selected lines are written in a new json file.
    """
    with bz2.open(filename_in, 'rb') as file_in:
        with bz2.open(filename_out, 'wb') as file_out:
            for instance in file_in:
                instance = json.loads(instance)     # loading a sample
                if any(domain in url for url in instance['urls']):
                    file_out.write((json.dumps(instance)+'\n').encode('utf-8'))

In [None]:
def create_selected_files():
    """
    Creates the `selected` directory with quotes from the three newspapers
    for years from 2015 to 2020.
    """
    # Create directory
    os.makedirs(SELECTED_DIR, exist_ok=True)

    # Create files
    for filename in tqdm(os.listdir(QUOTEBANK_DIR)):
        filename_in = os.path.join(QUOTEBANK_DIR, filename)
        filename_out = os.path.join(SELECTED_DIR, 'selected-' + filename)
        save_newspapers(filename_in, filename_out)

In [None]:
# Warning: heavy computation (2 hours)
create_selected_files()

In [None]:
def create_newspapers_files():
    """Creates the separated files for the three newspapers."""
    selected_files = os.listdir(SELECTED_DIR)

    # Each newspaper
    for newspaper, domain in NEWSPAPER_DOMAIN.items():
        print('Newspaper:', newspaper)

        # Create directory
        newspaper_dir = os.path.join(DATA_DIR, newspaper)
        os.makedirs(newspaper_dir, exist_ok=True)

        # Create files
        for filename in selected_files:
            print(' -> Loading', filename)
            filename_in = os.path.join(SELECTED_DIR, filename)
            filename_out = os.path.join(
                newspaper_dir, f'{newspaper}-{filename}')
            save_newspaper(filename_in, filename_out, domain)

        print('_' * 50)

In [None]:
# Warning: heavy computation (1 hour)
create_newspapers_files()

Newspaper: CNN
 -> Loading selected-quotes-2020.json.bz2
 -> Loading selected-quotes-2019.json.bz2
 -> Loading selected-quotes-2018.json.bz2
 -> Loading selected-quotes-2017.json.bz2
 -> Loading selected-quotes-2016.json.bz2
 -> Loading selected-quotes-2015.json.bz2
__________________________________________________
Newspaper: FOX
 -> Loading selected-quotes-2020.json.bz2
 -> Loading selected-quotes-2019.json.bz2
 -> Loading selected-quotes-2018.json.bz2
 -> Loading selected-quotes-2017.json.bz2
 -> Loading selected-quotes-2016.json.bz2
 -> Loading selected-quotes-2015.json.bz2
__________________________________________________
Newspaper: NYT
 -> Loading selected-quotes-2020.json.bz2
 -> Loading selected-quotes-2019.json.bz2
 -> Loading selected-quotes-2018.json.bz2
 -> Loading selected-quotes-2017.json.bz2
 -> Loading selected-quotes-2016.json.bz2
 -> Loading selected-quotes-2015.json.bz2
__________________________________________________


## Tokenization and lemmatization

In this part, the tokenized version of the quotations is created and saved together with the corresponding `quoteID` as dataframe into a compressed json file that will be loaded in the second notebook.

The `add_col_tokens` function is implemented in the `src.text_processing` module.

In [None]:
def save_tokens_newspaper(newspaper: str):
    """Saves a dataframe with tokenized quotations for a newspaper."""
    # Create dataframe
    dirname = os.path.join(DATA_DIR, newspaper)
    df = create_df_from_bz2_dir(dirname)

    # Add tokens column
    add_col_tokens(df)

    # Save dataframe
    filename = os.path.join(TOKENS_DIR, f'{newspaper}-tokenizer.json.bz2')
    print('Save file', filename)
    save_df_bz2(df[[TOKENS_COL]].reset_index(), filename)

In [None]:
# Create tokens directory
os.makedirs(TOKENS_DIR, exist_ok=True)

In [None]:
# Warning: heavy computation
save_tokens_newspaper('CNN')

Load bz2 files: 100%|██████████| 6/6 [02:10<00:00, 21.77s/file]
100%|██████████| 597820/597820 [2:02:28<00:00, 81.35it/s]


Save file /content/drive/Shareddrives/ADA/data/tokens/CNN-tokenizer.json.bz2


In [None]:
# Warning: heavy computation
save_tokens_newspaper('NYT')

Load bz2 files: 100%|██████████| 6/6 [02:15<00:00, 22.63s/file]
100%|██████████| 858367/858367 [2:46:00<00:00, 86.18it/s]


Save file /content/drive/Shareddrives/ADA/data/tokens/NYT-tokenizer.json.bz2


In [None]:
# Warning: heavy computation
save_tokens_newspaper('FOX')

Load bz2 files: 100%|██████████| 6/6 [02:57<00:00, 29.52s/file]
100%|██████████| 679319/679319 [2:20:42<00:00, 80.46it/s]


Save file /content/drive/Shareddrives/ADA/data/tokens/FOX-tokenizer.json.bz2


## Loading a dataset for a newspaper

Once the files are created, we can easily load the datasets from the three newspapers. Here is the example for the *New York Times*.

The functions used to create the dataset are implemented in the `src.df_factory` module.

In [None]:
# Create the dataframe
df = create_df_from_bz2_dir(NYT_DIR)
df

Load bz2 files: 100%|██████████| 6/6 [02:05<00:00, 20.88s/file]


Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-02-18-004289,an appetite for power.,,[],2020-02-18 14:44:45,3,"[[None, 0.3665], [Robin Niblett, 0.3339], [Jos...","[https://hypervocal.com/items/3249757, https:/...",E
2020-01-09-006199,Andrew Yang's Lies About Supporting Medicare f...,Andrew Yang,"[Q11118258, Q28723576]",2020-01-09 01:21:54,2,"[[Andrew Yang, 0.7197], [None, 0.2804]]",[https://www.nytimes.com/2020/01/08/opinion/me...,E
2020-01-22-017789,eager to erase the image of congressional Repu...,Eric Cantor,[Q497271],2020-01-22 21:20:52,2,"[[Eric Cantor, 0.5013], [None, 0.3045], [Kevin...",[http://mobile.nytimes.com/2020/01/22/us/polit...,E
2020-01-31-022641,Given the partisan nature of this impeachment ...,Lisa Murkowski,[Q22360],2020-01-31 00:00:00,24,"[[Lisa Murkowski, 0.6433], [None, 0.224], [Joh...",[http://feeds.foxnews.com/~r/foxnews/politics/...,E
2020-01-23-024008,"He got on top of me, and he raped me.",Annabella Sciorra,[Q231395],2020-01-23 00:00:00,75,"[[Annabella Sciorra, 0.5251], [Harvey Weinstei...",[https://www.rawstory.com/2020/01/sopranos-act...,E
...,...,...,...,...,...,...,...,...
2015-12-14-009516,be the Healthiest Individual Ever Elected to t...,Donald Trump,"[Q22686, Q27947481]",2015-12-14 21:29:14,258,"[[Donald Trump, 0.4715], [None, 0.1979], [Haro...",[http://time.com/4148215/donald-trump-health-p...,E
2015-11-10-015262,Change is inevitable -- it's the progress that...,Andy Stern,[Q4761352],2015-11-10 14:35:05,1,"[[Andy Stern, 0.8624], [None, 0.1376]]",[http://mobile.nytimes.com/blogs/bits/2015/11/...,E
2015-10-15-044368,"I just don't fit in,",,[],2015-10-15 12:00:21,7,"[[None, 0.4883], [Renee Unterman, 0.2619], [Ra...",[http://edmontonjournal.com/news/politics/1015...,E
2015-09-15-104423,"Think to Win: The Strategic Dimension of Tennis,",Allen Fox,[Q1561999],2015-09-15 13:22:33,2,"[[Allen Fox, 0.779], [None, 0.2174], [Rafael N...",[http://dcourier.com/main.asp?SectionID=2&SubS...,E


In [None]:
# Look at the dataframe of tokens
tokens_filename = os.path.join(TOKENS_DIR, 'NYT-tokenizer.json.bz2')
create_df_from_bz2(tokens_filename)

Unnamed: 0_level_0,tokens
quoteID,Unnamed: 1_level_1
2020-02-18-004289,"[appetite, power]"
2020-01-09-006199,"[Andrew, Yang, lie, support, Medicare, expose,..."
2020-01-22-017789,"[eager, erase, image, congressional, Republica..."
2020-01-31-022641,"[partisan, nature, impeachment, beginning, com..."
2020-01-23-024008,[rape]
...,...
2015-12-14-009516,"[healthy, Individual, elect, Presidency]"
2015-11-10-015262,"[change, inevitable, progress, optional]"
2015-10-15-044368,[fit]
2015-09-15-104423,"[think, Win, Strategic, Dimension, Tennis]"


In [None]:
# Add tokens column
tokens_filename = os.path.join(TOKENS_DIR, 'NYT-tokenizer.json.bz2')
df = add_col_tokens_from_bz2(df, tokens_filename)
df

Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,tokens
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-02-18-004289,an appetite for power.,,[],2020-02-18 14:44:45,3,"[[None, 0.3665], [Robin Niblett, 0.3339], [Jos...","[https://hypervocal.com/items/3249757, https:/...",E,"[appetite, power]"
2020-01-09-006199,Andrew Yang's Lies About Supporting Medicare f...,Andrew Yang,"[Q11118258, Q28723576]",2020-01-09 01:21:54,2,"[[Andrew Yang, 0.7197], [None, 0.2804]]",[https://www.nytimes.com/2020/01/08/opinion/me...,E,"[Andrew, Yang, lie, support, Medicare, expose,..."
2020-01-22-017789,eager to erase the image of congressional Repu...,Eric Cantor,[Q497271],2020-01-22 21:20:52,2,"[[Eric Cantor, 0.5013], [None, 0.3045], [Kevin...",[http://mobile.nytimes.com/2020/01/22/us/polit...,E,"[eager, erase, image, congressional, Republica..."
2020-01-31-022641,Given the partisan nature of this impeachment ...,Lisa Murkowski,[Q22360],2020-01-31 00:00:00,24,"[[Lisa Murkowski, 0.6433], [None, 0.224], [Joh...",[http://feeds.foxnews.com/~r/foxnews/politics/...,E,"[partisan, nature, impeachment, beginning, com..."
2020-01-23-024008,"He got on top of me, and he raped me.",Annabella Sciorra,[Q231395],2020-01-23 00:00:00,75,"[[Annabella Sciorra, 0.5251], [Harvey Weinstei...",[https://www.rawstory.com/2020/01/sopranos-act...,E,[rape]
...,...,...,...,...,...,...,...,...,...
2015-12-14-009516,be the Healthiest Individual Ever Elected to t...,Donald Trump,"[Q22686, Q27947481]",2015-12-14 21:29:14,258,"[[Donald Trump, 0.4715], [None, 0.1979], [Haro...",[http://time.com/4148215/donald-trump-health-p...,E,"[healthy, Individual, elect, Presidency]"
2015-11-10-015262,Change is inevitable -- it's the progress that...,Andy Stern,[Q4761352],2015-11-10 14:35:05,1,"[[Andy Stern, 0.8624], [None, 0.1376]]",[http://mobile.nytimes.com/blogs/bits/2015/11/...,E,"[change, inevitable, progress, optional]"
2015-10-15-044368,"I just don't fit in,",,[],2015-10-15 12:00:21,7,"[[None, 0.4883], [Renee Unterman, 0.2619], [Ra...",[http://edmontonjournal.com/news/politics/1015...,E,[fit]
2015-09-15-104423,"Think to Win: The Strategic Dimension of Tennis,",Allen Fox,[Q1561999],2015-09-15 13:22:33,2,"[[Allen Fox, 0.779], [None, 0.2174], [Rafael N...",[http://dcourier.com/main.asp?SectionID=2&SubS...,E,"[think, Win, Strategic, Dimension, Tennis]"
