<h1 align="center">
ADA Project: Define the political orientation of newspapers
<br>
Notebook 2: Analyses
</h1>

---

This notebook contains the analyses of the quotebank dataset, that consists of quotations from multiple newspapers published between 2015 and 2020. The final goal of this project is to define the political orientation of the selected newspapers, mainly based on the distribution of republican and democratic oriented quotations.

The selected newspapers are:

- The New York Times
- CNN
- Fox News

The CNN is known to emphasize the democrat opinion while the Fox News is known for its republican ideas. We will do the same analysis on these three newspapers and compare the results.

The determination of the political orientation of newspapers is based on several chosen topics that are commonly addressed in the USA and on which republicans and democrats tend to argue. We use dictionaries to select the interesting quotations and apply unsupervized machine learning techniques combined with sentiment analysis to highlight the separation of political opinions in the quotes.

Finally, a complementary dataset will be used in this project, containing further information about the speakers identified in the quotebank such as nationality, party and more.

Our analysis is divided into the following steps:


PART A: New York Times Newspaper
- Data preprocessing  
  - Data loading  
  - Basic pre-processing  
  - Loading of the additional dataset: speaker information  
  - Speakers cleaning  
  - Initial visualization of the dataset  
  - First step towards sentiment analysis  

- Analyses
  - Topics detection  
  - Sentiment analysis  
  - Topics anlayses
  - PCA on speakers  

PART B: CNN Newspaper
- Preprocessing (similar steps)
- Analyses (similar steps)

PART C: Fox News
- Preprocesing (similar steps)
- Analyses (similar steps)

Note that this notebook should be run after running the `project_pt1_loading.ipynb` notebook dedicated to the loading of the complete 2015-2020 quotebank, selection of the quotations coming from the given journal and creation of a tokenized version of the quotations. In the latter, all of these steps are saved into compressed json file that will be loaded in this `project_pt2_analyses.ipynb` notebook.

The functions used during the analyses are implemented in the modules of the `src` directory in the repository.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
pip install empath



In [None]:
pip install -U wordcloud



In [None]:
# Import standard libraries
import gc
import os
import sys
import pandas as pd

# Import functions to display html content
from IPython.core.display import display, HTML

# Add root to path
sys.path.append('/content/drive/Shareddrives/ADA')

# Init garbage collector
gc.collect()

239

In [None]:
# Import modules from src
import src.constants as constants
import src.data_cleaning as dc
import src.df_factory as dff
import src.parquet_files as pf
import src.paths as paths
import src.plot_utils as pu
import src.sentiment_analysis as sa
import src.table_utils as tu
import src.text_processing as tp
import src.wordcloud as wc


The twython library has not been installed. Some functionality from the twitter package will not be available.



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## PART A: The New York Times

We begin our analysis on the *New York Times* newspaper.

The main dataset used in the project is [Quotebank](https://zenodo.org/record/4277311), an open corpus of millions of quotations attributed to the speakers who uttered them, extracted from a lot of news articles published in english between 2015 and 2020.

The complete dataset contains the following informations:

- `quoteID`: used to identify the quotation.
- `quotation`: the quotation published in the article.
- `speaker`: the name of the person with the highest probability of being the speaker. The speaker is set to None if they were not found.
- `qids`: used to identify the speaker, if any.
- `date`: the publication date of the quotation.
- `numOccurences`: the number of occurences of the quotation in the newspapers.
- `probas`: the probability for different people to be the speaker.
- `urls`: the urls of the article in which the quotation was found.
- `phase`: related to the processing of the quotation when building the quotebank.

The given dataset is divided into 6 json files according to the year of publication (2015-2020). As it is a very large dataset and since we're only interested in a portion of it, we only selected the quotations from the selected newspapers before loading the data. This step was performed in the first notebook `project_pt1_loading.ipynb`. Here, we load the 6 reduced size json files obtained after running the latter notebook and assembly them to get one final dataframe.

### Data preprocessing

#### Data loading

First, we create the dataset of quotes.

In [None]:
# Create dataframe of quotes
df_nyt = dff.create_df_from_bz2_dir(paths.NYT_DIR)

Load bz2 files: 100%|██████████| 6/6 [02:08<00:00, 21.43s/file]


In [None]:
df_nyt

Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-01-01-002600,"At Home in the Whole Food Kitchen,",,[],2015-01-01 03:40:40,44,"[[None, 0.6134], [James Beard, 0.1426], [Karen...",[http://www.washingtonpost.com/pb/recipes/whol...,E
2015-01-01-003238,Blackout: Remembering the Things I Drank to Fo...,,[],2015-01-01 19:26:56,67,"[[None, 0.4248], [Terry Gross, 0.2328], [Alia ...",[http://salon.com/2015/01/01/from_better_call_...,E
2015-01-01-013950,I will get up.,Perumal Murugan,[Q18761417],2015-01-01 00:57:26,6,"[[Perumal Murugan, 0.4423], [None, 0.3639], [B...",[http://www.examiner.com/article/nietzsche-on-...,E
2015-01-01-013998,I wish I could un-see it.,Mike Schroepfer,[Q6848733],2015-01-01 04:48:03,14,"[[Mike Schroepfer, 0.5114], [None, 0.3752], [D...",[http://www.staradvertiser.com/r?19=961&43=651...,E
2015-01-01-020572,Joyful Rendezvous Upon Pure Ice and Snow.,,[],2015-01-01 06:53:00,54,"[[None, 0.5441], [Wang Hui, 0.1782], [Xi Jinpi...",[http://www.sportskeeda.com/winter-sports/2022...,E
...,...,...,...,...,...,...,...,...
2020-04-16-068421,"You can source stories through the internet, d...",Peter Hamby,[Q24851454],2020-04-16 15:09:55,1,"[[Peter Hamby, 0.8762], [None, 0.1238]]",[http://www.nytimes.com/2020/04/16/business/me...,E
2020-04-16-068721,"You have to disobey,",Wayne Hoffman,"[Q16205097, Q7976336]",2020-04-16 09:04:24,2,"[[Wayne Hoffman, 0.7339], [None, 0.2128], [Bra...",[http://mobile.nytimes.com/2020/04/16/us/coron...,E
2020-04-16-068856,You lose the texture.,Peter Hamby,[Q24851454],2020-04-16 15:09:55,1,"[[Peter Hamby, 0.7456], [None, 0.2544]]",[http://www.nytimes.com/2020/04/16/business/me...,E
2020-04-16-068904,You must knock out the coronavirus with your E...,Ryuho Okawa,[Q7385496],2020-04-16 09:00:27,2,"[[Ryuho Okawa, 0.9008], [None, 0.0992]]",[http://nytimes.com/2020/04/16/nyregion/happy-...,E


#### Data cleaning

Once the dataset is loaded, some basic cleaning steps are performed to detect any abnormalities and clean the data, to make it usable for any further analyses.

We start with a data size reduction by removing the useless columns. In our case, `probas`, `phase` and `urls` are removed. Note that the urls can be removed since we already selected the New York Times quotations during the loading (see `project_pt1_loading.ipynb `).

In [None]:
# Drop useless columns
print('Columns dropped:', constants.USELESS_COLS)
dc.drop_useless_columns(df_nyt)
print('Columns kept:', list(df_nyt.columns))

Columns dropped: ['phase', 'probas', 'urls']
Columns kept: ['quotation', 'speaker', 'qids', 'date', 'numOccurrences']


We check that the dataframe does not contain any duplicated rows or missing entries (see `src.data_cleaning.remove_abnormalities`).


In [None]:
# Check abnormalities
dc.remove_abnormalities(df_nyt, verbose=True)

No duplicated rows
No missing entries


Finally, the type of each column is converted to the most appropriate one. More precisely, it is important for `quotation` and `speaker` to be set to string (see `src.data_cleaning.convert_columns_type`).

In [None]:
# Convert types
dc.convert_columns_type(df_nyt, verbose=True)

Old types:
quotation         object
speaker           object
qids              object
date              object
numOccurrences     int64
dtype: object

New types:
quotation                 string
speaker                   string
qids                      object
date              datetime64[ns]
numOccurrences             Int64
dtype: object


#### Tokenization

We add the `tokens` column with the tokens associated to each quotation. The tokens are generated by the first notebook and saved in a compressed json file. So, we load this file and add the column to the dataframe of quotes.

In [None]:
# Add tokens column
df_nyt = dff.add_col_tokens_from_bz2(df_nyt, paths.NYT_TOKENS_PATH)

# Drop -PRON- tokens
dc.drop_pron_tokens(df_nyt)

df_nyt

100%|██████████| 858367/858367 [00:04<00:00, 178181.57it/s]


Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,tokens
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01-002600,"At Home in the Whole Food Kitchen,",,[],2015-01-01 03:40:40,44,"[home, Food, Kitchen]"
2015-01-01-003238,Blackout: Remembering the Things I Drank to Fo...,,[],2015-01-01 19:26:56,67,"[blackout, remember, thing, drink, forget]"
2015-01-01-013950,I will get up.,Perumal Murugan,[Q18761417],2015-01-01 00:57:26,6,[]
2015-01-01-013998,I wish I could un-see it.,Mike Schroepfer,[Q6848733],2015-01-01 04:48:03,14,[wish]
2015-01-01-020572,Joyful Rendezvous Upon Pure Ice and Snow.,,[],2015-01-01 06:53:00,54,"[joyful, rendezvous, Pure, Ice, Snow]"
...,...,...,...,...,...,...
2020-04-16-068421,"You can source stories through the internet, d...",Peter Hamby,[Q24851454],2020-04-16 15:09:55,1,"[source, story, internet, screen]"
2020-04-16-068721,"You have to disobey,",Wayne Hoffman,"[Q16205097, Q7976336]",2020-04-16 09:04:24,2,[disobey]
2020-04-16-068856,You lose the texture.,Peter Hamby,[Q24851454],2020-04-16 15:09:55,1,"[lose, texture]"
2020-04-16-068904,You must knock out the coronavirus with your E...,Ryuho Okawa,[Q7385496],2020-04-16 09:00:27,2,"[knock, coronavirus, cantare, belief]"


#### Speakers

Now that the quotations are well preprocessed, we focus on the speakers. Some quotations are attributed to known speakers and some aren't. Ultimately, we would need to perform analyses on the identified speakers. Therefore, we form a reduced size dataset named `df_nyt_unique_speakers` from the complete dataset `df_nyt` by extracting the quotations attributed only to known speakers.

The `qids` column is replaced by the `qid` column with only the QID of the unique speaker.

In [None]:
# Create dataframe with identified speakers
df_nyt_unique_speakers = dff.create_df_unique_speakers(df_nyt)
df_nyt_unique_speakers

100%|██████████| 552140/552140 [00:01<00:00, 512935.33it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01-013950,I will get up.,Perumal Murugan,2015-01-01 00:57:26,6,[],Q18761417
2015-01-01-013998,I wish I could un-see it.,Mike Schroepfer,2015-01-01 04:48:03,14,[wish],Q6848733
2015-01-02-000297,"A Crowbar In the Buddhist Garden,",Stephen Reid,2015-01-02 20:38:45,2,"[crowbar, Buddhist, Garden]",Q7610344
2015-01-02-009154,for services to local government.,Queen Elizabeth II,2015-01-02 10:12:35,8,"[service, local, government]",Q9682
2015-01-02-027603,"It's working out very nicely,",President Donald Trump,2015-01-02 08:54:49,25,"[work, nicely]",Q22686
...,...,...,...,...,...,...
2020-04-16-068389,You can imagine the runway to ramp up the U.S....,Michael Dowse,2020-04-16 05:00:07,4,"[imagine, runway, ramp, Open, short, runway, t...",Q3308160
2020-04-16-068421,"You can source stories through the internet, d...",Peter Hamby,2020-04-16 15:09:55,1,"[source, story, internet, screen]",Q24851454
2020-04-16-068721,"You have to disobey,",Wayne Hoffman,2020-04-16 09:04:24,2,[disobey],Q16205097
2020-04-16-068856,You lose the texture.,Peter Hamby,2020-04-16 15:09:55,1,"[lose, texture]",Q24851454


This new dataframe is then extended using an external dataset given as a ".parquet" file composed of additional information on many speakers. Note that this external dataset is common to all 3 newspapers. Therefore, the loading as well as the preprocessing of this dataset is only performed once, and used in the other parts.

In the loading, we only select the informations that will be pertinent to add in our data such as: 
- `aliases`: the different names that are used to name the person. This information will be useful if we want to identify the quotations in which the given speaker was cited.
- `id`: similar to the qids in the other dataset. This column will be used to merge the dataframe of speakers to the dataframe of quotes.
- `nationality`: referred as a wikidata item. We need to know the nationality of the speakers because we will have to focus on the american speakers to classify them as republican, democrats, other party or none.
- `US_congress_bio_ID`:  member IDs from the "Biographical Directory of the United States Congress". This information will help us know how important in the political world the person is.
- `party`: the party to which the speaker belongs to, referred as wikidata item. Multiple items are sometimes repertoriated for one person. This will be taken care of later. This column will be particularly helpful to determine the repartition between republicans and democrats. The parties will be later classified in 4 different groups.
- `label`: the label that is used to name the person. We will use this information as the speaker's name.

In [None]:
# Load parquet file with speaker attributes
df_speakers = pf.create_df_from_parquet(paths.PARQUET_PATH)

In [None]:
df_speakers

Unnamed: 0_level_0,aliases,nationality,US_congress_bio_ID,party,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Q23,"[Washington, President Washington, G. Washingt...","[Q161885, Q30]",W000178,[Q327591],George Washington
Q42,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[Q145],,,Douglas Adams
Q1868,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[Q31],,,Paul Otlet
Q207,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[Q30],,[Q29468],George W. Bush
Q297,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[Q29],,,Diego Velázquez
...,...,...,...,...,...
Q106406560,[Barker Howard],[Q30],,,Barker B. Howard
Q106406571,[Charles Macomber],[Q30],,,Charles H. Macomber
Q106406588,,,,,Dina David
Q106406593,,,,,Irma Dexinger


As for the other dataset, we check for any abnormalities contained in the dataset using `remove_abnormalities`.

In [None]:
# Check abnormalities
dc.remove_abnormalities(df_speakers, verbose=True)

No duplicated rows


Some entries seem to be missing. Let's look at them more into details. If the information concerning the `nationality` are missing, it is not important. Indeed, we would simply not use these information since it is not accessible. Similarly, if the `aliases`, `party` or `US_congress_bio_ID` contain missing entries, it means that the speaker does not have a repertoriated alias, party nor congress id and this doesn't import us and shouldn't be understood as a data error.

However, we would like to make sure that no `id` values are missing.

In [None]:
# Check for missing ids
print('Some ids are missing: ', df_speakers.index.isna().any())

Some ids are missing:  False


No ids are missing, thus we will have no problem when merging the dataframes.

As already done for the previous dataset, we convert the type of each column into the most appropriate one.

In [None]:
# Convert types
dc.convert_columns_type(df_speakers, verbose=True)

Old types:
aliases               object
nationality           object
US_congress_bio_ID    object
party                 object
label                 object
dtype: object

New types:
aliases               object
nationality           object
US_congress_bio_ID    string
party                 object
label                 string
dtype: object


Some speakers may have multiple parties. The possible explanation is that the name of the parties changed or that the speaker changed its party. We will only allow the speaker to have one party. We consider that the actual party is the last one.

In [None]:
# Number of speakers with multiple parties
pf.get_number_speakers_several_parties(df_speakers)

100%|██████████| 9055981/9055981 [00:09<00:00, 911514.23it/s]


32499

In [None]:
# Affiliate the last party to the speaker
pf.affiliate_speakers_last_party(df_speakers)
df_speakers

100%|██████████| 9055981/9055981 [00:09<00:00, 935203.78it/s]


Unnamed: 0_level_0,aliases,nationality,US_congress_bio_ID,party,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Q23,"[Washington, President Washington, G. Washingt...","[Q161885, Q30]",W000178,Q327591,George Washington
Q42,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[Q145],,,Douglas Adams
Q1868,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[Q31],,,Paul Otlet
Q207,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[Q30],,Q29468,George W. Bush
Q297,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[Q29],,,Diego Velázquez
...,...,...,...,...,...
Q106406560,[Barker Howard],[Q30],,,Barker B. Howard
Q106406571,[Charles Macomber],[Q30],,,Charles H. Macomber
Q106406588,,,,,Dina David
Q106406593,,,,,Irma Dexinger


Additionally, we want to further analyze the US parties. Therefore, we select the american speakers and define them. We will classify the speakers into 4 categories according to the speaker's affiliated party:
- democratic party
- republican party
- other party
- no party

In [None]:
# Select the US speakers and attribute them to a party category
df_speakers_us_party = pf.create_df_us_party(df_speakers)
df_speakers_us_party

Unnamed: 0_level_0,aliases,US_congress_bio_ID,label,party_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q23,"[Washington, President Washington, G. Washingt...",W000178,George Washington,other party
Q207,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",,George W. Bush,republican party
Q633,"[Neil Percival Young, Shakey, Godfather of Gru...",,Neil Young,no party
Q873,"[Mary Louise Streep, Meryl Louise Streep, Stre...",,Meryl Streep,democratic party
Q1381,,,Dave Arneson,no party
...,...,...,...,...
Q106406546,[Leonard Gaskill],,Leonard T. Gaskill,no party
Q106406557,[Andrew Healy],,Andrew F. Healy,no party
Q106406560,[Barker Howard],,Barker B. Howard,no party
Q106406571,[Charles Macomber],,Charles H. Macomber,no party


In [None]:
# Proportions of parties
df_speakers_us_party.party_name.value_counts()

no party            390002
democratic party     22325
republican party     22324
other party           4432
Name: party_name, dtype: int64

Note that to find the labels corresponding to the american nationality, democratic party and republican party, we simply looked at the labels assigned to well-known speakers (Barack Obama for the Democratic Party, Donald Trump for the Republican Party).

#### Merge quotes and speakers

Since both `df_nyt_unique_speakers` and `df_speakers_us_party` dataframes have been cleaned separately, we can now merge the two so that `df_nyt_speakers_party` is extended with the needed informations.

In [None]:
# Merge dataframes
df_nyt_speakers_party = pf.merge_quotes_speakers(
    df_nyt_unique_speakers, df_speakers_us_party)
df_nyt_speakers_party

Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-01-013998,I wish I could un-see it.,Mike Schroepfer,2015-01-01 04:48:03,14,[wish],Q6848733,,,Mike Schroepfer,no party
2015-07-20-108903,"We're working on this right now,",Mike Schroepfer,2015-07-20 17:31:41,8,"[work, right]",Q6848733,,,Mike Schroepfer,no party
2016-04-25-054863,It's not totally obvious how all this shakes o...,Mike Schroepfer,2016-04-25 01:07:00,6,"[totally, obvious, shake, lot, consumer, produ...",Q6848733,,,Mike Schroepfer,no party
2016-04-25-088906,The world is making enough phones. It's better...,Mike Schroepfer,2016-04-25 01:07:00,8,"[world, phone, world, device]",Q6848733,,,Mike Schroepfer,no party
2018-02-19-029384,"I can high-five Mark and Sheryl from my desk, ...",Mike Schroepfer,2018-02-19 18:02:51,7,"[high, Mark, Sheryl, desk, team, right]",Q6848733,,,Mike Schroepfer,no party
...,...,...,...,...,...,...,...,...,...,...
2020-04-16-021184,I want to be adventure-ready when this is over...,Hillary Allen,2020-04-16 19:12:21,1,"[want, adventure, ready, chop, wood, good, tra...",Q55214523,,,Hillary Allen,no party
2020-04-16-033449,Most people at the hospital are going to know ...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[people, hospital, know, neighbor, somebody, r...",Q6176044,,,Jeffrey Hatcher,no party
2020-04-16-044355,The beauty of our town is that you can pick up...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[beauty, town, pick, phone, talk, doc, spot, c...",Q6176044,,,Jeffrey Hatcher,no party
2020-04-16-045724,"The faster we learn,",Christopher Murray,2020-04-16 09:29:01,1,"[faster, learn]",Q1077588,,,Christopher Murray,no party


### Visualization

Our preprocessing of the data is now done. We will do some visualizations to motivate our project.

#### Top speakers in the New York Times

We are first interested in the distribution of the speakers in the *New York Times*.

In [None]:
# Barplot top speakers
fig = pu.plot_bar_top_speakers(
    df_nyt_speakers_party,
    title='Top 10 speakers in the New York Times between 2015 and 2020',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_bar_top_speakers.html'),
)
fig

As observed in the graph, Donald Trump is by far the personality that is the most represented in the newspaper. Remember that the results are biased since he was the President of the US during almost the totality of the period.

In [None]:
# Pie chart top speakers
fig = pu.plot_pie_top_speakers(
    df_nyt_speakers_party,
    title='Top 10 speakers in the New York Times between 2015 and 2020',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_pie_top_speakers.html'),
)
fig

In [None]:
# Pie chart proportion of parties
fig = pu.plot_pie_parties(
    df_nyt_speakers_party,
    title='Proportions of parties for the New York Times',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_pie_parties.html'),
)
fig

#### Word clouds

**Word cloud** is a technique for visualizing frequent words in a text where the size of the words represents their frequency. For this project, we can use this visualization tool to observe the most frequent words appearing in the quotes, and observe whether there is a difference in the words that are used according to the speakers' political party. To do so we will use the tokenized version of the quotes.

We can easily create wordclouds in Python using the [wordcloud](https://amueller.github.io/word_cloud/) library.

We split the quotes by political party and create a wordcloud for the following categories:

- Democratic party (blue)
- Republican party (red)
- Other parties (green)
- No party (purple)

In [None]:
# Generate word clouds for each party category
for party_name in constants.PARTIES_LIST:
    wordcloud = wc.create_wordcloud_party(
        df=df_nyt_speakers_party,
        party_name=party_name,
    )
    filename = f"wordcloud_{party_name.replace(' ', '_')}.svg"
    wc.plot_wordcloud(
        wordcloud=wordcloud,
        filename=os.path.join(paths.FIGS_DIR, filename),
        title=f'Wordcloud for {party_name}',
    )

Output hidden; open in https://colab.research.google.com to view.

Of course, the plots show a lot irrelevant words, as we haven't selected the most pertinent topics yet. Some words can still be very meaningful of the party (e.g. "women", "child", "health care" for the democrats; "military", "China", "North Korea" for the republicans).

By filtering these words, as it will be done with the dictionaries, we would maybe highlight some very evocative ideas in each party. 
Note that we are mainly interested in the republican and democrats plots but we an use the other two as references.

### Sentiment analysis

An important part of the project concerns **sentiment analysis**. Sentiment analysis is the practice of using algorithms to classify various samples of related text into overall positive and negative categories.

#### First steps towards sentiment analysis


To perform sentiment analysis, we use the Python library [NLTK](https://www.nltk.org/). It stands for Natural Language Toolkit and includes a sentiment analyzer. We use the built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).

Let's show how to use the sentiment analyzer.

In [None]:
sa.get_polarity_scores('ADA is awesome!')

{'compound': 0.6588, 'neg': 0.0, 'neu': 0.313, 'pos': 0.687}

We get back a dictionary of different scores. The negative, neutral, and positive scores are related: they all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.

Let's add the column with compound scores in the dataframe.

In [None]:
# Add compound score column
sa.add_col_compound_score(df_nyt_speakers_party)
df_nyt_speakers_party

100%|██████████| 305187/305187 [01:18<00:00, 3897.06it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01-013998,I wish I could un-see it.,Mike Schroepfer,2015-01-01 04:48:03,14,[wish],Q6848733,,,Mike Schroepfer,no party,0.4019
2015-07-20-108903,"We're working on this right now,",Mike Schroepfer,2015-07-20 17:31:41,8,"[work, right]",Q6848733,,,Mike Schroepfer,no party,0.0000
2016-04-25-054863,It's not totally obvious how all this shakes o...,Mike Schroepfer,2016-04-25 01:07:00,6,"[totally, obvious, shake, lot, consumer, produ...",Q6848733,,,Mike Schroepfer,no party,-0.3400
2016-04-25-088906,The world is making enough phones. It's better...,Mike Schroepfer,2016-04-25 01:07:00,8,"[world, phone, world, device]",Q6848733,,,Mike Schroepfer,no party,0.4404
2018-02-19-029384,"I can high-five Mark and Sheryl from my desk, ...",Mike Schroepfer,2018-02-19 18:02:51,7,"[high, Mark, Sheryl, desk, team, right]",Q6848733,,,Mike Schroepfer,no party,0.0000
...,...,...,...,...,...,...,...,...,...,...,...
2020-04-16-021184,I want to be adventure-ready when this is over...,Hillary Allen,2020-04-16 19:12:21,1,"[want, adventure, ready, chop, wood, good, tra...",Q55214523,,,Hillary Allen,no party,0.4939
2020-04-16-033449,Most people at the hospital are going to know ...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[people, hospital, know, neighbor, somebody, r...",Q6176044,,,Jeffrey Hatcher,no party,0.4939
2020-04-16-044355,The beauty of our town is that you can pick up...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[beauty, town, pick, phone, talk, doc, spot, c...",Q6176044,,,Jeffrey Hatcher,no party,0.5859
2020-04-16-045724,"The faster we learn,",Christopher Murray,2020-04-16 09:29:01,1,"[faster, learn]",Q1077588,,,Christopher Murray,no party,0.0000


Then, by plotting the distribution of the compound score, we can determine if the quotes are rather positive or negative.

In [None]:
# Distribution of compound score
pu.plot_hist_compound(
    df_nyt_speakers_party,
    title='Distribution of compound score for the New York Times',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_hist_compound_score.html'),
)

Output hidden; open in https://colab.research.google.com to view.

The main part of the quotes are neutral. Nevertheless, we can see that there are more positive quotes than negative quotes.

#### Sentiment analysis on selected topics

The politicial orientation of the newspapers are assessed through several topics. We choose 11 different topics that are commonly addressed in the news, and on which the republican and the democrats often tend to argue. The topics are the following:

- immigration  
- healthcare  
- climate
- trump
- abortion
- women right
- violence
- racism
- war
- tax
- coal 

In the first place, all of these topics will be used in our analysis. We will then identify the most relevant ones in order to distinguish the republicans and the democrats using statistical tests.

The way it is done is by first selecting each quote that emphasizes one of the  topics. Therefore, we use [Empath](https://github.com/Ejhfast/empath-client) to generate a list of words on a topic with the help of a few input words as examples called seed terms. Then, for each topic, the quotations are compared with the generated words giving for each quotation a score related to the presence of the generated words in the quotation. We assume that if a quotation has score bigger than 0, it is relevant for the topic and should be kept.

A sentiment analysis is then performed on each of the selected quotations aiming at determining the sentiments towards the topic, as indicator of the opinion towards the topic in question.

Finally, we create an additional dataset `df_nyt_topics` that contains all the quotations that were selected for one or several topics, with the sentiment scores for each topic.

In [None]:
# Dictionary with topics and seed words
display(HTML(tu.make_html_table_topics(constants.TOPICS_DICT)))

Topic,Seed words
Immigration,"refugee, immigration, border, citizenship, naturalization"
Healthcare,"health, medical, treatment, disease, aid, hospital, insurance, reimbursement"
Climate,"melting, global warming, temperature, rise, change, ecology, meteorology, urgency, co2, greenhouse gas, climate event"
Trump,"president, donald trump, republican, 2016 presidential election"
Abortion,"pregnancy, woman, life, choice, family, child, foetus, body, right, terminate, abort, rape"
Women right,"abortion, sexism, salary gap, sexual harassment, abuse, gender equality, gender, woman, female, patriarchy, feminism"
Violence,"police violence, gun, second amendment, shooting, death, police brutality, firearm"
Racism,"discrimination, privilage, race, ethnicity, equality, afroamerican, white, black, hate crime, color"
War,"military, irak, afghanistan, palestine, middle east, soldier, arm, weapon, missile, conflict, operation, troop, bomb, force"
Tax,"income, revenue, free trade, taxpayer, imposition, fee, social welfare, tax evasion, tariff, deductible, vat"


We first create the lexicon using Empath.

In [None]:
# Create the lexicon
lexicon = tp.create_lexicon(constants.TOPICS_DICT)

Topic: immigration
["immigrants", "citizenship", "asylum_seekers", "immigration", "migrants", "illegal_immigrants", "political_refugees", "refugee_status", "Salvadorans", "emigration", "legal_status", "asylum", "border", "homelands", "refugees", "deportation", "refugee", "Nicaraguans", "homeland", "dual_citizenship", "resettlement", "political_asylum", "asylum-seekers", "naturalization", "emigrants", "Albania", "boat_people", "mainland", "United_States_citizens", "illegal_aliens", "Haitians", "repatriation", "American_citizens", "Soviet_Jews", "exiles", "visas", "Central_Americans", "Guatemalans", "ethnic_Germans", "Chinese_citizens", "American_citizenship", "Mexicans", "indigenous_people", "Rumania", "immigration_officials", "green_cards", "citizen", "persecution", "United_States_citizenship", "Xinjiang", "Tibetans", "Eritrea", "Guantanamo", "Soviet_Jews", "immigrate", "Tajikistan", "religious_persecution", "permanent_residency", "Sri_Lanka", "Ethiopia", "ethnic_Russians", "Cubans", "


Let's explore the quotations that were selected for one or many topics. To do so, we simply add an extra column to the orginial data `de_nyt_speakers_party` called `topics` that contains the topics associated with the quotation, if any.

In [None]:
# Add topics column
tp.add_topics_col(df_nyt_speakers_party, lexicon, constants.TOPICS_DICT.keys())
df_nyt_speakers_party

100%|██████████| 305187/305187 [11:12<00:00, 453.78it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score,topics
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2015-01-01-013998,I wish I could un-see it.,Mike Schroepfer,2015-01-01 04:48:03,14,[wish],Q6848733,,,Mike Schroepfer,no party,0.4019,[]
2015-07-20-108903,"We're working on this right now,",Mike Schroepfer,2015-07-20 17:31:41,8,"[work, right]",Q6848733,,,Mike Schroepfer,no party,0.0000,[]
2016-04-25-054863,It's not totally obvious how all this shakes o...,Mike Schroepfer,2016-04-25 01:07:00,6,"[totally, obvious, shake, lot, consumer, produ...",Q6848733,,,Mike Schroepfer,no party,-0.3400,[]
2016-04-25-088906,The world is making enough phones. It's better...,Mike Schroepfer,2016-04-25 01:07:00,8,"[world, phone, world, device]",Q6848733,,,Mike Schroepfer,no party,0.4404,[]
2018-02-19-029384,"I can high-five Mark and Sheryl from my desk, ...",Mike Schroepfer,2018-02-19 18:02:51,7,"[high, Mark, Sheryl, desk, team, right]",Q6848733,,,Mike Schroepfer,no party,0.0000,[]
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-16-021184,I want to be adventure-ready when this is over...,Hillary Allen,2020-04-16 19:12:21,1,"[want, adventure, ready, chop, wood, good, tra...",Q55214523,,,Hillary Allen,no party,0.4939,[]
2020-04-16-033449,Most people at the hospital are going to know ...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[people, hospital, know, neighbor, somebody, r...",Q6176044,,,Jeffrey Hatcher,no party,0.4939,[healthcare]
2020-04-16-044355,The beauty of our town is that you can pick up...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[beauty, town, pick, phone, talk, doc, spot, c...",Q6176044,,,Jeffrey Hatcher,no party,0.5859,[]
2020-04-16-045724,"The faster we learn,",Christopher Murray,2020-04-16 09:29:01,1,"[faster, learn]",Q1077588,,,Christopher Murray,no party,0.0000,[]



Here are the quotes associated with at least one topic.

In [None]:
# Quotes with at least one topic
df_nyt_speakers_party[
    df_nyt_speakers_party.topics.progress_apply(lambda x: len(x) > 0)
]

100%|██████████| 305187/305187 [00:00<00:00, 796083.82it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score,topics
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-05-20-045322,is the person at Facebook leading the efforts ...,Mike Schroepfer,2019-05-20 10:26:42,2,"[person, Facebook, lead, effort, build, automa...",Q6848733,,,Mike Schroepfer,no party,0.0000,[abortion]
2015-01-05-030101,It is even said that he likes beautiful women ...,President Donald Trump,2015-01-05 18:16:50,192,"[like, beautiful, woman, young]",Q22686,"[Donald John Trump, Donald J. Trump, Trump, Th...",,Donald Trump,republican party,0.7717,"[abortion, women right]"
2015-01-10-010403,He's a lot of fun to be with. It is even said ...,President Donald Trump,2015-01-10 13:14:48,158,"[lot, fun, like, beautiful, woman, young]",Q22686,"[Donald John Trump, Donald J. Trump, Trump, Th...",,Donald Trump,republican party,0.8750,"[abortion, women right]"
2015-04-28-037816,Our great African-American president hasn't ex...,Donald Trump,2015-04-28 11:44:39,31,"[great, african, american, president, exactly,...",Q22686,"[Donald John Trump, Donald J. Trump, Trump, Th...",,Donald Trump,republican party,0.5399,[trump]
2015-05-13-052020,"The baby is born,",President Donald Trump,2015-05-13 10:10:00,44,"[baby, bear]",Q22686,"[Donald John Trump, Donald J. Trump, Trump, Th...",,Donald Trump,republican party,0.0000,[abortion]
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-15-027125,If you go in with anxiety and with the very in...,Martha Beck,2020-04-15 19:00:05,1,"[anxiety, innocent, love, intention, control, ...",Q6774337,,,Martha Beck,no party,0.7316,[climate]
2020-04-16-005446,"Barry writes with a sustained, manic energy,",Marcy Dermansky,2020-04-16 18:33:36,1,"[Barry, write, sustained, manic, energy]",Q1894433,,,Marcy Dermansky,no party,0.2732,[coal]
2020-04-16-061946,"`We Ride Upon Sticks' is quirky, comic and pai...",Marcy Dermansky,2020-04-16 18:33:36,1,"[ride, stick, quirky, comic, painstakingly, de...",Q1894433,,,Marcy Dermansky,no party,0.0000,"[women right, racism]"
2020-04-16-033449,Most people at the hospital are going to know ...,Jeffrey Hatcher,2020-04-16 15:40:48,1,"[people, hospital, know, neighbor, somebody, r...",Q6176044,,,Jeffrey Hatcher,no party,0.4939,[healthcare]


As the results seem coherent, we create the `df_nyt_topics` with one column per topic and one row per quote. It contains the compound score if the quote is about the topic, Nan otherwise.

In [None]:
# Create topics dataframe
df_nyt_topics = tp.create_df_topics(
    df_nyt_speakers_party, constants.TOPICS_DICT.keys()
)
df_nyt_topics

Create df topics: 100%|██████████| 11/11 [00:39<00:00,  3.62s/topic]


Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01-013998,,,,,,,,,,,
2015-07-20-108903,,,,,,,,,,,
2016-04-25-054863,,,,,,,,,,,
2016-04-25-088906,,,,,,,,,,,
2018-02-19-029384,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2020-04-16-021184,,,,,,,,,,,
2020-04-16-033449,,0.4939,,,,,,,,,
2020-04-16-044355,,,,,,,,,,,
2020-04-16-045724,,,,,,,,,,,


As indicator of the general sentiment of the *New York Times* towards each of the topic, we average the sentiment scores over the quotations for each topic.

In [None]:
# Average score of all quotations per topic
df_nyt_topics.mean()

immigration_compound_score    0.022796
healthcare_compound_score     0.148200
climate_compound_score        0.056578
trump_compound_score          0.112566
abortion_compound_score       0.087479
women_right_compound_score   -0.011667
violence_compound_score      -0.258490
racism_compound_score         0.040351
war_compound_score           -0.166441
tax_compound_score            0.121272
coal_compound_score           0.121668
dtype: float64

For each topic, we get an average score very close to 0: the negative and the positive scores tend to cancel each other, showing an average neutrality in the sentiments. These results are not usable for our purpose, we will thus perform further analysis on the topics.

#### Analysis of the quotes according to political party

Now, we complete our selected quotes dataframe with informations on the speaker and the speaker's political party. We do so by merging our dataframe with the `df_nyt_speakers_party` datafame created before, and keeping only the speaker's name and its party.

From this merged dataframe we can then analyse the mean sentiment scores per topic for each party, and determine if the opinion on the topic really significantly differs from one party to another. The goal is to have a selection of topics on which the democrats and republicans are significantly disagreeing, to later compare it with the opinion of the *New York Times* on the same topics.

A mean sentiment score per topic will also be computed for each speaker, this step is described further down in this notebook.

In [None]:
# Merge the dataframe with speaker and political party infos
df_nyt_topics = df_nyt_topics.merge(
    df_nyt_speakers_party[['label', 'party_name']],
    left_index=True,
    right_index=True,
)
df_nyt_topics

Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score,label,party_name
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2015-01-01-013998,,,,,,,,,,,,Mike Schroepfer,no party
2015-07-20-108903,,,,,,,,,,,,Mike Schroepfer,no party
2016-04-25-054863,,,,,,,,,,,,Mike Schroepfer,no party
2016-04-25-088906,,,,,,,,,,,,Mike Schroepfer,no party
2018-02-19-029384,,,,,,,,,,,,Mike Schroepfer,no party
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-16-021184,,,,,,,,,,,,Hillary Allen,no party
2020-04-16-033449,,0.4939,,,,,,,,,,Jeffrey Hatcher,no party
2020-04-16-044355,,,,,,,,,,,,Jeffrey Hatcher,no party
2020-04-16-045724,,,,,,,,,,,,Christopher Murray,no party


The mean sentiment scores per topic for each party are computed below. These are the scores on which statistical tests will be performed.

In [None]:
# Means per party
print('Democrats mean scores:')
print(df_nyt_topics[df_nyt_topics.party_name=='democratic party'].mean())
print('\nRepublicans mean scores:')
print(df_nyt_topics[df_nyt_topics.party_name=='republican party'].mean())

Democrats mean scores:
immigration_compound_score    0.058077
healthcare_compound_score     0.150884
climate_compound_score        0.027231
trump_compound_score          0.100460
abortion_compound_score       0.056584
women_right_compound_score   -0.036573
violence_compound_score      -0.256246
racism_compound_score         0.030768
war_compound_score           -0.189838
tax_compound_score            0.118945
coal_compound_score           0.106741
dtype: float64

Republicans mean scores:
immigration_compound_score   -0.001884
healthcare_compound_score     0.144360
climate_compound_score        0.043240
trump_compound_score          0.108766
abortion_compound_score       0.053618
women_right_compound_score   -0.062868
violence_compound_score      -0.345475
racism_compound_score        -0.079219
war_compound_score           -0.169192
tax_compound_score            0.123028
coal_compound_score           0.074485
dtype: float64


We can now test if the sentiment analysis scores are different for both parties, by running a Student's t-test on the scores for each topic.

For each topic, the parameters of the t-test would be:

- Variables: sentiment analysis scores for democrats' quotes, sentiment analysis scores for republicans' quotes
- NULL hypothesis $H_0$: the two scores follow the same distribution (the two parties have the same opinion about the topic)
- Alternative hypothesis $H_A$: the two scores are significantly different (the two parties have different opinions about the topic)
- Significance level: $\alpha = 0.05$

In [None]:
# Run ttest
display(HTML(sa.run_ttest(df_nyt_topics)))

Topic,t-statistic,p-value,Same opinion?
Immigration,2.6035,0.0093,❌
Healthcare,0.3883,0.6978,✅
Climate,-0.4239,0.6717,✅
Trump,-0.6619,0.508,✅
Abortion,0.272,0.7856,✅
Women_right,1.6035,0.1089,✅
Violence,5.6492,0.0,❌
Racism,5.3928,0.0,❌
War,-1.3633,0.1728,✅
Tax,-0.2165,0.8286,✅


We can now visualize the results of the sentiment analysis by plotting the mean sentiment score and standard deviation for each topic, for both political parties. 

In [None]:
# Plotting the mean and std per topic sentiment score, Republican vs Democrats
fig = pu.plot_mean_sentiment_scores_per_party(
    df_nyt_topics,
    title = 'Mean score per topic for the sentiment analysis (NYT)',
    filename = os.path.join(paths.FIGS_DIR, 'nyt_sentiment_scores_parties.html')
)
fig.show()

### Topics repartition over the years

Now let's see how much the NYT talked about the different topics each year, and compare this to how much the Republicans and Democrats talked about the same topics.

First we merge our dataframe with the dates of the quotes.

In [None]:
# Merge selected quotes with the dates
df_nyt_topics = df_nyt_topics.merge(
    df_nyt['date'], how='inner', left_index=True, right_index=True
)
df_nyt_topics['date'] = pd.to_datetime(df_nyt_topics['date'])

Then, we count the number of quotes per period of time, for each topic. Here we count the number of quotes per month. We create 3 new dataframes: one for the count for the whole journal, one for the count of quotes from the democratic speakers, and one with the count from the republicans. To be able to later compare these counts between dataframes, we also normalize them (by the total number of quote per topic per dataframe).

In [None]:
# Count the number of quotes per topic for the whole journal,
# for democrats quotes only, and for republicans quotes only
quotes_over_the_years_nyt = df_nyt_topics.groupby(df_nyt_topics.date.dt.to_period('M')).count()
quotes_over_the_years_R = df_nyt_topics[df_nyt_topics.party_name == 'republican party'].drop(
    ['label', 'party_name', 'date'], axis=1).groupby(df_nyt_topics.date.dt.to_period('M')).count()
quotes_over_the_years_D = df_nyt_topics[df_nyt_topics.party_name == 'democratic party'].drop(
    ['label', 'party_name', 'date'], axis=1).groupby(df_nyt_topics.date.dt.to_period('M')).count()

In [None]:
# Normalize the counts
sum_NYT = quotes_over_the_years_nyt.sum(axis=0)
quotes_over_the_years_nyt = quotes_over_the_years_nyt.transform(
    axis=1, func=lambda x: x / sum_NYT
)

sum_R = quotes_over_the_years_R.sum(axis=0)
quotes_over_the_years_R = quotes_over_the_years_R.transform(
    axis=1, func=lambda x: x / sum_R
)

sum_D = quotes_over_the_years_D.sum(axis=0)
quotes_over_the_years_D = quotes_over_the_years_D.transform(
    axis=1, func=lambda x: x / sum_D
)

In [None]:
# Checking that the sum is 1
quotes_over_the_years_nyt.sum()

immigration_compound_score    1.0
healthcare_compound_score     1.0
climate_compound_score        1.0
trump_compound_score          1.0
abortion_compound_score       1.0
women_right_compound_score    1.0
violence_compound_score       1.0
racism_compound_score         1.0
war_compound_score            1.0
tax_compound_score            1.0
coal_compound_score           1.0
label                         1.0
party_name                    1.0
date                          1.0
dtype: float64

In [None]:
quotes_over_the_years_nyt.head()

Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score,label,party_name,date
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2015-01,0.000622,0.00026,0.0,0.00013,0.000684,0.000791,0.000471,0.000581,0.00033,0.0,0.0,0.000469,0.000469,0.000469
2015-02,0.000622,0.00052,0.0,0.000519,0.000684,0.000967,0.000707,0.000436,0.00022,0.000529,0.000156,0.000577,0.000577,0.000577
2015-03,0.0,0.00026,0.000632,0.0,0.000442,0.000703,0.000236,0.000436,0.00022,0.0,0.000312,0.000682,0.000682,0.000682
2015-04,0.0,0.00026,0.000632,0.00039,0.000603,0.000264,0.000589,0.000727,0.00022,0.000529,0.000156,0.000492,0.000492,0.000492
2015-05,0.0,0.00026,0.0,0.00013,0.000483,0.000352,0.000118,0.000436,0.00011,0.0,0.0,0.000393,0.000393,0.000393


Now, we can visualize the evolution over the years of the frequency at which the New York Times spoke about a specific topic.

In [None]:
# For the whole NYT
fig = pu.plot_topics_count_stacked(
    df=quotes_over_the_years_nyt,
    journal_name='NYT',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_topics_count.html')
)
fig.show()

We can now plot the same thing for both Republican and Democratic parties, and compare the results with the full New York Times results.

To do so we will only plot the topics for which the political parties have significantly different opinions. For the New York Times, we will thus consider the following topics:
- Immigration
- Violence
- Racism

In [None]:
significant_topics = ['immigration', 'violence', 'racism']

for topic in significant_topics:
    fig = pu.plot_topics_R_vs_D(
        df_democrats=quotes_over_the_years_D,
        df_republicans=quotes_over_the_years_R,
        topic=topic,
        filename=os.path.join(paths.FIGS_DIR, f'NYT_R_VS_D_{topic}.html')
    )
    fig.show()

#### PCA on the speakers quotations

A way to assess the party-specific sentiment scores computed for each selected quotation is to use a dimensionality reduction technique called **Principal Component Analysis**. The objective is to project the  M-dimensional party-specific scores into a substantively meaningful vector space. If the sentiment scores are actually party-specific, we should observe a clear demarcation between the democrats and the republicans on the PCA. 

To do so, we select the democrat and republican speakers in our dataframe and put them in another dataframe called `df_nyt_avg_topics`. 

In [None]:
# Average the compound scores
df_nyt_avg_topics = sa.create_df_avg_compound_score(df_nyt_topics)
df_nyt_avg_topics

Unnamed: 0_level_0,Unnamed: 1_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score
label,party_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Aaron Persky,democratic party,,,,,,,-0.8750,,,,
Aaron Peskin,democratic party,,,,0.00000,,0.07720,,,,0.3182,0.02580
Aaron Peterson,democratic party,,,,,-0.44380,,,,,,
Abdul El-Sayed,democratic party,,,,,,,,0.7264,,,
Abel Maldonado,republican party,,,,,,,,,,,-0.38180
...,...,...,...,...,...,...,...,...,...,...,...,...
Zephyr Teachout,democratic party,,0.126,,0.00000,0.07305,0.07222,-0.7791,,0.0000,0.7351,0.35295
Zev Yaroslavsky,democratic party,0.0772,,,,,,,-0.4404,,-0.3612,
Zina Bash,republican party,,,,,0.49390,0.49390,,,,,
Zoe Lofgren,democratic party,,,,-0.01275,-0.33365,-0.33365,-0.1944,,,-0.5267,


As the dataset has been prepared for the PCA, we can begin the analysis.
Note that we replace all the NaN entries in the data by 0, as a PCA can't support any missing entries. As consequences, we consider that the speakers that do not speake out on a topic have a neutral opinion on it.

In [None]:
# Perform PCA
nyt_pca, df_nyt_pca = sa.pca_analysis(df_nyt_avg_topics, stardardize=True)
df_nyt_pca

Unnamed: 0_level_0,Unnamed: 1_level_0,PC1,PC2
label,party_name,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaron Persky,democratic party,1.142326,0.851375
Aaron Peskin,democratic party,-0.452370,0.476579
Aaron Peterson,democratic party,0.582375,-0.590420
Abdul El-Sayed,democratic party,-1.153956,-0.541472
Abel Maldonado,republican party,0.245798,-1.094766
...,...,...,...
Zephyr Teachout,democratic party,-0.142656,3.598684
Zev Yaroslavsky,democratic party,0.708817,-1.378969
Zina Bash,republican party,-1.697075,-0.446643
Zoe Lofgren,democratic party,1.619991,-1.712572


Here are the results of the PCA using 2 components for a visualization purpose.

In [None]:
# Plot components 
pu.plot_scatter_pca(
    df_nyt_pca,
    title='Principal component 1 vs principal component 2 for New York Times',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_pca_2d.html'),
)

The results are not the ones that we expected. Indeed, on the components plot, it is impossible to observe any demarcation between democrats and republicans. 
The interpretation of these results is that the sentiment scores on each topic failed to distinguish the two major parties, meaning that they are not party-specific.

Let's osberve the results when projecting the scores into 3 components.

In [None]:
# PCA with 3 components
nyt_pca, df_nyt_pca = sa.pca_analysis(
    df_nyt_avg_topics, n_components=3, stardardize=True
)
pu.plot_scatter_pca(
    df_nyt_pca,
    title='Principal components PC1 vs PC2 vs PC3 for New York Times',
    filename=os.path.join(paths.FIGS_DIR, 'nyt_pca_3d.html'),
)

The results that we get by projecting our data into 3 components are the same as the previous ones, as expected.

#### Limitations

We see for example on the PCA that Donald Trump has a component score of approximately 0. This is surprising as we would have expected him to have opinions strongly oriented towards the republicans. In addition, all the compound scores are close to zero, showing an overall neutrality in almost all topics.

This shows the major limitation of the approach using Sentiment analysis. Indeed, the sentiment analysis isn't optimal for computing the opinions towards a topic since it only considers the positive and negative words in the quotations, without taking into account the context. Therefore, two quotations that have an opposite meaning on the same topic can have the same sentiment score, just because both quotations use positive key words or on the contrary negative ones. 

In this example we suppose that Donald Trump isn't neutral at all and instead has very positive or very negative opinion on each topic. However, by averaging the sentiment scores over each of his quotations, they cancel each other and we get an average neutral score. This isn't accurate.

In [None]:
# Compound scores per topic for Donald Trump
df_nyt_avg_topics.loc['Donald Trump']

Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score
party_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
republican party,-0.038595,0.090076,0.048381,0.09546,0.039805,-0.095797,-0.376245,-0.066973,-0.174575,0.10198,0.047985


In [None]:
# Delete useless variables to free memory
variables = list(globals().keys())
for variable in variables:
    if 'nyt' in variable:
        print('Delete', variable)
        del globals()[variable]
gc.collect()

Delete df_nyt
Delete df_nyt_unique_speakers
Delete df_nyt_speakers_party
Delete df_nyt_topics
Delete quotes_over_the_years_nyt
Delete df_nyt_avg_topics
Delete nyt_pca
Delete df_nyt_pca


151641

## PART B: CNN

In this part, we will perform the same preprocessing and a similar data analyses that the ones we did on the *New York Times*, but this time on the quotations coming from the CNN newspaper.

Since all the different steps were commented on PART A, and since we apply the same functions on this part, we will limit the comments.

### Data preprocessing

In [None]:
# Create dataframe of quotes
df_cnn = dff.create_df_from_bz2_dir(paths.CNN_DIR)
df_cnn

Load bz2 files: 100%|██████████| 6/6 [02:06<00:00, 21.13s/file]


Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-01-01-000553,a number of ribs and bones in his face.,Harry Reid,"[Q19650494, Q21466700, Q314459, Q5671912]",2015-01-01 18:24:34,47,"[[Harry Reid, 0.7903], [None, 0.2062], [Presid...",[http://www.koco.com/news/images-new-years-day...,E
2015-01-01-000590,a profound lesson should be learned,Xi Jinping,[Q15031],2015-01-01 09:26:06,19,"[[Xi Jinping, 0.9223], [None, 0.0772], [Steven...",[http://rss.cnn.com/~r/rss/cnn_topstories/~3/m...,E
2015-01-01-001105,All cancers are caused by a combination of bad...,Bert Vogelstein,[Q827502],2015-01-01 05:00:00,69,"[[Bert Vogelstein, 0.8643], [None, 0.1259], [C...",[http://www.eurekalert.org/pub_releases/2015-0...,E
2015-01-01-001334,"Also, the timing of the key movement into the ...",Alan Adler,"[Q4706057, Q47545678]",2015-01-01 21:26:55,46,"[[Alan Adler, 0.4895], [Alan Alder, 0.4792], [...",[http://www.reuters.com/article/2015/01/01/gm-...,E
2015-01-01-001457,an environment that is objectively and subject...,,[],2015-01-01 20:30:58,3,"[[None, 0.8312], [Ryan Cassata, 0.1688]]",[http://www.cnn.com/2015/01/01/us/transgender-...,E
...,...,...,...,...,...,...,...,...
2020-04-16-068668,You had gatherings where people had cookouts a...,,[],2020-04-16 03:51:09,7,"[[None, 0.8507], [John early, 0.1493]]",[http://rss.cnn.com/~r/rss/cnn_topstories/~3/w...,E
2020-04-16-068708,You have this digital file and you can just se...,,[],2020-04-16 10:16:15,2,"[[None, 0.7624], [Anthony Costa, 0.2376]]",[http://cnn.com/2020/04/16/tech/coronavirus-me...,E
2020-04-16-068837,"You know, William,",William Eggleston,[Q389912],2020-04-16 08:19:12,1,"[[William Eggleston, 0.5114], [None, 0.4886]]",[https://www.cnn.com/style/article/william-egg...,E
2020-04-16-069081,"You will have no in-person crowds,",Roy Cooper,"[Q16106910, Q7372694, Q7372695]",2020-04-16 01:52:21,2,"[[Roy Cooper, 0.9108], [None, 0.0892]]",[http://rss.cnn.com/~r/rss/cnn_topstories/~3/n...,E


In [None]:
# Clean dataframe
dc.drop_useless_columns(df_cnn)
dc.remove_abnormalities(df_cnn)
dc.convert_columns_type(df_cnn)

In [None]:
# Add tokens column
df_cnn = dff.add_col_tokens_from_bz2(df_cnn, paths.CNN_TOKENS_PATH)

# Drop -PRON- tokens
dc.drop_pron_tokens(df_cnn)

df_cnn

100%|██████████| 597820/597820 [00:05<00:00, 116561.91it/s]


Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,tokens
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01-000553,a number of ribs and bones in his face.,Harry Reid,"[Q19650494, Q21466700, Q314459, Q5671912]",2015-01-01 18:24:34,47,"[number, rib, bone, face]"
2015-01-01-000590,a profound lesson should be learned,Xi Jinping,[Q15031],2015-01-01 09:26:06,19,"[profound, lesson, learn]"
2015-01-01-001105,All cancers are caused by a combination of bad...,Bert Vogelstein,[Q827502],2015-01-01 05:00:00,69,"[cancer, cause, combination, bad, luck, enviro..."
2015-01-01-001334,"Also, the timing of the key movement into the ...",Alan Adler,"[Q4706057, Q47545678]",2015-01-01 21:26:55,46,"[timing, key, movement, accessory, position, r..."
2015-01-01-001457,an environment that is objectively and subject...,,[],2015-01-01 20:30:58,3,"[environment, objectively, subjectively, hosti..."
...,...,...,...,...,...,...
2020-04-16-068668,You had gatherings where people had cookouts a...,,[],2020-04-16 03:51:09,7,"[gathering, people, cookout, thing, think, peo..."
2020-04-16-068708,You have this digital file and you can just se...,,[],2020-04-16 10:16:15,2,"[digital, file, send, people, hit, print, like..."
2020-04-16-068837,"You know, William,",William Eggleston,[Q389912],2020-04-16 08:19:12,1,"[know, William]"
2020-04-16-069081,"You will have no in-person crowds,",Roy Cooper,"[Q16106910, Q7372694, Q7372695]",2020-04-16 01:52:21,2,"[person, crowd]"


In [None]:
# Create dataframe with identified speakers
df_cnn_unique_speakers = dff.create_df_unique_speakers(df_cnn)
df_cnn_unique_speakers

100%|██████████| 391542/391542 [00:00<00:00, 486745.34it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01-000553,a number of ribs and bones in his face.,Harry Reid,2015-01-01 18:24:34,47,"[number, rib, bone, face]",Q19650494
2015-01-01-000590,a profound lesson should be learned,Xi Jinping,2015-01-01 09:26:06,19,"[profound, lesson, learn]",Q15031
2015-01-01-001105,All cancers are caused by a combination of bad...,Bert Vogelstein,2015-01-01 05:00:00,69,"[cancer, cause, combination, bad, luck, enviro...",Q827502
2015-01-01-001334,"Also, the timing of the key movement into the ...",Alan Adler,2015-01-01 21:26:55,46,"[timing, key, movement, accessory, position, r...",Q4706057
2015-01-01-002011,"Any time I get with the president, I will try ...",Kirk Caldwell,2015-01-01 01:00:02,8,"[time, president, try, good, plug, site]",Q6415403
...,...,...,...,...,...,...
2020-04-16-066231,When we get to the point where people go to wo...,Gina Raimondo,2020-04-16 01:52:21,2,"[point, people, work, kid, school, way, able, ...",Q5562913
2020-04-16-066368,"When you look at the dye,",William Eggleston,2020-04-16 08:19:12,1,"[look, dye]",Q389912
2020-04-16-067487,"With homeschooling, I am a kind of a task mast...",Mikie Sherrill,2020-04-16 12:05:41,3,"[homeschooling, kind, task, master, desk, comp...",Q47087146
2020-04-16-068837,"You know, William,",William Eggleston,2020-04-16 08:19:12,1,"[know, William]",Q389912


In [None]:
# Merge speakers
df_cnn_speakers_party = pf.merge_quotes_speakers(
    df_cnn_unique_speakers, df_speakers_us_party)
df_cnn_speakers_party

Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-01-001105,All cancers are caused by a combination of bad...,Bert Vogelstein,2015-01-01 05:00:00,69,"[cancer, cause, combination, bad, luck, enviro...",Q827502,,,Bert Vogelstein,no party
2015-01-01-024137,"Our study shows, in general, that a change in ...",Bert Vogelstein,2015-01-01 05:00:00,37,"[study, general, change, number, stem, cell, d...",Q827502,,,Bert Vogelstein,no party
2015-01-02-038741,The actual contribution of these random mistak...,Bert Vogelstein,2015-01-02 14:49:01,27,"[actual, contribution, random, mistake, cancer...",Q827502,,,Bert Vogelstein,no party
2015-01-02-041237,"The more these mutations accumulate, the highe...",Bert Vogelstein,2015-01-02 14:49:01,27,"[mutation, accumulate, high, risk, cell, grow,...",Q827502,,,Bert Vogelstein,no party
2017-03-23-007828,"And it will kill 600,000 of us,",Bert Vogelstein,2017-03-23 18:00:15,25,[kill],Q827502,,,Bert Vogelstein,no party
...,...,...,...,...,...,...,...,...,...,...
2020-04-15-030604,Isolationism: A History of America's Efforts t...,Charles A. Kupchan,2020-04-15 04:15:28,1,"[isolationism, history, America, effort, shiel...",Q1063476,,,Charles A. Kupchan,no party
2020-04-16-000673,"a great poet of the color red,",Donna Tartt,2020-04-16 08:19:12,1,"[great, poet, color, red]",Q255339,,,Donna Tartt,no party
2020-04-16-030138,It's pretty obvious that the governor of Calif...,Dan Walters,2020-04-16 18:21:17,2,"[pretty, obvious, governor, California, want, ...",Q5214550,,,Dan Walters,no party
2020-04-16-038073,"perfectly boring, certainly.",Hilton Kramer,2020-04-16 08:19:12,1,"[perfectly, bore, certainly]",Q5764634,,,Hilton Kramer,no party


### Visualization

In [None]:
# Barplot top speakers
fig = pu.plot_bar_top_speakers(
    df_cnn_speakers_party,
    title='Top 10 speakers for CNN between 2015 and 2020',
    filename=os.path.join(paths.FIGS_DIR, 'cnn_bar_top_speakers.html'),
)
fig

In [None]:
# Pie chart top speakers
fig = pu.plot_pie_top_speakers(
    df_cnn_speakers_party,
    title='Top 10 speakers for CNN between 2015 and 2020',
    filename=os.path.join(paths.FIGS_DIR, 'cnn_pie_top_speakers.html'),
)
fig

In [None]:
# Pie chart proportion of parties
fig = pu.plot_pie_parties(
    df_cnn_speakers_party,
    title='Proportions of parties for CNN',
    filename=os.path.join(paths.FIGS_DIR, 'cnn_pie_parties.html'),
)
fig

### Sentiment analysis

In [None]:
# Add compound score column
sa.add_col_compound_score(df_cnn_speakers_party)
df_cnn_speakers_party

100%|██████████| 243740/243740 [01:16<00:00, 3201.05it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01-001105,All cancers are caused by a combination of bad...,Bert Vogelstein,2015-01-01 05:00:00,69,"[cancer, cause, combination, bad, luck, enviro...",Q827502,,,Bert Vogelstein,no party,-0.2960
2015-01-01-024137,"Our study shows, in general, that a change in ...",Bert Vogelstein,2015-01-01 05:00:00,37,"[study, general, change, number, stem, cell, d...",Q827502,,,Bert Vogelstein,no party,-0.6249
2015-01-02-038741,The actual contribution of these random mistak...,Bert Vogelstein,2015-01-02 14:49:01,27,"[actual, contribution, random, mistake, cancer...",Q827502,,,Bert Vogelstein,no party,-0.7845
2015-01-02-041237,"The more these mutations accumulate, the highe...",Bert Vogelstein,2015-01-02 14:49:01,27,"[mutation, accumulate, high, risk, cell, grow,...",Q827502,,,Bert Vogelstein,no party,-0.7579
2017-03-23-007828,"And it will kill 600,000 of us,",Bert Vogelstein,2017-03-23 18:00:15,25,[kill],Q827502,,,Bert Vogelstein,no party,-0.6908
...,...,...,...,...,...,...,...,...,...,...,...
2020-04-15-030604,Isolationism: A History of America's Efforts t...,Charles A. Kupchan,2020-04-15 04:15:28,1,"[isolationism, history, America, effort, shiel...",Q1063476,,,Charles A. Kupchan,no party,0.1027
2020-04-16-000673,"a great poet of the color red,",Donna Tartt,2020-04-16 08:19:12,1,"[great, poet, color, red]",Q255339,,,Donna Tartt,no party,0.6249
2020-04-16-030138,It's pretty obvious that the governor of Calif...,Dan Walters,2020-04-16 18:21:17,2,"[pretty, obvious, governor, California, want, ...",Q5214550,,,Dan Walters,no party,0.7184
2020-04-16-038073,"perfectly boring, certainly.",Hilton Kramer,2020-04-16 08:19:12,1,"[perfectly, bore, certainly]",Q5764634,,,Hilton Kramer,no party,0.6486


In [None]:
# Distribution of compound score
pu.plot_hist_compound(
    df_cnn_speakers_party,
    title='Distribution of compound score for CNN',
    filename=os.path.join(paths.FIGS_DIR, 'cnn_hist_compound_score.html'),
)

In [None]:
# Add topics column
tp.add_topics_col(df_cnn_speakers_party, lexicon, constants.TOPICS_DICT.keys())
df_cnn_speakers_party.head()

100%|██████████| 243740/243740 [09:41<00:00, 419.11it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score,topics
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2015-01-01-001105,All cancers are caused by a combination of bad...,Bert Vogelstein,2015-01-01 05:00:00,69,"[cancer, cause, combination, bad, luck, enviro...",Q827502,,,Bert Vogelstein,no party,-0.296,[coal]
2015-01-01-024137,"Our study shows, in general, that a change in ...",Bert Vogelstein,2015-01-01 05:00:00,37,"[study, general, change, number, stem, cell, d...",Q827502,,,Bert Vogelstein,no party,-0.6249,[]
2015-01-02-038741,The actual contribution of these random mistak...,Bert Vogelstein,2015-01-02 14:49:01,27,"[actual, contribution, random, mistake, cancer...",Q827502,,,Bert Vogelstein,no party,-0.7845,[]
2015-01-02-041237,"The more these mutations accumulate, the highe...",Bert Vogelstein,2015-01-02 14:49:01,27,"[mutation, accumulate, high, risk, cell, grow,...",Q827502,,,Bert Vogelstein,no party,-0.7579,[]
2017-03-23-007828,"And it will kill 600,000 of us,",Bert Vogelstein,2017-03-23 18:00:15,25,[kill],Q827502,,,Bert Vogelstein,no party,-0.6908,[]


In [None]:
# Create topics dataframe
df_cnn_topics = tp.create_df_topics(
    df_cnn_speakers_party, constants.TOPICS_DICT.keys()
)

# Merge the dataframe with speaker and political party infos
df_cnn_topics = df_cnn_topics.merge(
    df_cnn_speakers_party[['label', 'party_name']],
    left_index=True,
    right_index=True,
)

df_cnn_topics.head()

Create df topics: 100%|██████████| 11/11 [00:36<00:00,  3.33s/topic]


Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score,label,party_name
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2015-01-01-001105,,,,,,,,,,,-0.296,Bert Vogelstein,no party
2015-01-01-024137,,,,,,,,,,,,Bert Vogelstein,no party
2015-01-02-038741,,,,,,,,,,,,Bert Vogelstein,no party
2015-01-02-041237,,,,,,,,,,,,Bert Vogelstein,no party
2017-03-23-007828,,,,,,,,,,,,Bert Vogelstein,no party


In [None]:
# Run ttest
display(HTML(sa.run_ttest(df_cnn_topics)))

Topic,t-statistic,p-value,Same opinion?
Immigration,0.2662,0.7901,✅
Healthcare,1.7822,0.0748,✅
Climate,-1.4023,0.1613,✅
Trump,-2.3317,0.0198,❌
Abortion,-1.4712,0.1412,✅
Women_right,-0.3038,0.7613,✅
Violence,4.7895,0.0,❌
Racism,2.235,0.0255,❌
War,-1.9046,0.0569,✅
Tax,-0.3098,0.7567,✅


In [None]:
# Plotting the mean and std per topic sentiment score, Republican vs Democrats
fig = pu.plot_mean_sentiment_scores_per_party(
    df=df_cnn_topics,
    title='Mean score per topic for the sentiment analysis (CNN)',
    filename=os.path.join(paths.FIGS_DIR, 'cnn_sentiment_scores_parties.html')
)
fig.show()

### Topics repartition over the years

In [None]:
# Merge selected quotes with the dates
df_cnn_topics = pd.merge(
    df_cnn_topics, df_cnn_speakers_party['date'], how='inner',
    left_index=True, right_index=True
)
df_cnn_topics['date'] = pd.to_datetime(df_cnn_topics['date'])

In [None]:
# Count the number of quotes per topic for the whole journal,
# for democrats quotes only, and for republicans quotes only
quotes_over_the_years_cnn = df_cnn_topics.groupby(df_cnn_topics.date.dt.to_period('M')).count()
quotes_over_the_years_R = df_cnn_topics[df_cnn_topics.party_name=='republican party'].drop(
    ['label', 'party_name', 'date'], axis=1).groupby(df_cnn_topics.date.dt.to_period('M')).count()
quotes_over_the_years_D = df_cnn_topics[df_cnn_topics.party_name=='democratic party'].drop(
    ['label', 'party_name', 'date'], axis=1).groupby(df_cnn_topics.date.dt.to_period('M')).count()

In [None]:
# Normalize the counts
sum_CNN = quotes_over_the_years_cnn.sum(axis=0)
quotes_over_the_years_cnn = quotes_over_the_years_cnn.transform(
    axis=1, func=lambda x: x / sum_CNN
)

sum_R = quotes_over_the_years_R.sum(axis=0)
quotes_over_the_years_R = quotes_over_the_years_R.transform(
    axis=1, func=lambda x: x / sum_R
)

sum_D = quotes_over_the_years_D.sum(axis=0)
quotes_over_the_years_D = quotes_over_the_years_D.transform(
    axis=1, func=lambda x: x / sum_D
)

In [None]:
# Checking that the sum is 1
quotes_over_the_years_cnn.sum()

immigration_compound_score    1.0
healthcare_compound_score     1.0
climate_compound_score        1.0
trump_compound_score          1.0
abortion_compound_score       1.0
women_right_compound_score    1.0
violence_compound_score       1.0
racism_compound_score         1.0
war_compound_score            1.0
tax_compound_score            1.0
coal_compound_score           1.0
label                         1.0
party_name                    1.0
date                          1.0
dtype: float64

In [None]:
quotes_over_the_years_cnn.head()

Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score,label,party_name,date
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2015-01,0.007494,0.007115,0.011664,0.007695,0.01153,0.010011,0.015,0.008795,0.013647,0.010515,0.011115,0.010634,0.010634,0.010634
2015-02,0.012563,0.013901,0.006221,0.012368,0.012216,0.010454,0.011517,0.014169,0.016935,0.009727,0.007893,0.011241,0.011241,0.011241
2015-03,0.007274,0.008538,0.006221,0.00907,0.012901,0.013997,0.014323,0.013192,0.013071,0.006835,0.009182,0.01198,0.01198,0.01198
2015-04,0.009918,0.011165,0.014774,0.01223,0.015401,0.016478,0.016355,0.012215,0.017593,0.008675,0.011759,0.01246,0.01246,0.01246
2015-05,0.008816,0.008209,0.007776,0.00907,0.011369,0.00877,0.013355,0.009609,0.015209,0.007361,0.009182,0.010441,0.010441,0.010441


In [None]:
# For the whole journal
fig = pu.plot_topics_count_stacked(
    df=quotes_over_the_years_cnn,
    journal_name='CNN',
    filename = os.path.join(paths.FIGS_DIR, 'cnn_topics_count.html')
)
fig.show()

In [None]:
significant_topics = ['trump', 'violence', 'racism', 'coal']

for topic in significant_topics:
    fig = pu.plot_topics_R_vs_D(
        df_democrats=quotes_over_the_years_D,
        df_republicans=quotes_over_the_years_R,
        topic=topic,
        filename=os.path.join(paths.FIGS_DIR, f'CNN_R_VS_D_{topic}.html')
    )
    fig.show()

#### PCA on the speakers quotations

In [None]:
# PCA
df_cnn_avg_topics = sa.create_df_avg_compound_score(df_cnn_topics)
cnn_pca, df_cnn_pca = sa.pca_analysis(df_cnn_avg_topics, stardardize=True)
pu.plot_scatter_pca(
    df_cnn_pca,
    title='Principal component 1 vs principal component 2 for CNN',
    filename=os.path.join(paths.FIGS_DIR, 'cnn_pca_2d.html'),
)

In [None]:
# Delete useless variables to free memory
variables = list(globals().keys())
for variable in variables:
    if 'cnn' in variable:
        print('Delete', variable)
        del globals()[variable]
gc.collect()

Delete df_cnn
Delete df_cnn_unique_speakers
Delete df_cnn_speakers_party
Delete df_cnn_topics
Delete quotes_over_the_years_cnn
Delete df_cnn_avg_topics
Delete cnn_pca
Delete df_cnn_pca


106742

## Part C: FOX News

### Data preprocessing

In [None]:
# Create dataframe of quotes
df_fox = dff.create_df_from_bz2_dir(paths.FOX_DIR)
df_fox

Load bz2 files: 100%|██████████| 6/6 [03:06<00:00, 31.10s/file]


Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-01-01-000076,"2015 equals Mommy, Daddy, Carmen and a special...",,[],2015-01-01 17:18:35,29,"[[None, 0.4046], [Alec Baldwin, 0.3687], [Hila...",[http://feeds.nydailynews.com/~r/nydnrss/enter...,E
2015-01-01-000655,A `shining city' is perhaps all the president ...,,[],2015-01-01 12:00:00,127,"[[None, 0.603], [Mario Cuomo, 0.2895], [Matild...",[http://www.sacbee.com/news/nation-world/artic...,E
2015-01-01-001082,"Alcoholics come in many forms,",George Koob,[Q5541413],2015-01-01 16:57:13,142,"[[George Koob, 0.8807], [None, 0.1193]]",[http://www.bostonherald.com/news_opinion/nati...,E
2015-01-01-001128,"All I heard was like this `pop, pop, pop, pop....",,[],2015-01-01 15:14:59,40,"[[None, 0.6614], [Steve Adair, 0.3386]]",[http://thechronicleherald.ca/canada/1260375-s...,E
2015-01-01-001254,All those trying to move up fell down on the s...,,[],2015-01-01 01:22:54,223,"[[None, 0.558], [Wu Tao, 0.4256], [Chen Yi, 0....",[http://kwwl.com/story/27739130/35-killed-43-i...,E
...,...,...,...,...,...,...,...,...
2020-04-16-068808,"You know, that mentality, I think, is really p...",Mike Rowe,"[Q3313524, Q455808, Q6848639]",2020-04-16 00:00:00,1,"[[Mike Rowe, 0.9549], [None, 0.0451]]",[https://www.foxnews.com/media/mike-rowe-ameri...,E
2020-04-16-069006,"you stay at home, unless you are getting groce...",Andy Beshear,[Q21572825],2020-04-16 00:00:00,1,"[[Andy Beshear, 0.5395], [None, 0.4228], [Gov....",[https://www.foxnews.com/us/coronavirus-stay-a...,E
2020-04-16-069202,Your wife's name is Mother Earth. And she is w...,Drew Barrymore,[Q676094],2020-04-16 00:00:00,1,"[[Drew Barrymore, 0.6503], [None, 0.3497]]",[http://feeds.foxnews.com/~r/foxnews/entertain...,E
2020-04-16-069247,You're hurting too much so it wasn't going to ...,Paul McCartney,[Q2599],2020-04-16 00:00:00,1,"[[Paul McCartney, 0.8092], [None, 0.1616], [Jo...",[https://www.foxnews.com/entertainment/paul-mc...,E


In [None]:
# Clean dataframe
dc.drop_useless_columns(df_fox)
dc.remove_abnormalities(df_fox)
dc.convert_columns_type(df_fox)

In [None]:
# Add tokens column
df_fox = dff.add_col_tokens_from_bz2(df_fox, paths.FOX_TOKENS_PATH)

# Drop -PRON- tokens
dc.drop_pron_tokens(df_fox)

df_fox

100%|██████████| 679319/679319 [00:04<00:00, 153244.40it/s]


Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,tokens
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01-000076,"2015 equals Mommy, Daddy, Carmen and a special...",,[],2015-01-01 17:18:35,29,"[equal, Mommy, Daddy, Carmen, special, guest, ..."
2015-01-01-000655,A `shining city' is perhaps all the president ...,,[],2015-01-01 12:00:00,127,"[shine, city, president, portico, White, House..."
2015-01-01-001082,"Alcoholics come in many forms,",George Koob,[Q5541413],2015-01-01 16:57:13,142,"[alcoholic, come, form]"
2015-01-01-001128,"All I heard was like this `pop, pop, pop, pop....",,[],2015-01-01 15:14:59,40,"[hear, like, pop, pop, pop, pop, half, asleep,..."
2015-01-01-001254,All those trying to move up fell down on the s...,,[],2015-01-01 01:22:54,223,"[try, fall, stair]"
...,...,...,...,...,...,...
2020-04-16-068808,"You know, that mentality, I think, is really p...",Mike Rowe,"[Q3313524, Q455808, Q6848639]",2020-04-16 00:00:00,1,"[know, mentality, think, powerful, sorry, ment..."
2020-04-16-069006,"you stay at home, unless you are getting groce...",Andy Beshear,[Q21572825],2020-04-16 00:00:00,1,"[stay, home, grocery, supply, need]"
2020-04-16-069202,Your wife's name is Mother Earth. And she is w...,Drew Barrymore,[Q676094],2020-04-16 00:00:00,1,"[wife, Mother, Earth, worth, live, day, know, ..."
2020-04-16-069247,You're hurting too much so it wasn't going to ...,Paul McCartney,[Q2599],2020-04-16 00:00:00,1,"[hurt, happen]"


In [None]:
# Create dataframe with identified speakers
df_fox_unique_speakers = dff.create_df_unique_speakers(df_fox)
df_fox_unique_speakers

100%|██████████| 446195/446195 [00:00<00:00, 477750.81it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01-001082,"Alcoholics come in many forms,",George Koob,2015-01-01 16:57:13,142,"[alcoholic, come, form]",Q5541413
2015-01-01-001586,And I fear it will be the same in 2015.,Hendrik Vos,2015-01-01 12:44:52,40,[fear],Q3088743
2015-01-01-001700,And making sure that when the relationship bet...,Todd Haymore,2015-01-01 08:15:00,7,"[sure, relationship, United, States, Cuba, cha...",Q28023785
2015-01-01-002255,as a result of his actions.,Rich Rodriguez,2015-01-01 02:36:50,57,"[result, action]",Q7323433
2015-01-01-003085,began to be realized when President Johnson pa...,Hillary Rodham Clinton,2015-01-01 05:08:52,4,"[begin, realize, President, Johnson, pass, Civ...",Q6294
...,...,...,...,...,...,...
2020-04-16-068808,"You know, that mentality, I think, is really p...",Mike Rowe,2020-04-16 00:00:00,1,"[know, mentality, think, powerful, sorry, ment...",Q3313524
2020-04-16-069006,"you stay at home, unless you are getting groce...",Andy Beshear,2020-04-16 00:00:00,1,"[stay, home, grocery, supply, need]",Q21572825
2020-04-16-069202,Your wife's name is Mother Earth. And she is w...,Drew Barrymore,2020-04-16 00:00:00,1,"[wife, Mother, Earth, worth, live, day, know, ...",Q676094
2020-04-16-069247,You're hurting too much so it wasn't going to ...,Paul McCartney,2020-04-16 00:00:00,1,"[hurt, happen]",Q2599


In [None]:
# Merge speakers
df_fox_speakers_party = pf.merge_quotes_speakers(
    df_fox_unique_speakers, df_speakers_us_party)
df_fox_speakers_party

Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2015-01-01-001082,"Alcoholics come in many forms,",George Koob,2015-01-01 16:57:13,142,"[alcoholic, come, form]",Q5541413,,,George Koob,no party
2016-05-20-058627,It can help doctors accurately measure a patie...,George Koob,2016-05-20 23:06:54,41,"[help, doctor, accurately, measure, patient, d...",Q5541413,,,George Koob,no party
2016-05-20-116536,This can help a lot with the treatment.,George Koob,2016-05-20 23:06:54,42,"[help, lot, treatment]",Q5541413,,,George Koob,no party
2016-05-20-134436,We wanted to make something people would want ...,George Koob,2016-05-20 23:06:54,39,"[want, people, want, wear]",Q5541413,,,George Koob,no party
2015-01-01-002255,as a result of his actions.,Rich Rodriguez,2015-01-01 02:36:50,57,"[result, action]",Q7323433,[RichRod],,Rich Rodriguez,no party
...,...,...,...,...,...,...,...,...,...,...
2020-04-15-079978,"You do your part to show you're clean, and you...",Noah Lyles,2020-04-15 10:04:57,9,"[clean, state, clean, come, test]",Q15989263,,,Noah Lyles,no party
2020-04-15-034880,It's one of the first times in my entire life ...,Jonathan Thomas,2020-04-15 00:00:00,2,"[time, entire, life, receive, end, donation, a...",Q28663071,,,Jonathan Thomas,no party
2020-04-15-073238,"We went from being open with visitors, with no...",Jonathan Thomas,2020-04-15 00:00:00,2,"[open, visitor, restriction, mask, point, comp...",Q28663071,,,Jonathan Thomas,no party
2020-04-15-046383,Rouses has supported a lot of us in the restau...,Johnny Sanchez,2020-04-15 00:00:00,1,"[rous, support, lot, restaurant, community, year]",Q6267655,,,Johnny Sanchez,no party


### Visualization

In [None]:
# Barplot top speakers
fig = pu.plot_bar_top_speakers(
    df_fox_speakers_party,
    title='Top 10 speakers for FOX between 2015 and 2020',
    filename=os.path.join(paths.FIGS_DIR, 'fox_bar_top_speakers.html'),
)
fig

In [None]:
# Pie chart top speakers
fig = pu.plot_pie_top_speakers(
    df_fox_speakers_party,
    title='Top 10 speakers for FOX between 2015 and 2020',
    filename=os.path.join(paths.FIGS_DIR, 'fox_pie_top_speakers.html'),
)
fig

In [None]:
# Pie chart proportion of parties
fig = pu.plot_pie_parties(
    df_fox_speakers_party,
    title='Proportions of parties for FOX',
    filename=os.path.join(paths.FIGS_DIR, 'fox_pie_parties.html'),
)
fig

### Sentiment analysis

In [None]:
# Add compound score column
sa.add_col_compound_score(df_fox_speakers_party)
df_fox_speakers_party

100%|██████████| 301144/301144 [01:37<00:00, 3093.14it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01-001082,"Alcoholics come in many forms,",George Koob,2015-01-01 16:57:13,142,"[alcoholic, come, form]",Q5541413,,,George Koob,no party,0.0000
2016-05-20-058627,It can help doctors accurately measure a patie...,George Koob,2016-05-20 23:06:54,41,"[help, doctor, accurately, measure, patient, d...",Q5541413,,,George Koob,no party,0.4019
2016-05-20-116536,This can help a lot with the treatment.,George Koob,2016-05-20 23:06:54,42,"[help, lot, treatment]",Q5541413,,,George Koob,no party,0.4019
2016-05-20-134436,We wanted to make something people would want ...,George Koob,2016-05-20 23:06:54,39,"[want, people, want, wear]",Q5541413,,,George Koob,no party,0.0772
2015-01-01-002255,as a result of his actions.,Rich Rodriguez,2015-01-01 02:36:50,57,"[result, action]",Q7323433,[RichRod],,Rich Rodriguez,no party,0.0000
...,...,...,...,...,...,...,...,...,...,...,...
2020-04-15-079978,"You do your part to show you're clean, and you...",Noah Lyles,2020-04-15 10:04:57,9,"[clean, state, clean, come, test]",Q15989263,,,Noah Lyles,no party,0.6597
2020-04-15-034880,It's one of the first times in my entire life ...,Jonathan Thomas,2020-04-15 00:00:00,2,"[time, entire, life, receive, end, donation, a...",Q28663071,,,Jonathan Thomas,no party,0.6486
2020-04-15-073238,"We went from being open with visitors, with no...",Jonathan Thomas,2020-04-15 00:00:00,2,"[open, visitor, restriction, mask, point, comp...",Q28663071,,,Jonathan Thomas,no party,-0.5267
2020-04-15-046383,Rouses has supported a lot of us in the restau...,Johnny Sanchez,2020-04-15 00:00:00,1,"[rous, support, lot, restaurant, community, year]",Q6267655,,,Johnny Sanchez,no party,0.3182


In [None]:
# Distribution of compound score
pu.plot_hist_compound(
    df_fox_speakers_party,
    title='Distribution of compound score for FOX',
    filename=os.path.join(paths.FIGS_DIR, 'fox_hist_compound_score.html'),
)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# Add topics column
tp.add_topics_col(df_fox_speakers_party, lexicon, constants.TOPICS_DICT.keys())
df_fox_speakers_party

100%|██████████| 301144/301144 [11:40<00:00, 430.13it/s]


Unnamed: 0_level_0,quotation,speaker,date,numOccurrences,tokens,qid,aliases,US_congress_bio_ID,label,party_name,compound_score,topics
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2015-01-01-001082,"Alcoholics come in many forms,",George Koob,2015-01-01 16:57:13,142,"[alcoholic, come, form]",Q5541413,,,George Koob,no party,0.0000,[]
2016-05-20-058627,It can help doctors accurately measure a patie...,George Koob,2016-05-20 23:06:54,41,"[help, doctor, accurately, measure, patient, d...",Q5541413,,,George Koob,no party,0.4019,"[healthcare, abortion]"
2016-05-20-116536,This can help a lot with the treatment.,George Koob,2016-05-20 23:06:54,42,"[help, lot, treatment]",Q5541413,,,George Koob,no party,0.4019,[healthcare]
2016-05-20-134436,We wanted to make something people would want ...,George Koob,2016-05-20 23:06:54,39,"[want, people, want, wear]",Q5541413,,,George Koob,no party,0.0772,[]
2015-01-01-002255,as a result of his actions.,Rich Rodriguez,2015-01-01 02:36:50,57,"[result, action]",Q7323433,[RichRod],,Rich Rodriguez,no party,0.0000,[]
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-15-079978,"You do your part to show you're clean, and you...",Noah Lyles,2020-04-15 10:04:57,9,"[clean, state, clean, come, test]",Q15989263,,,Noah Lyles,no party,0.6597,[]
2020-04-15-034880,It's one of the first times in my entire life ...,Jonathan Thomas,2020-04-15 00:00:00,2,"[time, entire, life, receive, end, donation, a...",Q28663071,,,Jonathan Thomas,no party,0.6486,[abortion]
2020-04-15-073238,"We went from being open with visitors, with no...",Jonathan Thomas,2020-04-15 00:00:00,2,"[open, visitor, restriction, mask, point, comp...",Q28663071,,,Jonathan Thomas,no party,-0.5267,[]
2020-04-15-046383,Rouses has supported a lot of us in the restau...,Johnny Sanchez,2020-04-15 00:00:00,1,"[rous, support, lot, restaurant, community, year]",Q6267655,,,Johnny Sanchez,no party,0.3182,[]


In [None]:
# Create topics dataframe
df_fox_topics = tp.create_df_topics(
    df_fox_speakers_party, constants.TOPICS_DICT.keys()
)

# Merge the dataframe with speaker and political party infos
df_fox_topics = df_fox_topics.merge(
    df_fox_speakers_party[['label', 'party_name']],
    left_index=True,
    right_index=True,
)

df_fox_topics

Create df topics: 100%|██████████| 11/11 [00:42<00:00,  3.83s/topic]


Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score,label,party_name
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2015-01-01-001082,,,,,,,,,,,,George Koob,no party
2016-05-20-058627,,0.4019,,,0.4019,,,,,,,George Koob,no party
2016-05-20-116536,,0.4019,,,,,,,,,,George Koob,no party
2016-05-20-134436,,,,,,,,,,,,George Koob,no party
2015-01-01-002255,,,,,,,,,,,,Rich Rodriguez,no party
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-15-079978,,,,,,,,,,,,Noah Lyles,no party
2020-04-15-034880,,,,,0.6486,,,,,,,Jonathan Thomas,no party
2020-04-15-073238,,,,,,,,,,,,Jonathan Thomas,no party
2020-04-15-046383,,,,,,,,,,,,Johnny Sanchez,no party


In [None]:
# Run ttest
display(HTML(sa.run_ttest(df_fox_topics)))

Topic,t-statistic,p-value,Same opinion?
Immigration,-0.5218,0.6018,✅
Healthcare,1.9044,0.0569,✅
Climate,-1.2908,0.1972,✅
Trump,-2.2583,0.024,❌
Abortion,0.9311,0.3518,✅
Women_right,0.3921,0.695,✅
Violence,1.8293,0.0674,✅
Racism,-0.4359,0.6629,✅
War,-2.3342,0.0196,❌
Tax,0.7577,0.4487,✅


In [None]:
# Plotting the mean and std per topic sentiment score, Republican vs Democrats
fig = pu.plot_mean_sentiment_scores_per_party(
    df=df_fox_topics,
    title='Mean score per topic for the sentiment analysis (FOX)',
    filename = os.path.join(paths.FIGS_DIR, 'fox_sentiment_scores_parties.html')
)
fig.show()

### Topics repartition over the years

In [None]:
# Merge selected quotes with the dates
df_fox_topics = pd.merge(
    df_fox_topics, df_fox_speakers_party['date'], how='inner',
    left_index=True, right_index=True
)
df_fox_topics['date'] = pd.to_datetime(df_fox_topics['date'])

In [None]:
# Count the number of quotes per topic for the whole journal,
# for democrats quotes only, and for republicans quotes only
quotes_over_the_years_fox = df_fox_topics.groupby(df_fox_topics.date.dt.to_period('M')).count()
quotes_over_the_years_R = df_fox_topics[df_fox_topics.party_name=='republican party'].drop(
    ['label', 'party_name', 'date'], axis=1).groupby(df_fox_topics.date.dt.to_period('M')).count()
quotes_over_the_years_D = df_fox_topics[df_fox_topics.party_name=='democratic party'].drop(
    ['label', 'party_name', 'date'], axis=1).groupby(df_fox_topics.date.dt.to_period('M')).count()

In [None]:
# Normalize the counts
sum_FOX = quotes_over_the_years_fox.sum(axis=0)
quotes_over_the_years_fox = quotes_over_the_years_fox.transform(
    axis=1, func=lambda x: x / sum_FOX
)

sum_R = quotes_over_the_years_R.sum(axis=0)
quotes_over_the_years_R = quotes_over_the_years_R.transform(
    axis=1, func=lambda x: x / sum_R
)

sum_D = quotes_over_the_years_D.sum(axis=0)
quotes_over_the_years_D = quotes_over_the_years_D.transform(
    axis=1, func=lambda x: x / sum_D
)

In [None]:
# Checking that the sum is 1
quotes_over_the_years_fox.sum()

immigration_compound_score    1.0
healthcare_compound_score     1.0
climate_compound_score        1.0
trump_compound_score          1.0
abortion_compound_score       1.0
women_right_compound_score    1.0
violence_compound_score       1.0
racism_compound_score         1.0
war_compound_score            1.0
tax_compound_score            1.0
coal_compound_score           1.0
label                         1.0
party_name                    1.0
date                          1.0
dtype: float64

In [None]:
quotes_over_the_years_fox.head()

Unnamed: 0_level_0,immigration_compound_score,healthcare_compound_score,climate_compound_score,trump_compound_score,abortion_compound_score,women_right_compound_score,violence_compound_score,racism_compound_score,war_compound_score,tax_compound_score,coal_compound_score,label,party_name,date
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2015-01,0.014892,0.008487,0.013898,0.009571,0.008542,0.007573,0.011371,0.011765,0.015642,0.012569,0.010801,0.009544,0.009544,0.009544
2015-02,0.018201,0.010721,0.015161,0.010204,0.008956,0.009142,0.011862,0.012968,0.019737,0.011259,0.010338,0.010012,0.010012,0.010012
2015-03,0.013688,0.009917,0.013266,0.008859,0.010587,0.012212,0.012353,0.014572,0.013292,0.010998,0.011418,0.011031,0.011031,0.011031
2015-04,0.01083,0.008398,0.007581,0.006882,0.008735,0.009483,0.011441,0.010561,0.013359,0.006023,0.009258,0.009417,0.009417,0.009417
2015-05,0.006167,0.004556,0.005685,0.004351,0.006054,0.005662,0.00751,0.006952,0.010003,0.006023,0.004783,0.006678,0.006678,0.006678


In [None]:
# For the whole journal
fig = pu.plot_topics_count_stacked(
    df=quotes_over_the_years_fox,
    journal_name='FOX',
    filename=os.path.join(paths.FIGS_DIR, 'fox_topics_count.html')
)
fig.show()

In [None]:
# Note: racism not significant but useful for our analysis
significant_topics = ['trump', 'war', 'coal', 'racism']

for topic in significant_topics:
    fig = pu.plot_topics_R_vs_D(
        df_democrats=quotes_over_the_years_D, 
        df_republicans=quotes_over_the_years_R,
        topic=topic,
        filename=os.path.join(paths.FIGS_DIR, f'FOX_R_VS_D_{topic}.html')
    )
    fig.show()

#### PCA on the speakers quotations

In [None]:
# PCA
df_fox_avg_topics = sa.create_df_avg_compound_score(df_fox_topics)
fox_pca, df_fox_pca = sa.pca_analysis(df_fox_avg_topics, stardardize=True)
pu.plot_scatter_pca(
    df_fox_pca,
    title='Principal component 1 vs principal component 2 for FOX',
    filename=os.path.join(paths.FIGS_DIR, 'fox_pca_2d.html'),
)

In [None]:
# Delete useless variables to free memory
variables = list(globals().keys())
for variable in variables:
    if 'fox' in variable:
        print('Delete', variable)
        del globals()[variable]
gc.collect()

Delete df_fox
Delete df_fox_unique_speakers
Delete df_fox_speakers_party
Delete df_fox_topics
Delete quotes_over_the_years_fox
Delete df_fox_avg_topics
Delete fox_pca
Delete df_fox_pca


87432

## References

- Quotebank
    - [Quotebank: A Corpus of Quotations from a Decade of News](https://zenodo.org/record/4277311)

- NLP
    - [NLTK Documentation](https://www.nltk.org/index.html)
    - [Spacy Documentation](https://spacy.io/)
    - [Empath](https://github.com/Ejhfast/empath-client)
    - [Sentiment Analysis: First Steps With Python's NLTK Library](https://realpython.com/python-nltk-sentiment-analysis/)
    - [Topic Modelling in Python with NLTK and Gensim](https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21)

- Wordcloud
    - [WordCloud Documentation](https://amueller.github.io/word_cloud/)
    - [Generating WordClouds in Python](https://www.datacamp.com/community/tutorials/wordcloud-python)
    - [Simple word cloud in Python](https://towardsdatascience.com/simple-wordcloud-in-python-2ae54a9f58e5)

- Visualization
    - [Plotly](https://plotly.com/graphing-libraries/)
    - [Python-tabulate](https://github.com/astanin/python-tabulate)

- Newspapers
    - [The New York Times](https://www.nytimes.com/)
    - [CNN](https://edition.cnn.com/)
    - [Fox News](https://www.foxnews.com/)