
# All the News Dataset: Sentiment EDA 🎭

This EDA explores the textual representations and scores from word count and sentiment preprocessed data of `AllTheNews`

#### Notebook Properties
* Upstream Notebook: `src.exploratory_data_analysis.preprocessor_wc_and_sentiments`
* Compute Resources: `32 GB RAM, 4 CPUs` (when not performing EDA on a sample of data)
* Last Updated: `Nov 23 2023`

#### Data

| **Name** | **Type** | **Location Type** | **Description** | **Location** | 
| --- | --- | --- | --- | --- | 
| `all_the_news_wc_sentiment` | `input` | `Delta` | WC & Sentiment assigned `AllTheNews` data | `catalog/text_eda/all_the_news.delta` | 

In [0]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import plotly.express as px
from deltalake import DeltaTable
from src.utils.io import FileSystemHandler

In [0]:
pd.set_option("display.max_columns", None)
pd.options.plotting.backend = "plotly"

datafs = FileSystemHandler("s3")

In [0]:
LIMIT_PARTITIONS: int | None = None
"""An input parameter to limit the number of table partitions to read from delta. Useful to perform EDA on a sample of data."""

SHUFFLE_PARTITIONS: bool = False
"""Whether to randomize the partitions before reading"""

INPUT_TABLE: str = "all_the_news" 
INPUT_CATALOG: str = "text_eda"


### Read Data

In [0]:
atn_delta_table: DeltaTable = datafs.read_delta(
    table=INPUT_TABLE,
    catalog_name=INPUT_CATALOG,
    as_pandas=False,
)

df: pd.DataFrame = datafs.read_delta_partitions(
    delta_table=atn_delta_table,
    N_partitions=LIMIT_PARTITIONS,
    shuffle_partitions=SHUFFLE_PARTITIONS,
)

df["date"] = pd.to_datetime(df["date"])
df = df[df.date < pd.to_datetime("2020-04-01")]
df = df.sort_values(by=["date"])

print(df.shape)
df.head()


## Sentiment EDA

This analysis seeks to answer the following questions:

1. By manual inspection, do polarizing articles appear to be biased?
2. Do certain authors / publications / sections consistently report with a positive or negative sentiment?
3. Is there a sentiment trend over time, and by group?


### Study 1: Sentiment Trends Over Time 📈
See bottom of section for observations

In [0]:
title_word_count_t: pd.DataFrame = (
    df.groupby(pd.Grouper(key="date", freq="M"))["title_word_count"]
    .mean()
    .reset_index()
)

title_word_count_t_fig = px.line(
    title_word_count_t,
    x="date",
    y="title_word_count",
    title="Title Word Count by Month over Time",
    template="plotly_white",
    markers=True,
    range_y=[0,20]
)

title_word_count_t_fig.update_traces(line=dict(color="violet"))
title_word_count_t_fig.show()

In [0]:
article_word_count_t: pd.DataFrame = (
    df.groupby(pd.Grouper(key="date", freq="M"))["article_word_count"].mean().reset_index()
)

article_word_count_t_fig = px.line(
    article_word_count_t,
    x="date",
    y="article_word_count",
    title="Article Word Count by Month over Time",
    template="plotly_white",
    markers=True,
    range_y=[0,700]
)

article_word_count_t_fig.update_traces(line=dict(color="violet"))
article_word_count_t_fig.show()

In [0]:
t_sentiments = (
    df.groupby(pd.Grouper(key="date", freq="M"))
    .agg(
        {
            "vader_compound_title": "mean",
            "article_textblob_sentiment": "mean",
            "title_textblob_sentiment": "mean",
        }
    )
    .reset_index()
)

t_sentiments_fig = px.line(
    t_sentiments,
    x="date",
    y=[
        "vader_compound_title",
        "article_textblob_sentiment",
        "title_textblob_sentiment",
    ],
    labels={"value": "Mean Sentiment Score", "variable": "Sentiment Type"},
    title="Average Sentiment per Model over Time",
    template="plotly_white",
    markers=True,
    range_y=[-0.5, 0.5],
)

t_sentiments_fig.show()

In [0]:
vader_title_polarity_t: pd.DataFrame = (
    df.groupby("date")["vader_compound_title"].mean().reset_index()
)

vader_title_polarity_t_fig = px.line(
    vader_title_polarity_t,
    x="date",
    y="vader_compound_title",
    title="VADER Title Polarity over Time",
    template="plotly_white",
)

vader_title_polarity_t_fig.update_traces(line=dict(color="blue"))
vader_title_polarity_t_fig.show()

In [0]:
textblob_article_polarity_t: pd.DataFrame = (
    df.groupby("date")["article_textblob_sentiment"].mean().reset_index()
)

textblob_article_polarity_t_fig = px.line(
    textblob_article_polarity_t,
    x="date",
    y="article_textblob_sentiment",
    title="Textblob Article Polarity over Time",
    template="plotly_white",
)

textblob_article_polarity_t_fig.update_traces(line=dict(color="red"))
textblob_article_polarity_t_fig.show()


### Study 1: Observations 📝

* The following trends are **consistent**:
  * Average word counts of title of article over time (generally 9-10 words per title)
  * Average sentiment per model over time, meaning choice of model may be irrelevant (with slight variance after Jul 2018)
  * Vader polarity of title is also consistent (with marginal dip after Jul 2018)

* The following trends are **curious**:
  * The average word count per article increases by about 100 words after Jul 2018
    * From the summary EDA, we also saw that article count for a year after Jul 2018 was also low -> Denser Articles?
  * The volatility of article polarity (textblob) decreases after the Jul 2018 date
    * Could the presence of more data in the article make the model outcomes less noisy?


### Study 2: Inspection of Polarizing Articles 🔍


## Polarity Filters

If a fraction of biased article are assumed to be highly polarized by sentiment, we must consider both highly positive and highly negative sentiments.

**Model Filters**:
* `vader` probability of positive / negative sentiment of article title > 50%
* `vader` probability of neutral sentiment of article < 50%
* `textblob` sentiment score > `0.15` or < `-0.15`

We establish these filters using two modes: `strict` and `loose`.

**Modes**
* **Strict**: Articles that both models have full confidence and consensus on sentiments (`AND` condition)
* **Loose**: Same filters as `strict`, but if either model's outcomes are met - then it's used (`OR` condition)

> The number `0.15` is obtained by eyeballing the time series distributions in study 1

In [0]:
negative_titles_strict: pd.DataFrame = df[
    (df.vader_prob_negative_title > 0.5) & (df.vader_prob_neutral_title < 0.5)
]

positive_titles_strict: pd.DataFrame = df[
    (df.vader_prob_positive_title > 0.5) & (df.vader_prob_neutral_title < 0.5)
]

polarized_titles_strict: pd.DataFrame = pd.concat(
    [positive_titles_strict, negative_titles_strict]
).reset_index(drop=True)

print(polarized_titles_strict.shape)
polarized_titles_strict.head()

In [0]:
polarized_pubs_dist: pd.Series = polarized_titles_strict.publication.value_counts(
    normalize=True
)
all_pubs_dist: pd.Series = df.publication.value_counts(normalize=True)
publication_polarity_diff = pd.concat(
    [all_pubs_dist, polarized_pubs_dist], axis=1
).reset_index()

publication_polarity_diff.columns = [
    "publication",
    "full_ratio",
    "polarized_ratio",
]

publication_polarity_diff["polarity_ratio_increase"] = (
    publication_polarity_diff["polarized_ratio"]
    - publication_polarity_diff["full_ratio"]
)

publication_polarity_diff = publication_polarity_diff.sort_values(
    by="polarity_ratio_increase", ascending=False
)
publication_polarity_diff = publication_polarity_diff.round(2)
publication_polarity_diff[publication_polarity_diff.polarity_ratio_increase > 0]

In [0]:
publication_polarity_diff[publication_polarity_diff.polarity_ratio_increase < 0]


### Study 2: Observations 📝

* The highly polarized articles are not evident of any bias purely by sentiment
* `Refinery 29`, `CNN`, `People` and `NYTimes` see an increased presence in polarized articles compared to all articles
* `Reuters`, `Verge`, `CNBC`, `TechCrunch`, etc. have a lower presence among highly polarized articles, meaning they are likely more neutrally worded publications

### Study 3: Sentiment Patterns for Groups 🎙

In [0]:
agg_pub_sentiment: pd.DataFrame = (
    df.groupby("publication")
    .agg(
        article_mean=("article_textblob_sentiment", "mean"),
        article_std=("article_textblob_sentiment", "std"),
        title_mean=("vader_compound_title", "mean"),
        title_std=("vader_compound_title", "std"),
    )
    .reset_index()
    .round(3)
    .sort_values(by="article_mean", ascending=False)
)

print(agg_pub_sentiment.shape)
agg_pub_sentiment.head()

In [0]:
px.scatter(
    agg_pub_sentiment,
    x="article_mean",
    y="article_std",
    color="publication",
    template="plotly_white",
    labels=["publication"],
    range_x=[0.05,0.2],
    range_y=[0.05,0.2],
    title="Average Article Polarity of Publications vs. Std. Error"
)

In [0]:
agg_author_sentiment: pd.DataFrame = (
    df.groupby("author")
    .agg(
        article_mean=("article_textblob_sentiment", "mean"),
        article_std=("article_textblob_sentiment", "std"),
        title_mean=("vader_compound_title", "mean"),
        title_std=("vader_compound_title", "std"),
        article_count=("article", "count")
    )
    .reset_index()
    .round(3)
    .query("article_count > 1")
    .sort_values(by="article_count", ascending=False)
)

print(agg_author_sentiment.shape)
agg_author_sentiment.head()

In [0]:
pub_sentiment_by_month: pd.DataFrame = (
    df.groupby([pd.Grouper(key="date", freq="M"), "publication"])[
        "article_textblob_sentiment"
    ]
    .mean()
    .round(3)
    .unstack(level="publication")
)

print(pub_sentiment_by_month.shape)
pub_sentiment_by_month.head()

In [0]:
pub_sentiment_by_month_fig = pub_sentiment_by_month.plot(
    kind="line",
    template="plotly_white",
    title="Sentiment over Time by Publication",
)
pub_sentiment_by_month_fig.show()


### Study 3: Observations 📝

* `Washington Post`, `TMZ` and `Fox News` have highly volatile polarity over time
* The standard deviation of certain authors' article sentiments vary highly


### EDA Bottom Line 

The final takeaway is that most sentiments across articles, titles and authors tend to be faily neutral and spikes are seen only on occasion if not rarely. Hence, while sentiment analysis could be an important factor to detect bias, we would need other indicators as well.