
# Preprocessor: Word Counts & Sentiments 👓

This notebook adds additional columns around word count and sentiment scores, that are used in the text based EDA

#### Notebook Properties
* Upstream Notebook: `src.data_ingestion.all_the_news_v2_ingest`
* Compute Resources: `32 GB RAM, 4 CPUs` (when not performing EDA on a sample of data)
* Last Updated: `Nov 23 2023`

#### Data

| **Name** | **Type** | **Location Type** | **Description** | **Location** | 
| --- | --- | --- | --- | --- | 
| `all_the_news` | `input` | `Delta` | Read full delta dataset of `AllTheNews` | `catalog/raw/all_the_news.delta` | 

In [0]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from random import shuffle
import contextlib
from tqdm.autonotebook import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from deltalake import DeltaTable
from deltalake.exceptions import TableNotFoundError
import pyarrow as pa

from src.utils.io import FileSystemHandler, partition_dataframe
from src.utils.schemas import all_the_news_raw_schema

In [0]:
pd.set_option("display.max_columns", None)
pd.options.plotting.backend = "plotly"
tqdm.pandas()

datafs = FileSystemHandler("s3")

In [0]:
LIMIT_PARTITIONS: int | None = None
"""An input parameter to limit the number of table partitions to read from delta. Useful to perform EDA on a sample of data."""

SHUFFLE_PARTITIONS: bool = False
"""Whether to randomize the partitions before reading"""

INPUT_TABLE: str = "all_the_news" 
INPUT_CATALOG: str = "raw"

OUTPUT_TABLE: str = "all_the_news"
OUTPUT_CATALOG: str ="text_eda"


### Read Data

In [0]:
atn_delta_table: DeltaTable = datafs.read_delta(
    table=INPUT_TABLE,
    catalog_name=INPUT_CATALOG,
    as_pandas=False,
)

df: pd.DataFrame = datafs.read_delta_partitions(
    delta_table=atn_delta_table,
    N_partitions=LIMIT_PARTITIONS,
    shuffle_partitions=SHUFFLE_PARTITIONS,
)

df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(by=["date"])

print(df.shape)
df.head()


## Summary Text Stats

In [0]:
print(f"Sum of Character Counts in Article Titles: {df.title.dropna().apply(len).sum():,}")
print(f"Sum of Character Counts in Article Bodies: {df.article.dropna().apply(len).sum():,}")

In [0]:
df["title_word_count"] = df["title"].dropna().apply(lambda x: len(str(x).split()))
df["article_word_count"] = df["article"].dropna().progress_apply(lambda x: len(str(x).split()))

In [0]:
title_word_count_sum: int = df["title_word_count"].fillna(0).astype("int64").sum()
article_word_count_sum: int = df["article_word_count"].fillna(0).astype("int64").sum()
print(f"Word Counts Sum - Article Titles: {title_word_count_sum:,.0f}")
print(f"Word Counts Sum - Article Bodies: {article_word_count_sum:,.0f}")

title_word_count_mean: int = df["title_word_count"].fillna(0).astype("int64").mean()
article_word_count_mean: int = df["article_word_count"].fillna(0).astype("int64").mean()
print()
print(f"Word Counts Mean - Article Titles: {title_word_count_mean:,.0f}")
print(f"Word Counts Mean - Article Bodies: {article_word_count_mean:,.0f}")

title_word_count_med: int = df["title_word_count"].fillna(0).astype("int64").median()
art_word_count_med: int = df["article_word_count"].fillna(0).astype("int64").median()
print()
print(f"Word Counts Median - Article Titles: {title_word_count_med:,.0f}")
print(f"Word Counts Median - Article Bodies: {art_word_count_med:,.0f}")


## Apply Sentiment Models

The following sentiment models are used through open-source packages:
* `vaderSentiment`
* `textblob`

This is used to study news sentiments over time, by section and others as an open exploratory data analysis, and also to see differences between model scores for certain articles and why this might be the case. Biased articles might exhibit extreme positive or negative sentiment.


#### textBlob sentiment


⚠️ The runtime of this cell is approximately 2 hours for the full volume of data

> Consider speed-up by parallel processing and other methods in future. For now, processed data is stored in Delta

In [0]:
df["title_textblob_sentiment"] = (
    df["title"].dropna().progress_apply(lambda text: TextBlob(text).sentiment.polarity)
)
df["article_textblob_sentiment"] = (
    df["article"].dropna().progress_apply(lambda text: TextBlob(text).sentiment.polarity)
)

In [0]:
print(df.shape)
df.head()


#### vaderSentiment


Why `VADER` (Valence Aware Dictionary & Sentiment Reasoner):
- VADER is finely-tuned to analyze sentiments in social media text. It effectively handles the nuances and idiosyncrasies of online textual content, like emoticons, slangs, and abbreviations, which are often challenging for traditional sentiment analysis tools.
- Unlike many sentiment analyzers that rely purely on machine learning models, VADER uses a combination of a lexicon (a list of lexical features such as words, emoji, etc., each tagged with its sentiment intensity) and a set of grammatical and syntactical rules to determine sentiment.
- Because of its lexicon and rule-based nature, VADER does not require extensive training on large datasets

`VADER` provides four scores:
- **Positive**: Probability of the text being positive.
- **Negative**: Probability of the text being negative.
- **Neutral**: Probability of the text being neutral.
- **Compound**: A normalized, weighted composite score which takes into account the other scores. This score is often used as a singular measure of sentiment for a given text

In [0]:
vader_analyzer = SentimentIntensityAnalyzer()

def assign_vader_scores(text: str) -> list[float]:
    """Returns vader scores as a vector to assign to pandas columns."""
    return_list: list = [None] * 4

    with contextlib.suppress(Exception):
        vader_dict: dict[str, float] = vader_analyzer.polarity_scores(text)
        return_list[0] = vader_dict["pos"]
        return_list[1] = vader_dict["neg"]
        return_list[2] = vader_dict["neu"]
        return_list[3] = vader_dict["compound"]

    return return_list

In [0]:
df[
    [
        "vader_prob_positive_title",
        "vader_prob_negative_title",
        "vader_prob_neutral_title",
        "vader_compound_title",
    ]
] = df.progress_apply(
    lambda row: assign_vader_scores(row.title), axis=1, result_type="expand"
)

In [0]:
print(df.shape)
df.head()


## Save Results

In [0]:
with contextlib.suppress(TableNotFoundError):
    """if table already doesn't exist, then ignore"""
    print(datafs.clear_delta(table=OUTPUT_TABLE, catalog_name=OUTPUT_CATALOG))

In [0]:
new_text_fields: list[pa.field] = [
    pa.field("title_word_count", pa.int64()),
    pa.field("article_word_count", pa.int64()),
    pa.field("title_textblob_sentiment", pa.float64()),
    pa.field("article_textblob_sentiment", pa.float64()),
    pa.field("vader_prob_positive_title", pa.float64()),
    pa.field("vader_prob_negative_title", pa.float64()),
    pa.field("vader_prob_neutral_title", pa.float64()),
    pa.field("vader_compound_title", pa.float64()),
]

all_the_news_text_eda_schema = all_the_news_raw_schema

for new_field in new_text_fields:
    all_the_news_text_eda_schema = all_the_news_text_eda_schema.append(new_field)

all_the_news_text_eda_schema

In [0]:
df_partitions: list[pd.DataFrame] = partition_dataframe(df, N_Partitions=54)

for p_df in tqdm(df_partitions):
    datafs.write_delta(
        dataframe=p_df,
        table=OUTPUT_TABLE,
        catalog_name=OUTPUT_CATALOG,
        schema=all_the_news_text_eda_schema,
    )

In [0]:
sample_df: pd.DataFrame = datafs.read_delta_partitions(
    delta_table=datafs.read_delta(
        table=OUTPUT_TABLE,
        catalog_name=OUTPUT_CATALOG,
    ),
    N_partitions=1,
    shuffle_partitions=True,
)
print(sample_df.info())
sample_df.head()