
# All the News Dataset: Exploratory Data Analysis 📊

This is an exploration of the `AllTheNews` dataset. 

#### Notebook Properties
* Upstream Notebook: `src.data_ingestion.all_the_news_v2_ingest`
* Compute Resources: `32 GB RAM, 4 CPUs` (when not performing EDA on a sample of data)
* Last Updated: `Nov 21 2023`

#### Data

| **Name** | **Type** | **Location Type** | **Description** | **Location** | 
| --- | --- | --- | --- | --- | 
| `all_the_news` | `input` | `Delta` | Read full delta dataset of `AllTheNews` | `catalog/raw/all_the_news.delta` | 

In [0]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from random import shuffle
from tqdm.autonotebook import tqdm
from src.utils.io import FileSystemHandler

In [0]:
pd.set_option("display.max_columns", None)
pd.options.plotting.backend = "plotly"
tqdm.pandas()

datafs = FileSystemHandler("s3")

In [0]:
LIMIT_PARTITIONS: int | None = None
"""An input parameter to limit the number of table partitions to read from delta. Useful to perform EDA on a sample of data."""
SHUFFLE_PARTITIONS: bool = False
"""Whether to randomize the partitions before reading"""

INPUT_TABLE: str = "all_the_news" 
INPUT_CATALOG: str = "raw"


### Read Data

In [0]:
atn_delta_table = datafs.read_delta(
    table=INPUT_TABLE,
    catalog_name=INPUT_CATALOG,
    as_pandas=False,
)

atn_partitions = atn_delta_table.file_uris()

if SHUFFLE_PARTITIONS:
    shuffle(atn_partitions)

if LIMIT_PARTITIONS:
    atn_partitions = atn_partitions[0: LIMIT_PARTITIONS]

print(len(atn_partitions))
atn_partitions[0:5]

In [0]:
l_dfs: list[pd.DataFrame] = []

for p in tqdm(atn_partitions):
    df_partition: pd.DataFrame = pd.read_parquet(p)
    l_dfs.append(df_partition)

In [0]:
df = pd.concat(l_dfs).reset_index(drop=True)

del l_dfs

print(df.shape)
df.head()


## Summary Statistics


#### Publications
* 30% of articles appear from Reuters
* NYTimes, CNBC & the Hill together make up an additional 25.5%
* Remaining publications make up the latter half of all articles

> May have to re-randomize accounting for Reuters' imbalance

In [0]:
df.publication.value_counts(normalize=True).plot(
    kind="bar",
    template="plotly_white",
    title="Article count by Publication",
)


#### Yeary Articles
* `2020` article count is significantly lower (by about half) of the other years' articles
* Dataset only has articles until `Apr 2 2020`

> May have to consider removing `2020` data also due to covid bias

In [0]:
print("Max Date of all Articles:", df.date.max())

df.year.value_counts().sort_index().plot(
    kind="line",
    template="plotly_white",
    markers=True,
    title="Article Count by Year over Time",
)


#### Monthly Articles

* Fairly non-volatile and standard distribution of articles over time
* COVID era has a spike in total count of articles
* `Apr 2020` data is not the true minimum count of articles since data only present for first 2 days

In [0]:
df.groupby(pd.Grouper(key="date", freq="M"))["article"].count().plot(
    kind="line",
    template="plotly_white",
    markers=True,
    title="Article Count by Month over Time",
)

In [0]:
publication_time_grouped = df.groupby(
    [pd.Grouper(key="date", freq="M"), "publication"]
)["article"].count()

publication_time_grouped_unstacked = publication_time_grouped.unstack(
    level="publication"
)

pub_mo_fig = publication_time_grouped_unstacked.plot(
    kind="line",
    template="plotly_white",
    markers=False,
    title="Articles over Time by Publication",
)
pub_mo_fig.show()

In [0]:
publication_time_grouped_disc = (
    df[df["publication"] != "Reuters"]
    .groupby([pd.Grouper(key="date", freq="M"), "publication"])["article"]
    .count()
)

publication_time_grouped_disc_unstacked = publication_time_grouped_disc.unstack(
    level="publication"
)

pub_mo_disc_fig = publication_time_grouped_disc_unstacked.plot(
    kind="line",
    template="plotly_white",
    markers=False,
    title="[Reuters Discounted] Articles over Time by Publication",
)
pub_mo_disc_fig.show()

In [0]:
df.day.value_counts().sort_index().plot(
    kind="bar",
    template="plotly_white",
    title="Article Count by Day of Month for all Years",
)

In [0]:
print(f"Character Count in Article Titles: {df.title.apply(len).sum():,}")
print(f"Character Counts in Article Bodies: {df.article.apply(len).sum():,}")

In [0]:
word_count_titles = df.title.str.split().progress_apply(len).sum()
word_count_articles = df.article.str.split().progress_apply(len).sum()

print(f"Word Count in Article Titles: {word_count_titles:,}")
print(f"Word Count in Article Bodies: {word_count_articles:,}")