
# All the News Dataset: Summary EDA 📊

This is a basic statistical / distribution exploration of the `AllTheNews` dataset. 

#### Notebook Properties
* Upstream Notebook: `src.data_ingestion.all_the_news_v2_ingest`
* Compute Resources: `32 GB RAM, 4 CPUs` (when not performing EDA on a sample of data)
* Last Updated: `Nov 21 2023`

#### Data

| **Name** | **Type** | **Location Type** | **Description** | **Location** | 
| --- | --- | --- | --- | --- | 
| `all_the_news` | `input` | `Delta` | Read full delta dataset of `AllTheNews` | `catalog/raw/all_the_news.delta` | 

In [0]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from deltalake import DeltaTable
from src.utils.io import FileSystemHandler

In [0]:
pd.set_option("display.max_columns", None)
pd.options.plotting.backend = "plotly"

datafs = FileSystemHandler("s3")

In [0]:
LIMIT_PARTITIONS: int | None = None
"""An input parameter to limit the number of table partitions to read from delta. Useful to perform EDA on a sample of data."""

SHUFFLE_PARTITIONS: bool = False
"""Whether to randomize the partitions before reading"""

INPUT_TABLE: str = "all_the_news" 
INPUT_CATALOG: str = "raw"


### Read Data

In [0]:
atn_delta_table: DeltaTable = datafs.read_delta(
    table=INPUT_TABLE,
    catalog_name=INPUT_CATALOG,
    as_pandas=False,
)

df: pd.DataFrame = datafs.read_delta_partitions(
    delta_table=atn_delta_table,
    N_partitions=LIMIT_PARTITIONS,
    shuffle_partitions=SHUFFLE_PARTITIONS,
)

df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(by=["date"])

print(df.shape)
df.head()


## Summary Statistics


#### Publications
* 30% of articles appear from Reuters
* NYTimes, CNBC & the Hill together make up an additional 25.5%
* Remaining publications make up the latter half of all articles

> May have to re-randomize accounting for Reuters' imbalance

In [0]:
df.publication.value_counts(normalize=True).plot(
    kind="bar",
    template="plotly_white",
    title="Article count by Publication",
)


#### Authors
* No bias introduced by authors since they're uniformly distributed

In [0]:
df.author.value_counts(normalize=True).head(20).plot(
    kind="bar",
    template="plotly_white",
    title="Article count by Author",
)


#### Publication Section

* 17% of articles come from `World News`, `Business News` and `Market News` sections
* Looking at the unique list of sections, there's a wide variety of sections distributed over the data

In [0]:
df.section.value_counts(normalize=True).head(20).plot(
    kind="bar",
    template="plotly_white",
    title="Article count by Publication Section",
)


#### Yeary Articles
* `2020` article count is significantly lower (by about half) of the other years' articles
* Dataset only has articles until `Apr 2 2020`

> May have to consider removing `2020` data also due to covid bias

In [0]:
print("Max Date of all Articles:", df.date.max())

df.year.value_counts().sort_index().plot(
    kind="line",
    template="plotly_white",
    markers=True,
    title="Article Count by Year over Time",
)


#### Monthly Articles

* Fairly non-volatile and standard distribution of articles over time
* COVID era has a spike in total count of articles
* `Apr 2020` data is not the true minimum count of articles since data only present for first 2 days

In [0]:
df.groupby(pd.Grouper(key="date", freq="M"))["article"].count().plot(
    kind="line",
    template="plotly_white",
    markers=True,
    title="Article Count by Month over Time",
)


#### Other Time Series

In [0]:
publication_time_grouped = df.groupby(
    [pd.Grouper(key="date", freq="M"), "publication"]
)["article"].count()

publication_time_grouped_unstacked = publication_time_grouped.unstack(
    level="publication"
)

pub_mo_fig = publication_time_grouped_unstacked.plot(
    kind="line",
    template="plotly_white",
    markers=False,
    title="Articles over Time by Publication",
)
pub_mo_fig.show()

In [0]:
publication_time_grouped_disc = (
    df[df["publication"] != "Reuters"]
    .groupby([pd.Grouper(key="date", freq="M"), "publication"])["article"]
    .count()
)

publication_time_grouped_disc_unstacked = publication_time_grouped_disc.unstack(
    level="publication"
)

pub_mo_disc_fig = publication_time_grouped_disc_unstacked.plot(
    kind="line",
    template="plotly_white",
    markers=False,
    title="[Reuters Discounted] Articles over Time by Publication",
)
pub_mo_disc_fig.show()

In [0]:
df.day.value_counts().sort_index().plot(
    kind="bar",
    template="plotly_white",
    title="Article Count by Day of Month for all Years",
)


#### Dense Authors

The following authors have written more than 1% of all articles within the publication given the dataset. This is not necessarily a bias, as sometimes one-off op-ed authors could have highly rhetorical sentiments as well. However, it's interesting to see the domination of articles by certain authors as a signal for later.

We don't consider authors who have written pieces in multiple publications as some authors could share the same name, and wouldn't represent the authors themselves

In [0]:
"""Removes authors like WIRED Staff, or generic author names that are the publication itself"""
author_cleaned_df: pd.DataFrame = df[
    df.apply(
        lambda row: row.publication.lower() not in row.author.lower()
        if not pd.isnull(row.publication) and not pd.isnull(row.author)
        else False,
        axis=1,
    )
]

print(author_cleaned_df.shape)


There are certain authors who have written more than 5% of all articles in the dataset. If the dataset is considered a fully representative random sample of all the news out there, then these authors require closer inspection in further analysis.

In [0]:
author_pub_ratio: pd.DataFrame = (
    author_cleaned_df
    .groupby(["author", "publication"])
    .agg(articles_by_author_for_publication=("article", "count"))
    .reset_index()
    .query("articles_by_author_for_publication > 1")
    .merge(
        df.groupby(["publication"])
        .agg(
            authors_in_publication=("author", "nunique"),
            articles_in_publication=("article", "count"),
        )
        .reset_index(),
        "left",
        "publication",
    )
)

author_pub_ratio["author_ratio_for_publication"] = (
    author_pub_ratio["articles_by_author_for_publication"]
    / author_pub_ratio["articles_in_publication"]
)
author_pub_ratio = author_pub_ratio.round(3)
author_pub_ratio = author_pub_ratio.query("author_ratio_for_publication > 0.01")
author_pub_ratio = author_pub_ratio.sort_values(
    by=["author_ratio_for_publication"], ascending=False
)

print(author_pub_ratio.shape)
author_pub_ratio.head(10)