## DUPLICATES

(https://www.clrn.org/how-to-handle-duplicate-data-in-machine-learning/)

From this first analysis:

- development set
  - duplicate rows: 1559
  - NaN article col: 1875
- evaluation set
  - dupliacte rows: 96
  - NaN article col: 449

In [1]:
from src.preprocessing import Preprocessor
from src.config import *
from src.preprocessing import *
from src.utils import load_data
import matplotlib.pyplot as plt

news_df = load_data(DEVELOPMENT_PATH)
news_df['title'] = Preprocessor.clean_text(news_df['title'])
news_df['article'] = Preprocessor.clean_text(news_df['article'])
news_df.loc[news_df['article'].str.len() < 5, "article"] = pd.NA

In [2]:
news_df.isna().sum()

source        294
title           1
article      1887
page_rank       0
timestamp       0
y               0
dtype: int64

---

Deduplication rules.
Records are grouped at content level using (article, title).

- (article, title) with different target labels (y) → drop entire cluster
- (article, title, y) identical → collapse records into a single instance, aggregating metadata as follows:
  - source → unique non-null list of sources
  -  page_rank → median value across observations
  -  timestamp → single representative value earliest

This procedure removes spurious duplicates while preserving informative metadata variability.

In [3]:
conflict_idx = (
    news_df.groupby(['article'])['y']
          .transform('nunique')
          .gt(1)
)
print(news_df.shape)
news_df.drop(index=news_df.index[conflict_idx], inplace=True)
print(news_df.shape)

(79997, 6)
(76462, 6)


In [None]:
import numpy as np
import pandas as pd

keys = ['article', 'y']

ts = news_df["timestamp"].replace("0000-00-00 00:00:00", pd.NA)
ts = pd.to_datetime(ts, errors="coerce")  # invalid -> NaT
news_df["timestamp"] = ts

dup_mask = news_df.duplicated(subset=keys, keep=False)
groups = news_df.loc[dup_mask].groupby(keys, dropna=False)

updates = {}          # keep_idx -> dict colonne aggiornate
to_drop = []          # indici da droppare

for (article, y), g in groups:
    keep_idx = g.index[0]
    drop_idxs = g.index[1:]
    to_drop.extend(drop_idxs)

    src_nonnull = g['source'].dropna().astype(str).unique()
    src_nonnull = sorted(src_nonnull)

    # se c'è una sola source reale, la metto in 'source' e lascio 'sources' a NaN
    if len(src_nonnull) == 1:
        source = src_nonnull[0]
        sources = pd.NA
    else:
        source = pd.NA
        sources = src_nonnull  # lista (senza NaN)

    ts = g['timestamp'].dropna()
    timestamp = ts.min() if not ts.empty else pd.NaT

    updates[keep_idx] = {
        'timestamp': timestamp,        
        'page_rank': g['page_rank'].median().round().astype('int'),
        'source': source,
        'sources': sources,
        'title': g.loc[g['title'].str.len().idxmax(), 'title']
    }

# crea colonna 'sources' se non esiste
if 'sources' not in news_df.columns:
    news_df['sources'] = pd.NA

# applica update sulle righe tenute
updates_df = pd.DataFrame.from_dict(updates, orient='index')
news_df.loc[updates_df.index, updates_df.columns] = updates_df

# droppa le righe duplicate "in eccesso"
news_df.drop(index=to_drop, inplace=True)



import numpy as np
import pandas as pd

keys = ['title', 'y']

dup_mask = news_df.duplicated(subset=keys, keep=False)
groups = news_df.loc[dup_mask].groupby(keys, dropna=False)

updates = {}          # keep_idx -> dict colonne aggiornate
to_drop = []          # indici da droppare

for (title, y), g in groups:
    keep_idx = g.index[0]
    drop_idxs = g.index[1:]
    to_drop.extend(drop_idxs)

    src_nonnull = g['source'].dropna().astype(str).unique()
    src_nonnull = sorted(src_nonnull)

    # se c'è una sola source reale, la metto in 'source' e lascio 'sources' a NaN
    if len(src_nonnull) == 1:
        source = src_nonnull[0]
        sources = pd.NA
    else:
        source = pd.NA
        sources = src_nonnull  # lista (senza NaN)

    ts = g['timestamp'].dropna()
    timestamp = ts.min() if not ts.empty else pd.NaT

    updates[keep_idx] = {
        'timestamp': timestamp,        
        'page_rank': g['page_rank'].median().round().astype('int'),
        'source': source,
        'sources': sources,
        'article': g.loc[g['article'].str.len().idxmax(), 'article']
    }

# crea colonna 'sources' se non esiste
if 'sources' not in news_df.columns:
    news_df['sources'] = pd.NA

# applica update sulle righe tenute
updates_df = pd.DataFrame.from_dict(updates, orient='index')
news_df.loc[updates_df.index, updates_df.columns] = updates_df

# droppa le righe duplicate "in eccesso"
news_df.drop(index=to_drop, inplace=True)

(72742, 7)
(71947, 7)


In [6]:
news_df.isna().sum()

source         374
title            0
article          6
page_rank        0
timestamp    24253
y                0
sources      71015
dtype: int64