## DUPLICATES

(https://www.clrn.org/how-to-handle-duplicate-data-in-machine-learning/)

From the first analysis:

- development set
  - duplicate rows: 1559
  - NaN article col: 1875
- evaluation set
  - dupliacte rows: 96
  - NaN article col: 449

In [None]:
from src.preprocessing import Preprocessor, initial_prep
from src.config import *
from src.preprocessing import *
from src.utils import load_data
import matplotlib.pyplot as plt

df = load_data(DEVELOPMENT_PATH)
df = initial_prep(df)

In [None]:
df['title'] = Preprocessor.clean_text(df['title'])
df['article'] = Preprocessor.clean_text(df['article'])
df.loc[df['article'].str.len() < 5, "article"] = pd.NA

# Timestamp formatting
ts = df["timestamp"].replace("0000-00-00 00:00:00", pd.NA)
ts = pd.to_datetime(ts, errors="coerce")  # invalid -> NaT
df["timestamp"] = ts

# DUPLICATES

# If all cols match and also the target => keep only one row
df.drop_duplicates(inplace=True)

# If all cols match but NOT the target => drop all rows
df.drop_duplicates(subset=['source', 'title', 'article', 'page_rank', 'timestamp'], keep=False, inplace=True)

# If title, article, source, page_rank match but NOT timestamp => keep only one row with y=majority_voting and timestamp the first/not null
for a, g in df[df.duplicated(subset=['source', 'title', 'article', 'page_rank'], keep=False)].sort_values(['title', 'timestamp']).groupby(['source', 'title', 'article', 'page_rank']):
    y = g['y'].value_counts().index[0]
    timestamp = g[g['y']==y]['timestamp'].min()
    df.loc[g.index[0], 'y'] = y
    df.loc[g.index[0], 'timestamp'] = timestamp
    df.drop(index=g.index[1:], inplace=True)


In [4]:
df

Unnamed: 0_level_0,source,title,article,page_rank,timestamp,y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,AllAfrica.com,opec boosts nigeria's oil revenue by .82m bpd,the organisation of petroleum exporting countr...,5,2004-09-16 22:39:53,5
1,Xinhua,yearender: mideast peace roadmap reaches dead-...,looking back at the major events that took pla...,5,2004-12-17 19:01:14,0
2,Yahoo,battleground dispatches for oct. 5 \ (cqpoliti...,cqpolitics.com - here are today's battleground...,5,2006-10-05 18:42:29,0
3,BBC,air best to resuscitate newborns,air rather than oxygen should be used to resus...,5,NaT,0
4,Yahoo,high tech german train crash kills at least on...,"<p><a href=""http://us.rd.yahoo.com/dailynews/r...",5,2006-09-22 17:28:57,0
...,...,...,...,...,...,...
79992,Yahoo,italy's embattled prodi faces vote of confiden...,"<p><a href=""http://us.rd.yahoo.com/dailynews/r...",5,2008-01-23 11:39:35,0
79993,All-Baseball.com,"ding dong, the deal is dead","as yesterday began, there was widespread antic...",5,NaT,4
79994,Yahoo,two bombs discovered in sardinia after berlusc...,afp - police discovered two bombs near the sar...,5,NaT,0
79995,Voice,red cross report alleges us detainee abuse at ...,a report by the international committee of the...,5,NaT,3


In [3]:
for a, g in df[df.duplicated(subset=['source', 'title', 'article', 'timestamp'], keep=False)].groupby(['source', 'title', 'article', 'timestamp']):
    print(g)

In [4]:
news_df

Unnamed: 0_level_0,source,title,article,page_rank,timestamp,y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,AllAfrica.com,opec boosts nigeria's oil revenue by .82m bpd,the organisation of petroleum exporting countr...,5,2004-09-16 22:39:53,5
1,Xinhua,yearender: mideast peace roadmap reaches dead-...,looking back at the major events that took pla...,5,2004-12-17 19:01:14,0
2,Yahoo,battleground dispatches for oct. 5 \ (cqpoliti...,cqpolitics.com - here are today's battleground...,5,2006-10-05 18:42:29,0
3,BBC,air best to resuscitate newborns,air rather than oxygen should be used to resus...,5,NaT,0
4,Yahoo,high tech german train crash kills at least on...,"<p><a href=""http://us.rd.yahoo.com/dailynews/r...",5,2006-09-22 17:28:57,0
...,...,...,...,...,...,...
79991,RedNova,leds move into home lighting market,"by mark jewell everett, mass. - joey nicotera'...",4,2007-06-25 07:08:21,2
79992,Yahoo,italy's embattled prodi faces vote of confiden...,"<p><a href=""http://us.rd.yahoo.com/dailynews/r...",5,2008-01-23 11:39:35,0
79994,Yahoo,two bombs discovered in sardinia after berlusc...,afp - police discovered two bombs near the sar...,5,NaT,0
79995,Voice,red cross report alleges us detainee abuse at ...,a report by the international committee of the...,5,NaT,3


In [2]:
news_df.drop_duplicates(inplace=True)

In [3]:
news_df

Unnamed: 0_level_0,source,title,article,page_rank,timestamp,y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,AllAfrica.com,opec boosts nigeria's oil revenue by .82m bpd,the organisation of petroleum exporting countr...,5,2004-09-16 22:39:53,5
1,Xinhua,yearender: mideast peace roadmap reaches dead-...,looking back at the major events that took pla...,5,2004-12-17 19:01:14,0
2,Yahoo,battleground dispatches for oct. 5 \ (cqpoliti...,cqpolitics.com - here are today's battleground...,5,2006-10-05 18:42:29,0
3,BBC,air best to resuscitate newborns,air rather than oxygen should be used to resus...,5,0000-00-00 00:00:00,0
4,Yahoo,high tech german train crash kills at least on...,"<p><a href=""http://us.rd.yahoo.com/dailynews/r...",5,2006-09-22 17:28:57,0
...,...,...,...,...,...,...
79992,Yahoo,italy's embattled prodi faces vote of confiden...,"<p><a href=""http://us.rd.yahoo.com/dailynews/r...",5,2008-01-23 11:39:35,0
79993,All-Baseball.com,"ding dong, the deal is dead","as yesterday began, there was widespread antic...",5,0000-00-00 00:00:00,4
79994,Yahoo,two bombs discovered in sardinia after berlusc...,afp - police discovered two bombs near the sar...,5,0000-00-00 00:00:00,0
79995,Voice,red cross report alleges us detainee abuse at ...,a report by the international committee of the...,5,0000-00-00 00:00:00,3


In [4]:
for a,g in news_df[news_df.drop(columns='y').duplicated(keep=False)].groupby(['source', 'title', 'article', 'page_rank', 'timestamp']):
    if len(g) != 2:
        print(len(g))
        print(g['y'])

3
Id
8076     0
28144    1
78809    5
Name: y, dtype: int64
3
Id
23106    5
52194    4
65990    3
Name: y, dtype: int64
3
Id
15708    3
18532    5
51791    2
Name: y, dtype: int64
3
Id
25516    2
28103    5
72653    3
Name: y, dtype: int64
3
Id
12505    5
29696    2
64346    3
Name: y, dtype: int64
3
Id
4315     0
28835    3
34474    5
Name: y, dtype: int64


In [5]:
news_df.drop_duplicates(subset=['source', 'title', 'article', 'page_rank', 'timestamp'], keep=False, inplace=True)
news_df.shape

(78697, 6)

In [6]:
news_df.isna().sum()

source        294
title           1
article      1880
page_rank       0
timestamp       0
y               0
dtype: int64

---

Deduplication rules.
Records are grouped at content level using (article, title).

- (article, title) with different target labels (y) → drop entire cluster
- (article, title, y) identical → collapse records into a single instance, aggregating metadata as follows:
  - source → unique non-null list of sources
  -  page_rank → median value across observations
  -  timestamp → single representative value earliest

This procedure removes spurious duplicates while preserving informative metadata variability.

In [3]:
df[df.duplicated(subset=['page_rank', 'title', 'article', 'timestamp'], keep=False)]


Unnamed: 0_level_0,source,title,article,page_rank,timestamp,y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1061,Ananova,adv: advance your career today,increase your earning potential with an accred...,5,NaT,1
2323,Telegraph.co.uk,brussels lets big spenders off the hook,the european commission yesterday abandoned ef...,5,NaT,1
3108,Washington,adv: free 4-room digital satellite system,free installation for your free system in up t...,5,NaT,5
4607,BBC,world 'ignoring' war torn darfur,"the world has shown ""callous disregard"" for th...",5,NaT,5
7394,BCC,wireless net to get speed boost,wireless computer networks could soon be runni...,5,NaT,2
9178,Topix.Net,brussels lets big spenders off the hook,the european commission yesterday abandoned ef...,5,NaT,0
9582,Calgary,goosen fires 64 to edge woods at tour champion...,atlanta (cp) - a dreadful front nine sunday le...,5,NaT,3
10352,Boston,bonds hits home run no. 700,barry bonds hit his 700th home run friday nigh...,5,NaT,4
11838,National,iraq car bomb kills 8 us marines in deadliest ...,a car bomb killed eight us marines outside fal...,5,NaT,3
14395,BCC,pay-as-you-drive car cover tested,a uk insurer is fitting mobile technology to s...,5,NaT,2


In [5]:
news_df = news_df.sort_values(['title', 'timestamp']).drop_duplicates(subset=['source', 'title', 'article', 'page_rank', 'y'], keep='first')

In [5]:
news_df[news_df.duplicated(subset=['source', 'title', 'article', 'timestamp',], keep=False)]


Unnamed: 0_level_0,source,title,article,page_rank,timestamp,y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
22505,RedNova,'high street' sign is among most stolen,"eugene, ore. - the signs marking high street h...",4,NaT,3
