# Original To Curated

The purpose of this notebook is to read the data issued from the [zenado website](https://zenodo.org/record/5530410) and stored in the `Dada/Original/` folder, to aggregate and clean the data, and to store it in the `Data/Curated/` folder in an appropriate format. 

In [1]:
import pandas as pd
import numpy as np

Some variables to look at the origin folders and the curated folders.

In [65]:
dataOriginalPath = 'Data/Original/'
dataTrainX = dataOriginalPath +'clickbait17-train-170630/instances.jsonl'
dataTrainY = dataOriginalPath +'clickbait17-train-170630/truth.jsonl'

# There are two other files
dataTrainX1 = dataOriginalPath +'clickbait17-train-170331/instances.jsonl'
dataTrainY1 = dataOriginalPath +'clickbait17-train-170331/truth.jsonl'

dataTestX = dataOriginalPath +'clickbait17-test-170720/instances.jsonl'
dataTestY = dataOriginalPath +'clickbait17-test-170720/truth.jsonl'

dataCuratedPath = 'Data/Curated/'
LCfraction = 'truthMean'

Reading the training data, using the [json lines](https://jsonlines.org/) format.

In [3]:
# Do not use automatic type discovering beacause id are sometime converted to int64 and truncated (4 hours bug).
pdoTrainX=pd.read_json(dataTrainX, lines = True, dtype=False)

In [4]:
pdoTrainX

Unnamed: 0,postMedia,postText,id,targetCaptions,targetParagraphs,targetTitle,postTimestamp,targetKeywords,targetDescription
0,[],[UK’s response to modern slavery leaving victi...,858462320779026433,[modern-slavery-rex.jpg],[Thousands of modern slavery victims have not ...,‘Inexcusable’ failures in UK’s response to mod...,Sat Apr 29 23:25:41 +0000 2017,"modern slavery, Department For Work And Pensio...",“Inexcusable” failures in the UK’s system for ...
1,[],[this is good],858421020331560960,"[In this July 1, 2010 file photo, Dr. Charmain...",[President Donald Trump has appointed the pro-...,Donald Trump Appoints Pro-Life Advocate as Ass...,Sat Apr 29 20:41:34 +0000 2017,"Americans United for Life, Dr. Charmaine Yoest...",President Donald Trump has appointed pro-life ...
2,[],"[The ""forgotten"" Trump roast: Relive his bruta...",858368123753435136,[President Trump will not attend this year's W...,[When the White House correspondents’ dinner i...,The ‘forgotten’ Trump roast: Relive his brutal...,Sat Apr 29 17:11:23 +0000 2017,"trump whcd, whcd, white house correspondents d...",President Trump won't be at this year's White ...
3,[],[Meet the happiest #dog in the world!],858323428260139008,"[Maru , Maru, Maru, Maru, Maru]",[Adorable is probably an understatement. This ...,"Meet The Happiest Dog In The World, Maru The H...",Sat Apr 29 14:13:46 +0000 2017,"Maru, husky, dogs, pandas, furball, instagram","The article is about Maru, a husky dog who has..."
4,[],[Tokyo's subway is shut down amid fears over a...,858283602626347008,[All nine lines of Tokyo's subway system were ...,[One of Tokyo's major subways systems says it ...,Tokyo's subway is shut down amid fears over an...,Sat Apr 29 11:35:31 +0000 2017,"Tokyo,subway,shut,fears,North,Korean,attack","The temporary suspension, which lasted ten min..."
...,...,...,...,...,...,...,...,...,...
19533,[media/photo_804240867972304896.jpg],[Brazil soccer team and pilot's final intervie...,804250183642976256,"[CNBC, msnbc, NBC NEWS, TODAY, xfinity]",[Watch Live: Joe Biden Honored on Senate Floor...,"NBC News Video See Brazil Soccer Team, Pilot’s...",Thu Dec 01 09:06:00 +0000 2016,,NBC News
19534,[],[😱😱😱😱😱😱😱😱😱😱😱😱😱😱],804156272086020096,"[Instagram/madonna, Speaker Ryan Retreats on H...",[On November 30 Politico reported that Eric Tr...,Politico Scoop: Eric Trump Killed Two Deer,Thu Dec 01 02:52:50 +0000 2016,Politico Scoop: Eric Trump Killed Two Deer,Politico Scoop: Eric Trump Killed Two Deer
19535,[],[Frenchs Forest high school may have to make w...,804149798651588608,[An artist's impression of the proposed new to...,[The Forest High School on Sydney's northern b...,Frenchs Forest high school may relocate to mak...,Thu Dec 01 02:27:07 +0000 2016,"frenchs forest, northern beaches, sydney, rede...",The Forest High School on Sydney's northern be...
19536,[media/photo_804133521023324160.jpg],[Oh Jeff… #bruh],804134698729385984,[Jeff Fisher May Think Danny Woodhead Still Pl...,[NFL coaches have a lot of information to reme...,Los Angeles Rams Jeff Fisher May Think Danny W...,Thu Dec 01 01:27:06 +0000 2016,"Humor, Football, NFL, NFC West, Los Angeles Ra...","Los Angeles Rams news, rumors, scores, schedul..."


Dropping some unused columns but keeping some of them for debugging purposes.

In [5]:
pdoTrainX.drop(['postMedia', 'targetCaptions', 'targetParagraphs', 'postTimestamp' , 'targetKeywords']	, axis=1, inplace=True)

In [6]:
# Just object types
pdoTrainX.dtypes

postText             object
id                   object
targetTitle          object
targetDescription    object
dtype: object

In [7]:
# Use id as an index. Dont forget inplace=True
pdoTrainX.set_index('id', inplace=True)

In [8]:
pdoTrainX

Unnamed: 0_level_0,postText,targetTitle,targetDescription
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
858462320779026433,[UK’s response to modern slavery leaving victi...,‘Inexcusable’ failures in UK’s response to mod...,“Inexcusable” failures in the UK’s system for ...
858421020331560960,[this is good],Donald Trump Appoints Pro-Life Advocate as Ass...,President Donald Trump has appointed pro-life ...
858368123753435136,"[The ""forgotten"" Trump roast: Relive his bruta...",The ‘forgotten’ Trump roast: Relive his brutal...,President Trump won't be at this year's White ...
858323428260139008,[Meet the happiest #dog in the world!],"Meet The Happiest Dog In The World, Maru The H...","The article is about Maru, a husky dog who has..."
858283602626347008,[Tokyo's subway is shut down amid fears over a...,Tokyo's subway is shut down amid fears over an...,"The temporary suspension, which lasted ten min..."
...,...,...,...
804250183642976256,[Brazil soccer team and pilot's final intervie...,"NBC News Video See Brazil Soccer Team, Pilot’s...",NBC News
804156272086020096,[😱😱😱😱😱😱😱😱😱😱😱😱😱😱],Politico Scoop: Eric Trump Killed Two Deer,Politico Scoop: Eric Trump Killed Two Deer
804149798651588608,[Frenchs Forest high school may have to make w...,Frenchs Forest high school may relocate to mak...,The Forest High School on Sydney's northern be...
804134698729385984,[Oh Jeff… #bruh],Los Angeles Rams Jeff Fisher May Think Danny W...,"Los Angeles Rams news, rumors, scores, schedul..."


In [9]:
# Remove the brackets from postText column, meaning taking the first element of the list
pdoTrainX["postText"]=pdoTrainX["postText"].apply(lambda x : x[0])

In [10]:
pdoTrainY=pd.read_json(dataTrainY, lines = True, dtype=False)

In [11]:
pdoTrainY

Unnamed: 0,truthJudgments,truthMean,id,truthClass,truthMedian,truthMode
0,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.000000,858464162594172928,clickbait,1.000000,1.000000
1,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,858462320779026433,no-clickbait,0.000000,0.000000
2,"[0.33333333330000003, 0.6666666666000001, 1.0,...",0.400000,858460992073863168,no-clickbait,0.333333,0.000000
3,"[0.0, 0.6666666666000001, 0.0, 0.3333333333000...",0.266667,858459539296980995,no-clickbait,0.333333,0.333333
4,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,858455355948384257,no-clickbait,0.000000,0.000000
...,...,...,...,...,...,...
19533,"[0.0, 0.6666666666000001, 0.0, 0.0, 0.0]",0.133333,804126501117435904,no-clickbait,0.000000,0.000000
19534,"[0.0, 0.0, 0.0, 0.33333333330000003, 0.0]",0.066667,804123103995580416,no-clickbait,0.000000,0.000000
19535,"[0.6666666666000001, 0.6666666666000001, 0.0, ...",0.333333,804121272967983104,no-clickbait,0.333333,0.000000
19536,"[1.0, 0.0, 0.6666666666000001, 1.0, 1.0]",0.733333,804119512010424320,clickbait,1.000000,1.000000


In [12]:
pdoTrainY.set_index('id', inplace=True)

In [13]:
pdoTrainY

Unnamed: 0_level_0,truthJudgments,truthMean,truthClass,truthMedian,truthMode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
858464162594172928,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.000000,clickbait,1.000000,1.000000
858462320779026433,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,no-clickbait,0.000000,0.000000
858460992073863168,"[0.33333333330000003, 0.6666666666000001, 1.0,...",0.400000,no-clickbait,0.333333,0.000000
858459539296980995,"[0.0, 0.6666666666000001, 0.0, 0.3333333333000...",0.266667,no-clickbait,0.333333,0.333333
858455355948384257,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,no-clickbait,0.000000,0.000000
...,...,...,...,...,...
804126501117435904,"[0.0, 0.6666666666000001, 0.0, 0.0, 0.0]",0.133333,no-clickbait,0.000000,0.000000
804123103995580416,"[0.0, 0.0, 0.0, 0.33333333330000003, 0.0]",0.066667,no-clickbait,0.000000,0.000000
804121272967983104,"[0.6666666666000001, 0.6666666666000001, 0.0, ...",0.333333,no-clickbait,0.333333,0.000000
804119512010424320,"[1.0, 0.0, 0.6666666666000001, 1.0, 1.0]",0.733333,clickbait,1.000000,1.000000


In [14]:
pdoTrainX.sort_index(inplace=True)

In [15]:
pdoTrainY.sort_index(inplace=True)

In [16]:
pdoTrainY['truthMean']

id
804113781580328960    0.066667
804119512010424320    0.733333
804121272967983104    0.333333
804123103995580416    0.066667
804126501117435904    0.133333
                        ...   
858455355948384257    0.000000
858459539296980995    0.266667
858460992073863168    0.400000
858462320779026433    0.133333
858464162594172928    1.000000
Name: truthMean, Length: 19538, dtype: float64

In [17]:
pdoTrainX.index.equals(pdoTrainY.index)

True

In [18]:
pdoTrainX.index.intersection(pdoTrainY.index).empty

False

In [19]:
# This use the id index
pdoTrainX['truthMean']=pdoTrainY['truthMean']

In [20]:
pdoTrainX

Unnamed: 0_level_0,postText,targetTitle,targetDescription,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
804113781580328960,"Panama Papers: Europol links 3,500 names to su...","Panama Papers: Europol links 3,500 names to su...",Law enforcement agency analysis uncovers proba...,0.066667
804119512010424320,The key to truly great chicken soup,A Superior Chicken Soup,For the best rendition of this American classi...,0.733333
804121272967983104,Afghan policewomen face down their fears to serve,100 Women 2016: On the frontline with the wome...,The Afghan women risking all to join the police.,0.333333
804123103995580416,Conservatives are watching less football this ...,Older Viewers and Conservatives Are Watching L...,"Many factors are dragging down NFL ratings, in...",0.066667
804126501117435904,Richard Sherman weighs in on Cam Newton’s stru...,Seattle Seahawks Richard Sherman Says 'Karma' ...,"Seattle Seahawks news, rumors, scores, schedul...",0.133333
...,...,...,...,...
858455355948384257,Trump now agrees with the majority of American...,Donald Trump said being US president was harde...,Donald Trump spent a great portion of 2016 ins...,0.000000
858459539296980995,Trump has flip-flopped. But his supporters are...,Trump Has Flip-Flopped. But His Supporters Are...,Barely over a tenth of Trump voters think his ...,0.266667
858460992073863168,Inside North Korea's secret prisons,Inside Kim Jong-un's camps of death: Former No...,A female guard (stock photo) at a North Korean...,0.400000
858462320779026433,UK’s response to modern slavery leaving victim...,‘Inexcusable’ failures in UK’s response to mod...,“Inexcusable” failures in the UK’s system for ...,0.133333


In [21]:
pdoTrainX.describe()

Unnamed: 0,truthMean
count,19538.0
mean,0.32453
std,0.252824
min,0.0
25%,0.133333
50%,0.266667
75%,0.466667
max,1.0


In [59]:
# pdo=pdo[pdo["Title"].str.len() >= 18]  
pdoTrainX[pdoTrainX["postText"].str.len() <= 4].describe()

Unnamed: 0,truthMean
count,104.0
mean,0.753205
std,0.190356
min,0.133333
25%,0.6
50%,0.8
75%,0.933333
max,1.0


In [60]:
# Removing empty entries
pdoTrainX=pdoTrainX[pdoTrainX["postText"].str.len() > 0]

In [67]:
pdoTrainX1=pd.read_json(dataTrainX1, lines = True, dtype=False)
pdoTrainX1.drop(['postMedia', 'targetCaptions', 'targetParagraphs', 'postTimestamp' , 'targetKeywords']	, axis=1, inplace=True)
pdoTrainX1.set_index('id', inplace=True)
pdoTrainX1["postText"]=pdoTrainX1["postText"].apply(lambda x : x[0])

pdoTrainY1=pd.read_json(dataTrainY1, lines = True, dtype=False)
pdoTrainY1.set_index('id', inplace=True)
pdoTrainX1['truthMean']=pdoTrainY1['truthMean']

pdoTrainX1=pdoTrainX1[pdoTrainX1["postText"].str.len() > 0]

In [70]:
# pd.concat([pdoTrainX, pdoTrainX1])

Unnamed: 0_level_0,postText,targetTitle,targetDescription,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
804113781580328960,"Panama Papers: Europol links 3,500 names to su...","Panama Papers: Europol links 3,500 names to su...",Law enforcement agency analysis uncovers proba...,0.066667
804119512010424320,The key to truly great chicken soup,A Superior Chicken Soup,For the best rendition of this American classi...,0.733333
804121272967983104,Afghan policewomen face down their fears to serve,100 Women 2016: On the frontline with the wome...,The Afghan women risking all to join the police.,0.333333
804123103995580416,Conservatives are watching less football this ...,Older Viewers and Conservatives Are Watching L...,"Many factors are dragging down NFL ratings, in...",0.066667
804126501117435904,Richard Sherman weighs in on Cam Newton’s stru...,Seattle Seahawks Richard Sherman Says 'Karma' ...,"Seattle Seahawks news, rumors, scores, schedul...",0.133333
...,...,...,...,...
609056814819323905,Man who received world's first penis transplan...,World's first penis transplant patient is set ...,"Surgeons at Stellenbosch University, who carri...",0.600000
610125815116865536,"RT @NYTSports: Abby didn't start, team couldn'...","At Women’s World Cup, Tie Leaves U.S. on Solid...",With Abby Wambach not starting for the first t...,0.266667
608338587495628801,Obama defends Affordable Care Act ahead of Sup...,Obama Defends Health Law Ahead of Supreme Cour...,President Obama talks at the G7 summit in Germ...,0.400000
609684420082180096,New study of the Deflategate report concludes ...,Deflating ‘Deflategate’,A new study weakens the case against the Patri...,0.400000


In [71]:
# Il faut convertir le résultat sous forme de dataset
import datasets as ds
curated = ds.Dataset.from_pandas(pd.concat([pdoTrainX, pdoTrainX1])) #, preserve_index=False)

In [72]:
curated.train_test_split(0.2)

DatasetDict({
    train: Dataset({
        features: ['postText', 'targetTitle', 'targetDescription', 'truthMean', 'id'],
        num_rows: 17529
    })
    test: Dataset({
        features: ['postText', 'targetTitle', 'targetDescription', 'truthMean', 'id'],
        num_rows: 4383
    })
})

In [73]:
curated.info.description = "Clickbait"
curated.info.version = "0.2.0"
curated.info.supervised_keys = [LCfraction]
curated.save_to_disk(dataCuratedPath)

In [74]:
curated.to_csv(dataCuratedPath+"/dataset.csv", sep=';') 

Creating CSV from Arrow format: 100%|██████████| 3/3 [00:00<00:00,  8.70ba/s]


7322577