# Original To Curated

The purpose of this notebook is to read the data issued from the [zenado website](https://zenodo.org/record/5530410) and stored in the `Dada/Original/` folder, to aggregate and clean the data, and to store it in the `Data/Curated/` folder in an appropriate format. 

In [1]:
import pandas as pd
import numpy as np

Some variables to point to the origin folders and the curated folders.

In [2]:
dataOriginalPath = 'Data/Clickbait17/Original/'
dataTrainX = dataOriginalPath +'clickbait17-train-170630/instances.jsonl'
dataTrainY = dataOriginalPath +'clickbait17-train-170630/truth.jsonl'

# These are are two other files
dataTrainX1 = dataOriginalPath +'clickbait17-train-170331/instances.jsonl'
dataTrainY1 = dataOriginalPath +'clickbait17-train-170331/truth.jsonl'

dataTestX = dataOriginalPath +'clickbait17-test-170720/instances.jsonl'
dataTestY = dataOriginalPath +'clickbait17-test-170720/truth.jsonl'

dataCuratedPath = 'Data/Clickbait17/Curated/'
actualOutput = 'truthMean'

Reading the training data, using the [json lines](https://jsonlines.org/) format.

In [3]:
# Do not use automatic type detection because id are sometime converted to int64 and are truncated (4 hours bug).
pdoTrainX=pd.read_json(dataTrainX, lines = True, dtype=False)

In [4]:
pdoTrainX

Unnamed: 0,postMedia,postText,id,targetCaptions,targetParagraphs,targetTitle,postTimestamp,targetKeywords,targetDescription
0,[],[UK’s response to modern slavery leaving victi...,858462320779026433,[modern-slavery-rex.jpg],[Thousands of modern slavery victims have not ...,‘Inexcusable’ failures in UK’s response to mod...,Sat Apr 29 23:25:41 +0000 2017,"modern slavery, Department For Work And Pensio...",“Inexcusable” failures in the UK’s system for ...
1,[],[this is good],858421020331560960,"[In this July 1, 2010 file photo, Dr. Charmain...",[President Donald Trump has appointed the pro-...,Donald Trump Appoints Pro-Life Advocate as Ass...,Sat Apr 29 20:41:34 +0000 2017,"Americans United for Life, Dr. Charmaine Yoest...",President Donald Trump has appointed pro-life ...
2,[],"[The ""forgotten"" Trump roast: Relive his bruta...",858368123753435136,[President Trump will not attend this year's W...,[When the White House correspondents’ dinner i...,The ‘forgotten’ Trump roast: Relive his brutal...,Sat Apr 29 17:11:23 +0000 2017,"trump whcd, whcd, white house correspondents d...",President Trump won't be at this year's White ...
3,[],[Meet the happiest #dog in the world!],858323428260139008,"[Maru , Maru, Maru, Maru, Maru]",[Adorable is probably an understatement. This ...,"Meet The Happiest Dog In The World, Maru The H...",Sat Apr 29 14:13:46 +0000 2017,"Maru, husky, dogs, pandas, furball, instagram","The article is about Maru, a husky dog who has..."
4,[],[Tokyo's subway is shut down amid fears over a...,858283602626347008,[All nine lines of Tokyo's subway system were ...,[One of Tokyo's major subways systems says it ...,Tokyo's subway is shut down amid fears over an...,Sat Apr 29 11:35:31 +0000 2017,"Tokyo,subway,shut,fears,North,Korean,attack","The temporary suspension, which lasted ten min..."
...,...,...,...,...,...,...,...,...,...
19533,[media/photo_804240867972304896.jpg],[Brazil soccer team and pilot's final intervie...,804250183642976256,"[CNBC, msnbc, NBC NEWS, TODAY, xfinity]",[Watch Live: Joe Biden Honored on Senate Floor...,"NBC News Video See Brazil Soccer Team, Pilot’s...",Thu Dec 01 09:06:00 +0000 2016,,NBC News
19534,[],[😱😱😱😱😱😱😱😱😱😱😱😱😱😱],804156272086020096,"[Instagram/madonna, Speaker Ryan Retreats on H...",[On November 30 Politico reported that Eric Tr...,Politico Scoop: Eric Trump Killed Two Deer,Thu Dec 01 02:52:50 +0000 2016,Politico Scoop: Eric Trump Killed Two Deer,Politico Scoop: Eric Trump Killed Two Deer
19535,[],[Frenchs Forest high school may have to make w...,804149798651588608,[An artist's impression of the proposed new to...,[The Forest High School on Sydney's northern b...,Frenchs Forest high school may relocate to mak...,Thu Dec 01 02:27:07 +0000 2016,"frenchs forest, northern beaches, sydney, rede...",The Forest High School on Sydney's northern be...
19536,[media/photo_804133521023324160.jpg],[Oh Jeff… #bruh],804134698729385984,[Jeff Fisher May Think Danny Woodhead Still Pl...,[NFL coaches have a lot of information to reme...,Los Angeles Rams Jeff Fisher May Think Danny W...,Thu Dec 01 01:27:06 +0000 2016,"Humor, Football, NFL, NFC West, Los Angeles Ra...","Los Angeles Rams news, rumors, scores, schedul..."


Dropping some unused columns but keeping some of them for debugging purposes.

In [5]:
# Just object types
pdoTrainX.dtypes

postMedia            object
postText             object
id                   object
targetCaptions       object
targetParagraphs     object
targetTitle          object
postTimestamp        object
targetKeywords       object
targetDescription    object
dtype: object

In [6]:
droppedColumns = ['targetCaptions', 'targetDescription', 'targetParagraphs', 'postTimestamp' , 'targetKeywords']

In [7]:
pdoTrainX.drop(droppedColumns, axis=1, inplace=True)

In [8]:
# Use id as an index. Dont forget inplace=True
pdoTrainX.set_index('id', inplace=True)

In [9]:
pdoTrainX

Unnamed: 0_level_0,postMedia,postText,targetTitle
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
858462320779026433,[],[UK’s response to modern slavery leaving victi...,‘Inexcusable’ failures in UK’s response to mod...
858421020331560960,[],[this is good],Donald Trump Appoints Pro-Life Advocate as Ass...
858368123753435136,[],"[The ""forgotten"" Trump roast: Relive his bruta...",The ‘forgotten’ Trump roast: Relive his brutal...
858323428260139008,[],[Meet the happiest #dog in the world!],"Meet The Happiest Dog In The World, Maru The H..."
858283602626347008,[],[Tokyo's subway is shut down amid fears over a...,Tokyo's subway is shut down amid fears over an...
...,...,...,...
804250183642976256,[media/photo_804240867972304896.jpg],[Brazil soccer team and pilot's final intervie...,"NBC News Video See Brazil Soccer Team, Pilot’s..."
804156272086020096,[],[😱😱😱😱😱😱😱😱😱😱😱😱😱😱],Politico Scoop: Eric Trump Killed Two Deer
804149798651588608,[],[Frenchs Forest high school may have to make w...,Frenchs Forest high school may relocate to mak...
804134698729385984,[media/photo_804133521023324160.jpg],[Oh Jeff… #bruh],Los Angeles Rams Jeff Fisher May Think Danny W...


In [10]:
# Remove the brackets from postText column, meaning taking the first element of the one element list
pdoTrainX["postText"]=pdoTrainX["postText"].apply(lambda x : x[0])
pdoTrainX["postMedia"]=pdoTrainX["postMedia"].apply(lambda x : x[0] if len(x)>0 else '')


In [11]:
# C:Chalenge dataset, B:Big dataset(Chalenge + Supplement), T:Test dataset, S:Supplement
pdoTrainX["fromDataset"] = "C"

Reading the actual output values

In [12]:
pdoTrainY=pd.read_json(dataTrainY, lines = True, dtype=False)

In [13]:
pdoTrainY

Unnamed: 0,truthJudgments,truthMean,id,truthClass,truthMedian,truthMode
0,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.000000,858464162594172928,clickbait,1.000000,1.000000
1,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,858462320779026433,no-clickbait,0.000000,0.000000
2,"[0.33333333330000003, 0.6666666666000001, 1.0,...",0.400000,858460992073863168,no-clickbait,0.333333,0.000000
3,"[0.0, 0.6666666666000001, 0.0, 0.3333333333000...",0.266667,858459539296980995,no-clickbait,0.333333,0.333333
4,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,858455355948384257,no-clickbait,0.000000,0.000000
...,...,...,...,...,...,...
19533,"[0.0, 0.6666666666000001, 0.0, 0.0, 0.0]",0.133333,804126501117435904,no-clickbait,0.000000,0.000000
19534,"[0.0, 0.0, 0.0, 0.33333333330000003, 0.0]",0.066667,804123103995580416,no-clickbait,0.000000,0.000000
19535,"[0.6666666666000001, 0.6666666666000001, 0.0, ...",0.333333,804121272967983104,no-clickbait,0.333333,0.000000
19536,"[1.0, 0.0, 0.6666666666000001, 1.0, 1.0]",0.733333,804119512010424320,clickbait,1.000000,1.000000


In [14]:
pdoTrainY.set_index('id', inplace=True)

In [15]:
pdoTrainY

Unnamed: 0_level_0,truthJudgments,truthMean,truthClass,truthMedian,truthMode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
858464162594172928,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.000000,clickbait,1.000000,1.000000
858462320779026433,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,no-clickbait,0.000000,0.000000
858460992073863168,"[0.33333333330000003, 0.6666666666000001, 1.0,...",0.400000,no-clickbait,0.333333,0.000000
858459539296980995,"[0.0, 0.6666666666000001, 0.0, 0.3333333333000...",0.266667,no-clickbait,0.333333,0.333333
858455355948384257,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,no-clickbait,0.000000,0.000000
...,...,...,...,...,...
804126501117435904,"[0.0, 0.6666666666000001, 0.0, 0.0, 0.0]",0.133333,no-clickbait,0.000000,0.000000
804123103995580416,"[0.0, 0.0, 0.0, 0.33333333330000003, 0.0]",0.066667,no-clickbait,0.000000,0.000000
804121272967983104,"[0.6666666666000001, 0.6666666666000001, 0.0, ...",0.333333,no-clickbait,0.333333,0.000000
804119512010424320,"[1.0, 0.0, 0.6666666666000001, 1.0, 1.0]",0.733333,clickbait,1.000000,1.000000


Combining input values and output values

In [16]:
# Not necessary to do that. Should return True anyway
pdoTrainX.sort_index(inplace=True)
pdoTrainY.sort_index(inplace=True)
pdoTrainX.index.equals(pdoTrainY.index)

True

In [17]:
# This use the id index to affect values at the proper place
pdoTrainX['truthMean']=pdoTrainY['truthMean']

In [18]:
pdoTrainX

Unnamed: 0_level_0,postMedia,postText,targetTitle,fromDataset,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
804113781580328960,,"Panama Papers: Europol links 3,500 names to su...","Panama Papers: Europol links 3,500 names to su...",C,0.066667
804119512010424320,media/photo_804119509338640385.jpg,The key to truly great chicken soup,A Superior Chicken Soup,C,0.733333
804121272967983104,,Afghan policewomen face down their fears to serve,100 Women 2016: On the frontline with the wome...,C,0.333333
804123103995580416,,Conservatives are watching less football this ...,Older Viewers and Conservatives Are Watching L...,C,0.066667
804126501117435904,media/photo_804125377400553474.jpg,Richard Sherman weighs in on Cam Newton’s stru...,Seattle Seahawks Richard Sherman Says 'Karma' ...,C,0.133333
...,...,...,...,...,...
858455355948384257,,Trump now agrees with the majority of American...,Donald Trump said being US president was harde...,C,0.000000
858459539296980995,media/photo_858459536574828544.jpg,Trump has flip-flopped. But his supporters are...,Trump Has Flip-Flopped. But His Supporters Are...,C,0.266667
858460992073863168,media/photo_858460986612862976.jpg,Inside North Korea's secret prisons,Inside Kim Jong-un's camps of death: Former No...,C,0.400000
858462320779026433,,UK’s response to modern slavery leaving victim...,‘Inexcusable’ failures in UK’s response to mod...,C,0.133333


In [19]:
pdoTrainX.describe()

Unnamed: 0,truthMean
count,19538.0
mean,0.32453
std,0.252824
min,0.0
25%,0.133333
50%,0.266667
75%,0.466667
max,1.0


Same thing for supplements values

In [20]:
pdoTrainX1=pd.read_json(dataTrainX1, lines = True, dtype=False)
pdoTrainX1.drop(droppedColumns, axis=1, inplace=True)
pdoTrainX1.set_index('id', inplace=True)
pdoTrainX1["postText"]=pdoTrainX1["postText"].apply(lambda x : x[0])
pdoTrainX1["postMedia"]=pdoTrainX1["postMedia"].apply(lambda x : x[0] if len(x)>0 else '')

pdoTrainY1=pd.read_json(dataTrainY1, lines = True, dtype=False)
pdoTrainY1.set_index('id', inplace=True)

pdoTrainX1.sort_index(inplace=True)
pdoTrainY1.sort_index(inplace=True)
pdoTrainX1['truthMean']=pdoTrainY1['truthMean']

pdoTrainX1["fromDataset"] = "S" # C:Chalenge dataset, B:Big dataset(Chalenge + Supplement), T:Test dataset, S:Supplement

Same thing for test values

In [26]:
pdoTestX=pd.read_json(dataTestX, lines = True, dtype=False)
pdoTestX.drop(droppedColumns, axis=1, inplace=True)
pdoTestX.set_index('id', inplace=True)
pdoTestX["postText"]=pdoTestX["postText"].apply(lambda x : x[0])
pdoTestX["postMedia"]=pdoTestX["postMedia"].apply(lambda x : x[0] if len(x)>0 else '')

pdoTestY=pd.read_json(dataTestY, lines = True, dtype=False)
pdoTestY.set_index('id', inplace=True)

pdoTestX.sort_index(inplace=True)
pdoTestY.sort_index(inplace=True)
pdoTestX['truthMean']=pdoTestY['truthMean']

pdoTestX["fromDataset"] = "T" # C:Chalenge dataset, B:Big dataset(Chalenge + Supplement), T:Test dataset, S:Supplement

In [28]:
pd.concat([pdoTrainX, pdoTrainX1, pdoTestX])

Unnamed: 0_level_0,postMedia,postText,targetTitle,fromDataset,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
804113781580328960,,"Panama Papers: Europol links 3,500 names to su...","Panama Papers: Europol links 3,500 names to su...",C,0.066667
804119512010424320,media/photo_804119509338640385.jpg,The key to truly great chicken soup,A Superior Chicken Soup,C,0.733333
804121272967983104,,Afghan policewomen face down their fears to serve,100 Women 2016: On the frontline with the wome...,C,0.333333
804123103995580416,,Conservatives are watching less football this ...,Older Viewers and Conservatives Are Watching L...,C,0.066667
804126501117435904,media/photo_804125377400553474.jpg,Richard Sherman weighs in on Cam Newton’s stru...,Seattle Seahawks Richard Sherman Says 'Karma' ...,C,0.133333
...,...,...,...,...,...
858460992052834304,media/photo_858460989590843398.jpg,"In President Trump's absence, ""nerd prom"" is c...",Plan Samantha Bee: Late Night Comedian Hosts A...,T,0.266667
858461421646086145,,Here's a breakdown of the best and riskiest mo...,Best and riskiest moves for every team's 2017 ...,T,0.600000
858462092407562244,media/photo_858462089165258752.jpg,"Austin, Texas police officer faked his death a...",Texas police officer faked his death and fled ...,T,0.066667
858462894568165376,,Motherhood in the time of Zika,Motherhood in the time of Zika,T,0.333333


In [29]:
# Convert the result into a Huggingface dataset
import datasets as ds
curated = ds.Dataset.from_pandas(pd.concat([pdoTrainX, pdoTrainX1, pdoTestX])) #, preserve_index=False)

In [30]:
curated.info.description = "Clickbait"
curated.info.version = "0.6.0"
curated.info.supervised_keys = [actualOutput]
curated.save_to_disk(dataCuratedPath)

In [31]:
# This file is not used. The \n\n line characters are used in the postText
curated.to_csv(dataCuratedPath+"/dataset.csv", sep=';') 

Creating CSV from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 11.34ba/s]


8606387

In [32]:
# This file is not used. The \n\n line characters are used in the postText
curated.to_json(dataCuratedPath+"/dataset.jsonl") 

Creating json from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 21.47ba/s]


11564270