# Original To Curated

The purpose of this notebook is to read the data issued from the [zenado website](https://zenodo.org/record/5530410) and stored in the `Dada/Original/` folder, to aggregate and clean the data, and to store it in the `Data/Curated/` folder in an appropriate format. 

In [3]:
import pandas as pd
import numpy as np

Some variables to point to the origin folders and the curated folders.

In [4]:
dataOriginalPath = 'Data/Original/'
dataTrainX = dataOriginalPath +'clickbait17-train-170630/instances.jsonl'
dataTrainY = dataOriginalPath +'clickbait17-train-170630/truth.jsonl'

# There are two other files
dataTrainX1 = dataOriginalPath +'clickbait17-train-170331/instances.jsonl'
dataTrainY1 = dataOriginalPath +'clickbait17-train-170331/truth.jsonl'

dataTestX = dataOriginalPath +'clickbait17-test-170720/instances.jsonl'
dataTestY = dataOriginalPath +'clickbait17-test-170720/truth.jsonl'

dataCuratedPath = 'Data/Curated/'
actualOutput = 'truthMean'

Reading the training data, using the [json lines](https://jsonlines.org/) format.

In [5]:
# Do not use automatic type detection because id are sometime converted to int64 and are truncated (4 hours bug).
pdoTrainX=pd.read_json(dataTrainX, lines = True, dtype=False)

In [6]:
pdoTrainX

Unnamed: 0,postMedia,postText,id,targetCaptions,targetParagraphs,targetTitle,postTimestamp,targetKeywords,targetDescription
0,[],[UK’s response to modern slavery leaving victi...,858462320779026433,[modern-slavery-rex.jpg],[Thousands of modern slavery victims have not ...,‘Inexcusable’ failures in UK’s response to mod...,Sat Apr 29 23:25:41 +0000 2017,"modern slavery, Department For Work And Pensio...",“Inexcusable” failures in the UK’s system for ...
1,[],[this is good],858421020331560960,"[In this July 1, 2010 file photo, Dr. Charmain...",[President Donald Trump has appointed the pro-...,Donald Trump Appoints Pro-Life Advocate as Ass...,Sat Apr 29 20:41:34 +0000 2017,"Americans United for Life, Dr. Charmaine Yoest...",President Donald Trump has appointed pro-life ...
2,[],"[The ""forgotten"" Trump roast: Relive his bruta...",858368123753435136,[President Trump will not attend this year's W...,[When the White House correspondents’ dinner i...,The ‘forgotten’ Trump roast: Relive his brutal...,Sat Apr 29 17:11:23 +0000 2017,"trump whcd, whcd, white house correspondents d...",President Trump won't be at this year's White ...
3,[],[Meet the happiest #dog in the world!],858323428260139008,"[Maru , Maru, Maru, Maru, Maru]",[Adorable is probably an understatement. This ...,"Meet The Happiest Dog In The World, Maru The H...",Sat Apr 29 14:13:46 +0000 2017,"Maru, husky, dogs, pandas, furball, instagram","The article is about Maru, a husky dog who has..."
4,[],[Tokyo's subway is shut down amid fears over a...,858283602626347008,[All nine lines of Tokyo's subway system were ...,[One of Tokyo's major subways systems says it ...,Tokyo's subway is shut down amid fears over an...,Sat Apr 29 11:35:31 +0000 2017,"Tokyo,subway,shut,fears,North,Korean,attack","The temporary suspension, which lasted ten min..."
...,...,...,...,...,...,...,...,...,...
19533,[media/photo_804240867972304896.jpg],[Brazil soccer team and pilot's final intervie...,804250183642976256,"[CNBC, msnbc, NBC NEWS, TODAY, xfinity]",[Watch Live: Joe Biden Honored on Senate Floor...,"NBC News Video See Brazil Soccer Team, Pilot’s...",Thu Dec 01 09:06:00 +0000 2016,,NBC News
19534,[],[😱😱😱😱😱😱😱😱😱😱😱😱😱😱],804156272086020096,"[Instagram/madonna, Speaker Ryan Retreats on H...",[On November 30 Politico reported that Eric Tr...,Politico Scoop: Eric Trump Killed Two Deer,Thu Dec 01 02:52:50 +0000 2016,Politico Scoop: Eric Trump Killed Two Deer,Politico Scoop: Eric Trump Killed Two Deer
19535,[],[Frenchs Forest high school may have to make w...,804149798651588608,[An artist's impression of the proposed new to...,[The Forest High School on Sydney's northern b...,Frenchs Forest high school may relocate to mak...,Thu Dec 01 02:27:07 +0000 2016,"frenchs forest, northern beaches, sydney, rede...",The Forest High School on Sydney's northern be...
19536,[media/photo_804133521023324160.jpg],[Oh Jeff… #bruh],804134698729385984,[Jeff Fisher May Think Danny Woodhead Still Pl...,[NFL coaches have a lot of information to reme...,Los Angeles Rams Jeff Fisher May Think Danny W...,Thu Dec 01 01:27:06 +0000 2016,"Humor, Football, NFL, NFC West, Los Angeles Ra...","Los Angeles Rams news, rumors, scores, schedul..."


Dropping some unused columns but keeping some of them for debugging purposes.

In [7]:
# Just object types
pdoTrainX.dtypes

postMedia            object
postText             object
id                   object
targetCaptions       object
targetParagraphs     object
targetTitle          object
postTimestamp        object
targetKeywords       object
targetDescription    object
dtype: object

In [8]:
pdoTrainX.drop(['postMedia', 'targetCaptions', 'targetParagraphs', 'postTimestamp' , 'targetKeywords']	, axis=1, inplace=True)

In [9]:
# Use id as an index. Dont forget inplace=True
pdoTrainX.set_index('id', inplace=True)

In [10]:
pdoTrainX

Unnamed: 0_level_0,postText,targetTitle,targetDescription
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
858462320779026433,[UK’s response to modern slavery leaving victi...,‘Inexcusable’ failures in UK’s response to mod...,“Inexcusable” failures in the UK’s system for ...
858421020331560960,[this is good],Donald Trump Appoints Pro-Life Advocate as Ass...,President Donald Trump has appointed pro-life ...
858368123753435136,"[The ""forgotten"" Trump roast: Relive his bruta...",The ‘forgotten’ Trump roast: Relive his brutal...,President Trump won't be at this year's White ...
858323428260139008,[Meet the happiest #dog in the world!],"Meet The Happiest Dog In The World, Maru The H...","The article is about Maru, a husky dog who has..."
858283602626347008,[Tokyo's subway is shut down amid fears over a...,Tokyo's subway is shut down amid fears over an...,"The temporary suspension, which lasted ten min..."
...,...,...,...
804250183642976256,[Brazil soccer team and pilot's final intervie...,"NBC News Video See Brazil Soccer Team, Pilot’s...",NBC News
804156272086020096,[😱😱😱😱😱😱😱😱😱😱😱😱😱😱],Politico Scoop: Eric Trump Killed Two Deer,Politico Scoop: Eric Trump Killed Two Deer
804149798651588608,[Frenchs Forest high school may have to make w...,Frenchs Forest high school may relocate to mak...,The Forest High School on Sydney's northern be...
804134698729385984,[Oh Jeff… #bruh],Los Angeles Rams Jeff Fisher May Think Danny W...,"Los Angeles Rams news, rumors, scores, schedul..."


In [11]:
# Remove the brackets from postText column, meaning taking the first element of the one element list
pdoTrainX["postText"]=pdoTrainX["postText"].apply(lambda x : x[0])

Reading the actual output values

In [12]:
pdoTrainY=pd.read_json(dataTrainY, lines = True, dtype=False)

In [13]:
pdoTrainY

Unnamed: 0,truthJudgments,truthMean,id,truthClass,truthMedian,truthMode
0,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.000000,858464162594172928,clickbait,1.000000,1.000000
1,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,858462320779026433,no-clickbait,0.000000,0.000000
2,"[0.33333333330000003, 0.6666666666000001, 1.0,...",0.400000,858460992073863168,no-clickbait,0.333333,0.000000
3,"[0.0, 0.6666666666000001, 0.0, 0.3333333333000...",0.266667,858459539296980995,no-clickbait,0.333333,0.333333
4,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,858455355948384257,no-clickbait,0.000000,0.000000
...,...,...,...,...,...,...
19533,"[0.0, 0.6666666666000001, 0.0, 0.0, 0.0]",0.133333,804126501117435904,no-clickbait,0.000000,0.000000
19534,"[0.0, 0.0, 0.0, 0.33333333330000003, 0.0]",0.066667,804123103995580416,no-clickbait,0.000000,0.000000
19535,"[0.6666666666000001, 0.6666666666000001, 0.0, ...",0.333333,804121272967983104,no-clickbait,0.333333,0.000000
19536,"[1.0, 0.0, 0.6666666666000001, 1.0, 1.0]",0.733333,804119512010424320,clickbait,1.000000,1.000000


In [14]:
pdoTrainY.set_index('id', inplace=True)

In [15]:
pdoTrainY

Unnamed: 0_level_0,truthJudgments,truthMean,truthClass,truthMedian,truthMode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
858464162594172928,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.000000,clickbait,1.000000,1.000000
858462320779026433,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,no-clickbait,0.000000,0.000000
858460992073863168,"[0.33333333330000003, 0.6666666666000001, 1.0,...",0.400000,no-clickbait,0.333333,0.000000
858459539296980995,"[0.0, 0.6666666666000001, 0.0, 0.3333333333000...",0.266667,no-clickbait,0.333333,0.333333
858455355948384257,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,no-clickbait,0.000000,0.000000
...,...,...,...,...,...
804126501117435904,"[0.0, 0.6666666666000001, 0.0, 0.0, 0.0]",0.133333,no-clickbait,0.000000,0.000000
804123103995580416,"[0.0, 0.0, 0.0, 0.33333333330000003, 0.0]",0.066667,no-clickbait,0.000000,0.000000
804121272967983104,"[0.6666666666000001, 0.6666666666000001, 0.0, ...",0.333333,no-clickbait,0.333333,0.000000
804119512010424320,"[1.0, 0.0, 0.6666666666000001, 1.0, 1.0]",0.733333,clickbait,1.000000,1.000000


Combining input values and output values

In [16]:
# Not necessary to do that. Should return True anyway
pdoTrainX.sort_index(inplace=True)
pdoTrainY.sort_index(inplace=True)
pdoTrainX.index.equals(pdoTrainY.index)

True

In [17]:
# This use the id index to affect values at the proper place
pdoTrainX['truthMean']=pdoTrainY['truthMean']

In [18]:
pdoTrainX

Unnamed: 0_level_0,postText,targetTitle,targetDescription,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
804113781580328960,"Panama Papers: Europol links 3,500 names to su...","Panama Papers: Europol links 3,500 names to su...",Law enforcement agency analysis uncovers proba...,0.066667
804119512010424320,The key to truly great chicken soup,A Superior Chicken Soup,For the best rendition of this American classi...,0.733333
804121272967983104,Afghan policewomen face down their fears to serve,100 Women 2016: On the frontline with the wome...,The Afghan women risking all to join the police.,0.333333
804123103995580416,Conservatives are watching less football this ...,Older Viewers and Conservatives Are Watching L...,"Many factors are dragging down NFL ratings, in...",0.066667
804126501117435904,Richard Sherman weighs in on Cam Newton’s stru...,Seattle Seahawks Richard Sherman Says 'Karma' ...,"Seattle Seahawks news, rumors, scores, schedul...",0.133333
...,...,...,...,...
858455355948384257,Trump now agrees with the majority of American...,Donald Trump said being US president was harde...,Donald Trump spent a great portion of 2016 ins...,0.000000
858459539296980995,Trump has flip-flopped. But his supporters are...,Trump Has Flip-Flopped. But His Supporters Are...,Barely over a tenth of Trump voters think his ...,0.266667
858460992073863168,Inside North Korea's secret prisons,Inside Kim Jong-un's camps of death: Former No...,A female guard (stock photo) at a North Korean...,0.400000
858462320779026433,UK’s response to modern slavery leaving victim...,‘Inexcusable’ failures in UK’s response to mod...,“Inexcusable” failures in the UK’s system for ...,0.133333


In [19]:
pdoTrainX.describe()

Unnamed: 0,truthMean
count,19538.0
mean,0.32453
std,0.252824
min,0.0
25%,0.133333
50%,0.266667
75%,0.466667
max,1.0


Same thing for complementary values

In [20]:
pdoTrainX1=pd.read_json(dataTrainX1, lines = True, dtype=False)
pdoTrainX1.drop(['postMedia', 'targetCaptions', 'targetParagraphs', 'postTimestamp' , 'targetKeywords']	, axis=1, inplace=True)
pdoTrainX1.set_index('id', inplace=True)
pdoTrainX1["postText"]=pdoTrainX1["postText"].apply(lambda x : x[0])

pdoTrainY1=pd.read_json(dataTrainY1, lines = True, dtype=False)
pdoTrainY1.set_index('id', inplace=True)

pdoTrainX1.sort_index(inplace=True)
pdoTrainY1.sort_index(inplace=True)
pdoTrainX1['truthMean']=pdoTrainY1['truthMean']


In [21]:
pdoTrainX1

Unnamed: 0_level_0,postText,targetTitle,targetDescription,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
607668877594497024,RT @WSJLive: This year's Tony nominees desribe...,Tony Nominees' Craziest Moments on Stage,"Tony Award nominees Carey Mulligan, Elisabeth ...",0.666667
607671137062010881,Orphaned fruit bat pups nursed back to health ...,Going into bat to save a species: Dedication a...,North Sydney's Kukundi fruit bat shelter has e...,0.066667
607672568057700352,Ohio State’s Tyvis Powell learned that champio...,Ohio State's Tyvis Powell Thinks He's Above Cu...,In the offseason following a championship se...,0.200000
607674674168926209,China’s fishermen explain why they think the s...,China’s fishermen explain why they think the s...,Those who sail from Hainan island say the Sout...,0.600000
607675444834398208,"RT @BBCSport: ""I'm living my dream"" - Watch wh...",BBC Sport - Lewis Hamilton 'living his dream' ...,Lewis Hamilton says Mercedes are enabling him ...,0.333333
...,...,...,...,...
610200047951609857,Petition calling for Kay Burley's sacking reac...,Petition to sack Kay Burley following Alton To...,A petition calling for the sacking of Sky News...,0.000000
610200274658029568,RT @BuzzFeedNews: This Trooper Pulled Over An ...,An Old Lady In A Scooter Was Lost On A Highway...,,0.400000
610201503752658944,"RT @irin: No one can ever top this sentence, a...",Rich Californians balk at limits: ‘We’re not a...,"After years of devastating drought, ultra-weal...",0.666667
610201840836186112,VIDEO: ESPN's OTL study reveals that college a...,OTL Investigates Perception of Top College Ath...,ESPN's Outside the Lines conducted a study...,0.200000


In [22]:
pd.concat([pdoTrainX, pdoTrainX1])

Unnamed: 0_level_0,postText,targetTitle,targetDescription,truthMean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
804113781580328960,"Panama Papers: Europol links 3,500 names to su...","Panama Papers: Europol links 3,500 names to su...",Law enforcement agency analysis uncovers proba...,0.066667
804119512010424320,The key to truly great chicken soup,A Superior Chicken Soup,For the best rendition of this American classi...,0.733333
804121272967983104,Afghan policewomen face down their fears to serve,100 Women 2016: On the frontline with the wome...,The Afghan women risking all to join the police.,0.333333
804123103995580416,Conservatives are watching less football this ...,Older Viewers and Conservatives Are Watching L...,"Many factors are dragging down NFL ratings, in...",0.066667
804126501117435904,Richard Sherman weighs in on Cam Newton’s stru...,Seattle Seahawks Richard Sherman Says 'Karma' ...,"Seattle Seahawks news, rumors, scores, schedul...",0.133333
...,...,...,...,...
610200047951609857,Petition calling for Kay Burley's sacking reac...,Petition to sack Kay Burley following Alton To...,A petition calling for the sacking of Sky News...,0.000000
610200274658029568,RT @BuzzFeedNews: This Trooper Pulled Over An ...,An Old Lady In A Scooter Was Lost On A Highway...,,0.400000
610201503752658944,"RT @irin: No one can ever top this sentence, a...",Rich Californians balk at limits: ‘We’re not a...,"After years of devastating drought, ultra-weal...",0.666667
610201840836186112,VIDEO: ESPN's OTL study reveals that college a...,OTL Investigates Perception of Top College Ath...,ESPN's Outside the Lines conducted a study...,0.200000


In [23]:
# Convert the result into a Huggingface dataset
import datasets as ds
curated = ds.Dataset.from_pandas(pd.concat([pdoTrainX, pdoTrainX1])) #, preserve_index=False)

In [22]:
curated.info.description = "Clickbait"
curated.info.version = "0.3.0"
curated.info.supervised_keys = [actualOutput]
curated.save_to_disk(dataCuratedPath)

In [24]:
# This file is not used. The \n\n line characters are used in the postText
curated.to_csv(dataCuratedPath+"/dataset.csv", sep=';') 

Creating CSV from Arrow format: 100%|██████████| 3/3 [00:00<00:00, 10.07ba/s]


7342633

In [26]:
# This file is not used. The \n\n line characters are used in the postText
curated.to_json(dataCuratedPath+"/dataset.jsonl") 

Creating json from Arrow format: 100%|██████████| 3/3 [00:00<00:00, 22.83ba/s]


8766898