# Preprocessing and Feature Extraction

## Imports

In [7]:
import json
import nltk
import numpy as np
import pandas as pd

## Read files

There are two different files. The first one, called "instances" contains all the information for a given post. The second, "truth", contains the labels of each instance. The following are the schemas of these files:

`""" Fields in instances.jsonl: <br/>
 { <br/>
    "id": "<instance id>", <br/>
    "postTimestamp": "<weekday> <month> <day> <hour>:<minute>:<second> <time_offset> <year>", <br/>
    "postText": ["<text of the post with links removed>"], <br/>
    "postMedia": ["<path to a file in the media archive>"], <br/>
    "targetTitle": "<title of target article>", <br/>
    "targetDescription": "<description tag of target article>", <br/>
    "targetKeywords": "<keywords tag of target article>", <br/>
    "targetParagraphs": ["<text of the ith paragraph in the target article>"], <br/>
    "targetCaptions": ["<caption of the ith image in the target article>"] <br/>
  } """`


`""" Fields in truth.jsonl:
  {
    "id": "<instance id>",
    "truthJudgments": [<number in [0,1]>],
    "truthMean": <number in [0,1]>,
    "truthMedian": <number in [0,1]>,
    "truthMode": <number in [0,1]>,
    "truthClass": "clickbait | no-clickbait"
  } """`

In [2]:
def loadDataset(size):
    instances = []
    labels = []
    fileName = 'trainSmall' if size == 'small' else 'trainLarge'
    with open('data/'+fileName+'/instances.jsonl') as file:
        for line in file:
            instances.append(json.loads(line))
    with open('data/'+fileName+'/truth.jsonl') as file:
        for line in file:
            labels.append(json.loads(line))
    return instances, labels

In [3]:
instances, labels = loadDataset('small')

## Preprocessing

In [47]:
from nltk.corpus import stopwords

In [62]:
instancesDF = pd.DataFrame(instances)
instancesDF.iloc[0,:]

id                                                  608310377143799810
postMedia                                                           []
postText             [Apple's iOS 9 'App thinning' feature will giv...
postTimestamp                           Tue Jun 09 16:31:10 +0000 2015
targetCaptions       ['App thinning' will be supported on Apple's i...
targetDescription    'App thinning' will be supported on Apple's iO...
targetKeywords       Apple,gives,gigabytes,iOS,9,app,thinning,featu...
targetParagraphs     [Paying for a 64GB phone only to discover that...
targetTitle          Apple gives back gigabytes: iOS 9 'app thinnin...
Name: 0, dtype: object

In [59]:
# In this function the text variable is a list of length 1
def preprocess(text):
    sw = set(stopwords.words('english'))    
    filtered_sentence = []
    if isinstance(lista, list):      
        for t in text:            
            word_tokens = nltk.word_tokenize(t)
            filtered_sentence = filtered_sentence + [w for w in word_tokens if not w in sw] 
    else:
        word_tokens = nltk.word_tokenize(text)
        filtered_sentence = filtered_sentence + [w for w in word_tokens if not w in sw] 
    return filtered_sentence

In [65]:
instancesDF['postTextTokens'] = instancesDF.postText.apply(preprocess)
instancesDF['postTextClean'] = instancesDF.postTextTokens.apply(' '.join)

instancesDF['targetCaptionsTokens'] = instancesDF.targetCaptions.apply(preprocess)
instancesDF['targetCaptionsClean'] = instancesDF.targetCaptionsTokens.apply(' '.join)

instancesDF['targetDescriptionTokens'] = instancesDF.targetDescription.apply(preprocess)
instancesDF['targetDescriptionClean'] = instancesDF.targetDescriptionTokens.apply(' '.join)

instancesDF['targetKeywordsTokens'] = instancesDF.targetKeywords.apply(preprocess)
instancesDF['targetKeywordsClean'] = instancesDF.targetKeywordsTokens.apply(' '.join)

instancesDF['targetParagraphsTokens'] = instancesDF.targetParagraphs.apply(preprocess)
instancesDF['targetParagraphsClean'] = instancesDF.targetParagraphsTokens.apply(' '.join)

instancesDF['targetTitleTokens'] = instancesDF.targetTitle.apply(preprocess)
instancesDF['targetTitleClean'] = instancesDF.targetTitleTokens.apply(' '.join)

In [66]:
instancesDF.head()

Unnamed: 0,id,postMedia,postText,postTimestamp,targetCaptions,targetDescription,targetKeywords,targetParagraphs,targetTitle,postTextTokens,...,targetCaptionsTokens,targetCaptionsClean,targetDescriptionTokens,targetDescriptionClean,targetKeywordsTokens,targetKeywordsClean,targetParagraphsTokens,targetParagraphsClean,targetTitleTokens,targetTitleClean
0,608310377143799810,[],[Apple's iOS 9 'App thinning' feature will giv...,Tue Jun 09 16:31:10 +0000 2015,['App thinning' will be supported on Apple's i...,'App thinning' will be supported on Apple's iO...,"Apple,gives,gigabytes,iOS,9,app,thinning,featu...",[Paying for a 64GB phone only to discover that...,Apple gives back gigabytes: iOS 9 'app thinnin...,"[Apple, 's, iOS, 9, 'App, thinning, ', feature...",...,"['App, thinning, ', supported, Apple, 's, iOS,...",'App thinning ' supported Apple 's iOS 9 later...,"[', A, p, p, h, n, n, n, g, ', w, l, l, b, e, ...",' A p p h n n n g ' w l l b e u p p r e n A p ...,"[A, p, p, l, e, ,, g, v, e, ,, g, g, b, e, ,, ...","A p p l e , g v e , g g b e , O S , 9 , p p , ...","[Paying, 64GB, phone, discover, significantly,...",Paying 64GB phone discover significantly reduc...,"[A, p, p, l, e, g, v, e, b, c, k, g, g, b, e, ...",A p p l e g v e b c k g g b e : O S 9 ' p p h ...
1,609297109095972864,[media/609297109095972864.jpg],[RT @kenbrown12: Emerging market investors are...,Fri Jun 12 09:52:05 +0000 2015,"[Stocks Fall as Investors Watch Central Banks,...",Global investors have yanked $9.3 billion from...,"emerging market,emerging markets,em flows,em i...","[Emerging markets are out of favor., Global in...",Emerging Markets Suffer Largest Outflow in Sev...,"[RT, @, kenbrown12, :, Emerging, market, inves...",...,"[Stocks, Fall, Investors, Watch, Central, Bank...",Stocks Fall Investors Watch Central Banks Do T...,"[G, l, b, l, n, v, e, r, h, v, e, n, k, e, $, ...",G l b l n v e r h v e n k e $ 9 . 3 b l l n f ...,"[e, e, r, g, n, g, r, k, e, ,, e, e, r, g, n, ...","e e r g n g r k e , e e r g n g r k e , e f l ...","[Emerging, markets, favor, ., Global, investor...",Emerging markets favor . Global investors yank...,"[E, e, r, g, n, g, M, r, k, e, S, u, f, f, e, ...",E e r g n g M r k e S u f f e r L r g e O u f ...
2,609504474621612032,[],[U.S. Soccer should start answering tough ques...,Fri Jun 12 23:36:05 +0000 2015,[US to vote for Ali in FIFA election and not B...,A U.S. Senator's scathing letter questioned U....,,"[WINNIPEG, Manitoba – The bubble U.S. Soccer i...",U.S. Soccer should start answering tough quest...,"[U.S., Soccer, start, answering, tough, questi...",...,"[US, vote, Ali, FIFA, election, Blatter, US, v...",US vote Ali FIFA election Blatter US vote Ali ...,"[A, U, ., S, ., S, e, n, r, ', c, h, n, g, l, ...",A U . S . S e n r ' c h n g l e e r q u e n e ...,[],,"[WINNIPEG, ,, Manitoba, –, The, bubble, U.S., ...","WINNIPEG , Manitoba – The bubble U.S. Soccer p...","[U, ., S, ., S, c, c, e, r, h, u, l, r, n, w, ...",U . S . S c c e r h u l r n w e r n g u g h q ...
3,609748367049105409,[],[How theme parks like Disney World left the mi...,Sat Jun 13 15:45:13 +0000 2015,"[Some 1,000 persons turned out in Albuquerque,...","America's top family vacation spots, like the ...","disney, disney world, disney ticket prices, di...",[When Walt Disney World opened in an Orlando s...,How theme parks like Disney World left the mid...,"[How, theme, parks, like, Disney, World, left,...",...,"[Some, 1,000, persons, turned, Albuquerque, ,,...","Some 1,000 persons turned Albuquerque , New Me...","[A, e, r, c, ', p, f, l, v, c, n, p, ,, l, k, ...","A e r c ' p f l v c n p , l k e h e `` h p p e...","[n, e, ,, n, e, w, r, l, ,, n, e, c, k, e, p, ...","n e , n e w r l , n e c k e p r c e , n e w r ...","[When, Walt, Disney, World, opened, Orlando, s...",When Walt Disney World opened Orlando swamp 19...,"[H, w, h, e, e, p, r, k, l, k, e, D, n, e, W, ...",H w h e e p r k l k e D n e W r l l e f h e l ...
4,608688782821453825,[media/608688782821453825.jpg],[Could light bulbs hurt your health? One compa...,Wed Jun 10 17:34:49 +0000 2015,[Electric lights have made the world safer and...,One company will put a health notice on all th...,"health, Should there be warning labels on your...",[(CNN)The light bulb always makes the world's ...,Warning labels on your light bulbs,"[Could, light, bulbs, hurt, health, ?, One, co...",...,"[Electric, lights, made, world, safer, made, p...",Electric lights made world safer made people s...,"[O, n, e, c, p, n, w, l, l, p, u, h, e, l, h, ...",O n e c p n w l l p u h e l h n c e n l l h e ...,"[h, e, l, h, ,, S, h, u, l, h, e, r, e, b, e, ...","h e l h , S h u l h e r e b e w r n n g l b e ...","[(, CNN, ), The, light, bulb, always, makes, w...",( CNN ) The light bulb always makes world 's t...,"[W, r, n, n, g, l, b, e, l, n, u, r, l, g, h, ...",W r n n g l b e l n u r l g h b u l b


## Define features

The following is the list of features to implement:
* numChar(TargetTitle, Post, TargetParagraphs)
* diffNumChar(TargetTitleVsPost, TargetTitleVsTargetParagraphs, PostVsTargetParagraphs)
* ratioNumChar(TargetTitleVsPost, TargetTitleVsTargetParagraphs, PostVsTargetParagraphs)
* numWords (TargetTitle, Post, TargetParagraphs)
* diffNumWords (TargetTitleVsPost, TargetTitleVsTargetParagraphs, PostVsTargetParagraphs)
* ratioNumWords (TargetTitleVsPost, TargetTitleVsTargetParagraphs, PostVsTargetParagraphs)
* numFormalInformal (TargetTitle, Post, TargetParagraphs)
* ratioFormalInformal (TargetTitle, Post, TargetParagraphs)

In [73]:
# numChars
instancesDF['numCharPost'] = instancesDF.postTextClean.apply(len)

In [None]:
# diffNumChars

In [None]:
# ratioNumChars


In [None]:
# numWords


In [None]:
# diffNumWords


In [None]:
# ratioNumWords

## Save feature sets

In [74]:
featureSet = instancesDF[['numCharPost']]
featureSet.head()

Unnamed: 0,numCharPost
0,66
1,91
2,79
3,58
4,74


In [75]:
featureSet.to_csv('feature_set_small.csv')