# Feature Selection

* Coleman-Liau score (CLScore)
* RIX and LIX indices
* Formality measure (fmeasure)
* Number of uppercase words, presence of questionmarks and exclamation marks in headlines (titles), and the length of the title (number of words) are the most important content features


* The character n-gram features and the word 1-gram feature appear to contribute most to performance
    * Character n-grams are known to capture writing style


* headline: word count
* body: 1. Informality: We compute the frequencies of two informality indicators, namely internet slang and bait words. Additionally, the length of news bodies is also an input feature.


* Sent length, word length, ratio of stop words to content words

In [1]:
# get the data

import json
import os
import pandas as pd

# https://github.com/ipython/ipython/issues/10123
directory_path = os.getcwd()
dataset_no_figures_path = directory_path + '/../data/dataset_no_figures/'

truth_classes = {}

with open(dataset_no_figures_path + 'truth_train.jsonl') as f:
    for line in f:
        truth = json.loads(line)
        truth_classes[truth['id']] = truth['truthClass']
        
df = pd.DataFrame()

with open(dataset_no_figures_path + 'instances_train.jsonl') as f:
    for line in f:
        instance = json.loads(line)
        data = pd.DataFrame({'postText': instance['postText'], 'truthClass': truth_classes[instance['id']]}, index=[instance['id']])
        df = df.append(data)
        
print(df)

                                                postText    truthClass
0      Apple's iOS 9 'App thinning' feature will give...  no-clickbait
1      RT @kenbrown12: Emerging market investors are ...  no-clickbait
2      U.S. Soccer should start answering tough quest...     clickbait
3      How theme parks like Disney World left the mid...  no-clickbait
4      Could light bulbs hurt your health? One compan...     clickbait
5      13 classic ’00s songs that were actually meant...     clickbait
6      Dez Bryant is reportedly considering skipping ...  no-clickbait
7      Pregnant mother of 12 accused of keeping kids ...  no-clickbait
8      RT @fionamatthias: 10 ways the expat life Is l...  no-clickbait
9      House #GOP plans two days of debate, Friday sh...  no-clickbait
10     Azeri government behind foreign media ban, say...  no-clickbait
11     Only one in three of us complain when we are u...  no-clickbait
12     An open letter to Jerry Seinfeld from a "polit...  no-clickbait
13    

In [None]:
# preprocess the data

# TODO remove newlines from postText? (e.g., /n in 17560)

In [28]:
# get the features

from nltk.tokenize import sent_tokenize, word_tokenize
from string import ascii_lowercase, ascii_uppercase
import nltk

def number_of_words(text):
    return len(word_tokenize(text))

def number_of_uppercase_words(text):
    words = word_tokenize(text)
    n_of_uppercase_words = 0
    for word in words:
        if word[0] in ascii_uppercase:
            n_of_uppercase_words += 1
    return n_of_uppercase_words

def is_exclamation_question_mark_present(text):
    return '!' in text or '?' in text

def lix(text):
    return number_of_words(text) / len(sent_tokenize(text))

def rix(text):
    lw = 0
    words = word_tokenize(text)
    for word in words:
        if len(word) >= 7:
            lw += 1
    return lw / len(sent_tokenize(text))

# https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index
def clindex(text):
    text_lower = text.lower()
    number_of_letters = 0
    for character in text_lower:
        if character in ascii_lowercase:
            number_of_letters += 1
    number_of_sentences = len(sent_tokenize(text))
    n_of_words = number_of_words(text)
#     l = Letters ÷ Words × 100
    l = number_of_letters / n_of_words * 100
#     s = Sentences ÷ Words × 100
    s = number_of_sentences / n_of_words * 100

    return 0.0588 * l - 0.296 * s - 15.8

In [26]:
# normalize features

# TODO get features in range of [0,1] or [-1,1]?
# TODO convert all booleans to 0 and 1?

In [None]:
# create train and test sets

In [None]:
# create error functions

In [None]:
# test models

# TODO try https://xgboost.ai