This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  A check grade will be given to generic features that apply across arbitrary text classification problems (e.g., a feature for bigrams); check+ will be given for at least one feature that reveals your own understanding of your data. What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

Q0: Briefly describe your data (including the categories you're predicting)

The data we will be using is a fake news/true news dataset curated from <a href = 'https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset'>Kaggle</a>. The dependent variable is the veracity of a news story (either True or False), with the text of the news story as the source of predictors.

In [1]:
import sys
from collections import Counter
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
from scipy import sparse
import numpy as np

In [2]:
import nltk
import operator
import re

In [3]:
!pip install g2p_en



In [4]:
from g2p_en import G2p
g2p = G2p()

In [5]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=cols[1]
            X.append(text)
            Y.append(label)
    return X, Y

In [6]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../6.classification/fake news dataset"

In [7]:
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)

In [8]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]
    
    correct=0.
    for label in devY:
        if label == majority:
            correct+=1
            
    print("%s\t%.3f" % (majority, correct/len(devY)))

Here we'll create two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [9]:
# Here's a sample dictionary we can create by inspecting the output of the Mann-Whitney test (in 2.compare/)
dem_dictionary=set(["republican","cut", "opposition"])
repub_dictionary=set(["growth","economy"])

def political_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"]=1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"]=1
    return feats

In [10]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

Q1: Add first new feature function here.  Describe your feature and why you think it will help.

The feature we want is fairly naïve: it uses the length of the document of the text. The hypothesis is that fake news will tend to be shorter (the means of fake and true news datasets were slightly different) for reasons such as true news quoting a lot of relevant sources (and the fake news documents not). We wonder if the logistic regression may be able to use that.

In [11]:
def new_feature_class_one(tokens):
    feats={}
    feats["length"]=len(tokens)
    return feats

Q2: Add second new feature function here. Describe your feature and why you think it will help.

The second feature we use is text complexity, as measured by the Flesch-Kincaid grade level. The hypothesis is that fake news documents tend to be simpler in their language and not using as many multisyllabic words. We hope that the logistic regression is able to use that.

In [12]:
arpabet = nltk.corpus.cmudict.dict()

In [13]:
def get_pronunciation(word):
    if word in arpabet:
        # pick the first pronunciation
        return arpabet[word][0]

    else:
        return g2p(word)

def get_syllable_count(word):
    pronunciation=get_pronunciation(word)
    sylls=0
    for phon in pronunciation:
        # vowels in arpabet end in digits (indicating stress)
        if re.search("\d$", phon) is not None:
            sylls+=1
    return sylls

In [14]:
def flesch_kincaid_grade_level(tokens):
    num_words=0
    num_sents=0
    num_syllables=0
    
    #for sent in nltk.sent_tokenize(text):
     #   sent_tokens=nltk.word_tokenize(sent)

    valid_words_in_sent=False
    for token in tokens:
        syllables=get_syllable_count(token)
        if syllables > 0:
            num_syllables+=syllables
            num_words+=1
            # flag to ensure sentence has at least one word (as distinct from e.g. all punctuation)
            valid_words_in_sent=True

    if valid_words_in_sent:
        num_sents+=1

    fk=0.39 * (num_words/num_sents) + 11.8 * (num_syllables/num_words) - 15.59
    return fk

In [15]:
def new_feature_class_two(tokens):
    feats={}
    feats["text complexity"]=flesch_kincaid_grade_level(tokens)
    return feats

This is the main function we'll use to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

### Preprocessing

We remove the indicators of veracity: namely, '(Reuters)', the news post location (e.g 'WASHINGTON') and other pre-text metadata in our documents that verify the information. This is an additional step of pre-processing that we do to make the fake and news datasets on as equal footing as possible, since for example, mentioning '(Reuters)' is a excellent sign of verification.

In [16]:
def build_features(trainX, feature_functions):
    data=[]
    for doc in trainX:
        feats={}
        doc = doc.lower()
        tokens=nltk.word_tokenize(doc)
        tokens=doc.split(" ")
        stopwords = ['(reuters)', '-', 'washington', 'london']
        tokens = [i for i in tokens if i not in stopwords]
        
        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [17]:
# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1
                
    return feature_vocab

In [18]:
# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature 
# values will be 0 for most documents (note: why?), and we don't want to save them all in 
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [19]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY))
    
    # finding weights
    n = 10
    weights=logreg.coef_[0]
    reverse_vocab=[None]*len(weights)
    for k in feature_vocab:
        reverse_vocab[feature_vocab[k]]=k
    for feature, weight in sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))[:n]:
        print("%.3f\t%s" % (weight, feature))
    print()
    for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
        print("%.3f\t%s" % (weight, feature))

In [20]:
majority_class(trainY,devY)

0	0.518


Explore the impact of different feature functions by evaluating them below:

In [21]:
features=[political_dictionary_feature]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.578
0.552	word_in_dem_dictionary
0.939	word_in_repub_dictionary

0.939	word_in_repub_dictionary
0.552	word_in_dem_dictionary


In [22]:
features=[political_dictionary_feature, unigram_feature]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.993
-2.439	UNIGRAM_more:
-2.367	UNIGRAM_via:
-1.855	UNIGRAM_obama
-1.843	UNIGRAM_s
-1.647	UNIGRAM_t
-1.160	UNIGRAM_image
-1.075	UNIGRAM_via
-1.070	UNIGRAM_this
-1.065	UNIGRAM_trump
-1.022	UNIGRAM_gop

2.630	UNIGRAM_on
2.234	UNIGRAM_said
1.969	UNIGRAM_said.
1.618	UNIGRAM_trump’s
1.593	UNIGRAM_u.s.
1.387	UNIGRAM_...
1.273	UNIGRAM_minister
1.218	UNIGRAM_“the
1.166	UNIGRAM_“i
1.046	UNIGRAM_reuters.


The unigram feature is especially good at discerning between the true and fake news datasets. The reporting style of the language ('on [], [] said...') is a big contributing factor, as seen from the coefficient weights. This conforms with peer-reviewed literature that found unigrams to be effective: </br>
<li>Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.</li>
<li>Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).</li>

In [23]:
features=[new_feature_class_one]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.556
-0.001	length

-0.001	length


Our length feature is certainly not as effective.

In [24]:
features=[new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.555
-0.001	text complexity

-0.001	text complexity


Surprisingly, the text complexity measure either was not as effective as expected.

In [None]:
features=[new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

### Additional feature exploration

In [27]:
def new_feature_class_three(tokens):
    feats={}
    if 'http' in tokens or 'https' in tokens:
        feats["hyperlink"] = 1
    else:
        feats["hyperlink"] = 0
    return feats

Here, we expect that news stories that contain links are more truthful.

In [28]:
features=[new_feature_class_three]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.518
-0.853	hyperlink

-0.853	hyperlink
