# Final Project: Detecting Fake News

## Overview

I am using logistic regression to determine whether an article is classified as fake news or not. The training and testing data is split into two separate csv files. The test csv file is the exact same except it does not have the label column. The columns consist of id, title, author, text, and the label. A label of one indicates that the article is unreliable. If the label is 0 then you can trust the article. 

The datasets used for this project can be found [here](https://www.kaggle.com/c/fake-news/data).

### Technologies Used

- Python3
- Pandas
- Numpy
- Keras
- Seaborn

### Imports

In [45]:
# import warnings
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline  
import numpy as np
import pandas as pd
import seaborn as sns

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

import keras
from keras.models import Sequential
from keras.layers import Dense, Input, Activation

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report , accuracy_score, precision_score, recall_score, f1_score


pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Data Preprocessing and Exploration

In [0]:
#load in dataset
url = 'https://raw.githubusercontent.com/ebonnecab/ds2-2/master/final_project/fake-news/train.csv'
train_df = pd.read_csv(url)

In [3]:
train_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
#get datatypes for each column
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
id        20800 non-null int64
title     20242 non-null object
author    18843 non-null object
text      20761 non-null object
label     20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [5]:
#checking for duplicates
duplicatesdf = train_df[train_df.duplicated()]
print(duplicatesdf)

Empty DataFrame
Columns: [id, title, author, text, label]
Index: []


In [6]:
#checking if any columns contain nan values
train_df.isna().any()

id        False
title      True
author     True
text       True
label     False
dtype: bool

In [7]:
#determining how many nan values
print(train_df.isnull().sum())

id           0
title      558
author    1957
text        39
label        0
dtype: int64


#### Observations About Null Values

After exploring the data, I noticed that all the rows with NaN titles and content are understandably labeled as fake news. However, there are a few articles that are labeled as real news with missing author information. 

In [8]:
null_txt_df = train_df[train_df['text'].isnull()]
null_txt_df.head()

Unnamed: 0,id,title,author,text,label
142,142,Gorafi Magazine : Entretien exclusif avec Bara...,,,1
573,573,Le top des recherches Google passe en top des ...,,,1
1200,1200,La Corée du Nord annonce avoir envoyé un missi...,,,1
1911,1911,Grand-Prix du Brésil – Romain Grosjean obtient...,,,1
2148,2148,Gorafi Magazine: Barack Obama « Je vous ai déj...,,,1


In [9]:
train_df = train_df.dropna(how='any')
print(train_df.isnull().sum())

id        0
title     0
author    0
text      0
label     0
dtype: int64


In [0]:
del train_df['id']

In [11]:
train_df.head()

Unnamed: 0,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


### Feature Extraction

- Uppercase Words
- Word Count
- Average Word Length

In [0]:
def avg_word_len(col):
    words = col.split()
    word_len = 0
    for word in words:
        word_len+= len(word)
    if len(words) != 0:
        avg = word_len/len(words)
    else: 
        avg = 0  
    return avg

In [0]:
cols_to_change = ['title', 'text']

for col in cols_to_change:
    train_df['Uppercase'] = train_df[col].str.count(r'[A-Z]')
    train_df['word_count'] = train_df[col].apply(lambda x : len(str(x).split(" ")))
    train_df['avg_word_len'] = train_df[col].apply(lambda x : avg_word_len(x))


In [14]:
train_df.head()

Unnamed: 0,title,author,text,label,Uppercase,word_count,avg_word_len
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,210,820,5.00122
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,105,727,4.83662
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,226,1266,5.059242
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,130,559,4.788151
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,21,154,5.071429


#### Text Preprocessing

1. Convert to lowercase
2. Remove punctuation
3. Remove stop words using NLTK
4. Remove numbers
5. Strip whitespace
5. Remove most frequently occuring words
6. Remove rare words
8. Lemmatization using Textblob

In [0]:
#stop words from nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

cols_to_change = ['title', 'text', 'author']

for col in cols_to_change:
    train_df[col] = train_df[col].str.lower()
    train_df[col] = train_df[col].str.replace("[^\w\s]" , "")
    train_df[col] = train_df[col].apply(lambda x: " ".join( word for word in x.split() if word not in stop))
    train_df[col] = train_df[col].apply(lambda x: " ".join(word for word in x.split() if not word.isnumeric()))
    train_df[col] = train_df[col].str.strip()

In [0]:
all_words = ' '.join(train_df.text).split()
freq_words = pd.Series(all_words).value_counts()[:10]
rare_words = pd.Series(all_words).value_counts()[-150000:]

In [17]:
print('Total Word Count: ', len(all_words))
print('Most Frequent Words')
print('===================')
print(freq_words)
print('Rarely Occuring Words')
print('======================')
rare_words[:10]

Total Word Count:  8145980
Most Frequent Words
said      77434
mr        66024
trump     42011
one       35302
would     35040
people    32995
new       28214
like      24571
also      23811
us        22372
dtype: int64
Rarely Occuring Words


saberrattling    11
ritualistic      11
paces            11
hornsey          11
mortally         11
steht            11
throttle         11
otsc             11
yam              11
consults         11
dtype: int64

In [0]:
train_df.text = train_df.text.apply(lambda x: " ".join( word for word in x.split() if word not in freq_words))
train_df.text = train_df.text.apply(lambda x: " ".join( word for word in x.split() if word not in rare_words))

In [19]:
train_df.head(20)

Unnamed: 0,title,author,text,label,Uppercase,word_count,avg_word_len
0,house dem aide didnt even see comeys letter ja...,darrell lucus,house dem aide didnt even see comeys letter ja...,1,210,820,5.00122
1,flynn hillary clinton big woman campus breitbart,daniel j flynn,ever get feeling life circles roundabout rathe...,0,105,727,4.83662
2,truth might get fired,consortiumnewscom,truth might get fired october tension intellig...,1,226,1266,5.059242
3,civilians killed single us airstrike identified,jessica purkiss,videos civilians killed single airstrike ident...,1,130,559,4.788151
4,iranian woman jailed fictional unpublished sto...,howard portnoy,print iranian woman sentenced six years prison...,1,21,154,5.071429
5,jackie mason hollywood would love trump bombed...,daniel nussbaum,trying times jackie mason voice reason weeks e...,0,45,206,4.862069
7,benoît hamon wins french socialist partys pres...,alissa j rubin,paris france chose idealistic traditional cand...,0,118,849,4.957421
9,backchannel plan ukraine russia courtesy trump...,megan twohey scott shane,week michael flynn resigned national security ...,0,398,1773,4.889977
10,obamas organizing action partners soroslinked ...,aaron klein,organizing action activist group morphed barac...,0,336,1553,5.284848
11,bbc comedy sketch real housewives isis causes ...,chris tomlinson,bbc produced spoof real housewives tv programm...,0,180,552,4.906844


In [0]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(col):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(col)]

for col in ['title', 'text']:
  train_df[col] = train_df[col].apply(lambda x: " ".join(lemmatize_text(str(x))))

In [21]:
train_df.head()

Unnamed: 0,title,author,text,label,Uppercase,word_count,avg_word_len
0,house dem aide didnt even see comeys letter ja...,darrell lucus,house dem aide didnt even see comeys letter ja...,1,210,820,5.00122
1,flynn hillary clinton big woman campus breitbart,daniel j flynn,ever get feeling life circle roundabout rather...,0,105,727,4.83662
2,truth might get fired,consortiumnewscom,truth might get fired october tension intellig...,1,226,1266,5.059242
3,civilian killed single u airstrike identified,jessica purkiss,video civilian killed single airstrike identif...,1,130,559,4.788151
4,iranian woman jailed fictional unpublished sto...,howard portnoy,print iranian woman sentenced six year prison ...,1,21,154,5.071429


#### Convert Categorical Data to Numerical Data

In [0]:
#replacing author with frequency distribution
  #using actual values as keys
  keys = train_df['author'].drop_duplicates().values.tolist()
  keys.sort()

  #using freq count as vals
  vals = train_df['author'].value_counts().sort_index().tolist()
  vals = [x/sum(vals) for x in vals]
  
  #creating dict to associate vals with key
  freq_dict = dict(zip(keys, vals))

  #replacing vals in col with numeric
  train_df['author_num'] = train_df['author'].map(freq_dict)

In [23]:
train_df.head(20)

Unnamed: 0,title,author,text,label,Uppercase,word_count,avg_word_len,author_num
0,house dem aide didnt even see comeys letter ja...,darrell lucus,house dem aide didnt even see comeys letter ja...,1,210,820,5.00122,0.000547
1,flynn hillary clinton big woman campus breitbart,daniel j flynn,ever get feeling life circle roundabout rather...,0,105,727,4.83662,0.00093
2,truth might get fired,consortiumnewscom,truth might get fired october tension intellig...,1,226,1266,5.059242,0.003172
3,civilian killed single u airstrike identified,jessica purkiss,video civilian killed single airstrike identif...,1,130,559,4.788151,5.5e-05
4,iranian woman jailed fictional unpublished sto...,howard portnoy,print iranian woman sentenced six year prison ...,1,21,154,5.071429,0.002406
5,jackie mason hollywood would love trump bombed...,daniel nussbaum,trying time jackie mason voice reason week exc...,0,45,206,4.862069,0.006125
7,benoît hamon win french socialist party presid...,alissa j rubin,paris france chose idealistic traditional cand...,0,118,849,4.957421,0.000602
9,backchannel plan ukraine russia courtesy trump...,megan twohey scott shane,week michael flynn resigned national security ...,0,398,1773,4.889977,5.5e-05
10,obamas organizing action partner soroslinked i...,aaron klein,organizing action activist group morphed barac...,0,336,1553,5.284848,0.002953
11,bbc comedy sketch real housewife isi cause out...,chris tomlinson,bbc produced spoof real housewife tv programme...,0,180,552,4.906844,0.002461


#### TFDIF for Text Frequency Distribution

In [24]:
tfidf = TfidfVectorizer()

#creating all_text col
train_df.title = train_df.title.astype(str)
train_df.text = train_df.text.astype(str)
train_df['all_text'] = train_df[['title', 'text']].apply(lambda x :' '.join(x), axis = 1)

tfidf.fit(train_df['all_text'].tolist())
features = tfidf.transform(train_df['all_text']).toarray()
features_df = pd.DataFrame(features)
print(tfidf.vocabulary_)
print(tfidf.idf_)

[10.12074384 10.12074384 10.12074384 ... 10.12074384 10.12074384
 10.12074384]


In [25]:
print(features_df.shape)
print(train_df.shape)

(18285, 35139)
(18285, 9)


Splitting Data into training and test set using features from TFIDF

In [0]:
X_train, X_test, y_train, y_test = train_test_split(features, train_df['label'], random_state=0, test_size=0.25)

## Building the Model

In [0]:
def build_log_model():
    model_log = Sequential()
    model_log.add(Dense(8, input_dim = 35139, kernel_initializer='normal'))
    model_log.add(Activation('softmax'))
    model_log.add(Dense(1, activation='sigmoid'))
    model_log.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
    print(model_log.summary())
    return model_log

## Training and Evaluating the Model

In [37]:
model_log = build_log_model()
model_log.fit(X_train, y_train, epochs=10, batch_size=50,  verbose=1)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 8)                 281120    
_________________________________________________________________
activation_3 (Activation)    (None, 8)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 9         
Total params: 281,129
Trainable params: 281,129
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f46a2c8cb70>

In [40]:
print("Accuracy of Model")
print("=================")
print(model_log.evaluate(X_test, y_test, verbose=0))

Accuracy of Model
[0.17161023568554293, 0.9553805774278216]


## Resources

- ['Data Science 2.2 - Neural Networks & Deep Learning'](https://github.com/Make-School-Courses/DS-2.2-Deep-Learning)
- ['Fake Real News Classifier'](https://medium.com/deepdatascience/fake-real-news-classifier-with-machine-learning-eaf2cbe44e8f)
- ['Keras Sequential Model Guide'](https://keras.io/getting-started/sequential-model-guide/)