# Final Project: Detecting Fake News

## Overview

I am using logistic regression to determine whether an article is classified as fake news or not. The training and testing data is split into two separate csv files. The test csv file is the exact same except it does not have the label column. The columns consist of id, title, author, text, and the label. A label of one indicates that the article is unreliable. If the label is 0 then you can trust the article. 

The datasets used for this project can be found [here](https://www.kaggle.com/c/fake-news/data).

### Technologies Used

- Python3
- Pandas
- Numpy
- Keras
- Seaborn

### Imports

In [None]:
# import warnings
import warnings
warnings.filterwarnings('ignore')
import keras
import matplotlib.pyplot as plt
%matplotlib inline  
import numpy as np
import pandas as pd
import seaborn as sns
import nltk

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Data Preprocessing and Exploration

In [2]:
#load in dataset
train_df = pd.read_csv('fake-news/train.csv')
train_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [3]:
#get datatypes for each column
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
id        20800 non-null int64
title     20242 non-null object
author    18843 non-null object
text      20761 non-null object
label     20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [4]:
#checking for duplicates
duplicatesdf = train_df[train_df.duplicated()]
print(duplicatesdf)

Empty DataFrame
Columns: [id, title, author, text, label]
Index: []


In [5]:
#checking if any columns contain nan values
train_df.isna().any()

id        False
title      True
author     True
text       True
label     False
dtype: bool

In [6]:
#determining how many nan values
print(train_df.isnull().sum())

id           0
title      558
author    1957
text        39
label        0
dtype: int64


#### Observations About Null Values

After exploring the data, I noticed that all the rows with NaN titles and content are understandably labeled as fake news. However, there are a few articles that are labeled as real news with missing author information. 

In [7]:
null_txt_df = train_df[train_df['text'].isnull()]
null_txt_df.head()

Unnamed: 0,id,title,author,text,label
142,142,Gorafi Magazine : Entretien exclusif avec Bara...,,,1
573,573,Le top des recherches Google passe en top des ...,,,1
1200,1200,La Corée du Nord annonce avoir envoyé un missi...,,,1
1911,1911,Grand-Prix du Brésil – Romain Grosjean obtient...,,,1
2148,2148,Gorafi Magazine: Barack Obama « Je vous ai déj...,,,1


In [8]:
train_df = train_df.dropna(how='any')
print(train_df.isnull().sum())

id        0
title     0
author    0
text      0
label     0
dtype: int64


In [9]:
del train_df['id']

In [10]:
train_df.head()

Unnamed: 0,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


### Feature Extraction

- Uppercase Words
- Word Count
- Average Word Length

In [11]:
def avg_word_len(col):
    words = col.split()
    word_len = 0
    for word in words:
        word_len+= len(word)
    if len(words) != 0:
        avg = word_len/len(words)
    else: 
        avg = 0  
    return avg

In [12]:
cols_to_change = ['title', 'text']

for col in cols_to_change:
    train_df['Uppercase'] = train_df[col].str.count(r'[A-Z]')
    train_df['word_count'] = train_df[col].apply(lambda x : len(str(x).split(" ")))
    train_df['avg_word_len'] = train_df[col].apply(lambda x : avg_word_len(x))


In [13]:
train_df.head()

Unnamed: 0,title,author,text,label,Uppercase,word_count,avg_word_len
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,210,820,5.00122
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,105,727,4.83662
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,226,1266,5.059242
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,130,559,4.788151
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,21,154,5.071429


#### Text Preprocessing

1. Convert to lowercase
2. Remove punctuation
3. Remove stop words using NLTK
4. Remove most frequently occuring words
5. Remove rare words

In [75]:
all_words = ' '.join(train_df.text).split()
freq_words = pd.Series(all_words).value_counts()[:10]
rare_words = pd.Series(all_words).value_counts()[-150000:]

In [52]:
print('Total Word Count: ', len(all_words))
print('Most Frequent Words')
print('===================')
freq_words

Total Word Count:  8320398
Most Frequent Words


said      77434
mr        66024
trump     42011
one       35302
would     35040
people    32995
new       28214
like      24571
also      23811
us        22372
dtype: int64

In [77]:
print('Rarely Occuring Words')
print('======================')
rare_words[:50]

Rarely Occuring Words


ohnehin           9
wades             9
vegetarians       9
новые             9
candice           9
266               9
derelict          9
bansal            9
kadar             9
lifeboats         9
invigorating      9
interwoven        9
entertains        9
shilling          9
hollands          9
preeminence       9
tcm               9
buoy              9
accompaniment     9
archrival         9
overstepping      9
uncontrollably    9
hightower         9
apologise         9
monastic          9
sevens            9
interpersonal     9
earthen           9
renta             9
vader             9
rolex             9
recon             9
laughingly        9
intodayin         9
monticello        9
cosmetology       9
im22              9
matriarch         9
countryman        9
antiwarcom        9
grinch            9
cu                9
gilmour           9
bowers            9
coreyciorciari    9
yorkville         9
lifeanddeath      9
arol              9
grannis           9
burris            9


In [79]:
#stop words from nltk
stop = stopwords.words('english')
cols_to_change = ['title', 'text', 'author']

for col in cols_to_change:
    train_df[col] = train_df[col].str.lower()
    train_df[col] = train_df[col].str.replace( "[^\w\s]" , "" )
    if col != 'author':
        train_df[col] = train_df[col].apply(lambda x: " ".join( word for word in x.split() if word not in stop))
        train_df[col] = train_df[col].apply(lambda x: " ".join( word for word in x.split() if word not in freq_words))
        train_df[col] = train_df[col].apply(lambda x: " ".join( word for word in x.split() if word not in rare_words))

In [80]:
train_df.head()

Unnamed: 0,title,author,text,label,Uppercase,word_count,avg_word_len
0,house dem aide didnt even see comeys letter ja...,darrell lucus,house dem aide didnt even see comeys letter ja...,1,210,820,5.00122
1,flynn hillary clinton big woman campus breitbart,daniel j flynn,ever get feeling life circles roundabout rathe...,0,105,727,4.83662
2,truth might get fired,,truth might get fired october 29 2016 tension ...,1,226,1266,5.059242
3,15 civilians killed single airstrike identified,jessica purkiss,videos 15 civilians killed single airstrike id...,1,130,559,4.788151
4,iranian woman jailed fictional unpublished sto...,howard,print iranian woman sentenced six years prison...,1,21,154,5.071429


## Building the Model

## Training and Evaluating the Model

## Conclusion

## Resources