# Fake News Prediction

In this notebook, I will be analyzing the data aquired from <a href='https://www.kaggle.com/'>Kaggle</a>, in order to prepare the data for creating a predicting machine learning model for <b>Fake News Detection</b>. The created model will be used in a fake news detector website in order to fulfill the requirements of <b>Advanced Programming</b> subject.

# Data analysis and preprocessing

In every machine learning project, in order to manipulate with data, we will always need <b>Pandas</b> library. That's why we will import it in the beginning of the file.

In [1]:
import pandas as pd

In our dataset, data is separated into two different files: 

- Fake
- True

Representing fake and true news. Since we will be using <b>Supervised Learning</b>, what we plan to do is concatenate both files in one training_data file and mark the data with a label as fake with 0 and true with 1. But first, we will analyse True and Fake dataset separately and assign the label. 

In [2]:
df_true = pd.read_csv('True.csv')
df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [3]:
df_fake = pd.read_csv('Fake.csv')
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
label_value = 1
df_true['label'] = label_value
df_true.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [5]:
label_value = 0
df_fake['label'] = label_value
df_fake.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


Below, we concatenated both dataframes and shuffled all the rows in order to mix the data.

In [6]:
df = pd.concat([df_true, df_fake], axis=0, ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
df.head()

Unnamed: 0,title,text,subject,date,label
0,BREAKING: GOP Chairman Grassley Has Had Enoug...,"Donald Trump s White House is in chaos, and th...",News,"July 21, 2017",0
1,Failed GOP Candidates Remembered In Hilarious...,Now that Donald Trump is the presumptive GOP n...,News,"May 7, 2016",0
2,Mike Pence’s New DC Neighbors Are HILARIOUSLY...,Mike Pence is a huge homophobe. He supports ex...,News,"December 3, 2016",0
3,California AG pledges to defend birth control ...,SAN FRANCISCO (Reuters) - California Attorney ...,politicsNews,"October 6, 2017",1
4,AZ RANCHERS Living On US-Mexico Border Destroy...,Twisted reasoning is all that comes from Pelos...,politics,"Apr 25, 2017",0


Let's check other information regarding our dataset:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


In [8]:
df.columns

Index(['title', 'text', 'subject', 'date', 'label'], dtype='object')

### Prepare data

In [9]:
df = df.drop(columns = ['title','subject','date'])
df.head()

Unnamed: 0,text,label
0,"Donald Trump s White House is in chaos, and th...",0
1,Now that Donald Trump is the presumptive GOP n...,0
2,Mike Pence is a huge homophobe. He supports ex...,0
3,SAN FRANCISCO (Reuters) - California Attorney ...,1
4,Twisted reasoning is all that comes from Pelos...,0


We can save the final dataframe in a csv file just in case it will be needed but this is completely not neccessary.

In [10]:
df.to_csv('final_data_file.csv', index=False)

In [11]:
df

Unnamed: 0,text,label
0,"Donald Trump s White House is in chaos, and th...",0
1,Now that Donald Trump is the presumptive GOP n...,0
2,Mike Pence is a huge homophobe. He supports ex...,0
3,SAN FRANCISCO (Reuters) - California Attorney ...,1
4,Twisted reasoning is all that comes from Pelos...,0
...,...,...
44893,ABUJA (Reuters) - The United States has formal...,1
44894,Tune in to the Alternate Current Radio Network...,0
44895,I m convinced the Freedom From Religion group...,0
44896,WASHINGTON (Reuters) - The Republican tax plan...,1


## Processing of <i>'text'</i> column with Natural Language Toolkit

In [12]:
df['text'] = df['text'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,text,label
0,"donald trump s white house is in chaos, and th...",0
1,now that donald trump is the presumptive gop n...,0
2,mike pence is a huge homophobe. he supports ex...,0
3,san francisco (reuters) - california attorney ...,1
4,twisted reasoning is all that comes from pelos...,0


In [13]:
# Remove punctuation

import string

def punctuation_removal(text):
    all_list = [char for char in text if char not in string.punctuation]
    clean_str = ''.join(all_list)
    return clean_str

df['text'] = df['text'].apply(punctuation_removal)
df.head()

Unnamed: 0,text,label
0,donald trump s white house is in chaos and the...,0
1,now that donald trump is the presumptive gop n...,0
2,mike pence is a huge homophobe he supports exg...,0
3,san francisco reuters california attorney gen...,1
4,twisted reasoning is all that comes from pelos...,0


In [16]:
# Removing stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

df.head(10)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fitor\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Unnamed: 0,text,label
0,donald trump white house chaos trying cover ru...,0
1,donald trump presumptive gop nominee time reme...,0
2,mike pence huge homophobe supports exgay conve...,0
3,san francisco reuters california attorney gene...,1
4,twisted reasoning comes pelosi days especially...,0
5,washington reuters lawyer private practice dec...,1
6,aden reuters salafist imam shot dead gunmen ea...,1
7,kuala lumpur reuters potential witnesses multi...,1
8,goal socialism communism vladimir lenin commun...,0
9,opposing views beliefs much country heated feu...,0


In [17]:
x = df['text']

In [18]:
y = df['label']

In [24]:
x.shape
y.shape

(44898,)

# Machine Learning Model

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
x_train , x_test , y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Vectorization

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
vect = TfidfVectorizer()
x_train = vect.fit_transform(x_train)
x_test = vect.transform(x_test)

### Logistic Regression

In [30]:
from sklearn.linear_model import LogisticRegression

In [31]:
LogReg = LogisticRegression()
LogReg.fit(x_train, y_train)

In [32]:
LogReg.score(x_test, y_test)

0.9891982182628062

In [34]:
from sklearn.metrics import classification_report

In [35]:
prediction_LR = LogReg.predict(x_test)
print(classification_report(y_test, prediction_LR))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4669
           1       0.99      0.99      0.99      4311

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980

