# Fake News Detection - Machine Learning - Logistic Regression

Implementing Fake News Detection in Python. Implementing a machine learning system to predict if the news is real or fake.

Using textual data.

Resource: https://www.youtube.com/watch?v=nacLBdyG6jE

#### Overview

We will use a 'label data set'. This data set includes many news articles. Along with other information, it contains columns labeled as "Real News" or "Fake News".

We will import this dataset.

We will pre process this dataset.The softwares understands numbers, but do not understand text or characters. Our data is text, so we will have to convert the text data to numbers that be meaningful to the machine.

Once the text is converted, we will split the dataset into training data and test data. 
We will train our machine learning model with the training data set. 
We will use the test data to evaluate our model. 

We will feed the training data into the Logistic Regression model. We are using this model because logistic regression is best suited for binary classification, which means it is being classified into one of two types: real or fake. 

We will train the logistic regression model, resulting in a Trained Logistic Regression model. We will be able to perform evaluations on this model. 

In [1]:
import numpy as np
##### For making numpy arrays

import pandas as pd
##### For creating data frames and storing data 
import re
##### Regular expression library, to search for words in a text or paragraph

from nltk.corpus import stopwords
##### Natural language Toolkit Corpus
#### stopwords to remove words that do not add much value to a paragraph vertex. ex) they, the, a, what, when

from nltk.stem.porter import PorterStemmer
##### PorterStemmer will be used when we performed the stemming function. Steamming is used to remove the prefix and suffix of a word, and return the root word 

from sklearn.feature_extraction.text import TfidfVectorizer
##### To convert text into feature vectors

from sklearn.model_selection import train_test_split
##### To split data set into training data and test data

from sklearn.linear_model import LogisticRegression
##### For Logistic Regression Model 

from sklearn.metrics import accuracy_score
##### For testing accuracy

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Amrita\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
print(stopwords.words("english"))
##### Will print stop words in english

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Data Pre-processing

In [4]:
# Load dataset to pandas DataFrame
newsDataset = pd.read_csv('./train.csv.zip')

In [6]:
newsDataset

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1


In [7]:
#see rows and columns
newsDataset.shape

(20800, 5)

In [8]:
# print first 5 rows of dataframe using head function
newsDataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [9]:
# Counting (summation) the number of missing values (null) in the dataset
newsDataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [10]:
# Replace null values in dataset with a empty string
newsDataset = newsDataset.fillna('')

In [12]:
# We can merge data together. Merge author and news column, seperated by a space
newsDataset['content'] = newsDataset['author'] + " " + newsDataset['title']

In [13]:
# Lets see:
print(newsDataset['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [16]:
# Seperate label column from dataset using the drop() function
# We are removing column, therefore axis = 1
X = newsDataset.drop(columns='label', axis=1)

In [17]:
# A variable to store label data
Y = newsDataset['label']

In [18]:
#print
print(X)
print(Y)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

### Stemming

Stemming - The process we will implement to reduce words to their 'root' words. 

In [34]:
stem_of_word = PorterStemmer()

In [41]:
# Lets create a function!

def stemming(content):
    # take content data and remove all extra numbers and characters.
    # This will keep only characters from the alphabet
    # Others will be replaced by a space 
    content_stem = re.sub('[^a-zA-z]',' ',content)
    
    # convery to lowercase 
    content_stem = content_stem.lower()
    
    # to split and convert to lists
    content_stem = content_stem.split()
    
    # Stemming function. To keep all words other than stopwords which we declared earlier. 
    content_stem = [stem_of_word.stem(word) for word in content_stem if not word in stopwords.words('english')]
    
    # Join remaining words
    content_stem = ' '.join(stem_of_word)
    return content_stem

In [50]:
# content is joined author and title
# To return same column as data column
newsDataset['content'] = newsDataset['content'].apply(stemming)

TypeError: stemming() got an unexpected keyword argument 'axis'

### Error: I am recieving a error on the .apply() function in Python. As I searched online, I was introduced to the fact that this is a error in Python. Some scenarios can provoke this error. 

One solution was add axis=1 as a argument in apply. This would mean to be passing the original data frame columns into the function, rather than the rows. So, this is not the correct solution for me. 

I will come back to this as I continue my research in python. 