Hello! Welcome to a kaggle beginner's interpretation of the best way to tackle this problem.



**Importing and reading relevant files**

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

true = pd.read_csv("/kaggle/input/fake-news-detection/True.csv")
fake = pd.read_csv("/kaggle/input/fake-news-detection/Fake.csv")
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [2]:
from gc import collect;
from IPython.display import clear_output;
import nltk;
dler = nltk.downloader.Downloader();
dler._update_index();
nltk.download('omw-1.4');

clear_output();
for i in range(3): collect(i);

In [3]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


Let's combine both true and fake news datasets into the same dataset to make life easier. 

At the same time, we add in an indicator that helps us differentiate the two types of news.

In [4]:
true["Class"] = "1"
fake["Class"] = "0"
dataset = pd.concat([true,fake])
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset.head()

Unnamed: 0,title,text,subject,date,Class
0,Frankfurt starts evacuation before attempt to ...,FRANKFURT (Reuters) - Frankfurt emergency serv...,worldnews,"September 2, 2017",1
1,"WATCH: Kieth Scott’s Wife Drops Mic On Cops, ...",What you are about to see is disturbing. Keith...,News,"September 23, 2016",0
2,Fatal Niger operation sparks calls for public ...,WASHINGTON (Reuters) - Democratic U.S. lawmake...,politicsNews,"October 26, 2017",1
3,BREAKING: ONLY MLB Player To Kneel During Nati...,When Oakland Athletics catcher Bruce Maxwell t...,politics,"Oct 29, 2017",0
4,Rosy White House tax cut forecast clashes with...,WASHINGTON/NEW YORK (Reuters) - The White Hous...,politicsNews,"October 27, 2017",1


**Next, we check for any missing or erroneous data.**

In [5]:
dataset.isnull().sum()

title      0
text       0
subject    0
date       0
Class      0
dtype: int64

**Removing unnecessary columns**


On closer examination, the date, subject and title are not relevant to our classification. Thus, we remove them.

Furthermore, we should combine all forms of text into one column.

In [6]:
dataset.text = dataset.text + dataset.title
dataset.drop(["title","subject","date"],axis=1,inplace = True)
dataset.head()

Unnamed: 0,text,Class
0,FRANKFURT (Reuters) - Frankfurt emergency serv...,1
1,What you are about to see is disturbing. Keith...,0
2,WASHINGTON (Reuters) - Democratic U.S. lawmake...,1
3,When Oakland Athletics catcher Bruce Maxwell t...,0
4,WASHINGTON/NEW YORK (Reuters) - The White Hous...,1


**Lemmatization**


We will now use the concept of lemmatization so that we can convert the chunk of text into readable data.

Lemmatization can also be tweaked to remove useless words from affecting the predictions, such as "a","the","and".

To facilitate the use of the vectorizer, we will also include a space to seperate each chunk of words, as well as adjust all of them to be lowercase to standardise them.


In [7]:
stop_words = set(stopwords.words("english"))

def Lemmatizer(text):
    word_bank = []
    lem = WordNetLemmatizer()
    word_tokens = word_tokenize(text)
    for word in word_tokens:
        if word not in stop_words:
            new_word = re.sub('[^a-zA-Z]', '',word)
            new_word = new_word.lower()
            lemmatized_word = lem.lemmatize(new_word)
            word_bank.append(lemmatized_word)
    

    return " ".join(word_bank)

dataset["text"] = dataset["text"].apply(Lemmatizer)
dataset.head()

Unnamed: 0,text,Class
0,frankfurt reuters frankfurt emergency servi...,1
1,what see disturbing keith lamont scott man k...,0
2,washington reuters democratic u lawmaker ca...,1
3,when oakland athletics catcher bruce maxwell t...,0
4,washingtonnew york reuters the white house ...,1


**Vectorization and splitting**

In order to convert our bits of text into data that can be interpreted, we will be using a TF-IDF vectorizer.


More info can be read here:
https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

In short, it is a way of weighing each individual word based on the amount of the times it appears in all texts.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
x_train, x_test, y_train, y_test = train_test_split(pd.DataFrame(dataset["text"]),pd.DataFrame(dataset["Class"]), test_size=0.2, random_state=1)

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train.iloc[:,0])
xv_test = vectorization.transform(x_test.iloc[:,0])

Let's test this using the passive aggressive classifier.

In short, the PAC is a less computationally taxing algorithm that is almost exclusively used to detect fake news.

Read more:https://www.geeksforgeeks.org/passive-aggressive-classifiers/

In [9]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import PassiveAggressiveClassifier


pa_clf = PassiveAggressiveClassifier(loss = 'squared_hinge',max_iter=50,C=0.16)
pa_clf.fit(xv_train, y_train)
y_pred = pa_clf.predict(xv_test)

accscore = accuracy_score(y_test, y_pred)

print('The accuracy of prediction is {:.2f}%.\n'.format(accscore*100))

  y = column_or_1d(y, warn=True)


The accuracy of prediction is 99.64%.

