# <center>Fake News Detection</center>

<img  src='https://img.etimg.com/thumb/msid-76305449,width-650,imgsize-240474,,resizemode-4,quality-100/fake-news.jpg'></img>

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0' role="tab" aria-controls="home"><center>Quick navigation</center></h3>

* [1. Introduction](#1)
* [2. Data Reading and Analysis](#2)
* [3. Data Processing and Cleansing](#3)
* [3. Data Exploration](#4)
* [4. Data Visualization](#5)  
* [5. Model Training](#6)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Introduction</center><a id=1></a></h3>

<h2>Problem statement:</h2>

<br>
The authenticity of Information has become a longstanding issue affecting businesses and society, both for printed and digital media. On social networks, the reach and effects of information spread occur at such a fast pace and so amplified that distorted, inaccurate, or false information acquires a tremendous potential to cause real-world impacts, within minutes, for millions of users. Recently, several public concerns about this problem and some approaches to mitigate the problem were expressed. <br>
In this project, you are given a dataset in the fake-news_data.zip folder. The folder contains aCSV files train_news.csv and you have to use the train_news.csv data to build a model to predict whether a news is fake or not fake. You have to try out different models on the dataset,evaluate their performance, and finally report thebest model you got on the data and its performance.


<h2>Data- Description:</h2>
    
<br>
There are 6 columns in the dataset provided to you. The description of each of the column is given below:<br>
* “id”: Unique id of each news article<br>
* “headline”: It is the title of the news.<br>
* “news”: It contains the full text of the news article<br>
* “Unnamed:0”: It is a serial number<br>
* “written_by”: It represents the author of the news article<br>
* “label”: It tells whether the news is fake (1) or not fake (0).


## <a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Data Reading and Analysis</center></h3><a id=2></a>

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

In [None]:
#reading data 
df=pd.read_csv('C:/Users/HP/Desktop/Fake News Detection/train_news.csv')
df.head()
C:\Users\Neelesh\Desktop\Fake News Detection

In [None]:
#dropping un necessary features here 
df.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
# let's check data once again 
df.head()

Now it's looking perfect  
  >> Let's go deep dive into data 

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Data Processing and Cleansing</center><a id=3></a></h3>

In [None]:
#distribution of classes for prediction
def create_distribution(dataFile):
    
    return sns.countplot(x='label', data=dataFile, palette='hls')

In [None]:
create_distribution(df)

In [None]:
#data integrity check (missing label values)
#none of the datasets contains missing values therefore no cleaning required
df.isnull().sum()

In [None]:
df.info()

In [None]:
#eng_stemmer = SnowballStemmer('english')
#stopwords = set(nltk.corpus.stopwords.words('english'))
X=df.drop('label',axis=1) # Droping output feature
X.head()

In [None]:
y=df['label'] # Assigning output to y
y.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

In [None]:
df=df.dropna() # Droping null values present
df.shape

In [None]:
news=df.copy()

news.reset_index(inplace=True)

news.head(10)

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(news)):
    headline = re.sub('[^a-zA-Z]', ' ', news['headline'][i]) # Filtering all the headlines by removing numbers and symbols
    headline = headline.lower()
    headline = headline.split()
    
    headline = [ps.stem(word) for word in headline if not word in stopwords.words('english')] # making base form of the words
    headline = ' '.join(headline)
    corpus.append(headline)

In [None]:
corpus[2:5]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,ngram_range=(1,3))
X = cv.fit_transform(corpus).toarray()

In [None]:
X.shape

In [None]:
y=news['label']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0) # Dividing data in train test

In [None]:
print("Features name:",cv.get_feature_names()[:20])
print("Parameters:",cv.get_params())

In [None]:
count_df = pd.DataFrame(X_train, columns=cv.get_feature_names())
count_df.head()

## <a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Model Training</center></h3><a id=5></a>

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB()

from sklearn import metrics
import numpy as np
import itertools

In [None]:
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred)
cm

In [None]:

from sklearn.linear_model import PassiveAggressiveClassifier
linear_clf = PassiveAggressiveClassifier(max_iter=50)

In [None]:
linear_clf.fit(X_train, y_train)
pred = linear_clf.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred)
cm

In [None]:
classifier=MultinomialNB(alpha=0.1)

In [None]:
previous_score=0
for alpha in np.arange(0,1,0.1):
    sub_classifier=MultinomialNB(alpha=alpha)
    sub_classifier.fit(X_train,y_train)
    y_pred=sub_classifier.predict(X_test)
    score = metrics.accuracy_score(y_test, y_pred)
    if score>previous_score:
        classifier=sub_classifier
    print("Alpha: {}, Score : {}".format(alpha,score))

In [None]:
feature_names = cv.get_feature_names()

classifier.coef_[0]

In [None]:
sorted(zip(classifier.coef_[0], feature_names), reverse=True)[:20]

In [None]:
corpus = []
for i in range(0, len(news)):
    news1 = re.sub('[^a-zA-Z]', ' ', news['news'][i])
    news1 = news1.lower()
    news1 = news1.split()
    
    news1 = [ps.stem(word) for word in news1 if not word in stopwords.words('english')]
    news1 = ' '.join(news1)
    corpus.append(news1)

In [None]:
corpus[2:5]