**DESCRIPTION** : A machine learning program to identify when an article might be fake news

**Dataset Description**

 *train.csv*: A full training dataset with the following attributes:

    1. id: unique id for a news article
    2. title: the title of a news article
    3. author: author of the news article
    4. text: the text of the article; could be incomplete
    5. label: a label that marks the article as potentially unreliable
                1: unreliable
                0: reliable


**Importing the required libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re #(regular expression - a library useful for searching word in a paragraph)
from nltk.corpus import stopwords # stopword- words that doesn't add much value of the paragraph , like -'the' ,'with' ,'i',etc.
from nltk.stem.porter import PorterStemmer # returns the root of a word by removing its prefix and suffix
from sklearn.feature_extraction.text import TfidfVectorizer # to convert text into feature vector (i.e numbers)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# printing the stopwords of English language
print(stopwords.words('english'))

**Data Pre-pocessing**

In [5]:
#Loading training dataset to pandas framework
news_dataset = pd.read_csv('/dataset/train.csv')

In [None]:
# size of dataset
news_dataset.shape


In [None]:
# printing the first 5 rows of the dataframe
news_dataset.head()

In [None]:
news_dataset.isnull().sum()

In [9]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

Now Let’s explore the unique values in the each category using below code.

In [None]:
sns.countplot(data=news_dataset,
              x='label',
              order=news_dataset['label'].value_counts().index)

In [10]:
# Merging 'title' and 'author' column in a new column 'body'
news_dataset['body']=news_dataset['author']+' '+news_dataset['title']


In [None]:
print(news_dataset['body'])

**Stemming :**

Returning only the root word by removing its prefix and suffix.

For eg :
teacher,teaching --> teach

In [12]:
port_stem =PorterStemmer()

In [13]:
def stemming(body):
  # replacing all the words(like number , punctuations etc) that do not come in the below defined set (i.e. contains only alphabet ) with a space
  stemmed_body =re.sub('[^a-zA-Z]',' ',body)
  # converting all words to lowercase,as it might happen that our model misinterpret upper letter word to be more significant
  stemmed_body =stemmed_body.lower()
  # splitting into list
  stemmed_body =stemmed_body.split()
  # stem all the words that are not stopwords
  stemmed_body =[port_stem.stem(word) for word in stemmed_body if not word in stopwords.words('english')]
  # join all the stemmed words
  stemmed_body =' '.join(stemmed_body)
  return stemmed_body

In [14]:
# Applying stemming function to our dataset's body column
news_dataset['body']=news_dataset['body'].apply(stemming)

In [None]:
print(news_dataset['body'])

In [16]:
# Seperating the data and label
X =news_dataset['body'].values
Y =news_dataset['label'].values


In [None]:
print(X)
print(Y)

Let’s visualize the WordCloud for fake and real news separately.

In [None]:
# Real
consolidated = ' '.join(
    word for word in news_dataset['body'][news_dataset['label'] == 0].astype(str))
wordCloud = WordCloud(width=1600,
                      height=800,
                      random_state=21,
                      max_font_size=110,
                      collocations=False)
plt.figure(figsize=(15, 10))
plt.imshow(wordCloud.generate(consolidated), interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
# Fake
consolidated = ' '.join(
    word for word in news_dataset['body'][news_dataset['label'] == 1].astype(str))
wordCloud = WordCloud(width=1600,
                      height=800,
                      random_state=21,
                      max_font_size=110,
                      collocations=False)
plt.figure(figsize=(15, 10))
plt.imshow(wordCloud.generate(consolidated), interpolation='bilinear')
plt.axis('off')
plt.show()

**What is a TfidfVectorizer?**

*TF (Term Frequency):* The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

*IDF (Inverse Document Frequency):* Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

Thus in TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.


In [18]:
# Converting the textual data to numerical data using TfidfVectorizer
vectorizer =TfidfVectorizer()
vectorizer.fit(X) # not doing on Y as it is already a numerical vector
X =vectorizer.transform(X)

In [None]:
# X after numerical vectorization
print(X)


In [20]:
# Splitting the dataset to training and test data
X_train,X_test,Y_train,Y_test =train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)


**Note** : In the above cell

*stratify=Y* will make sure that random split has same proportion of 0 's and 1 's as that in original dataset

*random_state* is used to set the seed for the random generator so that we can ensure that the results that we get can be reproduced.

**Model Training : logistic regression**

In [None]:
model = LogisticRegression()
#training the model
model.fit(X_train,Y_train)

**Evaluation and Prediction**

In [23]:
# Predict training data
Y_train_pred = model.predict(X_train)
# accuracy score of training data
training_accuracy = accuracy_score(Y_train_pred,Y_train)

In [None]:
print('Accuracy score of training data :',training_accuracy)

In [25]:
# Predict testing data
Y_test_pred = model.predict(X_test)
# accuracy score of test data
test_accuracy = accuracy_score(Y_test_pred,Y_test)

In [None]:
print('Accuracy score of test data :',test_accuracy)

**Model Training : Decision Tree Classifier**

In [None]:
model = DecisionTreeClassifier()
#training the model
model.fit(X_train,Y_train)

**Evaluation and Prediction**


In [44]:
# Predicting training data
Y_train_pred=model.predict(X_train)
# accuracy score of training data
training_accuracy = accuracy_score(Y_train_pred,Y_train)

In [45]:
print('Accuracy score of training data :',training_accuracy)

Accuracy score of training data : 1.0


In [46]:
# Predict testing data
Y_test_pred = model.predict(X_test)
# accuracy score of test data
test_accuracy = accuracy_score(Y_test_pred,Y_test)

In [None]:
print('Accuracy score of test data :',test_accuracy)

The confusion matrix for Decision Tree Classifier can be implemented with the code below.

In [None]:
# Confusion matrix of Results from Decision Tree classification
cm = metrics.confusion_matrix(Y_test, model.predict(X_test))
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
                                            display_labels=[False, True])
cm_display.plot()
plt.show()

**Making a predictive system**

In [None]:
X_new =X_test[0]
prediction = model.predict(X_new)
print(prediction)
if (prediction[0]==0):
  print("It's a real news")
else :
  print("It's a fake news")

In [None]:
# Cross verifying above prediciton
print(Y_test[0])
if (Y_test[0]==0):
  print("It's a real news")
else :
  print("It's a fake news")

**Conclusion**

*Decision Tree Classifier* and *Logistic regression* are performing well.