# Problem Statement:
### With the penetration of social media our country is fighting a fake news pandemic. So some brave volunteers took upon the challenge and compiled a list of real and fake news of the past. Now it's your responsibility to create an AI model that can predict if a news is real or fake before it goes viral in social media.


## Gaining the insights from the model

### Importing all the necessary libraries

In [1]:
# Regular EDA and plotting libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# models from sklearn

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [4]:
# model evaluation

from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

import nltk
import re

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ksuma\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Reading our train dataset

In [5]:
df=pd.read_csv("train.csv")
df.head()

# DISPLAYING THE FIRST FIVE ROWS OF TRAIN DATASET

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [7]:
# CHECKING THE TOTAL NUMBER OF ROWS AND COLUMNS IN THE DATASET

df.shape

(20800, 5)

In [8]:
# Checking the information(types, total non-values) of each column

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [9]:
# Calculating the total number of null values in each column

df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

### *Since our dataset's target variable is label which is 1 in case of fake news and 0 in case of real news* 
#### Let's calculate the following:

1. Total percentage of 1 and 0 in label column
2. Null author values as per label
3. Label vaues as per empty title and text rows

In [12]:
# There is 50.06% real news and 49.93% fake news
df.label.value_counts(normalize=True)

1    0.500625
0    0.499375
Name: label, dtype: float64

In [13]:
author_empty=df[df["author"].isnull()]
print("Shape of empty author values is:{}".format(author_empty.shape))

Shape of empty author values is:(1957, 5)


In [14]:
# 98% "1" values and 2% "0" values
author_empty["label"].value_counts(normalize=True)

1    0.986714
0    0.013286
Name: label, dtype: float64

In [15]:
text_empty=df[df["text"].isnull()]
print("Shape of empty text values is:{}".format(text_empty.shape))

Shape of empty text values is:(39, 5)


In [16]:
# All the news where there is no text present is fake news
text_empty["label"].value_counts(normalize=True)

1    1.0
Name: label, dtype: float64

In [19]:
title_empty=df[df["title"].isnull()]
print("Shape of empty title values is:{}".format(title_empty.shape))

Shape of empty title values is:(558, 5)


In [20]:
# All the news where there is no title present is fake news
title_empty["label"].value_counts(normalize=True)

1    1.0
Name: label, dtype: float64

In [22]:
df.set_index('id',inplace=True)

In [23]:
author_max=df.author.value_counts()
author_max.index.name="total"
author_max = author_max.to_frame().reset_index()
main_authors=author_max[author_max["author"]>100]
author_real=main_authors.total
authors=df[df.author.isin(author_real)]
authors.label.value_counts(normalize=True)

0    0.897895
1    0.102105
Name: label, dtype: float64

### *Insights from above are as follows:*
1. 98% of the values where the author values are not present represents fake news.
2. Every news where there is absense of title or text is a 100% fake news.
3. Among the authors who have 100+ publications; 89.7% represent real news and 10.3% represents fake news.

# Data Modelling

In [24]:
dfa=df.copy()

In [26]:
df1 = df.drop_duplicates(keep=False)

In [30]:
print("After removing duplicate rows we firstly had {} rows in our main dataset and now we have {} rows.".format(df.shape[0],df1.shape[0]))

After removing duplicate rows we firstly had 20800 rows in our main dataset and now we have 20603 rows.


In [39]:
df1.dropna(inplace=True)

In [40]:
print("After removing the null values we are now left with {} rows".format(df1.shape[0]))

After removing the null values we are now left with 18197 rows


In [41]:
dfa=df1.copy()

In [42]:
dfa.isnull().sum()

title     0
author    0
text      0
label     0
dtype: int64

## Data Training and testing begins...

In [43]:
X=dfa.drop('label',axis=1)

In [44]:
X.head()

Unnamed: 0_level_0,title,author,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...


In [48]:
y=dfa['label']

In [49]:
y.head()

id
0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

In [47]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

In [50]:
dfa.shape

(18197, 4)

In [60]:
dfa.reset_index(inplace=True)

In [61]:
dfa.head(10)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,"PARIS — France chose an idealistic, traditi...",0
7,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0
8,10,Obama’s Organizing for Action Partners with So...,Aaron Klein,"Organizing for Action, the activist group that...",0
9,11,"BBC Comedy Sketch ""Real Housewives of ISIS"" Ca...",Chris Tomlinson,The BBC produced spoof on the “Real Housewives...,0


In [62]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
import re

In [63]:
for i in range(0, len(dfa)):
    review = re.sub('[^a-zA-Z]', ' ', dfa['title'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [64]:
## Applying Countvectorizer
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,ngram_range=(1,3))
X = cv.fit_transform(corpus).toarray()

In [65]:
X.shape

(18197, 5000)

In [66]:
# split data into train, test sets
np.random.seed(60)


# split into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

In [67]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [68]:
y_train

id
6609     0
4251     0
6631     0
12893    1
14494    0
        ..
7502     0
20356    0
13538    0
3630     1
2597     0
Name: label, Length: 14557, dtype: int64

In [69]:
# puts models in a dictionary
models ={"Logistic Regression":LogisticRegression(),
          "KNN": KNeighborsClassifier(),
          "Random Forest":RandomForestClassifier()}

In [70]:
# create a function to fit and score models
def fit_and_score(models,X_train,X_test,y_train,y_test):
    # set random seed
    np.random.seed(60)
    # make a dictionary to keep model scores
    model_scores = {}
    # loop through models
    for name,model in models.items():
        # fit the model to the data
        model.fit(X_train,y_train)
        # evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test,y_test)
    return model_scores  

In [71]:
model_scores = fit_and_score(models=models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test= y_test)
model_scores

{'Logistic Regression': 0.9321428571428572,
 'KNN': 0.7917582417582417,
 'Random Forest': 0.9381868131868132}

## HYPERPARAMETER TUNING WITH RandomizedSearchCV

WE RE GOING TO TUNE:
   * logicticregression()
   * randomforestclassifier()
   
...using randomzisedsearchcv   

In [72]:
# create hyperpara grid for logisticregression
log_reg_grid ={"C": np.logspace(-4,4,20),
               "solver":["liblinear"]}

In [73]:
# create hyerpara grid for randomforestclassifer
rf_grid={"n_estimators": np.arange(10,1000,50),
          "max_depth":[None,3,5,10],
          "min_samples_split":np.arange(2,20,2),
           "min_samples_leaf":np.arange(1,20,2)}

In [74]:
# tune logisticregresssion
np.random.seed(42)

# setup random hyperparameter search for logicticsregression
rs_log_reg =RandomizedSearchCV(estimator=LogisticRegression(),
                              param_distributions=log_reg_grid,
                              cv=5,
                              n_iter=20,
                              verbose= True)
#fit random hyperpara. search model for logic.rege.
rs_log_reg.fit(X_train,y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20,
                   param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
       4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
       2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
       1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
       5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
                                        'solver': ['liblinear']},
                   verbose=True)

In [75]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 1.623776739188721}

In [76]:
rs_log_reg.score(X_test,y_test)

0.9324175824175824