# <u>Machine Learning Project - Fake News Prediction System Using NLP techniques and Logistic Regression</u>

### <i>In this Project , we build a Fake News Prediction System using Machine Learning with python . we use Logistic Regression model for Prediction and we apply some NLP Techinques on our text data .</i>

## importing the Dependencies

In [2]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


## Downloading Stopwords for Preprocessing Step 

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\engyo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Loading Data

In [4]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')


## Get information about our Data

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      5200 non-null   int64 
 1   title   5078 non-null   object
 2   author  4697 non-null   object
 3   text    5193 non-null   object
dtypes: int64(1), object(3)
memory usage: 162.6+ KB


## Viewing the First 5 rows in our Data

In [7]:
train.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


#  <u> Prepreocessing :- </u>

## Checking Missing values

In [8]:
train.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [9]:
test.isnull().sum()

id          0
title     122
author    503
text        7
dtype: int64

## Handling Missing values

In [10]:
train=train.fillna("")
test=test.fillna('')

## merging title and author into Content column

In [11]:
train["Content"]=train["title"]+train["author"]
test['Content']=test['title']+test["author"]

In [12]:
print(train['Content'].head())
print(50*'*')
print(test['Content'].head())

0    House Dem Aide: We Didn’t Even See Comey’s Let...
1    FLYNN: Hillary Clinton, Big Woman on Campus - ...
2    Why the Truth Might Get You FiredConsortiumnew...
3    15 Civilians Killed In Single US Airstrike Hav...
4    Iranian woman jailed for fictional unpublished...
Name: Content, dtype: object
**************************************************
0    Specter of Trump Loosens Tongues, if Not Purse...
1    Russian warships ready to strike terrorists ne...
2    #NoDAPL: Native American Leaders Vow to Stay A...
3    Tim Tebow Will Attempt Another Comeback, This ...
4    Keiser Report: Meme Wars (E995)Truth Broadcast...
Name: Content, dtype: object


## Separating Features and target 

In [13]:
X=train.drop(columns="label",axis=1)
Y=train["label"]

# Stemming

### Stemming---> reducing word to its root                                                   
<i> Ex--> actor ,actress ,acting -----> act. </i>

In [14]:
port_stem=PorterStemmer()

## Creating Stemming Function

In [15]:
def Stemming (content):
    stemmed_content=re.sub('^a-zA-Z'," ",content)
    stemmed_content=stemmed_content.lower()
    stemmed_content=stemmed_content.split()
    stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content=" ".join(stemmed_content)
    return stemmed_content

## Applying Stemming Function 

In [16]:
train["Content"]=train['Content'].apply(Stemming)
test["Content"]=test['Content'].apply(Stemming)

In [17]:
print(train["Content"].head())
print(50*'*')
print(test["Content"].head())

0    hous dem aide: didn’t even see comey’ letter j...
1    flynn: hillari clinton, big woman campu - brei...
2              truth might get firedconsortiumnews.com
3    15 civilian kill singl us airstrik identifiedj...
4    iranian woman jail fiction unpublish stori wom...
Name: Content, dtype: object
**************************************************
0    specter trump loosen tongues, purs strings, si...
1    russian warship readi strike terrorist near al...
2    #nodapl: nativ american leader vow stay winter...
3    tim tebow attempt anoth comeback, time basebal...
4    keiser report: meme war (e995)truth broadcast ...
Name: Content, dtype: object


## Separating Features and target 

In [18]:
X=train['Content'].values
Y=train['label'].values

## Applying TFIDF  Vectorizer

In [19]:
vectorizer=TfidfVectorizer()
vectorizer.fit(X)
X=vectorizer.transform(X)

## Split data into train and test

In [20]:
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=3)

#  <u> Applying Logistic Regression Algorithm :-  </u>

In [21]:
model=LogisticRegression()

#### Training our model 

In [22]:
model.fit(x_train,y_train)

In [23]:
y_pred=model.predict(x_train)


## Computing Training Acccuracy 

In [24]:
accuracy=accuracy_score(y_pred,y_train)
print("training Accuracy :- ",accuracy*100)

training Accuracy :-  98.32417582417582


#### Testing our Model

In [25]:
y_pred=model.predict(x_test)

## Computing Testing Acccuracy 

In [26]:
accuracy=accuracy_score(y_pred,y_test)
print("testing Accuracy :- ",accuracy*100)

testing Accuracy :-  95.88141025641026


## Making predictions on test data

In [27]:
X=test['Content'].values
X = vectorizer.transform(X)
predictions=model.predict(X)
predictions

array([0, 1, 1, ..., 0, 1, 0], dtype=int64)