<a href="https://colab.research.google.com/github/erelliushasree/projectOne/blob/master/FakeNews_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
#printing stopwords in english
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data pre-processing

In [6]:
#loading the dataset to a pandas DataFrame
news_dataset=pd.read_csv('/content/train.csv')

In [7]:
news_dataset.shape


(4986, 2)

In [8]:
#print the first 2 rows of the dataframe
news_dataset.head()

Unnamed: 0,text,label
0,Get the latest from TODAY Sign up for our news...,1
1,2d Conan On The Funeral Trump Will Be Invited...,1
2,It’s safe to say that Instagram Stories has fa...,0
3,Much like a certain Amazon goddess with a lass...,0
4,At a time when the perfect outfit is just one ...,0


In [9]:
#counting the no.of missing values in the dataset
news_dataset.isnull().sum()  # 0 specifies no missing values

Unnamed: 0,0
text,0
label,0


In [10]:
#replacing the null values with empty string
news_dataset=news_dataset.fillna('')

In [11]:
news_dataset['content']=news_dataset['text']

In [12]:
print(news_dataset['content'])

0       Get the latest from TODAY Sign up for our news...
1       2d  Conan On The Funeral Trump Will Be Invited...
2       It’s safe to say that Instagram Stories has fa...
3       Much like a certain Amazon goddess with a lass...
4       At a time when the perfect outfit is just one ...
                              ...                        
4981    The storybook romance of WWE stars John Cena a...
4982    The actor told friends he’s responsible for en...
4983    Sarah Hyland is getting real.  The Modern Fami...
4984    Production has been suspended on the sixth and...
4985    A jury ruled against Bill Cosby in his sexual ...
Name: content, Length: 4986, dtype: object


In [13]:
#separating the data and label using drop
X=news_dataset.drop(columns='label',axis=1)
Y=news_dataset['label']

In [14]:
print(X)
print(Y)

                                                   text  \
0     Get the latest from TODAY Sign up for our news...   
1     2d  Conan On The Funeral Trump Will Be Invited...   
2     It’s safe to say that Instagram Stories has fa...   
3     Much like a certain Amazon goddess with a lass...   
4     At a time when the perfect outfit is just one ...   
...                                                 ...   
4981  The storybook romance of WWE stars John Cena a...   
4982  The actor told friends he’s responsible for en...   
4983  Sarah Hyland is getting real.  The Modern Fami...   
4984  Production has been suspended on the sixth and...   
4985  A jury ruled against Bill Cosby in his sexual ...   

                                                content  
0     Get the latest from TODAY Sign up for our news...  
1     2d  Conan On The Funeral Trump Will Be Invited...  
2     It’s safe to say that Instagram Stories has fa...  
3     Much like a certain Amazon goddess with a lass...  
4

Stemming : it is a process of reducing a word to its root word . ex: actor,actress,acting-----> root word is act

In [15]:
port_stem=PorterStemmer()

In [16]:
def stemming(content): #creating function called stemming ,content specifies the input we are giving.
  stemmed_content=re.sub('[^a-zA-Z]',' ',content) # we are using regular expression for searching paragraph or text , sub means substitute certain values  ^ means exclusion and a-zA-Z means we want only alphabets we dont want numbers special character etc., and feed this data to content
  stemmed_content=stemmed_content.lower() # converting all the letters to lowercase
  stemmed_content=stemmed_content.split() # after that we split the dataset into list
  stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content=' '.join(stemmed_content)
  return stemmed_content

In [17]:
news_dataset['content']=news_dataset['content'].apply(stemming)

In [18]:
print(news_dataset['content'])

0       get latest today sign newslett one ever truli ...
1                        conan funer trump invit conan tb
2       safe say instagram stori far surpass competito...
3       much like certain amazon goddess lasso height ...
4       time perfect outfit one click away high demand...
                              ...                        
4981    storybook romanc wwe star john cena nikki bell...
4982    actor told friend respons encourag brad reigni...
4983    sarah hyland get real modern famili star took ...
4984    product suspend sixth final season netflix hou...
4985    juri rule bill cosbi sexual assault retrial th...
Name: content, Length: 4986, dtype: object


In [19]:
#separating the data and label
X=news_dataset['content'].values
Y-news_dataset['label'].values

Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0
...,...
4981,0
4982,0
4983,0
4984,0


In [20]:
print(X)

['get latest today sign newslett one ever truli get lose love one blake shelton except older brother richi die nov shelton note tweet monday chang life forev richi die car accid shelton home state oklahoma two year ago shelton sent messag th anniversari loss richi blake half brother share mother passeng car collid school bu ada south oklahoma citi richi driver redena mcmanu year old boy christoph mcmanu die shortli collis bu driver passeng uninjur accord polic report accid clearli remain blake told minut rememb pick phone call week dead tell someth pick phone call tell someth saw tv like constantli shock dead blake shelton play today halloween extravaganza new york citi oct getti imag blake wife miranda lambert wrote singl call inspir richi still two brother bond despit age differ share love countri music bedroom right across hallway mine littl blake said interview listen hank william jr waylon lynyrd skynyrd bob seeger whatev popular realli richi love music would sit go man guy hero c

In [21]:
print(Y)

0       1
1       1
2       0
3       0
4       0
       ..
4981    0
4982    0
4983    0
4984    0
4985    0
Name: label, Length: 4986, dtype: int64


In [22]:
Y.shape

(4986,)

In [23]:
#converting the textual data to numerical data
vectorizer=TfidfVectorizer() # Tf- term frequency , idf- inverse document frequency . (Tf-it counts no.of times the particular word is repeating in a document/text/paragraph.so the repetition tells the model that the word is very important and assigns particular numerical value to that word) . (idf- sometimes the word repeated many times doesn't have meaning in it, and detects those words are not significant and it reduces its important value
vectorizer.fit(X)

X=vectorizer.transform(X)

In [24]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 970346 stored elements and shape (4986, 43410)>
  Coords	Values
  (0, 173)	0.1248855332702942
  (0, 185)	0.031190974908102943
  (0, 255)	0.04819290555028625
  (0, 279)	0.09230909714107587
  (0, 527)	0.04197433923817173
  (0, 561)	0.04199869752238838
  (0, 1431)	0.05679635059999219
  (0, 3150)	0.0651905490390564
  (0, 3660)	0.03446435163729747
  (0, 3899)	0.35814418246694796
  (0, 4111)	0.061099763005242426
  (0, 4235)	0.05734174912822063
  (0, 4469)	0.043913097851338366
  (0, 4920)	0.17300809278001672
  (0, 5003)	0.13853321154871961
  (0, 5492)	0.08636725474642402
  (0, 5727)	0.09742842544431664
  (0, 6369)	0.03651391326763337
  (0, 6864)	0.061642552767868085
  (0, 7014)	0.08011018205445468
  (0, 7138)	0.05184599360401606
  (0, 7463)	0.08521999758076429
  (0, 7466)	0.08744902989167297
  (0, 7832)	0.061532162462976187
  (0, 7948)	0.08666268110592094
  :	:
  (4985, 38550)	0.013376247589991572
  (4985, 38816)	0.0111828745730335

splitting the dataset to training and test data

In [25]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2) # X means processed content and Y is the labels . stratify- if we don't mention it , the real news and fake news won't be seggregated in equal proportions.

Training the model : Logistic regression

In [27]:
model=LogisticRegression()

In [28]:
model.fit(X_train,Y_train)

Evaluation
accuracy score

In [29]:
#accuracy score on the training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [30]:
print('Accuracy score of training data: ', training_data_accuracy)

Accuracy score of training data:  0.8538114343029087


In [31]:
#accuracy score on the test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)

In [32]:
print('Accuracy score of test data: ', test_data_accuracy)

Accuracy score of test data:  0.7845691382765531


making a predictive system

In [41]:
X_new =X_test[1] # creating X_new variable and we are taking first row that is X_test[0] and we are predicting

prediction=model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Fake')
else:
  print('The news is Real')

[0]
The news is Fake


In [42]:
print(Y_test.iloc[1])

0
