**Description** :
Develop a machine learning program to identify when an article might be fake news.

1 represents fake news


0 represents not fake

Dataset source: https://www.kaggle.com/c/fake-news/data?select=train.csv

**Importing the necessary libraries**

In [93]:
import pandas as pd
import numpy as np
import re #for searching words in a text
from nltk.corpus import stopwords #nltk stands for natural language  , stop words : el words eli maandhomch value kbira kima the, and
from nltk.stem.porter import PorterStemmer #takes a word and removes the prefix and suffix
from sklearn.feature_extraction.text import TfidfVectorizer #converting the text into feature vectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [94]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [95]:
print(stopwords.words('english')) #these are the stopwords in the english lang

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**Data Preprocessing**

In [96]:
#loading the dataset
df = pd.read_csv('/content/train.csv',nrows=10953)
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [97]:
df.isna().sum()

id           0
title      300
author    1038
text        23
label        0
dtype: int64

In [98]:
#replacing null values with empty string
df.fillna('',inplace=True)

In [99]:
#merging the author name and news title
df['content'] = df['title']+' '+df['author']

In [100]:
print(df['content'])

0        House Dem Aide: We Didn’t Even See Comey’s Let...
1        FLYNN: Hillary Clinton, Big Woman on Campus - ...
2        Why the Truth Might Get You Fired Consortiumne...
3        15 Civilians Killed In Single US Airstrike Hav...
4        Iranian woman jailed for fictional unpublished...
                               ...                        
10948    U.S., in Reversal, Issues Permit for Keystone ...
10949    University of Chicago: Why Milo Scares Student...
10950    Migrants FLOOD Into U.S. From Mexico Right Bef...
10951    Twitter Temporarily Censored Drudge Report Twe...
10952    Trophy Hunting Fees Do Little to Help Threaten...
Name: content, Length: 10953, dtype: object


In [101]:
#Seperate the data labels and target
X=df.drop(columns='label',axis=1)
y=df['label']
print(X)
print(y)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
10948  10948  U.S., in Reversal, Issues Permit for Keystone ...   
10949  10949  University of Chicago: Why Milo Scares Student...   
10950  10950  Migrants FLOOD Into U.S. From Mexico Right Bef...   
10951  10951  Twitter Temporarily Censored Drudge Report Twe...   
10952  10952  Trophy Hunting Fees Do Little to Help Threaten...   

                   author                                               text  \
0           Darrell Lucus  House Dem Aide: We Didn’t Even See Comey’s Let...   
1         Daniel J. Flynn  Ever get

**Stemming**
Stemming : is the process of reducing word to its Root woord ("removing prfix and suffix")


exemple : acting , actress , actor --> act


It is an important step because we need to reduces the words as much as possible for better model performance .

In [102]:
port_stem = PorterStemmer()

In [103]:
def stemming(content):
  stemmed_content=re.sub('[^a+zA-z]',' ',content)
  # ^ --> exclusion anything that is not from a-z and from A-Z, In our data there is non alphabetical letters and I only want
  # alphabetical ones , they will be replace by a space
  stemmed_content = stemmed_content.lower() #converting the alphabetical to lower case because upper case letters means some significant things
  stemmed_content= stemmed_content.split() #words will be splited and converted to a list
  stemmed_content= [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content



In [104]:
df['content']=df['content'].apply(stemming)

In [105]:
print(df['content'])

0        hous dem aid even see comey letter jason chaff...
1        flynn hillari clinton big woman campu breitbar...
2                   truth might get fire consortiumnew com
3        civilian kill singl us airstrik identifi jessi...
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
10948    u revers issu permit keyston oil pipelin new y...
10949    univers chicago milo scare student faculti eve...
10950       migrant flood u mexico right elect robert gehl
10951    twitter temporarili censor drudg report tweet ...
10952    trophi hunt fee littl help threaten speci repo...
Name: content, Length: 10953, dtype: object


In [106]:
#Seperating the data and labels
X=df['content'].values
y=df['label'].values
print(X.shape)
print(y.shape)

(10953,)
(10953,)


In [107]:
#converting the textual data into numerical data
vectorizer  = TfidfVectorizer() #Counting the number of times a particular words is repeating in a document , paragraph , text
vectorizer.fit(X)
X = vectorizer.transform(X)

In [108]:
print(X)

  (0, 11838)	0.2865708705597008
  (0, 10153)	0.2613181592734911
  (0, 6739)	0.3549603901896217
  (0, 6532)	0.288843602049239
  (0, 5834)	0.24723479687619354
  (0, 5325)	0.22509254700928746
  (0, 3793)	0.2384099144439043
  (0, 2902)	0.2717295903085129
  (0, 2754)	0.3549603901896217
  (0, 2263)	0.24813424458297476
  (0, 1903)	0.36738018000596945
  (0, 217)	0.27415108074335315
  (1, 12631)	0.29293012473648056
  (1, 5194)	0.18714081492537926
  (1, 4210)	0.7215318119129445
  (1, 2730)	0.26362745494594264
  (1, 2149)	0.18802925703388923
  (1, 1708)	0.3859911096211017
  (1, 1466)	0.15150187448413935
  (1, 1166)	0.2859470416490254
  (2, 11784)	0.4249345052000388
  (2, 7258)	0.48957416267184056
  (2, 4557)	0.3453656914835271
  (2, 4124)	0.38745005808339017
  (2, 2378)	0.4561359099733589
  :	:
  (10951, 11396)	0.32577355832180216
  (10951, 10199)	0.33628589388218766
  (10951, 9460)	0.17739886438657157
  (10951, 4102)	0.32577355832180216
  (10951, 3347)	0.32577355832180216
  (10951, 2402)	0.31095

In [109]:
print(y)

[1 0 1 ... 1 0 0]


**Building our Machine Learning algorithme**

Splitting the dataset

In [110]:
#splitting the data into training & test data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y,random_state=42) #stratify=y for a similar propotion for the fake and real news

Training the model

In [111]:
model = LogisticRegression()
model.fit(X_train,y_train)
y_pred_train=model.predict(X_train)
print("accuracy : ",accuracy_score(y_train,y_pred_train))

accuracy :  0.9809573496804487


Evaluating the model performance

In [112]:
y_pred=model.predict(X_test)
print("accuracy: ",accuracy_score(y_test,y_pred))

accuracy:  0.9653073645769933


**Building a predictive System**

In [113]:
X_news= X_test[0]
prediction = model.predict(X_news)
if prediction == 0:
  print("real news")
else:
  print("fake news")

fake news
