<a href="https://colab.research.google.com/github/chrisszelaya/Fake-News-Detector-/blob/main/FakeNewsDetector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Lets Import our data**

In [None]:
import numpy as np
import pandas as pd

fake_df = pd.read_csv("Fake.csv", encoding='utf-8', engine="python")
true_df = pd.read_csv("True.csv",encoding='utf-8',engine="python")

fake_df['Label'] = "Fake"
true_df['Label'] = "True"

df = pd.concat([fake_df, true_df], axis=0)
df.head()



Unnamed: 0,title,text,subject,date,Label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake


In [None]:
df = df.drop(['title', 'subject', 'date'], axis=1)

In [None]:
df['Label'].value_counts()

Fake    23481
True    21417
Name: Label, dtype: int64

Cleaning Data


In [None]:
# Removing all text in all articles the comes before '(Reuters - )'
text = df['text']

text = text.apply(lambda x: x.split("(Reuters) - ")[1]
                  if len(x.split("(Reuters) - ")) > 1 else x)

# Replacing apostrophes with spaces in the texts
text = text.apply(lambda x: x.replace("’", ''))

df['text'] = text

# Shuffling the rows so that real and fake articles occur randomly among the df
df = df.sample(frac = 1).reset_index(drop = True)


## **Lets Split our data using a .8 train size and .2 test size**

In [None]:
from sklearn.model_selection import train_test_split
labels=df.Label
X_train,X_test,Y_train,Y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)


(35918,)
(8980,)
(35918,)
(8980,)


## Intizalize our Term Frequency Inverse Document Frequency vectorizer and Passive Agressive Classifier Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.preprocessing import StandardScaler
TFDI = Pipeline([("tfidf", TfidfVectorizer()), 
                    ('std_scaler', StandardScaler(with_mean=False)),
                  ('clf', PassiveAggressiveClassifier())
                    ])


TFDI.fit(X_train, Y_train)
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf',  PassiveAggressiveClassifier())])
y_pred = TFDI.predict(X_test)



## Final Metrics

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
score=accuracy_score(Y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 98.73%


## Conclusion

Overall our model was extremely succesful in detecting whether or not a certain document is real or fake in regards to the news. Our model achieves a 98% accuracy which is amazing and not that far off from being perfect.