## Python Fake News Detection Model
First attempt at the concept. Online resources were heavily utilized.

In [18]:
import pandas as pd

# Load the dataset
dataset = pd.read_csv('data/news.csv')

dataset

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


### Spliting data 

In [19]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X = dataset['text']
y = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Build and train the model

- The TfidfVectorizer is important for fake news detection because it helps in representing the textual data (news articles or headlines) in a numerical format that can be used by machine learning algorithms. By converting text into numerical features, it allows the machine learning model to understand and analyze the content of the news articles more effectively.

- Overall, the TF-IDF vectorization technique is a crucial step in fake news detection because it transforms textual data into a suitable format for machine learning algorithms. It helps in capturing the unique characteristics of real and fake news articles, enabling the model to make informed predictions based on the learned patterns.

- The PassiveAggressiveClassifier algorithm is important for fake news detection because it is a suitable choice for handling binary classification tasks where the data may change over time.


- Overall, the PassiveAggressiveClassifier is important for fake news detection because of its ability to handle streaming data, adapt to evolving distributions, and make confident predictions based on margin-based learning. It allows the model to learn from new examples, adjust decision boundaries, and detect emerging patterns of fake news as they arise.

        - Online learning: The PassiveAggressiveClassifier handles streaming data and changing data distributions, making it suitable for continuously generated news articles.
        - Adaptive learning: The algorithm updates its parameters incrementally, allowing efficient processing of large datasets and adjustment to new examples, essential for detecting emerging patterns of fake news.
        - Margin-based approach: The PassiveAggressiveClassifier minimizes loss while maintaining a tolerance margin, enabling confident predictions in uncertain or noisy classification boundaries.
        - Decision boundaries: The algorithm finds decision boundaries to separate real and fake news based on TF-IDF vectors, adapting to new patterns and dynamic changes in the distribution of news articles.


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier

# Create a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training set and testing set
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

# Create a PassiveAggressiveClassifier
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfidf_train, y_train)


### Evaluate the model
Measure the accuracy of the model by making predictions on the testing set and comparing them with the actual labels

Why is the confusion matrix important?

The confusion matrix is important in evaluating the performance of a classification model, including fake news detection. It provides a detailed breakdown of the model's predictions and helps in assessing its accuracy and effectiveness.

In [23]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Predict on the testing set
y_pred = classifier.predict(tfidf_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Accuracy: 0.936069455406472
Confusion Matrix:
[[587  41]
 [ 40 599]]


Use the model for prediction
Use the trained model to predict the authenticity of new news articles.

In [24]:
# List of news articles to test
news_article = [
    "China seeking to spy on the U.S from a base in Cuba",
    "Obesity on a rise in the U.S"
]

# Transform the news articles using the trained vectorizer
tfidf_news_article = tfidf_vectorizer.transform(news_article)

# Make predictions on the news articles
news_article_pred = classifier.predict(tfidf_news_article)

# Print the predictions
for article, prediction in zip(news_article, news_article_pred):
    print(f"Article: {article}")
    print(f"Prediction: {'Real' if prediction == 'REAL' else 'Fake'}")
    print()

Article: China seeking to spy on the U.S from a base in Cuba
Prediction: Fake

Article: Obesity on a rise in the U.S
Prediction: Fake

