# Problem Identification

Fake News, a real phenomenon that exists in every society. Hoaxes and propaganda heavily distorted people's perspective, making the iliterate a victim of misinformation that has real prominent harmful outcomes. With nowadays technology to spread information and several irresponsible parties in journalism industry, this has become a global event.

This event not only impact but reigns over people perception & action to give highly negative effects towards society. And it's controlled by numbers of people behind the screen. Manual ways to identify a fake news is not an effective method towards this status quo. Therefore we need a sophisticated and high-end technology to offset current situation.

Our idea is to use Artificial Intelligence to detect wether a news is considerable ```fake``` or ```real```. We thrive our best to build the greatest AI model by testing multiple algorithms and comparing multiple performance matrix.

# Why this dataset?

1. Large amount of rows to increase our model accuracy.

2. Content clarity

3. Easy to manipulate (preprocessing needs)

# Import Libraries

We use ```pandas``` to read csv file



We use ```string``` & ```re``` to preprocessing

In [14]:
import pandas as pd
import string
import re
import wordcloud

ModuleNotFoundError: No module named 'wordcloud'

We use ```sklearn``` to make instances of each algorithms

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

We also use ```sklearn``` train the algorithm & measure each algo's performance

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_val_score,
    GridSearchCV,
)
from sklearn.feature_extraction.text import TfidfVectorizer

# Read & Visualizing File

These commands gives a quick preview about the data and label by adding ```class``` collumn.

In [None]:
data_fake = pd.read_csv('datasets/Fake.csv')
data_true = pd.read_csv('datasets/True.csv')

In [None]:
data_fake.tail()

In [None]:
data_true.tail()

In [None]:
# add class col
data_fake["class"] = 0
data_true["class"] = 1

In [None]:
data_fake.head()

## Manual Testing

Take sample of the dataset to train ```manual testing``` method. and also label them the same way like we did on the datasets above.??????

In [None]:
# get data for manual testing
data_fake_manual_testing = data_fake.iloc[-10:]
data_true_manual_testing = data_true.iloc[-10:]

# Drop data from original datasets
data_fake = data_fake.iloc[:-10]
data_true = data_true.iloc[:-10]

In [None]:
# add class col
data_fake_manual_testing["class"] = 0
data_true_manual_testing["class"] = 1

In [None]:
data_true_manual_testing.head()

In [None]:
data_merge = pd.concat([data_fake,data_true], axis = 0)

In [None]:
data_merge.head()

In [None]:
data_merge.tail()

In [None]:
data = data_merge.drop(['title', 'subject', 'date'], axis = 1)

In [None]:
data.isnull().sum()

In [None]:
data = data.sample(frac=1).reset_index(drop=True)

In [None]:
data.head()

# Preprocessing

Includes lower-casing each text, removing square bracketed text, replacing URLs with an empty string, removing HTML tags, stripping punctuation, and eliminating alphanumeric characters that contain digits.

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"\[.*?\]", "", text)
    text = "".join(char if char.isalnum() or char.isspace() else " " for char in text)
    text = re.sub(r"https?://\S+|www\.\S+", "", text)
    text = re.sub(r"<.*?>", "", text)
    text = "".join(char for char in text if char not in string.punctuation)
    text = re.sub(r"\w*\d\w*", "", text)

    return text

In [None]:
data['text'] = data['text'].apply(clean_text)

In [None]:
data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

data["text_length"] = data["text"].apply(len)

# Plot scatter plot
plt.figure(figsize=(10, 6))
for class_label in data["class"].unique():
    subset = data[data["class"] == class_label]
    plt.scatter(subset["text_length"], subset["class"], label=class_label)

plt.title("Scatter Plot of Text Length vs. Class")
plt.xlabel("Text Length")
plt.ylabel("Class")
plt.legend(title="Class")
plt.grid(True)
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

class_distribution = data["class"].value_counts()

# Plot pie chart
plt.figure(figsize=(8, 8))
plt.pie(
    class_distribution,
    labels=class_distribution.index,
    autopct="%1.1f%%",
    startangle=140,
)
plt.title("Class Distribution")
plt.axis("equal") 
plt.show()

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word cloud for each class
for label in data["class"].unique():
    text = " ".join(data[data["class"] == label]["text"])
    wordcloud = WordCloud(width=800, height=400, background_color="white").generate(
        text
    )

    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.title("Word Cloud for Class {}".format(label))
    plt.axis("off")
    plt.show()

# Training

Includes:

- Assigning $X$ and $Y$ axis with text and class respectively.

- Splitting data for training and testing with a special random state (reproducibility).

- Using ```TfidfVectorizer()``` to create a new matrix that will be used in algorithms.

In [None]:
X = data['text']
y = data['class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=101)

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

In [None]:
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(X_train)
xv_test = vectorization.transform(X_test)

In [None]:
print(xv_train)

In [None]:
print(xv_test)

# Algorithms & Visualization

All of the algoritms respects the procedure as follows:

1. *Create an instance* of the algorithm class

2. Fitting the data for ```xv_train``` and ```y_train```

3. Predict the data using ```xv_test```

4. Create score for the model we train

5. Make other performance indicator for comparison

## 1.1) Logistic Regression: Algorithm

In [None]:
LR = LogisticRegression()
LR.fit(xv_train, y_train)

In [None]:
pred_lr = LR.predict(xv_test)

In [None]:
LR.score(xv_test, y_test)

In [None]:
print(classification_report(y_test, pred_lr))

## 1.2) Logistic Regression: Visualization

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

LR = LogisticRegression()
LR.fit(xv_train, y_train)

# Predictions
pred_lr = LR.predict(xv_test)

# accuracy
accuracy = LR.score(xv_test, y_test)
print("Accuracy:", accuracy)

# confusion matrix
cm = confusion_matrix(y_test, pred_lr)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix - Logistic Regression")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, pred_lr))

## 2.1) Decision Tree Classifier: Algorithm

In [None]:
DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

In [None]:
pred_dt = DT.predict(xv_test)

In [None]:
DT.score(xv_test, y_test)

In [None]:
print(classification_report(y_test, pred_dt))

## 2.2) Decision Tree Classifier: Visualization

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

# Predictions
pred_dt = DT.predict(xv_test)

# accuracy
accuracy = DT.score(xv_test, y_test)
print("Accuracy:", accuracy)

# confusion matrix
cm = confusion_matrix(y_test, pred_dt)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix - Decision Tree Classifiers")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, pred_dt))

## 3.1) Gradient Boost Classifier: Algorithm

In [None]:
GB = GradientBoostingClassifier(random_state=0)
GB.fit(xv_train, y_train)

In [None]:
pred_gb = GB.predict(xv_test)

In [None]:
GB.score(xv_test, y_test)

In [None]:
print(classification_report(y_test, pred_gb))

## 3.2) Gradient Boost Classifier: Visualization

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

GB = GradientBoostingClassifier()
GB.fit(xv_train, y_train)

# Predictions
pred_GB = GB.predict(xv_test)

# accuracy
accuracy = GB.score(xv_test, y_test)
print("Accuracy:", accuracy)

# confusion matrix
cm = confusion_matrix(y_test, pred_GB)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix - Gradient Boost Classifier")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, pred_GB))

## 4.1) Random Forest Classifier:  Algorithm

In [None]:
RF = RandomForestClassifier(random_state=0)
RF.fit(xv_train, y_train)

In [None]:
pred_rf = RF.predict(xv_test)

In [None]:
RF.score(xv_test, y_test)

In [None]:
print(classification_report(y_test, pred_rf))

## 4.2) Random Forest Classifier:  Visualization

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

RF = RandomForestClassifier()
RF.fit(xv_train, y_train)

# Predictions
pred_rf = RF.predict(xv_test)

# accuracy
accuracy = RF.score(xv_test, y_test)
print("Accuracy:", accuracy)

# confusion matrix
cm = confusion_matrix(y_test, pred_rf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix - Random Forest Classifier")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, pred_rf))

# Additional performance & data quality measurement

## 1) k-fold and stratified k-fold validation

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score


def evaluate_model(model, X_train, y_train, k=5, stratified=False):
    if stratified:
        kf = StratifiedKFold(n_splits=k)
    else:
        kf = KFold(n_splits=k)

    accuracies = []

    for train_index, val_index in kf.split(X_train, y_train):
        X_train_fold, X_val_fold = xv_train[train_index], xv_train[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
    
        model.fit(X_train_fold, y_train_fold)
        y_pred = model.predict(X_val_fold)
        accuracy = accuracy_score(y_val_fold, y_pred)
        accuracies.append(accuracy)

    avg_accuracy = sum(accuracies) / len(accuracies)
    return avg_accuracy


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import pandas as pd
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

data_fake = pd.read_csv("datasets/Fake.csv")
data_true = pd.read_csv("datasets/True.csv")

# class column
data_fake["class"] = 0
data_true["class"] = 1

# Get data for manual testing
data_fake_manual_testing = data_fake.iloc[-10:]
data_true_manual_testing = data_true.iloc[-10:]

# Drop data from original datasets
data_fake = data_fake.iloc[:-10]
data_true = data_true.iloc[:-10]

# class column
data_fake_manual_testing["class"] = 0
data_true_manual_testing["class"] = 1

# Merge
data_merge = pd.concat([data_fake, data_true], axis=0)
data = data_merge.drop(["title", "subject", "date"], axis=1)
data = data.sample(frac=1).reset_index(drop=True)


def clean_text(text):
    text = text.lower()
    text = re.sub(r"\[.*?\]", "", text)
    text = "".join(char if char.isalnum() or char.isspace() else " " for char in text)
    text = re.sub(r"https?://\S+|www\.\S+", "", text)
    text = re.sub(r"<.*?>", "", text)
    text = "".join(char for char in text if char not in string.punctuation)
    text = re.sub(r"\w*\d\w*", "", text)
    return text


# preprocessing
data["text"] = data["text"].apply(clean_text)
X = data["text"]
y = data["class"]

# Split data (test, train)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=101
)

# Vectorize
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(X_train)
xv_test = vectorization.transform(X_test)

LR = LogisticRegression()
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()

models = [LR, DT, RF]
for model in models:
    avg_accuracy = evaluate_model(model, xv_train, y_train, k=5, stratified=True)
    print("Model:", model.__class__.__name__)
    print("Average accuracy:", avg_accuracy)

## 2) Using cross_val_score and pipeline

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Define the pipeline
pipeline = Pipeline([("tfidf", TfidfVectorizer()), ("clf", LogisticRegression())])

# Perform cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5)

# Print average accuracy
print("Average Accuracy:", cv_scores.mean())

# Train the pipeline on the entire dataset
pipeline.fit(X, y)

# Predictions on the test set
pred_lr = pipeline.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, pred_lr)

# Create heatmap for confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix - Logistic Regression")
plt.show()

# Create classification report
print("\nClassification Report:")
print(classification_report(y_test, pred_lr))

In [None]:
from sklearn.model_selection import GridSearchCV

# Define a parameter grid
param_grid = {
    "clf__C": [
        0.001,
        0.01,
        0.1,
        1,
        10,
        100,
    ],  # Regularization parameter for Logistic Regression
}

# Define the pipeline with TfidfVectorizer and LogisticRegression
pipeline = Pipeline([("tfidf", TfidfVectorizer()), ("clf", LogisticRegression())])

# Initialize GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="accuracy")

# Perform grid search
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Testing

Using *manual test* to see if the outcome matches the actual result.

In [None]:
def lable(n):
    if n == 0:
        return "Fake"
    elif n == 1:
        return "True"
    else:
        return "Unidentified"
    
def manual_test(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test['text'] = new_def_test["text"].apply(clean_text)
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_LR = LR.predict(new_xv_test)
    pred_DT = DT.predict(new_xv_test)
    pred_GB = GB.predict(new_xv_test)
    pred_RF = RF.predict(new_xv_test)
    
    return print("\n\nLR Predicition: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC Prediction:{}".format(lable(pred_LR[0]),
                                                                                                             lable(pred_DT[0]),
                                                                                                             lable(pred_GB[0]),
                                                                                                             lable(pred_RF[0])))

In [None]:
news = str(
    "21st Century Wire says This week, the historic international Iranian Nuclear Deal was punctuated by a two-way prisoner swap between Washington and Tehran, but it didn t end quite the way everyone expected. On the Iranian side, one of the U.S. citizens who was detained in Iran, Nosratollah Khosravi-Roodsari, has stayed in Iran, but on the U.S. side   all 7 of the Iranians held in U.S. prisons DID NOT show up to their flight to Geneva for the prisoner exchange   with at least 3 electing to stay in the U.S  TEHRAN SIDE: In Iran, 5 U.S. prisoners were released, with 4 of them making their way to Germany via Switzerland.Will Robinson Daily MailNone of the Iranians freed in the prisoner swap have returned home and could still be in the United States, it has been reported.The seven former inmates, who were released as part of a deal with the Islamic republic, did not show up to get a flight to Geneva, Switzerland, where the exchange was set to take place on Sunday.Three of the Iranians have decided to stay in the United States, ABC reported, with some moving in with their families. However it is not known where the other four are.Three of the Americans who had been detained in Iran   Washington Post journalist Jason Rezaian, former U.S. Marine Amir Hekmati and Christian pastor Saeed Abedini   left Tehran at around 7am the same day, but weren t met by their counterparts in Switzerland Continue this story at the Mail OnlineREAD MORE IRAN NEWS AT: 21st Century Wire Iran Files"
)
manual_test(news)