#**Fake News**

Build a system to identify unreliable news articles

## **Dataset Description**

**train.csv:** A full training dataset with the following attributes:

* **id:** unique id for a news article
* **title:** the title of a news article
* **author:** author of the news article
* **text:** the text of the article; could be incomplete
* **label:** a label that marks the article as potentially unreliable



        1: unreliable

        0: reliable

# **Installing require Tools**

In [1]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.9.1-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.5.2 (from gradio)
  Downloading gradio_client-1.5.2-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.8.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

# **Import the Libraries**

In [2]:
import numpy as np
import pandas as pd
import re  # For regular expressions to clean the text
from nltk.corpus import stopwords   # For filtering out common stop words
from nltk.stem.porter import PorterStemmer   # For stemming words
from sklearn.feature_extraction.text import TfidfVectorizer  # To convert text data into numerical data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import gradio as gr

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# **Data Collection and Data Preprocessing**

In [5]:
# Loading the dataset into a pandas DataFrame
news_data = pd.read_csv('train.csv', on_bad_lines='skip', engine='python')  # Use this to identify issues

In [6]:
# Checking the shape of the dataset
news_data.shape    # Outputs the number of rows and columns in the dataset.

(1045, 5)

In [7]:
# Checking the numer of missing values in the dataset
print(news_data.isnull().sum())

id          0
title      29
author    108
text        2
label       0
dtype: int64


In [8]:
# Displays the first 5 rows of the dataset.
news_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [9]:
# Replace missing or NaN values with an empty string
news_data.fillna('', inplace=True)  # Fill missing values with an empty string


In [10]:
# Checking the shape of the dataset
news_data.shape    # Outputs the number of rows and columns in the dataset.

(1045, 5)

In [11]:
# Merging the author name and news title
news_data['content'] = news_data['author']+' '+news_data['title']


In [12]:
print(news_data['content'])

0       Darrell Lucus House Dem Aide: We Didn’t Even S...
1       Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2       Consortiumnews.com Why the Truth Might Get You...
3       Jessica Purkiss 15 Civilians Killed In Single ...
4       Howard Portnoy Iranian woman jailed for fictio...
                              ...                        
1040    Nicholas Confessore and Rachel Shorey Outside ...
1041    Joel B. Pollak Fake News: New York Times Targe...
1042    Michael Wilson, Samantha Schmidt and Sarah Mas...
1043    Jerome Hudson Resistance: Schwarzenegger Calls...
1044    Christine Hauser Virginia Officials Request U....
Name: content, Length: 1045, dtype: object


In [13]:
news_data.head()

Unnamed: 0,id,title,author,text,label,content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com Why the Truth Might Get You...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy Iranian woman jailed for fictio...


In [14]:
# Splitting features (X) and target (y)
X = news_data.drop(columns='label', axis=1)
y = news_data['label']

In [15]:
print(X)
print(y)

        id                                              title  \
0        0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1        1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2        2                  Why the Truth Might Get You Fired   
3        3  15 Civilians Killed In Single US Airstrike Hav...   
4        4  Iranian woman jailed for fictional unpublished...   
...    ...                                                ...   
1040  1040  Outside Money Favors Hillary Clinton at a 2-to...   
1041  1041  Fake News: New York Times Targets Breitbart fo...   
1042  1042  After Blast, New Yorkers Examine Themselves fo...   
1043  1043  Resistance: Schwarzenegger Calls for ’Grassroo...   
1044  1044  Virginia Officials Request U.S. Inquiry After ...   

                                                 author  \
0                                         Darrell Lucus   
1                                       Daniel J. Flynn   
2                                    Conso

#**Stemming**

Stemming is the process of reducing a word to its **Root-word**


Example

Classifier, Classification, Classified  ----> Class

"running" → "run"

In [16]:
# Initialize a Porter Stemmer instance
port_stem = PorterStemmer()

In [17]:
# Define a function to preprocess and stem text
def stemming(content):
    """
    Function to clean, lower case, split, remove stop words, and stem the input text content.

    Args:
    content (str): A string containing the text to process.

    Returns:
    str: The processed and stemmed string.
    """
    # Remove all non-alphabetic characters from the text and replace them with spaces
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)

    # Convert the cleaned text to lowercase
    stemmed_content = stemmed_content.lower()

    # Split the text into individual list of words (tokens)
    stemmed_content = stemmed_content.split()

    # Stem each word in the text that is not a stopword
    # 'stopwords.words('english')' provides a list of common English stop words
    stemmed_content = [
        port_stem.stem(word) for word in stemmed_content
        if word not in stopwords.words('english')
    ]

    # Join the processed words back into a single string
    stemmed_content = ' '.join(stemmed_content)

    # Return the final processed and stemmed text
    return stemmed_content

In [18]:
print(news_data['content'])

0       Darrell Lucus House Dem Aide: We Didn’t Even S...
1       Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2       Consortiumnews.com Why the Truth Might Get You...
3       Jessica Purkiss 15 Civilians Killed In Single ...
4       Howard Portnoy Iranian woman jailed for fictio...
                              ...                        
1040    Nicholas Confessore and Rachel Shorey Outside ...
1041    Joel B. Pollak Fake News: New York Times Targe...
1042    Michael Wilson, Samantha Schmidt and Sarah Mas...
1043    Jerome Hudson Resistance: Schwarzenegger Calls...
1044    Christine Hauser Virginia Officials Request U....
Name: content, Length: 1045, dtype: object


In [19]:
# Apply the `stemming` function to each row in the 'content' column
# This cleans, lowers, removes stopwords, and stems the text in the column
news_data['content'] = news_data['content'].apply(stemming)


# **OR**

In [20]:
# Splitting features (X) and target (y)
X = news_data['content'].values
y = news_data['label'].values

In [21]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael wilson samantha schmidt sarah maslin nir blast new yorker examin psycholog shrapnel new york time'
 'jerom hudson resist schwarzenegg call grassroot revolut u exit pari agreement'
 'christin hauser virginia offici request u inquiri inmat death jail new york time']


In [22]:
print(y)

[1 0 1 ... 0 0 0]


In [23]:
y.shape

(1045,)

In [24]:
# -- Converting the textual data to numerical data
# The TfidfVectorizer converts raw text into a numerical representation based on
# Term Frequency-Inverse Document Frequency (TF-IDF) values.
vectorizer = TfidfVectorizer()

# -- Fit the vectorizer to the raw text data
# The fit method learns the vocabulary and calculates document frequencies of terms
# from the input text data 'X'. Here, 'X' must contain raw text data.
vectorizer.fit(X)

# -- Transform the raw text data into numerical format
# After fitting the vectorizer, the transform method converts the text into
# a sparse matrix where each row corresponds to a document, and each column
# corresponds to a term's TF-IDF value in that document.
X = vectorizer.transform(X)

In [25]:
print(X)

  (0, 74)	0.2831062902722445
  (0, 570)	0.3421500381169244
  (0, 696)	0.2577029424093616
  (0, 870)	0.32304013448391095
  (0, 917)	0.2831062902722445
  (0, 1176)	0.261803860086362
  (0, 1644)	0.2360448062218859
  (0, 1800)	0.2662959071574504
  (0, 1996)	0.29896450319472506
  (0, 2066)	0.32304013448391095
  (0, 3086)	0.2712616348136228
  (0, 3628)	0.29896450319472506
  (1, 348)	0.3122283352625716
  (1, 438)	0.16778653572542396
  (1, 506)	0.3938225060269664
  (1, 659)	0.21920283267331853
  (1, 864)	0.2882595041930885
  (1, 1314)	0.6684485413436133
  (1, 1607)	0.20931213404905816
  (1, 3852)	0.3122283352625716
  (2, 692)	0.3333924581672224
  (2, 741)	0.4544515529762128
  (2, 1285)	0.3728788804443066
  (2, 1415)	0.35431279418623046
  (2, 2215)	0.47436153135177395
  :	:
  (1042, 3833)	0.2868348550949847
  (1042, 3894)	0.0899999289345843
  (1042, 3895)	0.3038029833686173
  (1043, 69)	0.3489504417235231
  (1043, 499)	0.2520989493160544
  (1043, 1196)	0.3489504417235231
  (1043, 1477)	0.329460

# **Splitting the Dataset into Training and Test Sets**

In [47]:
# Splitting the dataset into Training set and Test Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)


In [48]:
# checking the number of Test and Train dataset
print(X.shape, X_train.shape, X_test.shape)


(1045, 3913) (836, 3913) (209, 3913)


# **Model Training --> Logistic Regression**

In [49]:
# Initializing the Logistic Regression model
model = LogisticRegression()

In [50]:
# Training the Logistic Regression model with train data
model.fit(X_train, y_train)

# **Model Evaluation**

In [51]:
# Calculate accuracy on the training data
X_train_pred = model.predict(X_train)
train_data_accuracy = accuracy_score(X_train_pred, y_train)
print('Accuracy on training data : ', train_data_accuracy)


Accuracy on training data :  0.9784688995215312


In [52]:
# Generate and display the confusion matrix on the training data
# The confusion matrix shows the counts of True Positives, True Negatives, False Positives, and False Negatives
conf_matrix = confusion_matrix(y_train, X_train_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[399  18]
 [  0 419]]


In [53]:
# Step 4: Generate and display the classification report
# The classification report includes precision, recall, F1-score, and support for each class
class_report = classification_report(y_train, X_train_pred, target_names=["Reliable (0)", "Unreliable (1)"])
print("\nClassification Report:")
print(class_report)


Classification Report:
                precision    recall  f1-score   support

  Reliable (0)       1.00      0.96      0.98       417
Unreliable (1)       0.96      1.00      0.98       419

      accuracy                           0.98       836
     macro avg       0.98      0.98      0.98       836
  weighted avg       0.98      0.98      0.98       836



In [54]:
# accuracy on the test data
X_test_pred = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_pred, y_test)
print('Accuracy on test data : ', test_data_accuracy)

Accuracy on test data :  0.9425837320574163


In [55]:
# Generate and display the confusion matrix on the test data
# The confusion matrix shows the counts of True Positives, True Negatives, False Positives, and False Negatives
conf_matrix = confusion_matrix(y_test, X_test_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[ 93  11]
 [  1 104]]


In [56]:
# Step 4: Generate and display the classification report
# The classification report includes precision, recall, F1-score, and support for each class
class_report = classification_report(y_test, X_test_pred, target_names=["Reliable (0)", "Unreliable (1)"])
print("\nClassification Report:")
print(class_report)


Classification Report:
                precision    recall  f1-score   support

  Reliable (0)       0.99      0.89      0.94       104
Unreliable (1)       0.90      0.99      0.95       105

      accuracy                           0.94       209
     macro avg       0.95      0.94      0.94       209
  weighted avg       0.95      0.94      0.94       209



# **Making a Predictive System**

In [57]:
# Define the predictive system function
def predict_news_reliability(input_text, model, vectorizer):
    """
    Predict whether a given news article is reliable or unreliable.

    Parameters:
        input_text (str): The text of the news article to predict.
        model: The trained machine learning model.
        vectorizer: The fitted TfidfVectorizer instance.

    Returns:
        str: "Reliable" or "Unreliable" based on the model's prediction.
    """
    # Preprocess the input text
    preprocessed_text = stemming(input_text)  # Apply the same stemming function used in training

    # Convert text to numerical data using the vectorizer
    vectorized_input = vectorizer.transform([preprocessed_text])  # Convert to numerical format

    # Make a prediction using the trained model
    prediction = model.predict(vectorized_input)

    #  Map the prediction to the corresponding label
    if prediction[0] == 0:
        return "The news is Real"
    else:
        return "The news is Fake"


In [58]:
# Assuming `model` is the trained classifier and `vectorizer` is the fitted TfidfVectorizer
input_text = "Breaking news: The economy is seeing unprecedented growth due to new policies."
prediction_result = predict_news_reliability(input_text, model, vectorizer)

print(f"The news article is predicted to be: {prediction_result}")


The news article is predicted to be: The news is Fake


# **OR**

In [61]:
X_new = X_test[1]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


In [66]:
print(y_test[1])

1


In [64]:
X_new = X_test[2]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [65]:
print(y_test[2])

0


In [41]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


In [42]:
print(y_test[3])

1


In [44]:
# Define the predictive system function
def predict_news_reliability(input_text):
    """
    Predict whether a given news article is reliable or unreliable.

    Parameters:
        input_text (str): The text of the news article to predict.

    Returns:
        str: "The news is Real" or "The news is Fake" based on the model's prediction.
    """
    try:
        # Preprocess the input text
        preprocessed_text = stemming(input_text)  # Apply stemming function

        # Convert text to numerical data using the vectorizer
        vectorized_input = vectorizer.transform([preprocessed_text])  # Convert to numerical format

        # Make a prediction using the trained model
        prediction = model.predict(vectorized_input)

        # Map the prediction to the corresponding label
        if prediction[0] == 0:
            return "The news is Real"
        else:
            return "The news is Fake"
    except Exception as e:
        return f"Error: {str(e)}"

# Define the Gradio interface
interface = gr.Interface(
    fn=predict_news_reliability,
    inputs="text",
    outputs="text",
    title="News Reliability Predictive System",
    description=(
        "This system predicts whether a news article is Real or Fake. "
        "Please enter the text of the article to get the prediction."
    ),
    examples=[
        ["Breaking news! Scientists discover water on Mars."],
        ["Click here to win a free iPhone! This is not a scam."],
    ],
)

# Launch the Gradio interface
interface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://cbc8d66cc55d4f8fe9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


