<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<style>
  body {
    font-family: Arial, sans-serif;
    padding: 20px;
  }
  .container {
    max-width: 600px;
    margin: 0 auto;
    
    border-radius: 10px;
    padding: 20px;
    box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); /* Shadow effect */
  }
  h2 {
    color: #333;
    margin-bottom: 20px;
  }
  ul {
    list-style-type: none;
    padding: 0;
  }
  li {
    margin-bottom: 10px;
  }
  strong {
    font-weight: bold;
  }
</style>
</head>
<body>

<div class="container">
  <h2>News Article Dataset</h2>
  <p>The dataset contains the following features:</p>
  <ul>
    <li><strong>id:</strong> Unique identifier for each news article.</li>
    <li><strong>title:</strong> The title of the news article.</li>
    <li><strong>author:</strong> The author of the news article.</li>
    <li><strong>text:</strong> The text content of the news article.</li>
    <li><strong>label:</strong> The label indicating whether the news article is real (0) or fake (1).</li>
  </ul>
</div>

</body>
</html>


# Importing the Dependencies

In [29]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import nltk
nltk.download('stopwords', quiet=True)

True

Stopwords : 

Stopwords are common words in natural language processing tasks, like "the," "is," and "and," that are filtered out because they carry little meaning. Removing them focuses analysis on more meaningful words, aiding tasks like text classification and sentiment analysis.

In [30]:

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Allow User to Input the File Path or Use a Relative Path

In [31]:

news_dataset = pd.read_csv(r'C:\Users\Abdou\Desktop\Data analytics Projects\Python\Machine learning\Datasets\train.csv\train.csv')


## Data Pre-processing

In [32]:
news_dataset.shape

(20800, 5)

 print the first 5 rows of the dataframe


In [33]:

news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


 Missing values in the dataset


In [34]:

news_dataset.isnull().sum()


id           0
title      558
author    1957
text        39
label        0
dtype: int64

 replacing the null values with empty string


In [35]:

news_dataset = news_dataset.fillna('')

 merging the author name and news title


In [36]:

news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [37]:
print(news_dataset['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


Stemming :

Stemming is a natural language processing technique used to reduce words to their root or base form, stripping away affixes such as prefixes or suffixes. For example, stemming the words "running," "runner," and "runs" would result in the common root "run."

In [38]:
port_stem = PorterStemmer()

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

news_dataset['content'] = news_dataset['content'].apply(stemming)
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [39]:
print(news_dataset['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


separating the data and label
 

In [40]:

X = news_dataset['content'].values
Y = news_dataset['label'].values

In [41]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [42]:
print(Y)

[1 0 1 ... 0 1 1]


In [43]:
Y.shape

(20800,)

# Use stratified K-fold cross-validation


In [None]:

kf = KFold(n_splits=5, shuffle=True, random_state=None)
scores = []

#Splitting the dataset to training & test data
for train_index, test_index in kf.split(X, Y):
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    
# converting the textual data to numerical data

    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)
    
#Training the Model: Logistic Regression
    model = LogisticRegression()
    model.fit(X_train, Y_train)
    score = model.score(X_test, Y_test)
    scores.append(score)

print(f"Average cross-validation accuracy: {np.mean(scores):.2f}")

# Tune the Logistic Regression model


In [None]:

params = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2']
}

best_score = 0
best_params = None

for c in params['C']:
    for penalty in params['penalty']:
        model = LogisticRegression(C=c, penalty=penalty, solver='liblinear')
        model.fit(X_train, Y_train)
        score = model.score(X_test, Y_test)
        if score > best_score:
            best_score = score
            best_params = {'C': c, 'penalty': penalty}

print(f"Best Logistic Regression model parameters: {best_params}")
print(f"Best test accuracy: {best_score:.2f}")

# Evaluation

accuracy score on the training data


In [None]:

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [47]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9990384615384615


accuracy score on the test data


In [None]:

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [49]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9887019230769231


# Making a Predictive System

In [55]:
X_new = X_test[6]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real




<div style="background-color: #f4f4f4; padding: 20px; border-radius: 5px;">
  <h2 style="color: #333; margin-bottom: 10px;">Conclusion</h2>
  <p style="color: #666; line-height: 1.5;">
    This project demonstrates a basic framework for building a fake news detection system using machine learning techniques. The code preprocesses the input dataset, extracts features from the text data, trains a Logistic Regression model, and evaluates its performance.
  </p>
  <p style="color: #666; line-height: 1.5;">
    The key highlights of this project include:
    <ul style="color: #666; margin-top: 10px;">
      <li>Handling of missing values and text preprocessing</li>
      <li>Conversion of text data to numerical features using TF-IDF</li>
      <li>Model training and evaluation using K-fold cross-validation</li>
      <li>Hyperparameter tuning for the Logistic Regression model</li>
      <li>Functionality to test the model on new, user-provided news articles</li>
    </ul>
  </p>
  


# The adventure continues..


<div style="background-color: #f0f8ff; padding: 20px; border-radius: 5px;">
  <p style="color: #666; line-height: 1.5;">
    While the project shows promising results, there are several areas that could be improved to make the system more robust and effective:
  </p>
  <ul style="color: #666; margin-top: 10px;">
    <li>Exploring more advanced feature engineering techniques (e.g., named entity recognition, sentiment analysis)</li>
    <li>Incorporating additional data sources (e.g., social media, fact-checking websites) to enrich the training data</li>
    <li>Experimenting with more sophisticated machine learning models (e.g., ensemble methods, neural networks)</li>
    <li>Implementing a more user-friendly interface for real-world deployment</li>
    <li>Conducting further evaluation on a larger, more diverse dataset to assess the model's generalization capabilities</li>
  </ul>
</div>



<div style="background-color: #f0f0f0; padding: 20px; border-radius: 5px; box-shadow: 0px 0px 10px rgba(0, 0, 0, 0.1); text-align: center; margin: auto;">
  <h2 style="color: #333;">Thank you for visiting!</h2>
  <p style="color: #666;">Your feedback and contributions are highly appreciated.</p>
</div>