# **Predicting Fake News with PySpark**

This analysis aims to generate an NLP model to identify real vs fake news. Two datasets provided by Clément Bisaillon from Kaggle will be utilized in this analysis. Also, PySpark will be used in Google Colab for this analysis.

Link : https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv  
Citations:
- Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
- Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).


In [1]:
# Install PySpark for this Google Colab Notebook.
!pip install pyspark

# Import 'drive' to upload the necessary files from the Google Drive folder.
from google.colab import drive
drive.mount('/content/drive')

# Import SparkSession from pyspark.sql to start a session.
from pyspark.sql import SparkSession

# Assign a SparkSession to the object 'spark'.
spark = SparkSession.builder.appName('fake_news').getOrCreate()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Load the two datasets and assign them to the objects 'real' for the the real news data and 'fake' for the fake news data.
real = spark.read.csv('/content/drive/My Drive/Colab Notebooks/fake news/True.csv', inferSchema=True, header=True)
fake = spark.read.csv('/content/drive/My Drive/Colab Notebooks/fake news/Fake.csv', inferSchema=True, header=True)

## Data Visualization and Cleaning

In [3]:
# Visualize the first row of the 'real' dataset. Set 'truncate' to False to see all of the contents.
real.show(1, truncate=False)

+----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [4]:
# Visualize the first row of the 'fake' dataset with the same parameters used for 'real'.
fake.show(1, truncate=False)

+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Notes for the 'real' and 'fake' datasets
There are a few steps needed to ensure that the title and text cells can be used for NLP:   
1. Lower all of the string characters to avoid duplicate inputs due to capitalization.
2. Remove the location and the name of the news organization in the 'real' dataset to create a more flexible model that can identify real vs fake news without needing to identify the name and location of the news organization.
3. Remove all of the punctuations.
4. Place a space for certain areas in the 'fake' dataset where there are years (e.g.'2017') and words (e.g. 'Trump') that don't have a space in between. This might duplicate words and years inside the model.
5. Remove encoding text (e.g. 'xa0') in the documents.

Also, we need to place labels ('real' and 'fake) in the two datasets since they don't have one.

In [5]:
# Import 'lit' to place columns for the two datasets that labels whether they are 'real' or 'fake'.
from pyspark.sql.functions import lit

# Use 'withColumn' method for PySpark DataFrames to add columns for their respective labels.
real = real.withColumn(colName='label', col=lit('real'))
fake = fake.withColumn(colName='label', col=lit('fake'))

# Visualize the modifications done to the datasets.
real.show(5)
fake.show(5)

+--------------------+--------------------+------------+------------------+-----+
|               title|                text|     subject|              date|label|
+--------------------+--------------------+------------+------------------+-----+
|As U.S. budget fi...|WASHINGTON (Reute...|politicsNews|December 31, 2017 | real|
|U.S. military to ...|WASHINGTON (Reute...|politicsNews|December 29, 2017 | real|
|Senior U.S. Repub...|WASHINGTON (Reute...|politicsNews|December 31, 2017 | real|
|FBI Russia probe ...|WASHINGTON (Reute...|politicsNews|December 30, 2017 | real|
|Trump wants Posta...|SEATTLE/WASHINGTO...|politicsNews|December 29, 2017 | real|
+--------------------+--------------------+------------+------------------+-----+
only showing top 5 rows

+--------------------+--------------------+-------+-----------------+-----+
|               title|                text|subject|             date|label|
+--------------------+--------------------+-------+-----------------+-----+
| Donald 

### Notes on concatenation of the two datasets and column removal
We can now concatenate the two datasets together since we have already places labels for them. We can also remove the 'subject' and 'date' columns for the two datasets since these will not be used for the analysis.

In [6]:
# Create a new dataset 'full_data' the has both 'real' and 'fake' datasets through the 'union' method. Drop 'subject' and 'date' the the 'drop' method. 
full_data = real.union(fake).drop('subject','date')

In [7]:
# Check if there are any null values in the dataset.
full_data.describe().show()

+-------+--------------------+--------------------+-----+
|summary|               title|                text|label|
+-------+--------------------+--------------------+-----+
|  count|               44906|               44898|44906|
|   mean|                null|                null| null|
| stddev|                null|                null| null|
|    min| #AfterTrumpImplo...|                    | fake|
|    max|“You’re Not Welco...|youngers these da...| real|
+-------+--------------------+--------------------+-----+



### Removal of null values
We can remove null values for the dataset since there are only 8 rows with values missing.

In [8]:
# Drop rows with any null values.
full_data = full_data.na.drop(how='any')

In [9]:
# Check if there are rows with missing values.
full_data.describe().show()

+-------+--------------------+--------------------+-----+
|summary|               title|                text|label|
+-------+--------------------+--------------------+-----+
|  count|               44898|               44898|44898|
|   mean|                null|                null| null|
| stddev|                null|                null| null|
|    min| #AfterTrumpImplo...|                    | fake|
|    max|“You’re Not Welco...|youngers these da...| real|
+-------+--------------------+--------------------+-----+



In [10]:
# Double check if there are missing values in the 'text' column.
from pyspark.sql.functions import count, when, isnan
full_data.select(count(when(isnan('text'),1)).alias('Text NaN values')).show()

+---------------+
|Text NaN values|
+---------------+
|              0|
+---------------+



### Notes of empty data in the 'text' column
We can retain the rows with empty 'text' columns since we will merge the title and text columns.

In [11]:
# Import 'concat' and 'col' functions to merge the 'title' and 'text' columns. Drop the 'title' and 'text columns after.
from pyspark.sql.functions import concat, col

full_data = full_data.withColumn('full_text', concat(col('title'), lit(' '), col('text')))\
                    .drop('title','text')

# Check the modified data.
full_data.show()

+-----+--------------------+
|label|           full_text|
+-----+--------------------+
| real|As U.S. budget fi...|
| real|U.S. military to ...|
| real|Senior U.S. Repub...|
| real|FBI Russia probe ...|
| real|Trump wants Posta...|
| real|White House, Cong...|
| real|Trump says Russia...|
| real|Factbox: Trump on...|
| real|Trump on Twitter ...|
| real|Alabama official ...|
| real|Jones certified U...|
| real|New York governor...|
| real|Factbox: Trump on...|
| real|Trump on Twitter ...|
| real|Man says he deliv...|
| real|Virginia official...|
| real|U.S. lawmakers qu...|
| real|Trump on Twitter ...|
| real|U.S. appeals cour...|
| real|Treasury Secretar...|
+-----+--------------------+
only showing top 20 rows



### Text modification through Regular Expressions
As mentioned in 'Notes for the 'real' and 'fake' datasets', we need to clean the 'full_text' column to ensure that the column is clean enough for the model to learn from properly.

We will need to lower the string characters and use Regular Expressions to clean the data as much as possible.

In [12]:
# Import 'lower' and 'regexp_replace' to clean the 'full_text' column.
from pyspark.sql.functions import lower, regexp_replace

# Convert all of the characters to lower case.
clean_data = full_data.select('label', lower('full_text').alias('full_text'))

# Remove the name of the news organization and its location.
clean_data = clean_data.select('label', regexp_replace('full_text',r'[a-z]+ \(\w+\) -','').alias('full_text'))

# Remove punctuation marks since they will not be used.
clean_data = clean_data.select('label', regexp_replace('full_text',r'[^\w\s]','').alias('full_text'))

# Put spaces between digits and letters so that the model can learn them separately.
clean_data = clean_data.select('label', regexp_replace('full_text',r'(\d+)(\D+)', r'$1 $2').alias('full_text'))

# Remove formatting text seen in selected rows in the dataset.
clean_data = clean_data.select('label', regexp_replace('full_text',r'xa0',r'').alias('full_text'))

# Check a row to see the changes.
clean_data.show(1, truncate=False)

+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Model Creation with TF-IDF
To create a prediction model with all of the rows and the words they carry, TF-IDF will be used to highlight the words that carry significance in determining real vs fake news.

In [13]:
# Create 'train' and 'test' sets from 'clean_data'. Set the 'train' data at 70% of 'clean_data'. Leave the rest to 'test' data.
train = clean_data.sampleBy('label', fractions={'real':0.7, 'fake':0.7}, seed=1)
test = clean_data.subtract(train)

### Pipeline for the Features Modification Steps
A pipeline will be used to prepare the features for the model. Such steps will involve the use of a tokenizer, stop words remover, term frequency function, inverse document frequency function, and string indexer.

In [14]:
# Import 'Pipeline'.
from pyspark.ml import Pipeline

# Import 'RegexTokenizer', 'StopWordsRemover', 'HashingTF', 'IDF', and 'StringIndexer' to prepare the features for the model.
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer

# Create instances of the functions for the features modification. Link the input and output columns through the functions.
# Set the Regex Tokenizer to separate on non-alphanumeric characters (\W).
regex_tokenizer = RegexTokenizer(inputCol='full_text', outputCol='token_text', pattern='\\W')

# Set Stop Words Remover with the input coming from 'regex_tokenizer' and set output column to 'token_stop_text'.
stopwords_remover = StopWordsRemover(inputCol='token_text', outputCol='token_stop_text')

# Receive input from 'stopwords_remover' and set output column to 'tf_col'.
hashing_tf = HashingTF(inputCol='token_stop_text', outputCol='tf_col')

# Create IDF from 'tf_col' to create output column of 'features'
idf = IDF(inputCol='tf_col', outputCol='features')

# Convert 'real' and 'fake' labels to numerics for the model to use.
binary_labels = StringIndexer(inputCol='label', outputCol='bin_labels')

In [15]:
# Create a pipeline that houses all of the functions for the model.
pipe = Pipeline(stages=[regex_tokenizer,stopwords_remover,hashing_tf,idf,binary_labels])

# Fit the 'train' set to the model.
model_fit = pipe.fit(train)

# Transform both 'train' and 'test' based from the fitted model. 
train_transformed = model_fit.transform(train)
test_transformed = model_fit.transform(test)

# Select only the binary labels and features columns.
train_clean = train_transformed.select('bin_labels','features')
test_clean = test_transformed.select('bin_labels','features')

### Model Prediction with Logistic Regression
We will use Logistic Regression to predict the labels in the clean test set. We will then measure the accuracy of the model with the AUC score.

In [None]:
# Import 'LogisticRegression'.
from pyspark.ml.classification import LogisticRegression

# Create instance of 'LogisticRegression' and set the maximum number of iterations to 100.
logreg= LogisticRegression(featuresCol='features', 
                           labelCol='bin_labels', 
                           predictionCol='predictions', 
                           maxIter=100)

# Fit the model with the clean train data.
lr_model = logreg.fit(train_clean)

# Generate predictions from 'lr_model' used to learn from the clean train data.
predictions = lr_model.transform(test_clean)

In [17]:
# Import BinaryClassificationEvaluator to generate AUC score.
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create instance of BinaryClassificationEvaluator to generate AUC score.
evaluator = BinaryClassificationEvaluator(rawPredictionCol='predictions',labelCol='bin_labels')

# Evaluate the predictions and labels from the 'predictions' dataset.
AUC = evaluator.evaluate(predictions)

# Print AUC score.
print(AUC)

0.9526859951520478


## Conclusion
Our model was able to produce an AUC score of 0.95. We can conclude that this model accurately labels real news versus fake news.