<a href="https://colab.research.google.com/github/davidjguan/News-Classification/blob/main/CMSC320_Real_Fake_News_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real / Fake News Classification

## TASK 1: Download, Have a Look, and Preprocess Data

### TASK 1.1 [1 point]: Download the data.
Put the two csv files into the same directory as `main.ipynb`. Here is a piece of code to check whether you did correctly.

In [4]:
import os
files = os.listdir('.')
print("real.csv is in the folder:", "real.csv" in files)
print("fake.csv is in the folder:", "fake.csv" in files)

real.csv is in the folder: True
fake.csv is in the folder: True


### TASK 1.2 [0 point]: Load the data and have a first glance.

In [5]:
import pandas as pd
real_df = pd.read_csv('real.csv')
fake_df = pd.read_csv('fake.csv')
print("Real dataset shape:", real_df.shape)
print("Fake dataset shape:", fake_df.shape)
print("First 3 rows of real dataset:")
print(real_df.head(3))
print("\n\nFirst 3 rows of fake dataset:")
print(fake_df.head(3))

Real dataset shape: (5000, 4)
Fake dataset shape: (5000, 4)
First 3 rows of real dataset:
                                               title  \
0  Thousands march in Helsinki in far-right, anti...   
1  Marseille attacker probably radicalized by bro...   
2  U.S. farmers slam Trump's Cuba clampdown, pres...   

                                                text       subject  \
0  HELSINKI (Reuters) - Supporters of the far rig...     worldnews   
1  ROME (Reuters) - The brother of the man who ki...     worldnews   
2  CHICAGO (Reuters) - U.S. farm groups criticize...  politicsNews   

                date  
0  December 6, 2017   
1   October 9, 2017   
2     June 16, 2017   


First 3 rows of fake dataset:
                                               title  \
0  EP #15: Patrick Henningsen LIVE – ‘Crisis of A...   
1   Desperate For Members, White Supremacists Beg...   
2   Black Caucus Demands Action: Calls On Congres...   

                                                text  s

### TASK 1.3 [3 Points] Sanity Check. See whether there is any missing entries.

Please count the number of empty entries for each column, and then we decide what to do.

In [6]:
def count_empty_entries(df):

    for column in df.columns:
        empty_count = 0
        for row in df[column]:
            if row == ' ' or pd.isna(row) or row == "  ":
                empty_count += 1
            # print(row)

        print(f"Column '{column}' has {empty_count} empty entries.")

print("Counting empty entries in the real dataset:")
count_empty_entries(real_df)
print("\nCounting empty entries in the real dataset:")
count_empty_entries(fake_df)


Counting empty entries in the real dataset:
Column 'title' has 6 empty entries.
Column 'text' has 6 empty entries.
Column 'subject' has 4 empty entries.
Column 'date' has 5 empty entries.

Counting empty entries in the real dataset:
Column 'title' has 5 empty entries.
Column 'text' has 143 empty entries.
Column 'subject' has 5 empty entries.
Column 'date' has 10 empty entries.


### TASK 1.4 [(1+2) Points]. Delete some columns and rows.


In [7]:
# 1 point
def remove_subject_and_date(df):
    df = df.drop(columns=['subject', 'date'])
    return df

real_df = remove_subject_and_date(real_df)
fake_df = remove_subject_and_date(fake_df)
print("real_df's current columns:,", real_df.columns)
print("fake_df's current columns:,", fake_df.columns)

real_df's current columns:, Index(['title', 'text'], dtype='object')
fake_df's current columns:, Index(['title', 'text'], dtype='object')


In [8]:
# 2 points
def remove_rows_with_empty_entries(df):
    df = df[(df['title'].notna()) & (df['title'] != " ") & (df['text'].notna()) & (df['text'] != " ")]

    return df

real_df = remove_rows_with_empty_entries(real_df)
fake_df = remove_rows_with_empty_entries(fake_df)
print("After removing rows with empty entries:")
print("Real dataset shape after cleaning:", real_df.shape)
print("Fake dataset shape after cleaning:", fake_df.shape)

After removing rows with empty entries:
Real dataset shape after cleaning: (4988, 2)
Fake dataset shape after cleaning: (4853, 2)


### TASK 1.5 [5 Points] Remove special characters.
You may notice there are many special characters in the text data, such as punctuation marks, numbers, and other non-alphabetic characters.

Please **remove all these special characters** from the text data in both datasets. **Only keep alphabetic characters and spaces.**

In addition, **make all the characters lower-case**.

In [9]:
import re
def remove_special_characters(df):
    df['text'] = df['text'].str.replace(r'[^a-zA-Z ]', "", regex = True).str.lower()
    df['title'] = df['title'].str.replace(r'[^a-zA-Z ]', "", regex = True).str.lower()
    return df

real_df = remove_special_characters(real_df)
fake_df = remove_special_characters(fake_df)
print("After removing special characters:")
print("Real dataset shape after cleaning:", real_df.head(3))
print("\n\nFake dataset shape after cleaning:", fake_df.head(3))

After removing special characters:
Real dataset shape after cleaning:                                                title  \
0  thousands march in helsinki in farright antifa...   
1  marseille attacker probably radicalized by bro...   
2  us farmers slam trumps cuba clampdown press fo...   

                                                text  
0  helsinki reuters  supporters of the far right ...  
1  rome reuters  the brother of the man who kille...  
2  chicago reuters  us farm groups criticized pre...  


Fake dataset shape after cleaning:                                                title  \
0  ep  patrick henningsen live  crisis of america...   
1   desperate for members white supremacists beg ...   
2   black caucus demands action calls on congress...   

                                                text  
0   join patrick every wednesday at independent t...  
1  by now everyone is aware of how donald trump s...  
2  following the horrific tragedy in dallas texas...  


## TASK 2. Create the training and testing test.

Now we have finished pre-processing the data. We are now constructing a training set and a testing set to build up and evaluate our classifier.

We will follow the steps:
1. TASK 1.1. Assign labels to each row.
2. TASK 1.2. Concatenate the two tables `real_df` and `fake_df` into one data frame `union_df`.
3. TASK 1.3. Split the `union_df` into `train_df` and `test_df`, where the training/testing set contains 80%/20% of the data.

### TASK 2.1 [2 Points]. Assign labels to each row.
Add a `label` column to each of the data frame. For `real_df`, the `label` column is all 1. For `fake_df`, the `label` column is all 0.

In [10]:
def add_label_column(df, value):
    df['label'] = value
    return df

real_df = add_label_column(real_df, 1)  # Label for real emails
fake_df = add_label_column(fake_df, 0)  # Label for fake emails
print("After adding label column:")
print("Real dataset shape with labels:", real_df.head(3))
print("\n\nFake dataset shape with labels:", fake_df.head(3))


After adding label column:
Real dataset shape with labels:                                                title  \
0  thousands march in helsinki in farright antifa...   
1  marseille attacker probably radicalized by bro...   
2  us farmers slam trumps cuba clampdown press fo...   

                                                text  label  
0  helsinki reuters  supporters of the far right ...      1  
1  rome reuters  the brother of the man who kille...      1  
2  chicago reuters  us farm groups criticized pre...      1  


Fake dataset shape with labels:                                                title  \
0  ep  patrick henningsen live  crisis of america...   
1   desperate for members white supremacists beg ...   
2   black caucus demands action calls on congress...   

                                                text  label  
0   join patrick every wednesday at independent t...      0  
1  by now everyone is aware of how donald trump s...      0  
2  following the horrif

### TASK 2.2 [2 Points]. Concatenate the two data frames into one.

In [11]:
def concatenate_dataframes(df1, df2):
    combined_df = pd.concat([df1, df2], ignore_index=True)
    return combined_df

union_df = concatenate_dataframes(real_df, fake_df)
print("Combined dataset shape:", union_df.shape)
print("First 3 rows of the combined dataset:")
print(union_df.head(3))
print("\n\nLast 3 rows of the combined dataset:")
print(union_df.tail(3))

Combined dataset shape: (9841, 3)
First 3 rows of the combined dataset:
                                               title  \
0  thousands march in helsinki in farright antifa...   
1  marseille attacker probably radicalized by bro...   
2  us farmers slam trumps cuba clampdown press fo...   

                                                text  label  
0  helsinki reuters  supporters of the far right ...      1  
1  rome reuters  the brother of the man who kille...      1  
2  chicago reuters  us farm groups criticized pre...      1  


Last 3 rows of the combined dataset:
                                                  title  \
9838  cnns jim acosta schooled on the meaning of the...   
9839   donald trumps first campaign tv ad is here an...   
9840  one brutal image perfectly captures the truth ...   

                                                   text  label  
9838  we wish president trump could clone senior adv...      0  
9839  while an armed militia group of domestic te

### TASK 2.3 [2 Points]. Split the dataset into training and testing sets.

There are many ways to implement this. You could either implement the splitting from scratch, or you could use proper functions in `sklearn.model_selection`.

The printed results don't need to be exactly the same as the expected output, but they should be close.

In [12]:
from sklearn.model_selection import train_test_split
def split_dataset(df, train_size=0.8):
    train_df, test_df = train_test_split(df, train_size=train_size, random_state=42)
    return train_df, test_df

train_df, test_df = split_dataset(union_df, train_size=0.8)
print("Training dataset shape:", train_df.shape)
print("Testing dataset shape:", test_df.shape)
print("Proportion of training set:", len(train_df) / len(union_df))
print("Proportion of testing set:", len(test_df) / len(union_df))
print("#Positive samples in training set:", len(train_df[train_df['label'] == 1]))
print("#Negative samples in training set:", len(train_df[train_df['label'] == 0]))
print("#Positive samples in testing set:", len(test_df[test_df['label'] == 1]))
print("#Negative samples in testing set:", len(test_df[test_df['label'] == 0]))

Training dataset shape: (7872, 3)
Testing dataset shape: (1969, 3)
Proportion of training set: 0.7999187074484301
Proportion of testing set: 0.20008129255156995
#Positive samples in training set: 3984
#Negative samples in training set: 3888
#Positive samples in testing set: 1004
#Negative samples in testing set: 965


## TASK 3: Build a classifier and evaluate its performance.
Now we have our training set and testing set ready. We are going to train classifiers and evaluate its performance.

* You are going to first encode the news, which means convert each piece of news into a fixed-length vector. And then you will train a classifier.
* You are going to try two encoders, `TfidfVectorizer` and `SentenceTransformer`. You will use `LogisticRegression` as the classifier.
* Regarding the inputs to the classifiers, you are unsure **whether `title` or `text` will give better prediction.** So you will try both for each way of classification.

So you plan the following combinations of experiments. The target of this task is to fill in the **classification accuracy on the test set** into the table at the end.
|  | TfidfVectorizer | SentenceTransformer |
|----------|----------|----------|
| title |  |  |
| text |  |  |

### TASK 3.1: TfidfVectorizer - Logistic Regression

#### TASK 3.1.1 [5 Points]: TfidfVectorizer
Currently, our input is a string. Machine learning models like logistic regression cannot receive strings as inputs. Therefore, our first step is to convert the input strings into feature vectors. We will use `TfidfVectorizer` from `sklearn.feature_extraction.text` to achieve this. You are recommended to read the documentation to facilitate your coding.

You would do the following:
1. Fit the `TfidfVectorizer` using the training inputs.
2. Use the fitted vectorizer to transform the testing inputs.
3. Do the above steps for both `title` column and `text` column.

Hint: When fitting the `TfidfVectorizer`, pay attention to the `max_features` parameter. If you find the training later on takes a very long time, consider setting an appropriate value for this parameter.

In [13]:
def extract_feature_vectors(train_df_input_column, test_df_input_column):
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer(max_features=5000)
    train_feature_vectors = vectorizer.fit_transform(train_df_input_column)
    test_feature_vectors = vectorizer.transform(test_df_input_column)
    return train_feature_vectors, test_feature_vectors

train_title_feature_vectors, test_title_feature_vectors = extract_feature_vectors(train_df['title'], test_df['title'])
train_text_feature_vectors,  test_text_feature_vectors  = extract_feature_vectors(train_df['text'],  test_df['text'])
print("Feature vectors for training set (title):", train_title_feature_vectors.shape)
print("Feature vectors for testing set (title): ", test_title_feature_vectors.shape)
print("Feature vectors for training set (text): ", train_text_feature_vectors.shape)
print("Feature vectors for testing set (text):  ", test_text_feature_vectors.shape)

Feature vectors for training set (title): (7872, 5000)
Feature vectors for testing set (title):  (1969, 5000)
Feature vectors for training set (text):  (7872, 5000)
Feature vectors for testing set (text):   (1969, 5000)


### TASK 3.1.2 [8 Points]: Logistic Regression.
Train a **Logistic Regression model** on the training set and evaluate its performance on the testing set. Feel free to set your own hyper-parameters.

Consider using `accuracy_score` and `classification_report` from `sklearn.metrics` to evaluate and present your results.

You don't need to match the exact values from `expected_output.ipynb`, but you shall obtain a similar format of outputs and reasonable values (e.g. accuracy not lower than 0.85).

In [14]:
def LogisticRegression(train_feature_vectors, train_labels, test_feature_vectors, test_labels):
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report

    model = LogisticRegression()
    model.fit(train_feature_vectors, train_labels)
    predictions = model.predict(test_feature_vectors)
    accuracy = accuracy_score(test_labels, predictions)
    report = classification_report(test_labels, predictions)
    print("Accuracy of Logistic Regression Model:", accuracy)
    print("Classification Report:\n", report)
    return model, accuracy, report

# Prepare labels for training and testing
train_labels = train_df['label'].values
test_labels = test_df['label'].values
# Train and evaluate the Logistic Regression model using title feature vectors
model_title, accuracy_title, report_title = LogisticRegression(
    train_title_feature_vectors, train_labels,
    test_title_feature_vectors, test_labels
)
# Train and evaluate the Logistic Regression model using text feature vectors
model_text, accuracy_text, report_text = LogisticRegression(
    train_text_feature_vectors, train_labels,
    test_text_feature_vectors, test_labels
)
# Print the accuracies for both models
print("Accuracy of Logistic Regression Model using title feature vectors:", accuracy_title)
print("Accuracy of Logistic Regression Model using text feature vectors:", accuracy_text)


Accuracy of Logistic Regression Model: 0.9380396140172677
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.92      0.94       965
           1       0.92      0.96      0.94      1004

    accuracy                           0.94      1969
   macro avg       0.94      0.94      0.94      1969
weighted avg       0.94      0.94      0.94      1969

Accuracy of Logistic Regression Model: 0.9751142712036567
Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.96      0.97       965
           1       0.96      0.99      0.98      1004

    accuracy                           0.98      1969
   macro avg       0.98      0.97      0.98      1969
weighted avg       0.98      0.98      0.98      1969

Accuracy of Logistic Regression Model using title feature vectors: 0.9380396140172677
Accuracy of Logistic Regression Model using text feature vectors: 0.9751142712036567


### TASK 3.2 SentenceTransformer - Logistic Regression
You will use a python package named `sentence_transformers` to convert a sentence into a 384-dimensional vector. You are encouraged to read through the documentation: [https://sbert.net/](https://sbert.net/), which describes its simple interface.

Similar to the previous task, you will first transform the title/text into vectors and train a logistic regression to classify the news.

#### TASK 3.2.1 [3 Points] Install Sentence Transformer.
First you will install the sentence-transformers pakcage using pip. `pip install -U sentence-transformers` After installation, you can test-run the following codes from [https://sbert.net/](https://sbert.net/) to see whether it is successfully installed.

In [15]:
from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(3, 384)
tensor([[1.0000, 0.6660, 0.1046],
        [0.6660, 1.0000, 0.1411],
        [0.1046, 0.1411, 1.0000]])


#### TASK 3.2.2 [5 Points] Transform title/text into vectors using SentenceTransformer

In [16]:
def transform_into_vectors(train_df_input_column, test_df_input_column):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    train_feature_vectors = model.encode(train_df_input_column.tolist())
    test_feature_vectors = model.encode(test_df_input_column.tolist())
    return train_feature_vectors, test_feature_vectors

train_title_feature_vectors, test_title_feature_vectors = transform_into_vectors(train_df['title'], test_df['title'])
print("Feature vectors for training set (title):", train_title_feature_vectors.shape)
print("Feature vectors for testing set (title): ", test_title_feature_vectors.shape)
train_text_feature_vectors,  test_text_feature_vectors  = transform_into_vectors(train_df['text'],  test_df['text'])
print("Feature vectors for training set (text): ", train_text_feature_vectors.shape)
print("Feature vectors for testing set (text):  ", test_text_feature_vectors.shape)

Feature vectors for training set (title): (7872, 384)
Feature vectors for testing set (title):  (1969, 384)
Feature vectors for training set (text):  (7872, 384)
Feature vectors for testing set (text):   (1969, 384)


### TASK 3.2.3 [3 Points]: Logistic Regression (You can reuse the code from TASK 3.1.2)
Train a **Logistic Regression model** on the training set and evaluate its performance on the testing set. Feel free to set your own hyper-parameters.

Consider using `accuracy_score` and `classification_report` from `sklearn.metrics` to evaluate and present your results.

You don't need to match the exact values from `expected_output.ipynb`, but you shall obtain a similar format of outputs and reasonable values (e.g. accuracy not lower than 0.85).

In [17]:
def LogisticRegression(train_feature_vectors, train_labels, test_feature_vectors, test_labels):
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report

    model= LogisticRegression()
    model.fit(train_feature_vectors, train_labels)
    predictions = model.predict(test_feature_vectors)
    accuracy = accuracy_score(test_labels, predictions)
    report =classification_report(test_labels, predictions)
    print("Accuracy of Logistic Regression Model:", accuracy)
    print("Classification Report:\n", report)
    return model, accuracy, report

# Prepare labels for training and testing
train_labels = train_df['label'].values
test_labels = test_df['label'].values
# Train and evaluate the Logistic Regression model using title feature vectors
model_title, accuracy_title, report_title = LogisticRegression(
    train_title_feature_vectors, train_labels,
    test_title_feature_vectors, test_labels
)
# Train and evaluate the Logistic Regression model using text feature vectors
model_text, accuracy_text, report_text = LogisticRegression(
    train_text_feature_vectors, train_labels,
    test_text_feature_vectors, test_labels
)
# Print the accuracies for both models
print("Accuracy of Logistic Regression Model using title feature vectors:", accuracy_title)
print("Accuracy of Logistic Regression Model using text feature vectors:", accuracy_text)


Accuracy of Logistic Regression Model: 0.9080751650584052
Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.89      0.90       965
           1       0.90      0.92      0.91      1004

    accuracy                           0.91      1969
   macro avg       0.91      0.91      0.91      1969
weighted avg       0.91      0.91      0.91      1969

Accuracy of Logistic Regression Model: 0.9410868461147791
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94       965
           1       0.93      0.95      0.94      1004

    accuracy                           0.94      1969
   macro avg       0.94      0.94      0.94      1969
weighted avg       0.94      0.94      0.94      1969

Accuracy of Logistic Regression Model using title feature vectors: 0.9080751650584052
Accuracy of Logistic Regression Model using text feature vectors: 0.9410868461147791


### TASK 3.3 [5 Points]:

Finally, fill in the accuracy into the table according to your experiments.

|  | TfIdfVectorizer | SentenceTransformer |
|----------|----------|----------|
| title | .94 | .91 |
| text  | .98 | .94 |
