<a href="https://colab.research.google.com/github/ashwanissingh/Financial-Misinformation-Detection-using-NLP/blob/main/Final_Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CAPSTONE**
Checklist of Basic Steps for FMD Project


1.   Load and Inspect Dataset Preprocessing (Cleaning, Tokenization, Lemmatization, Stopword Removal)✅
2.  Feature Engineering (TF-IDF for traditional ML, Embeddings for RNNs)✅
3.  Train-Test Split, Model Selection (Logistic Regression, RNN with LSTM), Training and Evaluation (Accuracy, Classification Report)✅





# Step 1: Preprocessing the Data

1.1 load the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

# Load dataset
file_path = "/content/drive/MyDrive/train.json"
df = pd.read_json(file_path, lines=True)

# Display first few rows
print(df.head())


            ID                                              claim      posted  \
0  FMD_train_0                        Amazon.com and Intifada.com  09/25/2001   
1  FMD_train_1                          $100 JCPenney Coupon Scam  08/10/2015   
2  FMD_train_2  Did Ford Make Colin Kaepernick the Face of The...  09/10/2018   
3  FMD_train_3  State Dept. Employee Candace Claiborne Arreste...  04/15/2017   
4  FMD_train_4  No, Sean Connery's family did not become emoti...  12/03/2020   

                                          sci_digest  \
0  [Is the web site Intifada.com partnered with A...   
1                                                 []   
2  [A joke posted to the political humor section ...   
3  [Department of State employee Candace Claiborn...   
4  [We recently saw the same misleading advertisi...   

                                       justification      issues  \
0  Claim:  On-line bookseller Amazon.com is partn...    [profit]   
1  FACT CHECK: Can Facebook users get a 

1.2 Handle Missing Values

In [None]:
# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values (if any)
df = df.dropna()


ID               0
claim            0
posted           0
sci_digest       0
justification    0
issues           0
image_data       0
label            0
evidence         0
dtype: int64


1.3 Normalize text

In [None]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    text = " ".join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

df["clean_claim"] = df["claim"].apply(clean_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


1.4 tokenization and Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()

def tokenize_lemmatize(text):
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(tokens)

df["clean_claim"] = df["clean_claim"].apply(tokenize_lemmatize)


[nltk_data] Downloading package wordnet to /root/nltk_data...


1.5 Convert Labels into Numerical Format

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df["label_encoded"] = label_encoder.fit_transform(df["label"])  # Converts labels to numbers

# Show label mappings
print(dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))


{'False': np.int64(0), 'NEI': np.int64(1), 'True': np.int64(2)}


# Step 2: Feature Engineering

2.1 Convert Text into Numerical Features

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit to top 5000 words

# Fit and transform the claims
X = tfidf_vectorizer.fit_transform(df["clean_claim"])

# Convert to dense array
X = X.toarray()

# Show shape of transformed data
print("TF-IDF Feature Matrix Shape:", X.shape)


TF-IDF Feature Matrix Shape: (1953, 4659)


2.2 Split the Dataset into Training & Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

# Target labels
y = df["label_encoded"]

# Split into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print dataset sizes
print("Training Set Size:", X_train.shape)
print("Testing Set Size:", X_test.shape)


Training Set Size: (1562, 4659)
Testing Set Size: (391, 4659)


# Step 3: Model Selection and trainning

3.1 Train a Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Logistic Regression Accuracy: 0.649616368286445

Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.92      0.73       177
           1       1.00      0.06      0.11        68
           2       0.74      0.60      0.66       146

    accuracy                           0.65       391
   macro avg       0.78      0.53      0.50       391
weighted avg       0.72      0.65      0.60       391



3.2 Experiment with Other Models (RNN)

3.2.1 Convert Text into Sequences

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Hyperparameters
max_words = 5000  # Max number of words in vocabulary
max_len = 100  # Max length of input sequence

# Tokenize the text
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(df["clean_claim"])

# Convert text to sequences
X_sequences = tokenizer.texts_to_sequences(df["clean_claim"])

# Pad sequences to ensure equal length
X_padded = pad_sequences(X_sequences, maxlen=max_len, padding="post")

# Convert labels to numpy array
y = df["label_encoded"].values

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)


Training data shape: (1562, 100)
Testing data shape: (391, 100)


3.2.2 Define an RNN Model with LSTM

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define the RNN model
model = Sequential([
    Embedding(input_dim=max_words, output_dim=128, input_length=max_len),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dropout(0.5),
    Dense(32, activation="relu"),
    Dense(3, activation="softmax")  # 3 output classes (True, False, NEI)
])

# Compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Print model summary
model.summary()




3.2.3 Train the Model

In [None]:
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/10
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 119ms/step - accuracy: 0.4185 - loss: 1.0571 - val_accuracy: 0.4527 - val_loss: 1.0342
Epoch 2/10
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 141ms/step - accuracy: 0.4667 - loss: 1.0365 - val_accuracy: 0.4527 - val_loss: 1.0343
Epoch 3/10
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 107ms/step - accuracy: 0.4709 - loss: 1.0239 - val_accuracy: 0.4527 - val_loss: 1.0322
Epoch 4/10
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 106ms/step - accuracy: 0.4508 - loss: 1.0340 - val_accuracy: 0.4527 - val_loss: 1.0325
Epoch 5/10
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 102ms/step - accuracy: 0.4485 - loss: 1.0358 - val_accuracy: 0.4527 - val_loss: 1.0315
Epoch 6/10
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 106ms/step - accuracy: 0.4510 - loss: 1.0216 - val_accuracy: 0.4527 - val_loss: 1.0314
Epoch 7/10
[1m49/49[0m 

3.2.4  Evaluate the Model

In [None]:
# Evaluate on test data
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test Accuracy:", test_acc)

# Make predictions
y_pred = model.predict(X_test)
y_pred_classes = y_pred.argmax(axis=1)

# Classification report
from sklearn.metrics import classification_report
print("\nClassification Report:\n", classification_report(y_test, y_pred_classes))


[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step - accuracy: 0.4235 - loss: 1.0535
Test Accuracy: 0.4526854157447815
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 66ms/step

Classification Report:
               precision    recall  f1-score   support

           0       0.45      1.00      0.62       177
           1       0.00      0.00      0.00        68
           2       0.00      0.00      0.00       146

    accuracy                           0.45       391
   macro avg       0.15      0.33      0.21       391
weighted avg       0.20      0.45      0.28       391



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Step 4: Steps to Implement Transformer (BERT) Model

In [None]:
pip install transformers datasets torch tensorflow




In [None]:
from datasets import load_dataset
from transformers import BertTokenizer

# Load dataset from Hugging Face
dataset = load_dataset("lzw1008/COLING25-FMD")

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)


4.1 Load the pretrained Bert Tokenizer

In [None]:
import tensorflow as tf
from transformers import BertTokenizer

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize and encode sequences
def encode_texts(texts, max_len=100):
    return tokenizer(
        texts.tolist(),
        max_length=max_len,
        padding="max_length",
        truncation=True,
        return_tensors="tf"
    )


In [None]:
# Encode dataset
import numpy as np
from sklearn.model_selection import train_test_split

# Convert TensorFlow tensors to NumPy arrays
X = np.array(train_encodings["input_ids"])
y = np.array(df["label_encoded"])

# Now split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


4.2 load pretrained BERT model

In [None]:
import tensorflow as tf
from transformers import TFBertModel
from tensorflow.keras.layers import Dense, Dropout, Input, Lambda
from tensorflow.keras.models import Model

# Load pretrained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")

# Define input layers
input_ids = Input(shape=(100,), dtype=tf.int32, name="input_ids")
attention_mask = Input(shape=(100,), dtype=tf.int32, name="attention_mask")

# Wrap BERT model in a Lambda layer
def extract_bert_embeddings(inputs):
    input_ids, attention_mask = inputs
    return bert_model(input_ids=input_ids, attention_mask=attention_mask)[1]  # Pooled output

bert_output = Lambda(extract_bert_embeddings)([input_ids, attention_mask])

# Add dropout and dense layers for classification
x = Dropout(0.3)(bert_output)
x = Dense(64, activation="relu")(x)
output = Dense(2, activation="softmax")(x)  # Assuming binary classification

# Define model
model = Model(inputs=[input_ids, attention_mask], outputs=output)

# Compile model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Print model summary
model.summary()


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

NotImplementedError: Exception encountered when calling Lambda.call().

[1mWe could not automatically infer the shape of the Lambda's output. Please specify the `output_shape` argument for this Lambda layer.[0m

Arguments received by Lambda.call():
  • args=(['<KerasTensor shape=(None, 100), dtype=int32, sparse=False, name=input_ids>', '<KerasTensor shape=(None, 100), dtype=int32, sparse=False, name=attention_mask>'],)
  • kwargs={'mask': ['None', 'None']}