Project 3: Spam filter for Quora questions
Download data from here : https://www.dropbox.com/sh/kpf9z73woodfssv/AAAw1_JIzpuVvwteJCma0xMla?dl=0

Goal : Build a model for identifying if a question on Quora is spam

Suggested Guidelines :

1. To bring down dimensions of your model you can use glove embedding shared with you ( in the data )

2. Here is how you can use pertained embeddings : https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

3. You'll have to Create and maintain your own train/validation splits for the full data shared with you

4. Your solution needs to be uploaded to GitHub repo of your team

In [None]:
# tags dataset column- qid, question_text, target

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np

In [None]:
tags_df = pd.read_csv('/content/drive/MyDrive/Datasets/DL P3 Qura spam/train (1) (1).csv')

In [None]:
tags_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [None]:
# Extract the inputs and labels
texts = tags_df['question_text'].values
labels = tags_df['target'].values

Preprocess Data:

In [None]:
# Tokenize and pad the sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
MAX_NUM_WORDS = 20000  # Number of words to keep in the tokenizer
MAX_SEQUENCE_LENGTH = 100  # Maximum sequence length to pad

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(data, labels, test_size=0.2, random_state=42)


Found 196502 unique tokens.


Load Pre-trained GloVe Embeddings:

In [None]:
# Load GloVe embeddings
EMBEDDING_DIM = 100  # Can be 50, 100, 200 depending on GloVe model

embedding_index = {}
with open('/content/drive/MyDrive/Datasets/DL P3 Qura spam/glove.6B.100d[1].txt', 'r', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

print(f'Found {len(embedding_index)} word vectors.')




Found 400000 word vectors.


In [None]:
# Create an embedding matrix
embedding_matrix = np.zeros((MAX_NUM_WORDS, EMBEDDING_DIM))
for word, i in word_index.items():
    if i < MAX_NUM_WORDS:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

Build the Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout


# Define the model
model = Sequential()

# Add the embedding layer with pre-trained GloVe embeddings
embedding_layer = Embedding(input_dim=MAX_NUM_WORDS,
                            output_dim=EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)  # Freeze the embedding layer
model.add(embedding_layer)

# Add LSTM layer
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

# Add a fully connected layer
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))

# Output layer for binary classification
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()




In [None]:
# Train the model
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=64,
                    validation_data=(X_val, y_val),
                    verbose=2)

# Evaluate on validation set
val_preds = (model.predict(X_val) > 0.5).astype("int32")
print(f"Validation Accuracy: {accuracy_score(y_val, val_preds)}")
print(f"F1 Score: {f1_score(y_val, val_preds)}")

Epoch 1/10
13108/13108 - 3071s - 234ms/step - accuracy: 0.9484 - loss: 0.1371 - val_accuracy: 0.9551 - val_loss: 0.1156
Epoch 2/10
13108/13108 - 3044s - 232ms/step - accuracy: 0.9531 - loss: 0.1209 - val_accuracy: 0.9572 - val_loss: 0.1110
Epoch 3/10


Save the Model:


In [None]:
model.save('spam_filter_model.h5')