# Bi-directional RNN and Sentiment Analysis Assignement(Graded)
Now you will be building Bi-directional LSTM and RNN using Fake news dataset. 

### Problem Description
In this assignment, we aim to build a sentiment analysis model using a Bi-directional LSTM and RNN. The dataset used for this task is a Fake News dataset, which contains news articles labeled as real or fake. The goal is to preprocess the text data, convert it into a suitable format for model training, and then train a Bi-directional LSTM model to classify the news articles accurately. The performance of the model will be evaluated using various metrics such as accuracy, confusion matrix, and classification report.

### Description of the Dataset
A full training dataset with the following attributes:

- **id**: unique id for a news article
- **title**: the title of a news article
- **author**: author of the news article
- **text**: the text of the article; could be incomplete
- **label**: a label that marks the article as potentially unreliable
    - 1: unreliable
    - 0: reliable

### Assignement Task
- Import necessary libraries and load the dataset.
- Handle bad lines in the dataset and read it into a DataFrame.
- Perform initial data exploration (head, shape, null values).
- Drop rows with NaN values and verify the changes.
- Separate the dataset into independent features (X) and dependent features (y).
- Import TensorFlow and necessary Keras layers for model building.
- Preprocess the text data (tokenization, padding).
- Build and compile a Sequential model with Embedding and LSTM layers.
- Train the model on the training data and validate it on the test data.
- Evaluate the model's performance using confusion matrix, accuracy score, and classification report.
- Build and train a Bi-directional LSTM model.
- Evaluate the Bi-directional LSTM model's performance using the same metrics.

### Instructions
- Only write code when you see any of the below prompts,

    ```
    # YOUR CODE GOES HERE
    # YOUR CODE ENDS HERE
    # TODO
    ```

- Do not modify any other section of the code unless tated otherwise in the comments.

In [1]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Embedding# word 2 vec
from tensorflow.keras.preprocessing.sequence import pad_sequences# pre-padding and post padding
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
import nltk
import re
# setting a threshold value of 0.5->0.5=1 and <0.5 =0
from sklearn.metrics import confusion_matrix
### Dataset Preprocessing
from nltk.stem.porter import PorterStemmer ##stemming purpose
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
import numpy as np

# Task : We have to handle bad line
To handle bad lines in the dataset, we define a function `handle_bad_line` that prints the bad line and returns `None`. We then use this function while reading the CSV file with `pd.read_csv`.

In [3]:
def handle_bad_line(line):
    print() # TODO: log the line to a file
    return None 

df = pd.read_csv('FNC.csv', delimiter=',', encoding='utf-8', on_bad_lines=handle_bad_line, engine='python')



In [None]:
#TODO : Check if there are any missing values in the dataset. If there are, drop the rows.

In [None]:
# TODO: Check again to see if there are any missing values in the dataset.

# Task: Making the features
We will drop label column for the Independent features and we will add them to Dependent features

In [11]:
## Get the Independent Features
X=0 # TODO: Drop the label column from the dataset and store the remaining columns in a variable X with axis 1

In [12]:
## Get the Dependent features
y=df[0] # TODO: Store the label column in a variable y

In [None]:
# TODO: Check the shape of the dataset dependent and independent features and make sure they are of the same length

In [18]:
### Vocabulary size
voc_size=5000

## One Hot representation

In [19]:
messages=X.copy()

In [None]:
messages['title'][1]

In [None]:
messages

In [22]:
messages.reset_index(inplace=True)

In [None]:
messages

## Task: Now we will remove stop words and punctuations

In [None]:
# stopwords
# TODO: download the stopwords from nltk library

In [26]:
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    # removing special characters and replacing it with blanks
    review = re.sub(0, messages['title'][i]) # TODO: replace 0 special characters with blanks
    review = review # TODO: convert the review to lowercase
    review = review # TODO: split the review into words

    review = 0 # TODO: stem the words and remove the stopwords
    review = ' '.join(review)
    # TODO: append the review to the corpus

In [None]:
corpus

In [None]:
onehot_repr=0 # TODO: convert the words in the corpus to onehot representation using the fucntion
onehot_repr

In [None]:
# TODO : Check the length of the onehot encoded corpus first sentence

In [None]:
# TODO: Check the onehot representation of the first sentence

## Task: Embedding representation
To convert the text data into a numerical format suitable for model training, we use an embedding representation. This involves the following steps:

1. **One Hot Encoding**: Convert each word in the corpus to a unique integer using one hot encoding.
2. **Padding Sequences**: Ensure all sequences have the same length by padding shorter sequences with zeros.
3. **Embedding Layer**: Use an embedding layer in the neural network to convert the integer-encoded words into dense vectors of fixed size. This layer learns the word embeddings during training.

The embedding representation helps in capturing the semantic meaning of words and improves the performance of the model.

In [None]:
sent_length=20
embedded_docs=0 # TODO: pad the sequences to make them of the same length with post padding.
print(embedded_docs)

In [None]:
# TODO: Check the length of the first sentence after padding

In [None]:
# TODO: Check the first sentence after padding

In [None]:
## Creating model
# each and every word is going to get converted into a vector of 40 size
embedding_vector_features=40 ##features representation
model=0# TODO : Create a sequential model

# embedding layer
model.add() # TODO : Add the embedding layer with vocabulary size, embedding vector features and input length with sent_length

# LSTM-100 NEURONS
model.add(LSTM(100))

# Sigmoid for binary prediction in model
model.add() # TODO: Add a dense layer with 1 neuron and sigmoid activation function

model.compile() # TODO : Compile the model with binary crossentropy loss function, adam optimizer and accuracy as the metric
print(model.summary())

In [None]:
# Assuming voc_size and sent_length are predefined variables
embedding_vector_features = 40  # Size of the embedding vector

model = 0# TODO : Create a sequential model

# Embedding layer with correct input_dim (voc_size) and without deprecated input_length
model.add() # TODO : Add the embedding layer with vocabulary size, embedding vector features and input length with sent_length

# LSTM layer
model.add(LSTM(100))

# Dense layer with sigmoid activation for binary classification
model.add() # TODO: Add a dense layer with 1 neuron and sigmoid activation function

# Compile the model
model.compile() # TODO : Compile the model with binary crossentropy loss function, adam optimizer and accuracy as the metric

# Display the model summary
print(model.summary())


In [37]:
model.build() # TODO: Build the model with input shape as None and sent_length

In [None]:
# Dummy data: batch size of 1, sentence length of sent_length
dummy_input = np.random.randint(0, voc_size, (1, sent_length))
model.predict(dummy_input)

print(model.summary())


### Embedding Layer
- 20: This is the input length or sequence length, which represents the number of words in each input sequence.
- 40: This is the embedding dimension size (embedding_vector_features), which is the size of each word's embedding vector.
- This is the total number of parameters in the Embedding layer.
- Calculated as voc_size * embedding_vector_features = 5000 * 40 = 200,000.

### LSTM Layer
- None: Again, the batch size is flexible.
- 100: This is the number of LSTM units (neurons) in the layer.
- This is the total number of parameters in the LSTM layer.
- The LSTM parameters include:
- 4 * [(embedding_vector_features + LSTM_units) * LSTM_units + LSTM_units]
- Specifically: 4 * [(40 + 100) * 100 + 100] = 4 * [140 * 100 + 100] = 4 * [14,000 + 100] = 4 * 14,100 = 56,400.
- These parameters include the weights for input, forget, cell, and output gates in the LSTM.

### Dense Layer
- Output Shape: (None, 1)
- None: Again, the batch size is flexible.
- 1: This is the output size, which is 1 because the model is set up for binary classification (predicting one of two classes).
- Param # (101):
- This is the total number of parameters in the Dense layer.
- Calculated as LSTM_units + 1 = 100 + 1 = 101.

All the parameters in the model are trainable, meaning they will be updated during training to minimize the loss.
Non-trainable params: 0

There are no non-trainable parameters in this model. Non-trainable parameters might exist in models with layers like Batch Normalization where some parameters are not updated during training.


In [40]:
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
# TODO: Check the shape of the final dataset

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split() # TODO: Split the dataset into training and testing sets with 33% data for testing and random state as 42

## Model Training

In [None]:
### Finally Training
model.fit() # TODO : Fit the model with training data, validation data, epochs as 10 and batch size as 64

## Task: Performance Metrics & Accuracy
To evaluate the performance of our trained model, we will use the following metrics:

1. **Confusion Matrix**: This will help us understand the number of true positives, true negatives, false positives, and false negatives.
2. **Accuracy Score**: This will give us the overall accuracy of the model.
3. **Classification Report**: This will provide precision, recall, f1-score, and support for each class.

We will use the test data to make predictions and then calculate these metrics to assess the model's performance.

In [None]:
y_pred=model.predict(X_test)

In [45]:
y_pred=np.where(y_pred > 0.5, 1,0) ##AUC ROC Curve

In [46]:
from sklearn.metrics import confusion_matrix

In [None]:
#TODO: Print the confusion matrix

In [None]:
from sklearn.metrics import accuracy_score
# TODO: Print the accuracy score

In [None]:
from sklearn.metrics import classification_report
# TODO: Print the classification report

## Bidirectional LSTM RNN
In the upcoming cells, we will be building and training a Bidirectional LSTM model for sentiment analysis. The steps include:

1. **Importing Bidirectional Layer**: Import the Bidirectional layer from TensorFlow Keras.
2. **Creating the Model**: Define a Sequential model and add the necessary layers, including the Embedding layer and a Bidirectional LSTM layer.
3. **Compiling the Model**: Compile the model with appropriate loss function, optimizer, and metrics.
4. **Building the Model**: Build the model with the specified input shape.
5. **Training the Model**: Train the model using the training data.
6. **Evaluating the Model**: Evaluate the model's performance using confusion matrix, accuracy score, and classification report.

In [50]:
from tensorflow.keras.layers import Bidirectional

In [None]:
embedding_vector_features=40 ##features representation
model=0 # TODO : Create a sequential model

# embedding layer
model.add() # TODO : Add the embedding layer with vocabulary size, embedding vector features and input length with sent_length

# LSTM NEURONS
model.add() # TODO : Add a Bidirectional LSTM layer with 200 neurons

# Sigmoid for binary prediction in model
model.add() # TODO: Add a dense layer with 1 neuron and sigmoid activation function

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [53]:
model.build(input_shape=(None, sent_length))

In [None]:
# Dummy data: batch size of 1, sentence length of sent_length
dummy_input = 0 # TODO: Create a dummy input with random integers between 0 and voc_size with shape (1, sent_length)
model.predict(dummy_input)

print(model.summary())

In [55]:
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [56]:
X_train, X_test, y_train, y_test = train_test_split() # TODO: Split the data into training and testing data with 33% as the test data and random state as 42

## Task: Model Training
To train our Bidirectional LSTM model, we will follow these steps:

1. **Training the Model**: Use the training data to train the Bidirectional LSTM model. We will specify the number of epochs and batch size.
2. **Making Predictions**: Use the trained model to make predictions on the test data.
3. **Evaluating the Model**: Evaluate the model's performance using confusion matrix, accuracy score, and classification report.

In [None]:
### Training
model.fit() # TODO: Train the model with X_train and y_train with 10 epochs and batch size of 64

In [None]:
# Performance Metrics & Accuracy
y_pred=model.predict(X_test)

In [None]:
y_pred=np.where(y_pred > 0.5, 1,0) ##AUC ROC Curve
y_pred

In [None]:
# TODO: Print the confusion matrix

In [None]:
# TODO: Print the accuracy score

In [None]:
# TODO: Print the classification report