# Cross-Domain Sentiment Classification with Domain-Adaptive Neural Networks

## Project Overview

Sentiment analysis, the computational study of opinions expressed in text, has vast applications in understanding customer feedback, social media monitoring, and opinion mining. However, the performance of sentiment analysis models can significantly drop when applied to a new domain due to the domain discrepancy. This project aims to tackle this challenge using domain adaptation techniques in neural networks.

## Objective

The primary objective of this project is to develop a neural network capable of adapting the knowledge from one domain and effectively applying it to a different domain. Specifically, we will train our model on the IMDB movie review dataset and adapt it to analyze sentiments in the YELP restaurant review dataset.

## Approach

Train model primarily on the source domain data (IMDb) and use a smaller subset of the target domain data (Yelp) for domain adaptation techniques. Here's a common strategy to set up the data:

Train primarily on the source domain: Use all (or a large subset) of the source domain data for training the feature extractor and sentiment classifier.

Use a small subset of the target domain for adaptation: Include a smaller, balanced sample of the target domain data during training to help the model learn features that are useful for both domains. This is where techniques like the gradient reversal layer come in, to encourage the model to learn domain-agnostic features.

Further fine-tune on the target domain if necessary: After the model has been trained with the combined source and a small portion of the target domain data, you might optionally fine-tune the model further on a larger portion of the target domain data to improve performance on that specific domain.

---


# Initialize Dataframes

In [94]:
# hyperparameters 
yelp_sample_size = 10000  

## IMDB Data ∼ Domain Source

In [1]:
import pandas as pd 
import numpy as np

In [2]:
imdb_df = pd.read_csv('IMDB Dataset.csv')
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
sentiment_counts_imdb = imdb_df['sentiment'].value_counts()
print(sentiment_counts_imdb)

sentiment
positive    25000
negative    25000
Name: count, dtype: int64


## Yelp Data ∼ Target Source

We consider ratings of 4 and 5 stars as positive and ratings of 1 and 2 stars as negative. We discard 3-star reviews as they are neutral. Alternatively, we might include them in further explorations in one of the categories based on the needs of our analysis.

https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/data

In [5]:
import json
import pandas as pd

data_file = open("yelp_academic_dataset_review.json")
review_df = []
for line in data_file:
    review_df.append(json.loads(line))
yelp_df = pd.DataFrame(review_df)
data_file.close()

In [6]:
# Filter out rows where 'stars' is 3 
yelp_df = yelp_df[yelp_df['stars'] != 3.0].copy()

yelp_df['sentiment'] = yelp_df['stars'].apply(lambda x: 'positive' if x >= 4 else 'negative')
yelp_df = yelp_df.rename(columns={'text': 'review'})

yelp_df = yelp_df[['review', 'sentiment']]

yelp_df.head()

Unnamed: 0,review,sentiment
1,I've taken a lot of spin classes over the year...,positive
3,"Wow! Yummy, different, delicious. Our favo...",positive
4,Cute interior and owner (?) gave us tour of up...,positive
5,I am a long term frequent customer of this est...,negative
6,Loved this tour! I grabbed a groupon and the p...,positive


In [7]:
# ideally, we want symmetry here, also, for simplicity, we will use a subset of this data
sentiment_counts = yelp_df['sentiment'].value_counts()
print(sentiment_counts)

sentiment
positive    4684545
negative    1613801
Name: count, dtype: int64


In [90]:
# sampling an equal number of positive and negative reviews
yelp_positive_sample = yelp_df[yelp_df['sentiment'] == 'positive'].sample(n=yelp_sample_size // 2, random_state=42)
yelp_negative_sample = yelp_df[yelp_df['sentiment'] == 'negative'].sample(n=yelp_sample_size // 2, random_state=42)

yelp_balanced_sample = pd.concat([yelp_positive_sample, yelp_negative_sample]).sample(frac=1, random_state=42)

In [96]:
sentiment_counts = yelp_balanced_sample['sentiment'].value_counts()
print(sentiment_counts)

sentiment
negative    5000
positive    5000
Name: count, dtype: int64


## Combine Data to create the final Training Dataset

In [97]:
imdb_df['domain'] = 0
yelp_balanced_sample['domain'] = 1

combined_df = pd.concat([imdb_df, yelp_balanced_sample])

combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [99]:
combined_df

Unnamed: 0,review,sentiment,domain
0,paula may bitch never butch br br hilari line ...,negative,0
1,mani peopl say show kid hm kid approxim 7 9 ye...,negative,0
2,well written tale make batman sitcom actual re...,positive,0
3,think movi absolut beauti refer breathtak scen...,positive,0
4,film outstand despit nc 17 rate disturb scene ...,positive,0
...,...,...,...
59995,"The girl at the front desk was on the phone, I...",negative,1
59996,avoid one terribl movi excit pointless murder ...,negative,0
59997,product quit surpri absolut love obscur earli ...,positive,0
59998,decent movi although littl bit short time pack...,positive,0


# 1. Initial Model Training with Source Domain (IMDB)

## Data Preprocessing for IMDB: 
<span style="color:red">Status: </span> <span style="color:blue">ALMOST FINISHED: </span>This includes text cleaning, tokenization, and padding.

### Clean & Normalization 
Let's first clean the texts like removing stopwords, special characters, stemming, and lemmatization.

In [8]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk import word_tokenize, pos_tag

import string

import matplotlib.pyplot as plt

In [None]:
#nltk.download('stopwords')

In [9]:
def remove_stopwords(text):
    # remove stop words like "the", "is", "in", "on", "and", "but", etc. 
    # the focus is on the more meaningful words that give insight into the content.
    stop_words = stopwords.words('english')
    words = text.split()
    filtered_sentence = ''
    for word in words:
        if word not in stop_words:
            filtered_sentence = filtered_sentence + word + ' '
    return filtered_sentence

def remove_punctuation(text):
    table = str.maketrans('','',string.punctuation)
    words = text.split()
    filtered_sentence = ''
    for word in words:
        word = word.translate(table)
        filtered_sentence = filtered_sentence + word + ' '
    return filtered_sentence

def normalize_text(text):
    text = text.lower()
    # get rid of urls
    text = re.sub('https?://\S+|www\.\S+', '', text)
    # get rid of non words and extra spaces
    text = re.sub('\\W', ' ', text)
    text = re.sub('\n', '', text)
    text = re.sub(' +', ' ', text)
    text = re.sub('^ ', '', text)
    text = re.sub(' $', '', text)
    return text

def stemming(text):
    ps = PorterStemmer()
    words = text.split()
    filtered_sentence = ''
    for word in words:
        word = ps.stem(word)
        filtered_sentence = filtered_sentence + word + ' '
    return filtered_sentence

def clean_text(text):
    text = text.lower()
    text = text.replace(',',' , ')
    text = text.replace('.',' . ')
    text = text.replace('/',' / ')
    text = text.replace('@',' @ ')
    text = text.replace('#',' # ')
    text = text.replace('?',' ? ')
    text = normalize_text(text)
    text = remove_punctuation(text)
    text = remove_stopwords(text)
    text = stemming(text)
    return text

In [100]:
combined_df['review'] = combined_df['review'].apply(clean_text)

In [67]:
X_train_raw = imdb_df['review']
y_train_raw = imdb_df['sentiment']

In [103]:
from transformers import BertTokenizer
from torch.utils.data import DataLoader, TensorDataset

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

inputs = tokenizer(list(combined_df['review']), padding=True, truncation=True, return_tensors="pt", max_length=512)

In [109]:
sentiment_mapping = {'positive': 1, 'negative': 0}

combined_df['sentiment'] = combined_df['sentiment'].map(sentiment_mapping)

sentiments = torch.tensor(combined_df['sentiment'].values, dtype=torch.long)
domains = torch.tensor(combined_df['domain'].values, dtype=torch.long)

In [110]:
full_dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], sentiments, domains)

train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size

train_dataset, val_dataset = torch.utils.data.random_split(full_dataset, [train_size, val_size])

batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

## Build Model: 
<span style="color:red">Status: </span> <span style="color:blue">TO DO: </span> Design a neural network architecture suitable for sentiment analysis, e.g., LSTM, GRU, or even a transformer-based model.

In [115]:
from transformers import BertModel, BertTokenizer
import torch.nn as nn
import torch

# Define the model architecture
class SentimentDomainModel(nn.Module):
    def __init__(self, bert_model_name, hidden_size, sentiment_classes, domain_classes):
        super(SentimentDomainModel, self).__init__()
        
        # Feature Extractor
        self.bert = BertModel.from_pretrained(bert_model_name)
        
        # Sentiment Classifier
        self.sentiment_classifier = nn.Sequential(
            nn.Dropout(p=0.1),
            nn.Linear(self.bert.config.hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, sentiment_classes)
        )
        
        # Domain Classifier
        self.domain_classifier = nn.Sequential(
            nn.Dropout(p=0.1),
            nn.Linear(self.bert.config.hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, domain_classes)
        )

        self.gradient_reversal_alpha = 1.0  # Define the negative constant for the gradient reversal layer

    
    def forward(self, input_ids, attention_mask, token_type_ids):
        # BERT Feature Extraction
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids)
        
        pooled_output = outputs.pooler_output

        # Apply the gradient reversal layer with the chosen alpha
        reversed_features = gradient_reversal(pooled_output, self.gradient_reversal_alpha)
        
        # Sentiment classification
        sentiment_output = self.sentiment_classifier(pooled_output)
        
        # Domain classification
        domain_output = self.domain_classifier(reversed_features)
        
        return sentiment_output, domain_output

# Initialize the model
bert_model_name = 'bert-base-uncased'  # You can choose other BERT models as needed
hidden_size = 128  # Size of the hidden layer for both classifiers
sentiment_classes = 2  # Assuming binary classification for sentiment
domain_classes = 2  # Assuming binary classification for domain

model = SentimentDomainModel(bert_model_name, hidden_size, sentiment_classes, domain_classes)

In [None]:
import torch.optim as optim

# Assuming you have already defined the SentimentDomainModel as above

# Criterion for sentiment and domain classification
sentiment_criterion = nn.CrossEntropyLoss()
domain_criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=2e-5)

# Number of epochs
num_epochs = 3

# Training and evaluation function
def train_model(model, train_loader, val_loader, num_epochs, sentiment_criterion, domain_criterion, optimizer):
    for epoch in range(num_epochs):
        model.train()
        total_sentiment_loss = 0
        total_domain_loss = 0

        for batch in train_loader:
            # Unpack the batch
            input_ids, attention_mask, sentiments, domains = batch

            # Zero the parameter gradients
            optimizer.zero_grad()

            # Forward pass
            sentiment_outputs, domain_outputs = model(input_ids, attention_mask, None)

            # Compute loss
            sentiment_loss = sentiment_criterion(sentiment_outputs, sentiments)
            domain_loss = domain_criterion(domain_outputs, domains)

            # Combine losses and backward pass
            total_loss = sentiment_loss + domain_loss
            total_loss.backward()

            # Update parameters
            optimizer.step()

            # Statistics
            total_sentiment_loss += sentiment_loss.item()
            total_domain_loss += domain_loss.item()

        # Validation
        model.eval()
        total_val_loss = 0
        correct = 0
        with torch.no_grad():
            for batch in val_loader:
                input_ids, attention_mask, sentiments, domains = batch
                
                # Forward pass
                sentiment_outputs, domain_outputs = model(input_ids, attention_mask, None)
                
                # Compute loss
                sentiment_loss = sentiment_criterion(sentiment_outputs, sentiments)
                domain_loss = domain_criterion(domain_outputs, domains)
                val_loss = sentiment_loss + domain_loss

                total_val_loss += val_loss.item()

                # Sentiment accuracy
                _, predicted = torch.max(sentiment_outputs.data, 1)
                correct += (predicted == sentiments).sum().item()

        val_accuracy = correct / len(val_loader.dataset)

        print(f'Epoch {epoch+1}/{num_epochs}, '
              f'Train Loss: Sentiment {total_sentiment_loss:.4f} Domain {total_domain_loss:.4f}, '
              f'Val Loss: {total_val_loss:.4f}, Accuracy: {val_accuracy:.4f}')

    # Save the model checkpoint
    torch.save(model.state_dict(), 'sentiment_domain_model.pth')

# Call the training function
train_model(model, train_loader, val_loader, num_epochs, sentiment_criterion, domain_criterion, optimizer)

### Text Embedding

In [None]:
#!pip install torch==2.0.0 torchtext==0.15.0

Pre-trained Word2Vec models have been trained on large datasets and can capture the semantic meaning of words quite well.

In [32]:
# word embedding
import gensim.downloader as api

word2vec_model = api.load("word2vec-google-news-300")

Regarding the aggregation method: using the mean is a common and straightforward approach, but it's true that other methods could potentially capture more information. Some possible alternatives include:

Summation: Instead of averaging the word vectors, sum them. This may give more weight to longer texts.
TF-IDF Weighting: Weight the word vectors by their term frequency-inverse document frequency scores before averaging or summing, to give more importance to distinctive words.
Max Pooling: Take the maximum value across each dimension of the word vectors to form the text vector.
Concatenation: Concatenate the average, max, and min vectors.
Hierarchical Pooling: Use a more complex pooling strategy that combines vectors in a hierarchical structure to retain more information.
Paragraph Vectors: Train a model like Doc2Vec that learns fixed-length feature representations for variable-length pieces of texts.

In [31]:
def text_to_word2vec(text, model):
    words = word_tokenize(text)
    vector_list = [model[word] for word in words if word in model]

    if not vector_list: 
        return np.zeros(model.vector_size)
    
    # Aggregation method: mean vector for the text
    return np.mean(vector_list, axis=0)

In [57]:
X_train_word2vec = np.array([text_to_word2vec(text, word2vec_model) for text in X_train_raw])

> Labels Encoding and Dataset Splitting 

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [71]:
one = OneHotEncoder()
y_train_onehot = one.fit_transform(np.asarray(y_train_raw).reshape(-1, 1)).toarray()
y_train_tensor = torch.tensor(y_train_onehot, dtype=torch.float32)

In [72]:
X_train, X_val, y_train, y_val = train_test_split(
    X_train_word2vec, 
    y_train_tensor.numpy(), 
    test_size=0.2,  
    random_state=42  
)

In [73]:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32)

In [75]:
from torch.utils.data import DataLoader, TensorDataset

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

batch_size = 16

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [87]:
from torch.autograd import Function

class GradientReversalFunction(Function):
    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.view_as(x)

    @staticmethod
    def backward(ctx, grad_output):
        output = grad_output.neg() * ctx.alpha
        # Return gradient for input, None for alpha
        return output, None

# Alias to easily use the function without passing an alpha each time
def gradient_reversal(x, alpha=1.0):
    return GradientReversalFunction.apply(x, alpha)

In [112]:
import torch
import torch.nn as nn
from torch.optim import Adam
from transformers import BertModel

# Criterion for sentiment and domain classification
sentiment_criterion = nn.CrossEntropyLoss()
domain_criterion = nn.CrossEntropyLoss()

# Optimizers
optimizer = Adam(model.parameters(), lr=1e-5)

# Number of epochs
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    total_sentiment_loss = 0
    total_domain_loss = 0

    for inputs, sentiments, domains in train_loader:
        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        sentiment_outputs, domain_outputs = model(inputs['input_ids'], inputs['attention_mask'])

        # Compute loss
        sentiment_loss = sentiment_criterion(sentiment_outputs, sentiments)
        domain_loss = domain_criterion(domain_outputs, domains)

        # Combine losses
        total_loss = sentiment_loss + domain_loss
        total_loss.backward()

        # Update parameters
        optimizer.step()

        # Statistics
        total_sentiment_loss += sentiment_loss.item()
        total_domain_loss += domain_loss.item()

    # Validation
    model.eval()
    total_val_loss = 0
    correct = 0
    with torch.no_grad():
        for inputs, sentiments, domains in val_loader:
            sentiment_outputs, domain_outputs = model(inputs['input_ids'], inputs['attention_mask'])
            sentiment_loss = sentiment_criterion(sentiment_outputs, sentiments)
            domain_loss = domain_criterion(domain_outputs, domains)
            val_loss = sentiment_loss + domain_loss

            total_val_loss += val_loss.item()

            # Sentiment accuracy
            _, predicted = torch.max(sentiment_outputs.data, 1)
            correct += (predicted == sentiments).sum().item()

    val_accuracy = correct / len(val_dataset)

    print(f'Epoch {epoch+1}/{num_epochs}, '
          f'Train Loss: Sentiment {total_sentiment_loss:.4f} Domain {total_domain_loss:.4f}, '
          f'Val Loss: {total_val_loss:.4f}, Accuracy: {val_accuracy:.4f}')

# Save the model checkpoint
torch.save(model.state_dict(), 'sentiment_domain_model.pth')


ValueError: too many values to unpack (expected 3)

## (LINEAR) Train Model on IMDB: 
<span style="color:red">Status: </span> <span style="color:blue">TO DO: </span> Using the IMDB dataset, train your model until it achieves satisfactory performance. This trained model captures the characteristics of the source domain.

In [84]:
def evaluate_model(model, criterion, data_loader):
    model.eval()
    total_loss = 0
    correct_predictions = 0
    
    with torch.no_grad():
        for inputs, labels in data_loader:
            outputs = model(inputs)
            
            # Adjust labels to match the output of CrossEntropyLoss
            labels = torch.max(labels, 1)[1]
            
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            
            _, predicted = torch.max(outputs.data, 1)
            correct_predictions += (predicted == labels).sum().item()

    accuracy = (correct_predictions / len(data_loader.dataset)) * 100
    avg_loss = total_loss / len(data_loader)
    
    return avg_loss, accuracy

In [85]:
def train_model(model, criterion, optimizer, train_loader, val_loader, num_epochs=100, patience=5):
    best_val_loss = float('inf')
    no_improvement_epochs = 0
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            
            outputs = model(inputs)
            
            # Adjust labels to match the output of CrossEntropyLoss
            labels = torch.max(labels, 1)[1]
            
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        val_loss, accuracy = evaluate_model(model, criterion, val_loader)
        print(f"Epoch {epoch+1}, Validation Loss: {val_loss}, Validation Accuracy: {accuracy}%")
        
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            no_improvement_epochs = 0
            # Save the best model
            torch.save(model.state_dict(), 'best_model.pth')
        else:
            no_improvement_epochs += 1
            
        if no_improvement_epochs >= patience:
            print(f"Early stopping triggered after {epoch+1} epochs.")
            break

    return model

In [82]:
import torch.optim as optim

input_dim = 300   # Dimension of Word2Vec features
hidden_dim = 128  # Size of the first hidden layer
output_dim = 2    # Number of classes, assuming binary classification

model = SentimentClassifier(input_dim, hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss() 
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [83]:
num_epochs = 100
patience = 5  

trained_model = train_model(model, criterion, optimizer, train_loader, val_loader, num_epochs=100, patience=5)

Epoch 1, Validation Loss: 0.4208904819965363, Validation Accuracy: 80.97%
Epoch 2, Validation Loss: 0.4089745656967163, Validation Accuracy: 81.65%
Epoch 3, Validation Loss: 0.40679861245155335, Validation Accuracy: 81.49%
Epoch 4, Validation Loss: 0.4046786313056946, Validation Accuracy: 81.57%
Epoch 5, Validation Loss: 0.40317483930587766, Validation Accuracy: 81.96%
Epoch 6, Validation Loss: 0.42194099118709566, Validation Accuracy: 80.72%
Epoch 7, Validation Loss: 0.3963768006324768, Validation Accuracy: 82.19999999999999%
Epoch 8, Validation Loss: 0.4460666743993759, Validation Accuracy: 79.82000000000001%
Epoch 9, Validation Loss: 0.39371530619859696, Validation Accuracy: 82.17%
Epoch 10, Validation Loss: 0.3936872004389763, Validation Accuracy: 82.3%
Epoch 11, Validation Loss: 0.3916040333867073, Validation Accuracy: 82.49%
Epoch 12, Validation Loss: 0.40407978297472, Validation Accuracy: 81.97%
Epoch 13, Validation Loss: 0.40167471257448195, Validation Accuracy: 82.11%
Epoch 14

# 2. Domain Adaptation

## Adversarial Training

# 3. Fine-tuning on Target Domain (Optional but beneficial)

# 4. Evaluation

# some code I disregarded 

In [21]:
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

In [22]:
tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

In [23]:
vocab_size = 10000
max_length = 50
padding_value = 0  # Index for <pad> token

In [24]:
vocab = build_vocab_from_iterator(yield_tokens(X_train), specials=['<unk>', '<pad>', '<OOV>'], max_tokens=vocab_size)
vocab.set_default_index(vocab['<OOV>'])  

In [26]:
X_train_tokenized = [torch.tensor(vocab(tokenizer(sentence))) for sentence in X_train]
# X_test_tokenized = [torch.tensor(vocab(tokenizer(sentence))) for sentence in X_test]

X_train_padded = pad_sequence(X_train_tokenized, batch_first=True, padding_value=padding_value).to(torch.int64)
# X_test_padded = pad_sequence(X_test_tokenized, batch_first=True, padding_value=padding_value).to(torch.int64)

X_train_padded = X_train_padded[:, :max_length]
# X_test_padded = X_test_padded[:, :max_length]

### Tensorflow approach
NO NEED TO RUN THIS PART!
(notebook I found forinspiration: https://www.kaggle.com/code/antoniofranca/sentiment-analysis-on-imdb-movie-reviews/edit)

In [None]:
# important libraries for deep learning
import tensorflow as tf 
from tensorflow import keras
# for tokenizing texts
from tensorflow.keras.preprocessing.text import Tokenizer
# for text padding and truncating
from tensorflow.keras.utils import pad_sequences

In [None]:
# important properties
vocab_size = 10000
max_length = 50

trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

In [None]:
# Define tokenizer and fit on texts
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)

In [None]:
#To Save conf execute this cell
#Save Tokenizer Configuration
import json 
import os 

tok_conf = tokenizer.to_json()

with open('tok_conf.json', 'w') as outfile:
    outfile.write(json.dumps(tok_conf))

In [None]:
# Let's Tokenize and pad texts
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = pad_sequences(X_train, maxlen=max_length,
                         padding=padding_type,
                         truncating=trunc_type)
X_test = pad_sequences(X_test, maxlen=max_length,
                         padding=padding_type,
                         truncating=trunc_type)

In [None]:
X_train.shape