# Cross-Domain Sentiment Classification with Domain-Adaptive Neural Networks

## Project Overview

Sentiment analysis, the computational study of opinions expressed in text, has vast applications in understanding customer feedback, social media monitoring, and opinion mining. However, the performance of sentiment analysis models can significantly drop when applied to a new domain due to the domain discrepancy. This project aims to tackle this challenge using domain adaptation techniques in neural networks.

## Objective

The primary objective of this project is to develop a neural network capable of adapting the knowledge from one domain and effectively applying it to a different domain. Specifically, we will train our model on the IMDB movie review dataset and adapt it to analyze sentiments in the YELP restaurant review dataset.

## Approach

(TO WRITE after finishing the code)

---


# Initialize Dataframes

## IMDB Data ∼ Domain Source

In [19]:
import pandas as pd 
import numpy as np

In [20]:
imdb_df = pd.read_csv('IMDB Dataset.csv')
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [32]:
sentiment_counts_imdb = imdb_df['sentiment'].value_counts()
print(sentiment_counts_imdb)

sentiment
positive    25000
negative    25000
Name: count, dtype: int64


## Yelp Data ∼ Target Source

We consider ratings of 4 and 5 stars as positive and ratings of 1 and 2 stars as negative. We discard 3-star reviews as they are neutral. Alternatively, we might include them in further explorations in one of the categories based on the needs of our analysis.

https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/data

In [27]:
import json
import pandas as pd

data_file = open("yelp-dataset/yelp_academic_dataset_review.json")
review_df = []
for line in data_file:
    review_df.append(json.loads(line))
yelp_df = pd.DataFrame(review_df)
data_file.close()

In [28]:
# Filter out rows where 'stars' is 3 
yelp_df = yelp_df[yelp_df['stars'] != 3.0].copy()

yelp_df['sentiment'] = yelp_df['stars'].apply(lambda x: 'positive' if x >= 4 else 'negative')
yelp_df = yelp_df.rename(columns={'text': 'review'})

yelp_df = yelp_df[['review', 'sentiment']]

yelp_df.head()

Unnamed: 0,review,sentiment
1,I've taken a lot of spin classes over the year...,positive
3,"Wow! Yummy, different, delicious. Our favo...",positive
4,Cute interior and owner (?) gave us tour of up...,positive
5,I am a long term frequent customer of this est...,negative
6,Loved this tour! I grabbed a groupon and the p...,positive


In [31]:
# ideally, we want symmetry here
sentiment_counts = yelp_df['sentiment'].value_counts()
print(sentiment_counts)

sentiment
positive    4684545
negative    1613801
Name: count, dtype: int64


# 1. Initial Model Training with Source Domain (IMDB)

## Data Preprocessing for IMDB: 
<span style="color:red">Status: </span> <span style="color:blue">ALMOST FINISHED: </span>This includes text cleaning, tokenization, and padding.

### Clean & Normalization 
Let's first clean the texts like removing stopwords, special characters, stemming, and lemmatization.

In [21]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk import word_tokenize, pos_tag

import string

import matplotlib.pyplot as plt

In [34]:
# nltk.download('stopwords')

In [23]:
def remove_stopwords(text):
    # remove stop words like "the", "is", "in", "on", "and", "but", etc. 
    # the focus is on the more meaningful words that give insight into the content.
    stop_words = stopwords.words('english')
    words = text.split()
    filtered_sentence = ''
    for word in words:
        if word not in stop_words:
            filtered_sentence = filtered_sentence + word + ' '
    return filtered_sentence

def remove_punctuation(text):
    table = str.maketrans('','',string.punctuation)
    words = text.split()
    filtered_sentence = ''
    for word in words:
        word = word.translate(table)
        filtered_sentence = filtered_sentence + word + ' '
    return filtered_sentence

def normalize_text(text):
    text = text.lower()
    # get rid of urls
    text = re.sub('https?://\S+|www\.\S+', '', text)
    # get rid of non words and extra spaces
    text = re.sub('\\W', ' ', text)
    text = re.sub('\n', '', text)
    text = re.sub(' +', ' ', text)
    text = re.sub('^ ', '', text)
    text = re.sub(' $', '', text)
    return text

def stemming(text):
    ps = PorterStemmer()
    words = text.split()
    filtered_sentence = ''
    for word in words:
        word = ps.stem(word)
        filtered_sentence = filtered_sentence + word + ' '
    return filtered_sentence

def clean_text(text):
    text = text.lower()
    text = text.replace(',',' , ')
    text = text.replace('.',' . ')
    text = text.replace('/',' / ')
    text = text.replace('@',' @ ')
    text = text.replace('#',' # ')
    text = text.replace('?',' ? ')
    text = normalize_text(text)
    text = remove_punctuation(text)
    text = remove_stopwords(text)
    text = stemming(text)
    return text

In [29]:
imdb_df['review'] = imdb_df['review'].apply(clean_text)

> Dataset Splitting and Labels Encoding

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [7]:
X_train = imdb_df['review']
y_train = imdb_df['sentiment']

X_test = yelp_df['review']
y_test = yelp_df['sentiment']

one = OneHotEncoder()
y_train = one.fit_transform(np.asarray(y_train).reshape(-1,1)).toarray()

### Tokenization and Padding

In [43]:
# !pip install torch==2.0.0 torchtext==0.15.0

In [10]:
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

In [11]:
tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

In [12]:
vocab_size = 10000
max_length = 50
padding_value = 0  # Index for <pad> token

In [13]:
vocab = build_vocab_from_iterator(yield_tokens(X_train), specials=['<unk>', '<pad>', '<OOV>'], max_tokens=vocab_size)
vocab.set_default_index(vocab['<OOV>'])  

In [14]:
X_train_tokenized = [torch.tensor(vocab(tokenizer(sentence))) for sentence in X_train]
X_test_tokenized = [torch.tensor(vocab(tokenizer(sentence))) for sentence in X_test]

X_train_padded = pad_sequence(X_train_tokenized, batch_first=True, padding_value=padding_value).to(torch.int64)
X_test_padded = pad_sequence(X_test_tokenized, batch_first=True, padding_value=padding_value).to(torch.int64)

X_train_padded = X_train_padded[:, :max_length]
X_test_padded = X_test_padded[:, :max_length]

### Tensorflow approach
NO NEED TO RUN THIS PART!
(notebook I found for inspiration: https://www.kaggle.com/code/antoniofranca/sentiment-analysis-on-imdb-movie-reviews/edit)

In [69]:
# important libraries for deep learning
import tensorflow as tf 
from tensorflow import keras
# for tokenizing texts
from tensorflow.keras.preprocessing.text import Tokenizer
# for text padding and truncating
from tensorflow.keras.utils import pad_sequences

In [70]:
# important properties
vocab_size = 10000
max_length = 50

trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

In [71]:
# Define tokenizer and fit on texts
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)

In [72]:
#To Save conf execute this cell
#Save Tokenizer Configuration
import json 
import os 

tok_conf = tokenizer.to_json()

with open('tok_conf.json', 'w') as outfile:
    outfile.write(json.dumps(tok_conf))

In [73]:
# Let's Tokenize and pad texts
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = pad_sequences(X_train, maxlen=max_length,
                         padding=padding_type,
                         truncating=trunc_type)
X_test = pad_sequences(X_test, maxlen=max_length,
                         padding=padding_type,
                         truncating=trunc_type)

In [75]:
X_train.shape

(40000, 50)

## Build Model: 
<span style="color:red">Status: </span> <span style="color:blue">TO DO: </span> Design a neural network architecture suitable for sentiment analysis, e.g., LSTM, GRU, or even a transformer-based model.

## Train Model on IMDB: 
<span style="color:red">Status: </span> <span style="color:blue">TO DO: </span> Using the IMDB dataset, train your model until it achieves satisfactory performance. This trained model captures the characteristics of the source domain.

# 2. Domain Adaptation

# 3. Fine-tuning on Target Domain (Optional but beneficial)

# 4. Evaluation