## Sentiment Analysis Interpreter
We train a simple transformer for sentiment analysis on movie reviews, extract interpretable features using SAE and generate explanations using LLMs.

#### Imports

In [1]:
import os
from dataPreprocessing import *
import pandas as pd
from torch.utils.data import RandomSampler
from torch.utils.data import DataLoader

#### Data Preprocessing

##### Load Data

In [2]:
path = 'dataset/'
train_pos, train_neg, test_pos, test_neg = [], [], [], []
sets_dict = {'train/pos/': train_pos, 'train/neg/': train_neg, 'test/pos/': test_pos, 'test/neg/': test_neg}
for dataset in sets_dict:
        file_list = [f for f in os.listdir(os.path.join(path, dataset)) if f.endswith('.txt')]
        load_data(os.path.join(path, dataset), file_list, sets_dict[dataset])
train_data = pd.concat([pd.DataFrame({'review': train_pos, 'label':1}), pd.DataFrame({'review': train_neg, 'label':0})], axis = 0, ignore_index=True)
test_data = pd.concat([pd.DataFrame({'review': test_pos, 'label':1}), pd.DataFrame({'review': test_neg, 'label':0})], axis = 0, ignore_index=True)

In [3]:
print(train_data.shape)
print(train_data.head())
print(train_data.tail())

(25000, 2)
                                              review  label
0  For a movie that gets no respect there sure ar...      1
1  Bizarre horror movie filled with famous faces ...      1
2  A solid, if unremarkable film. Matthau, as Ein...      1
3  It's a strange feeling to sit alone in a theat...      1
4  You probably all already know this by now, but...      1
                                                  review  label
24995  My comments may be a bit of a spoiler, for wha...      0
24996  The "saucy" misadventures of four au pairs who...      0
24997  Oh, those Italians! Assuming that movies about...      0
24998  Eight academy nominations? It's beyond belief....      0
24999  Not that I dislike childrens movies, but this ...      0


In [4]:
print(test_data.shape)
print(test_data.head())
print(test_data.tail())

(25000, 2)
                                              review  label
0  Based on an actual story, John Boorman shows t...      1
1  This is a gem. As a Film Four production - the...      1
2  I really like this show. It has drama, romance...      1
3  This is the best 3-D experience Disney has at ...      1
4  Of the Korean movies I've seen, only three had...      1
                                                  review  label
24995  With actors like Depardieu and Richard it is r...      0
24996  If you like to get a couple of fleeting glimps...      0
24997  When something can be anything you want it to ...      0
24998  I had heard good things about "States of Grace...      0
24999  Well, this movie actually did have one redeemi...      0


##### Tokenize Data

In [5]:
train_data["tokenized"] = train_data["review"].apply(lambda x: tokenize(clean_text(x.lower())))
test_data["tokenized"] = test_data["review"].apply(lambda x: tokenize(clean_text(x.lower())))

In [6]:
print(train_data.head())

                                              review  label  \
0  For a movie that gets no respect there sure ar...      1   
1  Bizarre horror movie filled with famous faces ...      1   
2  A solid, if unremarkable film. Matthau, as Ein...      1   
3  It's a strange feeling to sit alone in a theat...      1   
4  You probably all already know this by now, but...      1   

                                           tokenized  
0  [for, a, movie, that, gets, no, respect, there...  
1  [bizarre, horror, movie, filled, with, famous,...  
2  [a, solid, ,, if, unremarkable, film, ., matth...  
3  [it, 's, a, strange, feeling, to, sit, alone, ...  
4  [you, probably, all, already, know, this, by, ...  


##### Voacb Map

In [7]:
train_vocab, reversed_train_vocab = generate_vocab_map(train_data)

##### Building Pytorch Dataset

In [8]:
from torch.utils.data import RandomSampler

train_dataset = ReviewDataset(train_vocab, train_data)
test_dataset  = ReviewDataset(train_vocab, test_data)

train_sampler = RandomSampler(train_dataset)
test_sampler  = RandomSampler(test_dataset)

##### Pytorch DataLoader

In [9]:
BATCH_SIZE = 64
train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler)
test_iterator  = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler)