# LELA 60331  Computational Linguistics 1
### Week 12Â¶

Today we are going to use Pytorch to perform classification with sequence models.

The first dataset we are going to work with consists of just over 10,000 surnames, labelled with 18 different nationalities. The first tasks will be to learn a classifier that can accurately assign a nationality to previously unseen surnames. To do this we will use RNNs.

In [None]:
! wget https://raw.githubusercontent.com/cbannard/lela60342/refs/heads/main/surnames_data.csv

We read the data into a Pandas dataframe:

In [None]:
import pandas as pd
import torch
surnames_df=pd.read_csv("surnames_data.csv")
surnames_df

We then use hierachical indexing in Pandas to represent the data as sequences of separate characters

### Hierachical indexing

One pandas feature that you will find useful in representing language data is hierachical indexing.

In [None]:
import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(16),index=[[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"]])
s

We can select subsets from the hierachical index as follows:

In [None]:
s.loc[1]

In [None]:
s.loc[1,"a"]

In [None]:
s.loc[:,"a"]

We can use this to represent our data here where characters belong to words

In [None]:
import torch
surnames_df=pd.read_csv("surnames_data.csv")

chars=[]
index_1=[]
index_2=[]
for i,row in surnames_df.iterrows():
    chars.extend(list(row.surname))
    index_1.extend([i]*len(row.surname))
    index_2.extend(range(len(row.surname)))

surnames_chars = pd.DataFrame(chars,index=[index_1,index_2])
surnames_chars.columns = ["chars"]
surnames_chars

We can then use the Pandas function get_dummies to produce one-hot codings of the characters

In [None]:
surnames_oh=pd.get_dummies(surnames_chars.chars,dtype=int)
surnames_oh

And the nationalities

In [None]:
nationalities_oh=pd.get_dummies(surnames_df.nationality,dtype=int)
nationalities_oh

We will then turn these into tensors for input to PyTorch and in particular to an LSTM layer. We want a tensor with the shape [Number_of_names, Number_of_characters_in name, Size_of_alphabet].

However the LSTM layer requires that all sequence be of the same length and so we pad our tensors by adding N tensors of zeros of the length of the one hot codings to the beginning of each name. So that the tensor actually has the form [Number_of_names, Number_of_characters_in_the_longest_name, Size_of_alphabet]

We do this using the function ZeroPad1d which takes as an argument a tuple with the following entries: padding_left, padding_right, padding_above, padding below.





In [None]:
from torch import nn
t=torch.ones([5,5,5])
print(t[0,:,:])
m = nn.ZeroPad1d((0,0,2,0))
print(m(t))

In [None]:
from torch import nn
# Find the length of the longest name in the data:
max_length=max([t[1] for t in surnames_oh.index])
# Make an array for the name tensors
X = [0] * (max(surnames_oh.index)[0]+1)
# Make an array for the label tensors
y = [0] * (max(nationalities_oh.index)+1)
# Iterate over index of the surnames one-hot data frame. The indices are tuples.
for ind in surnames_oh.index:
    # Make a tensor from subset of the dataframe for this name/index
    s=torch.from_numpy(surnames_oh.loc[ind[0]].values).to(dtype=torch.float)
    # Pad the tensor
    m = nn.ZeroPad1d((0,0,max_length-len(s),0))
    # Add tensors to arrays
    X[ind[0]] = m(s).cuda()
    y[ind[0]] = torch.from_numpy(nationalities_oh.loc[ind[0]].values).to(dtype=torch.float).cuda()
# Combine contents of arrays into a single tensor
X=torch.stack(X)
y=torch.stack(y)

In [None]:
X.shape

In [None]:
y.shape

### RNN layers in PyTorch

RNN layers can be specified as follows. We need to specify the size of the input (e.g. the length our one-hot vectors), the size of the hidden layer to use, the number of layers to include. And because of the way that our data is configured - (batch, seq, feature) rather than (seq, batch, feature) - we use the batch_first flag.


In [None]:
input=torch.randn((1,10,10))
rnn = nn.RNN(input_size=10, hidden_size=5, num_layers=1, batch_first=True)
hidden, output =rnn(input)
print(output)

The output is the final hidden layers from each step of the sequence. The second element output is a tuple containing the hidden states from all layers and timepoints. Here we are interested in the hidden layer values as it is the hidden layer from final step for each sequence that we will pass on to a linear layer to perform classification. We could take this from either the output or the hidden objects. In our code we take this from the output.

In [None]:
hidden

We can use torch.nn.Module to define our whole model

In [None]:
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
n_classes = 18

class SeqModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(input_size=84, hidden_size=42, num_layers=1, batch_first=True)
        self.linear = nn.Linear(42, n_classes)
    def forward(self, x):
        x, _ = self.rnn(x)
        # take only the last output
        x = x[:, -1, :]
        x = self.linear(x)
        return x

Now we have the model we can split the data then train and then test. We will use CrossEntropyLoss because our output is an 18-class softmax. We will use batch training.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

In [None]:
from sklearn.utils import gen_batches
import matplotlib.pyplot as plt
n_epochs = 150
batch_size = 128
model = SeqModel()
model.to("cuda")
ce_loss=[]
optimizer = optim.Adam(model.parameters(),lr=0.005)
loss_fn = nn.CrossEntropyLoss()

for i in range(n_epochs):
    cumul_loss = 0.0
    batches = gen_batches(X_train.shape[0],batch_size)
    cumul_loss=0.0
    for k in batches:
          inputs=X_train[k]
          outputs=y_train[k]
          y_pred = model(inputs)
          loss = loss_fn(y_pred, outputs)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
          cumul_loss += loss.item()
    ce_loss.append(cumul_loss)

plt.plot(range(1,n_epochs),ce_loss[1:])
plt.xlabel("number of epochs")
plt.ylabel("loss")



Our precision and recall are as follows:

In [None]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
y_test_pred=[np.argmax(x.cpu().detach().numpy()) for x in model(X_test)]
y_test_int=[np.argmax(x.cpu().detach().numpy()) for x in y_test]
precision_recall_fscore_support(y_test_int, y_test_pred, average='macro')


We can try the model out on individual names as follows:

In [None]:
name="bannard"
torch.manual_seed(42)
charset=list(surnames_oh.columns.values)
nationalities=list(nationalities_oh.columns.values)
oh = torch.zeros(16,len(charset))
for i,c in enumerate(name):
    oh[16-len(name)+i,charset.index(c)] = 1.0
oh=oh.to("cuda")
print(oh.shape)
pred=model(torch.unsqueeze(oh,0))
nationalities[np.argmax(pred.cpu().detach().numpy())]

We can examine the model and its weights as follows:

In [None]:
model

In [None]:
model.state_dict()

We can try to improve performance by using a gated RNN, specifically an LSTM (see week 12 lecture)

In [None]:
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
n_classes = 18

class SeqModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=84, hidden_size=42, num_layers=1, batch_first=True)
        self.linear = nn.Linear(42, n_classes)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        x = self.linear(x)
        return x

In [None]:
from sklearn.utils import gen_batches
import matplotlib.pyplot as plt
n_epochs = 150
batch_size = 128
model = SeqModel()
model.to("cuda")
ce_loss=[]
optimizer = optim.Adam(model.parameters(),lr=0.005)
loss_fn = nn.CrossEntropyLoss()


for i in range(n_epochs):
    cumul_loss = 0.0
    batches = gen_batches(X_train.shape[0],batch_size)
    cumul_loss=0.0
    for k in batches:
          inputs=X_train[k]
          outputs=y_train[k]
          y_pred = model(inputs)
          loss = loss_fn(y_pred, outputs)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
          cumul_loss += loss.item()
    ce_loss.append(cumul_loss)

plt.plot(range(1,n_epochs),ce_loss[1:])
plt.xlabel("number of epochs")
plt.ylabel("loss")





In [None]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
y_test_pred=[np.argmax(x.cpu().detach().numpy()) for x in model(X_test)]
y_test_int=[np.argmax(x.cpu().detach().numpy()) for x in y_test]
precision_recall_fscore_support(y_test_int, y_test_pred, average='macro')

In [None]:
name="bannard"
torch.manual_seed(42)
charset=list(surnames_oh.columns.values)
nationalities=list(nationalities_oh.columns.values)
oh = torch.zeros(16,len(charset))
for i,c in enumerate(name):
    oh[16-len(name)+i,charset.index(c)] = 1.0
oh=oh.to("cuda")
print(oh.shape)
pred=model(torch.unsqueeze(oh,0))
nationalities[np.argmax(pred.cpu().detach().numpy())]

## Review classification with RNN


Next we will apply the same process to the Yelp review sentiment data that we have been working with all semester.

In order to speed things up I have prepared the Tensors that you need from the raw data. I have also only used a random sample of 1000 reviews to make training time manageable in class.

In [None]:
!gdown 19cYQ_B3diu6RqlpYT5n9qHvS08_cScp7
!gdown 1DHj5zFiWX3hF3o8RxMx2Hn4sOW-VsHTs

!gunzip reviews_for_rnn.pt.gz

In [None]:
import torch
reviews_emb=torch.load("reviews_for_rnn.pt")
labels=torch.load("review_labels_for_rnn.pt")

The tokens in the reviews are represented using 300 element static embedding vectors. The longest review is 887 tokens long so we pad all the sentence vectors to this length. There are 1000 reviews. So the input data is a 1000x887x300 3D tensor.

In [None]:
reviews_emb.shape

We split this into an 800 review training set and a 200 review test set as follows

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_emb, labels, test_size=0.2, random_state=30)
X_train = X_train.to("cuda")
X_test = X_test.to("cuda")
y_train = y_train.to("cuda")
y_test = y_test.to("cuda")


Problem 1: Build an LSTM-based classifier for this review data. Note: this is a binary classifier so you will need to change the loss function and pass the output of your model x through torch.sigmoid()

Once you have build it you can evaluate it as follows

In [None]:
import numpy as np
from sklearn.metrics import precision_score, recall_score
labels_pred=[int(x.cpu().detach().numpy() > 0.5) for x in model(X_test)]
#precision_recall_fscore_support(y_test.cpu().detach().numpy(), np.array(labels_pred))
print(precision_score(y_test.cpu().detach().numpy(), np.array(labels_pred)))
print(recall_score(y_test.cpu().detach().numpy(), np.array(labels_pred)))


### Intent classification with RNNs

Now we are going to work with some sentences - utterances input to a dialogue system assigned with the speaker intent.

'PlayMusic', e.g. "play easy listening" \
'AddToPlaylist' e.g. "please add this song to road trip" \
'RateBook' e.g. "give this novel 5 stars" \
'SearchScreeningEvent' e.g. "give me a list of local movie times" \
'BookRestaurant' e.g. "i'd like a table for four at 7pm at Asti" \
'GetWeather' e.g. "what's it like outside" \
'SearchCreativeWork' "show me the new James Bond trailer"



*Problem* 2: Build and train an RNN-based classifier using a training subset of the data that can correctly classify a test subset of the data.

I have prebuilt the tensors containing word embeddings for you:

In [None]:
!wget https://raw.githubusercontent.com/cbannard/lela60342/refs/heads/main/utts_emb.pt.gz
!gunzip utts_emb.pt.gz
!wget https://raw.githubusercontent.com/cbannard/lela60342/refs/heads/main/intents_emb.pt
X_utts=torch.load("utts_emb.pt")
y_intents=torch.load("intents_emb.pt")

In [None]:
X_utts.shape

In [None]:
y_intents.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_utts, y_intents, test_size=0.2, random_state=30)
X_train=X_train.to("cuda")
X_test=X_test.to("cuda")
y_train=y_train.to("cuda")
y_test=y_test.to("cuda")


Problem 3: Starting from the Pandas data_frame intent_classification (imported as below), compile the data into the format needed by your model. Note that the utterances are of different lengths so you will need to do some padding. The data frame is hierachically indexed for utterance and word, so that the format is almost identical to the name data. Once you have compiled the data use it to train your model above.

In [None]:
!wget https://raw.githubusercontent.com/cbannard/lela60342/refs/heads/main/intent_classification.pickle

In [None]:
intent_classification = pd.read_pickle("intent_classification.pickle")


In [None]:
intent_classification