# LELA 60342 Research Methods in Computational and Corpus Linguistics
## Week 2

First I am going to introduce you to Pandas and its use in representing linguistic datasets, to prepare you for next week's CL2. Then we'll return to working with Pytorch.

### Pandas

Pandas is a very popular library that provides data structures and powerful tools for manipulating them. It is conventionally imported as follows:


In [1]:
import pandas as pd

### Series

A Series in Pandas is a one-dimensional data structure that is somewhere between a Python list and an ordered Python Dictionary. It consists of a set of values and an associated array of data labels, called an index. We can create it from a Python list as follows:

In [None]:
s = pd.Series([10,7,1,4])

A default index of sequence integers of 0 through N-1 (N being the length of the list) is added:

In [None]:
s

The values and the index can be obtained separately:

In [None]:
print(s.values)
print(s.index)

We can specify the indices when we create the object:

In [None]:
s = pd.Series([10,7,1,4], index=["games","starts","goals","assists"])
s

### DataFrame

Based on the R data structure of the same name.

Is a rectangular table of unordered columns. It has an index for both rows and columns. It is like a dictionary of Series which share the same index. The columns can be of different datatypes.

One way to create a DataFrame is from a Dictionary:



In [None]:
info={'year':[2000,2001,2002,2001,2002,2003],
      'population':[1.5,1.7,3.6,2.4,2.9,3.2]}

df=pd.DataFrame(info,["Ohio","Ohio","Ohio","Nevada","Nevada","Nevada"])

In [None]:
df

There are a number of different ways to extract subparts of the table. The recommended way is via special operators loc and iloc which allow you to select a subset of the data using labels and integers respectively. The first entry corresponds to the row and the second to the column.

In [None]:
df.loc["Ohio"]

In [None]:
df.loc["Ohio","population"]

In [None]:
df.iloc[1,0]

### Loading from files

There are also pandas functions for loading a range of different file types into a dataframe:
https://pandas.pydata.org/docs/reference/io.html

For example the coursework data can be loaded as follows:

In [2]:
!wget  https://raw.githubusercontent.com/cbannard/lela60331_24-25/refs/heads/main/coursework/Compiled_Reviews.txt

--2025-02-07 07:39:55--  https://raw.githubusercontent.com/cbannard/lela60331_24-25/refs/heads/main/coursework/Compiled_Reviews.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22322605 (21M) [text/plain]
Saving to: ‘Compiled_Reviews.txt’


2025-02-07 07:40:01 (3.94 MB/s) - ‘Compiled_Reviews.txt’ saved [22322605/22322605]



In [25]:
reviews_df=pd.read_table("Compiled_Reviews.txt")

In [26]:
reviews_df

Unnamed: 0,REVIEW,RATING,PRODUCT_TYPE,HELPFUL
0,"This is a wonderful album, that evokes memorie...",positive,music,neutral
1,"On one hand, this CD is a straight ahead instr...",positive,music,helpful
2,this band reminds me of the thrill i first got...,positive,music,unhelpful
3,"Like I said I would, I finally got around to p...",positive,music,unhelpful
4,Ok good CD. im not suprised. Ok jaheim may not...,positive,music,neutral
...,...,...,...,...
36510,When I bought the XBox on 6-11-06 it came at a...,negative,computer and video games,unhelpful
36511,This game is horrible. I rented it for 5 days ...,negative,computer and video games,unhelpful
36512,Man! What's the f#$%ing deal with companies h...,negative,computer and video games,unhelpful
36513,"ok, most people have more then one system, i h...",negative,computer and video games,unhelpful


We can select subsets of the data based on values of specified columns:

In [23]:
reviews_df.loc[reviews_df['RATING'] == "negative"]

Unnamed: 0,REVIEW,RATING,PRODUCT_TYPE,HELPFUL
20939,I've always held the philosophy you are what y...,negative,music,unhelpful
20940,someone get this band a producer and put them ...,negative,music,neutral
20941,Tihs Album is not all that good when it came o...,negative,music,neutral
20942,"this industry is ""goin down""u call this an alb...",negative,music,helpful
20943,I'm sorry but the guy below me doesn't know mu...,negative,music,helpful
...,...,...,...,...
36510,When I bought the XBox on 6-11-06 it came at a...,negative,computer and video games,unhelpful
36511,This game is horrible. I rented it for 5 days ...,negative,computer and video games,unhelpful
36512,Man! What's the f#$%ing deal with companies h...,negative,computer and video games,unhelpful
36513,"ok, most people have more then one system, i h...",negative,computer and video games,unhelpful


### Hierachical indexing

One feature that we will find useful in representing language data is hierachical indexing. One simple way we can do this is to pass a list of lists as the index.

In [None]:
import numpy as np
s=pd.Series(np.random.randn(16),index=[[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"]])
s

We can select subsets from the hierachical index as follows:

In [None]:
s.loc[1]

In [None]:
s.loc[1,"a"]

In [None]:
s.loc[:,"a"]

### Annotated Data

### CONLL-U format for Universal Dependencies

Annotations are encoded in plain text files (UTF-8) with three types of lines:

- Word lines containing the annotation of a word/token/node in 10 fields separated by single tab characters; see below.
- Blank lines marking sentence boundaries. The last line of each sentence is a blank line.
- Sentence-level comments starting with hash (#). Comment lines occur at the beginning of sentences, before word lines.

Sentences consist of one or more word lines, and word lines contain the following fields:

ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0). \\
FORM: Word form or punctuation symbol. \\
LEMMA: Lemma or stem of word form. \\
UPOS: Universal part-of-speech tag. \\
XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available. \\
FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. \\
HEAD: Head of the current word, which is either a value of ID or zero (0). \\
DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one. \\
DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs. \\
MISC: Any other annotation. \\


In [None]:
!wget https://raw.githubusercontent.com/cbannard/lela60342/refs/heads/main/sample.conllu

In [None]:
def conllu2pandas(fname):
    sentcount = 1
    dat,i1,i2=[],[],[]
    with open(fname, 'r', encoding='UTF-8') as file:
        while line := file.readline():
             #print(len(line))
             #print(line)
             if len(line) == 1:
                sentcount += 1
                #print(sentcount)
             elif line[0].isdigit():
                #print(sentcount)
                line = line.rstrip('\n')
                lst=line.split("\t")
                dat.append(lst[1:])
                i1.append(str(sentcount))
                i2.append(lst[0])
    return pd.DataFrame(dat,columns=["FORM","LEMMA","UPOS","XPOS","FEATS","HEAD","DEPREL","DEPS","MISC"],index=[i1,i2])


In [None]:
df=conllu2pandas("sample.conllu")

In [None]:
df.head(40)

To select from the hierachical index in a data frame instead of passing a single label or integer to loc for the row, we pass a tuple of labels:

In [None]:
df.loc[("2","1"),]

In [None]:
df.loc[("2","1"),"FORM"]

For a very extensive and readable introduction to Pandas functionality see this (available free online) book by its creator:
https://wesmckinney.com/book/


### scikit-learn
Scikit-learn is a machine learning toolkit that has implementations of a diverse range of ML methods, and you might want to use it if you are employing non-neural methods of e.g. classification. 

https://scikit-learn.org/stable/index.html

We won't use it much for ML, but it does have some valuable utilities that can be used on Pandas objects. For example train_test_split():
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#



In [29]:
reviews_df.loc[:,"REVIEW"]

0        This is a wonderful album, that evokes memorie...
1        On one hand, this CD is a straight ahead instr...
2        this band reminds me of the thrill i first got...
3        Like I said I would, I finally got around to p...
4        Ok good CD. im not suprised. Ok jaheim may not...
                               ...                        
36510    When I bought the XBox on 6-11-06 it came at a...
36511    This game is horrible. I rented it for 5 days ...
36512    Man!  What's the f#$%ing deal with companies h...
36513    ok, most people have more then one system, i h...
36514    Seriously....I was waiting quite awhile for th...
Name: REVIEW, Length: 36515, dtype: object

In [33]:
from sklearn.model_selection import train_test_split
X=reviews_df.loc[:,"REVIEW"]
y=reviews_df.loc[:,"RATING"]
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.2, random_state=30)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=30)

In [35]:
X_test

29079    After years of e-filing fees being included wi...
825      Quem não gosta de samba, bom sujeito não é, é ...
24508    I have this radio and it's a POS. Remote isn't...
7528     Like the other user I had to use a 1x8 under t...
17228    This was a replacement clock for one that fell...
                               ...                        
12770    This is the candy our two boys ask for the mos...
4066     I've had this card for about 1 1/2 years now a...
34844    I have found Gerber brand clothing to be disap...
34871    These broke within the first 3 days of use. 2 ...
18887     This was exactly what my husband had admired ...
Name: REVIEW, Length: 3652, dtype: object

### Back to Pytorch

Problem 1: Write code for a linear regression model predicting y from both features in X (data generated below) using Pytorch with autograd to obtain gradients. Print out the learning curve.

In [None]:
import torch
import matplotlib.pyplot as plt

In [None]:
x=torch.tensor([[-0.6832,  0.2324, -1.2326, -0.3170,  0.3240, -1.2326, -1.5989,  0.7818,
-0.3170,  0.2324,  1.0565,  1.4228,  1.3312],
        [-1.5407, -1.2839, -1.0271, -0.7703, -0.5136, -0.2568,  0.0000,  0.2568,
          0.5136,  0.7703,  1.0271,  1.2839,  1.5407]])
y=torch.tensor([33,49,41,54,52,45,36,58,45,69,55,56,68])

In [None]:
n_iters = 1000
num_features = 3
weights = torch.randn(num_features, requires_grad=True)
num_samples = y.shape[0]
linear_loss=[]
lr=0.002
for i in range(n_iters):
    y_est =torch.mv(x.T,weights[0:2])+weights[2]
    errors = y_est-y
    loss = errors.dot(errors)/num_samples
    linear_loss.append(loss.detach().numpy())

    loss.backward()

    dw1 =  weights.grad[0]
    dw2 =  weights.grad[1]
    db =   weights.grad[2]
    with torch.no_grad():
      weights[0] = weights[0] - lr * dw1
      weights[1] = weights[1] - lr * dw2
      weights[2] = weights[2] - lr * db
      weights.grad=None

plt.plot(range(1,n_iters),linear_loss[1:])
plt.xlabel("number of epochs")
plt.ylabel("loss")

Problem 2: Rewrite your code from problem 1 so as to use the GPU for all calculations

### Pytorch Loss Functions
Pytorch has implementations all common loss function

https://pytorch.org/docs/stable/nn.html#loss-functions

For example Mean Squared error loss:

In [None]:
loss = torch.nn.MSELoss()
estimate = torch.randn(30, requires_grad=True)
target = torch.randn(30)
loss(estimate, target)


Problem 3: Rewrite your code from problem 1/2 so as to use the PyTorch loss function.

Problem 4: Implement a logistic regression with two features for this data (as used in week 7 of CL1) using Pytorch's autograd function to get weights and the appropriate Pytorch loss function.  Once you have it working try to use the GPU.

In [2]:
import numpy as np

In [8]:
## Create simulated data
np.random.seed(10)
w1_center = (2, 3)
w2_center = (3, 2)
batch_size=50

x = np.zeros((batch_size, 2))
y = np.zeros(batch_size)
for i in range(batch_size):
    if np.random.random() > 0.5:
        x[i] = np.random.normal(loc=w1_center)
    else:
        x[i] = np.random.normal(loc=w2_center)
        y[i] = 1

x=torch.tensor(x.T)
y=torch.tensor(y)

Next time we'll look at how to use Pytorch to a) specify your forward pass, and b) perform weight updating.