In [74]:
import pandas as pd

# Lab 1: NLPScholar practice
### COSC 426: Fall 2025, Colgate University

Use this notebook to answer the questions in `Lab1.md`. For answers where you are looking at data files generated by the `evaluate` or `analyze` modes (which is almost all of them), you should load the data into the ipynb notebook and proceed from there. 

---

## Submission instructions

Submit the following files to gradescope:

* `Lab1.ipynb` with answers to all of the questions. **Make sure that when you submit the file, the outputs in the cells are visible on gradescope.** 
* All config files you created. Create a separate file for each config setting you use. 

---

## Part 1: Minimal Pair

In this part you will be working with three models: 
* `gpt2`
* `distilbert/distilgpt2`
* `distilbert-base-uncased`

1. What is the difference in surprisal between expected and unexpected when the verb lemma is `LIKE` when we consider the words specified in the ROI column?

In [None]:
# set `conditions` to lemma
df = pd.read_csv('./results/agreement.tsv', sep='\t')

In [13]:
df[df['lemma'] == 'LIKE'][['model', 'diff']]

Unnamed: 0,model,diff
1,distilbert-base-uncased,-2.278639
3,distilbert/distilgpt2,-1.516554
5,gpt2,-1.550564


2. Is the difference in surprisal between expected and unexpected greater for singular verbs compared to plural verbs when we consider the words specified in the ROI column? 

In [None]:
# set `conditions` to number
df = pd.read_csv('./results/agreement.tsv', sep='\t')

In [19]:
df[['model', 'number', 'diff']]

Unnamed: 0,model,number,diff
0,distilbert-base-uncased,plu,-4.133038
1,distilbert-base-uncased,sing,-2.329132
2,distilbert/distilgpt2,plu,-3.141551
3,distilbert/distilgpt2,sing,-1.433033
4,gpt2,plu,-3.184353
5,gpt2,sing,-1.607013


As the table shows, the difference in surprisal between expected and unexpected is **greater for plural** verbs across all three models.

3. Is the answer to question 2 different if you look at the surprisal of the entire sentence? 


In [None]:
# removed ROI column in the dataset
df = pd.read_csv('./results/agreement.tsv', sep='\t')

In [25]:
df[['model', 'number', 'diff']]

Unnamed: 0,model,number,diff
0,distilbert-base-uncased,plu,-2.580857
1,distilbert-base-uncased,sing,-1.291044
2,distilbert/distilgpt2,plu,-1.570776
3,distilbert/distilgpt2,sing,-0.481562
4,gpt2,plu,-1.592177
5,gpt2,sing,-0.610408


While the difference in surprisal between expected and unexpected decreases for both plural and singular verbs, it is still **greater for plural** verbs across all three models.

4. What is the mean probability of the expected (i.e., grammatical) sentences in `agreement.tsv` (over the entire sentence)?


In [None]:
# removed conditions
df = pd.read_csv('./results/agreement.tsv', sep='\t')

In [33]:
df[['model', 'expected']]

Unnamed: 0,model,expected
0,distilbert-base-uncased,7.662
1,distilbert/distilgpt2,6.631824
2,gpt2,6.330092


5. What is the mean probability of the word `chameleons` in the expected sentences when you consider the period vs. not (i.e., `chameleons` vs. `chameleons.`)


In [38]:
# Left out as instructed

6. What is bad about `minimal_pairs_bad.tsv`? What happens when you run `evaluate` on this data file? What happens when you run `analyze` on this data file? 

In `minimal_pairs_bad.tsv`, the `name` column was renamed to `condition`. The file is also missing matching pair IDs for some of its row. When I run evaluate, it seems to work without any warnings or errors, but when I run analyze, it returns two warnings: `WARNING: Excluding pairs which did not have expected or unexpected comparisons: {8, 9, 10, 7}` and `WARNING: No valid condition columns entered`.

---

## Part 2: Token Classification

In this part you will be working on a [Named Entity Recogniton](https://en.wikipedia.org/wiki/Named-entity_recognition) task with the following model: 
* `distilbert-base-uncased`

1. What is the overall accuracy of the model?

In [62]:
df = pd.read_csv('./results/ner_bycond.tsv', delimiter='\t')
df[['model', 'accuracy']]

Unnamed: 0,model,accuracy
0,distilbert-base-uncased,0.043478


2. What is the overall accuracy of the model if you ignore the `O` label (which indicates that there is no entity)

In [82]:
df = pd.read_csv('./results/ner_bycond.tsv', delimiter='\t')
df[['model', 'accuracy']]

Unnamed: 0,model,accuracy
0,distilbert-base-uncased,0.1875


3. What is the overall accuracy if you ignore punctuation and the `O` label?


In [83]:
df = pd.read_csv('./results/ner_bycond.tsv', delimiter='\t')
df[['model', 'accuracy']]

Unnamed: 0,model,accuracy
0,distilbert-base-uncased,0.0625


4. What is the accuracy of the `B-location` tag when you consider long vs. short sentences? 


In [97]:
df = pd.read_csv('./results/ner_bycond.tsv', delimiter='\t')
df[['model','condition', 'accuracy']]

Unnamed: 0,model,condition,accuracy
0,distilbert-base-uncased,full,0.2
1,distilbert-base-uncased,short,0.333333


5. Here is a model that has been trained on NER task: [bert-base-NER](https://huggingface.co/dslim/bert-base-NER/tree/main). **This model has been trained on a different set of labels than the model you were evaluating.** If you wanted to use this model instead, what would you need to change in your NLPScholar config file? (*Hint: It might be helpful to look at the model's `config.json`*)

Models & id2label. I would change the model name to bert-base-NER and change the id2label according to the new labels.

## Part 3: Text Classification

In this part you will be working on a Sentiment Analysis task with the following models: 
* `siebert/sentiment-roberta-large-english` (which has been trained on sentiment analysis)
* `roberta-large` which has not been trained (like in the TokenClassification example)

1. Is the accuracy of trained model greater than the untrained model? 
   

In [None]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df[['model', 'accuracy']] 
# I've observed some strange behavior. When I run both models from a single configuration file, 
# they produce the same accuracy. However, when I run them separately using two different configuration files, 
# they produce different results. 
# Suspiciously, the accuracy from the combined run is the average of the two individual model accuracies.

# I suspect there's a bug in the analysis mode. 
# Since I'm aware that parts of NLPScholar are still in development, 
# I've decided to focus on configuring the YAML file correctly, even if the results are inaccurate.
# Not sure if I'm just misconfiguring the file, but it is starnge that what worked for other exp's is not working for this one case.

Unnamed: 0,model,accuracy
0,roberta-large,0.5


In [106]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df[['model', 'accuracy']] 

Unnamed: 0,model,accuracy
0,siebert/sentiment-roberta-large-english,0.875


2. Is the difference between the trained and untrained model greater for full vs. one-line reviews? 


They are the same.

In [108]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df[['model', 'type', 'accuracy']]

Unnamed: 0,model,type,accuracy
0,roberta-large,Full,0.5
1,roberta-large,One-line,0.5


In [110]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df[['model', 'type', 'accuracy']]

Unnamed: 0,model,type,accuracy
0,siebert/sentiment-roberta-large-english,Full,0.875
1,siebert/sentiment-roberta-large-english,One-line,0.875


3. Is the f1 score for reviews that are positive greater than f1 score for reviews that are negative when you consider all reviews and all models? 


In [116]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df[['model', 'target', 'macro-f1']]

Unnamed: 0,model,target,macro-f1
0,roberta-large,Negative,0.805668
1,roberta-large,Positive,0.805668
2,siebert/sentiment-roberta-large-english,Negative,0.805668
3,siebert/sentiment-roberta-large-english,Positive,0.805668


4. Is the answer to 3 different if you consider just the ChatGPT generated reviews? 


In [119]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df[df['source']=='ChatGPT'][['model', 'target', 'macro-f1']]

Unnamed: 0,model,target,macro-f1
0,roberta-large,Negative,0.65368
2,roberta-large,Positive,0.65368
4,siebert/sentiment-roberta-large-english,Negative,0.65368
6,siebert/sentiment-roberta-large-english,Positive,0.65368


5. What is the average probability of the predicted label for both the models? 

In [123]:
df = pd.read_csv('./results/sentiment.tsv', delimiter='\t')
df

Unnamed: 0.1,Unnamed: 0,model,micro-precision,micro-recall,micro-f1,macro-precision,macro-recall,macro-f1,accuracy
0,0,roberta-large,0.6875,0.6875,0.6875,0.718182,0.6875,0.676113,0.6875
1,1,siebert/sentiment-roberta-large-english,0.6875,0.6875,0.6875,0.718182,0.6875,0.676113,0.6875
