# Dataset exploration

This assignment is designed to help you get a sense of the raw data we often work with as researchers. For this assignment, you will be using `pandas` in addition to your own knowledge of linguistic structure from the readings. This assignment requires the qualitative and quantitative assessment of several parts of a very small corpus. 

The coding components of this assignment will largely require loading in the data and building a simple input system that allows you to classify two instances of the same string as they appear in different sentences as meaning the same or different things. 

The assignment is designed to give you a feel for what we want out of computational models of semantic representations of words and also give you experience doing annotation for NLP tasks.

# The problem: Ambiguous words

Language is often ambiguous, with the same string or series of sounds corresponding to different meanings. This **ambiguity** leads to challenges for NLP, such as distinguishing between seriousness and sarcasm (e.g., "That was great"), or between completely different, unrelated meanings (e.g., "I couldn't bear to part with it" vs. "The bear did not want to part with it"). Often, we are interested in identifying what meaning a word has in context. Can we properly guess what a speaker means by "bear", or "great"?

One of the major areas of natural language processing is **Word Sense Disambiguation** (WSD). Under these schemes, a string is assumed to have several possible **senses**, or distinct meanings. We will discuss in greater detail in class why the idea that words have distinct meanings is challenging, from identifying and differentiating these meanings to building models that can assign a label to a given instance of a word. 

## The dataset

This assignment will center around annotating a dataset of sentences that are stored in a `.tsv` file. This dataset, like the others we have worked with, contains one "document" per line. Each document in this dataset corresponds to one line, or subtitle, from a corpus of English subtitles for movies, along with some **metadata**. This particular file is structured as a matrix where the rows are records and the columns are values for that record -- it is very similar to a dictionary structure.

The subtitles are presented out of context, one on each line in the `.tsv`.

Some example subtitles, in the first column of the dataframe (`"subtitle"`) will look like this:

* "Neither his post , nor adjacent ."
* "Send men out to post a guard ."

In the column `'word'` you will find **ambiguous words**. We have selected a small number of ambiguous words for you that have a diverse range of meanings for each. 

The dataset itself comes from [Rice et al. (2018)](https://link.springer.com/article/10.3758/s13428-018-1107-7), a study that looked at how common different types of ambiguous words were in a corpus of subtitles. In the third column in the tsv is the `'meaning_label'`. This is a categorical label (a string `str`) for meanings from an internet dictionary called Wordsmyth. The values are often represented as numbers ("1", "2"), but may also be represented as a dash character "-" to indicate something like, this meaning is "not in the Wordsmyth list." The labels themselves have been defined somewhat informally by lexicographers. The label you see was the product of two separate annotators (students in the Rice et al. paper's lab) going through each of these sentences by hand and assigning labels to them with the help of the dictionary.

Here we will be doing something similar, but less complex.

# Question 0: Your imports (1 point)

In [None]:
# put all of your imports here
from google.colab import drive, files
import os
import pandas as pd
import numpy as np

# Question 1: Loading in data using pandas (1 point)

Import your code below and load the file `ambiguous.tsv` into the notebook. Convert it into a dataframe using `pandas`. Your dataframe should be called `df`. The format for loading in the dataframe will look like this:

```python
import pandas as pd

df = pd.read_csv("/path/to/my/filename_for_this_assignment.tsv", sep="\t")
```

The code for this will be a little more complicated if you are loading it in through `files.upload()`:

```python
import pandas as pd
from io import BytesIO

my_file = files.upload()

df = pd.read_csv(BytesIO(my_file["filename_for_this_assignment.tsv"]), sep="\t")
```

In [None]:
# load in the document and store it as the desired dataframe
drive.mount("/content/drive/", force_remount=True)

df = pd.read_csv("/content/drive/MyDrive/Fall 2021 Computational Linguistics Notebooks/HW4/ambiguous.tsv", sep="\t")

df

Mounted at /content/drive/


Unnamed: 0,subtitle,word,meaning_label
0,"Got it . Hey , Scott . Hey , Eiko , where 's t...",post,2
1,"Yeah , well , I mean these guys are faster tha...",post,3
2,"Neither his post , nor adjacent .",post,2
3,"Well , I tried to sneak past you and get out ,...",post,1
4,Send men out to post a guard .,post,2
...,...,...,...
346,You heard it tick .,tick,1
347,Like a tick from an awful season,tick,2
348,I 'd hate to tick off the locals !,tick,-
349,Hold on a tick . `` Things to do before I die '',tick,1


# Question 2: Understanding the dataframe (3 points)

Now that you have `df` in the file, please **print** the answer to each of the following. If you have not worked with `pandas` before, please refer to the [documentation](https://pandas.pydata.org/docs/) and/or search for the following on Stack Overflow. If you search for these online, be sure to include the term "pandas" in your search. Each response is worth 1 point.

* The number of rows in `df`
* The names of the columns in `df`
* The unique values in `df['word']`

In [None]:
# number of rows
print('Number of rows in the given dataframe is {}'.format(len(df)))
# column names of df
columns_names = []
for col in df.columns:
    columns_names.append(col)
print('Column names of the dataframe are {}'.format(columns_names))
# unique values in df['word']
print('Unique values in the column word are {}'.format(df.word.unique()))

Number of rows in the given dataframe is 351
Column names of the dataframe are ['subtitle', 'word', 'meaning_label']
Unique values in the column word are ['post' 'zip' 'dove' 'tick']


# Question 3: Subsetting the data (5 points)

For this question, I would like you to subset to the rows in the dataframe that only have the word "zip" in them. In order to do this, refer to the `pandas` subsetting documentation: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

As with most comparisons, you will want to make sure you use the `==` operator from previous assignments. So, be sure to subset to all rows where the string is `"zip"`. 

Save the subset of the data as a variable called `zip_df`.

### Question 3A: Subsetting to a single matching string (2 points)

In [None]:
# your code for creating the zip dataframe goes here
zip_df = df[df['word'] == 'zip']

zip_df

Unnamed: 0,subtitle,word,meaning_label
75,But you still have to write the zip code in ca...,zip,-
76,Why do n't you zip up your jacket .,zip,2
77,I 'll give the afternoon twist a little zip .,zip,1
78,Do n't forget zip your coat .,zip,2
79,"No , zip it . Zip it .",zip,2
...,...,...,...
160,To remind us that every day holds the potentia...,zip,-
161,The zip is 06320 .,zip,-
162,"Bruising on her wrists , plastic zip tie , nav...",zip,2
163,"Maybe we live a lot closer to each other , per...",zip,-


## Question 3B: Describe the subset (1 point)

Once you have created your `zip_df` dataframe, print out the length of `zip_df` in terms of the number of rows.

In [None]:
# print the number of rows of zip_df
print('The length of the subset is {}'.format(len(zip_df)))

The length of the subset is 90


## Question 3C: Create a subset for sense "2" in zip_df

Now, create an additional subset to get the "2" sense. Store this smaller subset as `two_zip_df`. Print the number of rows in `two_zip_df`.

In [None]:
# create smaller subset
two_zip_df = df[df['meaning_label'] == '2'][df['word'] == 'zip']
print(two_zip_df)
# print number of rows
print('The length of the subset 2 is {}'.format(len(two_zip_df)))

                                              subtitle word meaning_label
76                 Why do n't you zip up your jacket .  zip             2
78                       Do n't forget zip your coat .  zip             2
79                              No , zip it . Zip it .  zip             2
82   You know , sometimes he forgets to zip himself...  zip             2
84                        I can almost zip the dress .  zip             2
92          I had to take me undies off to zip it up .  zip             2
93   Hey , Lanie , you at least wan na zip up or so...  zip             2
96                           Now help me zip this up .  zip             2
97                                     You must zip it  zip             2
98   Zip , zip , she gets into the flak suit , we g...  zip             2
99   I think ... you look beautiful . You sure ? Oh...  zip             2
100                  Here , zip me up will you Frank ?  zip             2
103                Thanks . Can you zi

  


# Question 4: Data Annotation (21 points total)

This question leads you through a long journey to understand how we can get data for our machine learning models. Below, you will find functions that will help us annotate our data inside a notebook. We can store our annotations and then use them to label our data later. 

For this annotation task, we will be simplifying the task of Word Sense Disambiguation down to pairwise comparisons. You will compare randomly selected sentences that contain two instances of the same string (i.e., "zip"). We want you to annotate whether the use of "zip" means roughly the same or different things in the sentences. This task is challenging because sometimes it is unclear from a sentence or there is not enough information to go on. Try your best and answer the questions below. You may have to go through the data several times, but it should teach you a lot about what we need to do good word sense disambiguation!

## Question 4A: Run and answer the prompt (1 point)

In [None]:
# Run the below code and complete one example

def subtitle_randomizer(subset_df):
  # get two rows
  two_rows = subset_df.sample(n=2, replace=False)
  sent_one, sent_two = two_rows['subtitle'].tolist()
  sense_sent_one, sense_sent_two = two_rows['meaning_label'].tolist()
  senses_match_flag = sense_sent_one==sense_sent_two
  print(f"First sentence: {sent_one}\nSecond sentence: {sent_two}")
  return senses_match_flag

def request_judgment():
  answer = ''
  while answer not in {'same', 'different'}:
    answer = input("Is the meaning of 'zip' in these two sentences the same or different?         :    ")
  if answer == 'same':
    same_flag = True
  elif answer== 'different':
    same_flag = False
  return same_flag

def annotate_one(subset_df):
  gold_label = subtitle_randomizer(subset_df)
  annotator_says_match_flag = request_judgment()
  print(f"\nGold label: {gold_label}\nAnnotator answer: {annotator_says_match_flag}\n")
  return gold_label, annotator_says_match_flag

gold_label, annotator_match = annotate_one(zip_df)

First sentence: I can almost zip the dress .
Second sentence: He goes to the `` bathroom . '' On his way out , he forgets to zip up his `` pants . ''
Is the meaning of 'zip' in these two sentences the same or different?         :    same

Gold label: True
Annotator answer: True



## Question 4B: Reflection on 4A (2 points)

For the specific example you saw above, elaborate on **how you made your decision** about whether the meanings were the same or different. Go into detail on the specific example -- **be specific**. Were there any **difficulties** that the sentences posed? If it was easy, what kind of information did you use? Is there a way that you can think of that would **make the task easier**? Refer to the readings from Bender and Lascarides (Chapter 4) and SLP3 (Chapter 18) where appropriate.

I made the decision based on the whole sentence and the meaning of the word in the given context. In the above example, in both sentences, the word 'zip' is referred to a part in the dress. So, the word 'zip' will have same meaning in both the sentences. It was easy and I believe that understanding the context of the sentence will be helpful in understanding the meaning of the word.


## Question 4C: Randomly sample multiple pairs (8 points)

Create a new list called `accuracies`.

Randomly sample `zip_df` 30 times using a `for` loop. For this question, you will make judgments about whether the senses in the two sentences you get are the same or not. To say whether two senses are the same, you should write "same" (without quotes) in the text box. If the two senses are different, you should write "different" (without quotes). **Keep track of what makes your judgments difficult** for Question 4C. Try to do your very best on these classifications and go through the answers until you get about 80% of these correct.

For each answer you make, you should **compare whether your answer matched** the `gold_label`. So, inside your loop, create a variable (a boolean) called `correct` that is whether your answer and the gold label match. Refer to the code in the `subtitle_randomizer` function if necessary.

In [None]:
# Create a list to store the accuracy of your responses
accuracies = []

# Create a loop to iteratively compare 30 pairs of sentences
for i in range(0,30):
  ## - Use the annotate_one function with zip_df
    gold_label, annotator_match = annotate_one(zip_df)
  ## check whether your answer is correct
    correct = (gold_label == annotator_match)
  ## add the correctness score for your answer to accuracy list
    if correct:
        accuracies.append(1)
    else:
        accuracies.append(0)

First sentence: Now help me zip this up .
Second sentence: Shall we just step behind the screen and try it on ? Fred ? We 'd be glad to help you zip it up , baby .
Is the meaning of 'zip' in these two sentences the same or different?         :    same

Gold label: True
Annotator answer: True

First sentence: What 's your zip code ? Hmm ?
Second sentence: And here 's me , dude , halfway through August , and zip .
Is the meaning of 'zip' in these two sentences the same or different?         :    different

Gold label: True
Annotator answer: False

First sentence: We did n't even remotely own anything like what Bill was selling them . Nada , zip .
Second sentence: You know , sometimes he forgets to zip himself up .
Is the meaning of 'zip' in these two sentences the same or different?         :    different

Gold label: False
Annotator answer: False

First sentence: I ca n't believe we drove around all day ... and there 's not a single job in this town . There is nothing , nada , zip . Yea

## Question 4D: Assessing your accuracy (2 points)

Now, take the average of `accuracies` using `np.mean` and save it into a variable called `mean_accuracy`. Report this value by printing it. 

#### Note: If your mean_accuracy is under .8, please go back up to 4C and rerun. Keep track of the number of times you have run.

In [None]:
# compute the proportion of your responses that were correct
mean_accuracy = np.mean(np.array(accuracies))
print('Mean of the accuracy is {}'.format(mean_accuracy))
print('Accuracy percentage is {}%'.format(mean_accuracy*100))

Mean of the accuracy is 0.9
Accuracy percentage is 90.0%


## Question 4E: Reflection on multiple rounds of annotation (8 points)

We asked you to assess your accuracy in 4D. Recall that in 4A we asked you to try to do your best. Tell us a little bit about the following:

1. About how long did it take you to get to 80% accuracy on the annotation?
2. What made some annotations easy?
3. What made some annotations hard?
4. Name some specific examples of sentences you got wrong and why you think that happened.
Refer to the readings from Bender and Lascarides (Chapters 3 and 4) and SLP3 (Chapter 18) where appropriate.

It took about 5 minutes to get past 80% accuracy. Almost all annotations are easy. Few annotations like "Iooks like you got zip ." are hard as it is hard to undersatnd the context. I the above sentence, I got confused whether the word is being referred as the score in a particular test or is it being referred as a part of the dress.

The following are the examples where I went wrong:

Example 1:

First sentence: I ca n't believe we drove around all day ... and there 's not a single job in this town . There is nothing , nada , zip . Yeah .

Second sentence: There 's a partial zip code on the reverse side .

Example 2:

First sentence: What 's your zip code ? Hmm ?

Second sentence: And here 's me , dude , halfway through August , and zip .

Example 3:

First sentence: Do n't forget zip your coat .

Second sentence: Iooks like you got zip .

In example 1, in the first sentence, it is clear that the word 'zip' is synonym of the word 'nothing' but in the second sentence, it is unclear if the word 'zip' is being used as a metaphor or as a direct word or the meaning of it is changed because of the word 'partial'.

In example 2, in the first sentence, it is clear that the annotation is being referred to an area code whereas in the second sentence, it is not clear that the annotation is a metaphor of '0' or if it has any other meaning.

As for example 3, as I mentioned above, the annotation is unclear in the second sentence.

# Bonus (Free response: 5 points)

Describe another way of solving the problem of annotating senses. Flesh out the procedure you would use, and explain why you think it would be better than the way we did the task here. What improvements does your task make upon the task in Questions 0-4? Given what we have discussed about senses and the difficulty in marking clear boundaries with them, does your method solve this problem? Why or why not?

The other ways of solving the problem of annotating senses are:
1. Supervised methods
2. Semi-supervised methods and
3. Unsupervised methods

Currently, we are using dictionary based method and I believe that Semi-supervised method produces better results when compared with the rest because many word sense disambiguation methods use semi-supervised learning, which enables both labeled and unlabeled input due to a lack of training data. An early example of such an algorithm was the Yarowsky algorithm. For word sense disambiguation, it employs human languages' 'One sense per collocation' and 'One sense per discourse' features.

The semi-supervised based approaches will address the challenges in questions 0-4 by detecting subtleties and patterns as follows:

When trained using the best classifier, the semi-supervised based WSD produces better results since the efficient method takes care of the accuracy component, and because it is semi-supervised, we don't have to label all the phrases, which saves a lot of time. We will retrain the algorithm and hyper tune the parameters properly whenever a new piece of data is encountered, and we will employ the semi-supervised approach. My suggested semi-supervised based method will produce optimal results when trained using an effective classifier, provided the training data has been accurately labeled and sufficient attention has been taken when labeling the same.

Now we'll use the ML algorithm to generate labels for the new data using the labelled data. Assuming that our machine learning algorithm is effective, the new supervised technique will be able to discover the patterns and context in which the sentence might be read, as well as which synonyms of the word "zip" are considered when making a choice. These complexities will be taken care of by a good algorithm, which will aid us in overcoming the WSD and appropriately labeling the data.

Other semi-supervised algorithms use enormous amounts of untagged corpora to complement the tagged corpora with co-occurrence information. These methods have the potential to aid in the adaption of supervised models to a variety of domains.

