# Dataset exploration

This assignment is designed to help you get a sense of the raw data we often work with as researchers. For this assignment, you will be using `pandas` in addition to your own knowledge of linguistic structure from the readings. This assignment requires the qualitative and quantitative assessment of several parts of a very small corpus. 

The coding components of this assignment will largely require loading in the data and building a simple input system that allows you to classify two instances of the same string as they appear in different sentences as meaning the same or different things. 

The assignment is designed to give you a feel for what we want out of computational models of semantic representations of words and also give you experience doing annotation for NLP tasks.

# The problem: Ambiguous words

Language is often ambiguous, with the same string or series of sounds corresponding to different meanings. This **ambiguity** leads to challenges for NLP, such as distinguishing between seriousness and sarcasm (e.g., "That was great"), or between completely different, unrelated meanings (e.g., "I couldn't bear to part with it" vs. "The bear did not want to part with it"). Often, we are interested in identifying what meaning a word has in context. Can we properly guess what a speaker means by "bear", or "great"?

One of the major areas of natural language processing is **Word Sense Disambiguation** (WSD). Under these schemes, a string is assumed to have several possible **senses**, or distinct meanings. We will discuss in greater detail in class why the idea that words have distinct meanings is challenging, from identifying and differentiating these meanings to building models that can assign a label to a given instance of a word. 

## The dataset

This assignment will center around annotating a dataset of sentences that are stored in a `.tsv` file. This dataset, like the others we have worked with, contains one "document" per line. Each document in this dataset corresponds to one line, or subtitle, from a corpus of English subtitles for movies, along with some **metadata**. This particular file is structured as a matrix where the rows are records and the columns are values for that record -- it is very similar to a dictionary structure.

The subtitles are presented out of context, one on each line in the `.tsv`.

Some example subtitles, in the first column of the dataframe (`"subtitle"`) will look like this:

* "Neither his post , nor adjacent ."
* "Send men out to post a guard ."

In the column `'word'` you will find **ambiguous words**. We have selected a small number of ambiguous words for you that have a diverse range of meanings for each. 

The dataset itself comes from [Rice et al. (2018)](https://link.springer.com/article/10.3758/s13428-018-1107-7), a study that looked at how common different types of ambiguous words were in a corpus of subtitles. In the third column in the tsv is the `'meaning_label'`. This is a categorical label (a string `str`) for meanings from an internet dictionary called Wordsmyth. The values are often represented as numbers ("1", "2"), but may also be represented as a dash character "-" to indicate something like, this meaning is "not in the Wordsmyth list." The labels themselves have been defined somewhat informally by lexicographers. The label you see was the product of two separate annotators (students in the Rice et al. paper's lab) going through each of these sentences by hand and assigning labels to them with the help of the dictionary.

Here we will be doing something similar, but less complex.

# Question 0: Your imports (1 point)

In [1]:
# put all of your imports here
from google.colab import files
import pandas as pd
from io import BytesIO
import numpy as np

# Question 1: Loading in data using pandas (1 point)

Import your code below and load the file `ambiguous.tsv` into the notebook. Convert it into a dataframe using `pandas`. Your dataframe should be called `df`. The format for loading in the dataframe will look like this:

```python
import pandas as pd

df = pd.read_csv("/path/to/my/filename_for_this_assignment.tsv", sep="\t")
```

The code for this will be a little more complicated if you are loading it in through `files.upload()`:

```python
import pandas as pd
from io import BytesIO

my_file = files.upload()

df = pd.read_csv(BytesIO(my_file["filename_for_this_assignment.tsv"]), sep="\t")
```

In [2]:
my_file = files.upload()

Saving ambiguous.tsv to ambiguous.tsv


In [3]:
# load in the document and store it as the desired dataframe
df = pd.read_csv(BytesIO(my_file["ambiguous.tsv"]), sep="\t")

# Question 2: Understanding the dataframe (3 points)

Now that you have `df` in the file, please **print** the answer to each of the following. If you have not worked with `pandas` before, please refer to the [documentation](https://pandas.pydata.org/docs/) and/or search for the following on Stack Overflow. If you search for these online, be sure to include the term "pandas" in your search. Each response is worth 1 point.

* The number of rows in `df`
* The names of the columns in `df`
* The unique values in `df['word']`

In [8]:
# number of rows
print("Number of rows in df::\n",len(df))

# column names of df
print("Column Names of df::")
columns = df.columns
for column in columns:
  print(column)

# unique values in df['word']
print("Unique values in df['word']::")
unique_values=df['word'].unique()
for uni_val in unique_values:
  print(uni_val)

Number of rows in df::
 351
Column Names of df::
subtitle
word
meaning_label
Unique values in df['word']::
post
zip
dove
tick


# Question 3: Subsetting the data (5 points)

For this question, I would like you to subset to the rows in the dataframe that only have the word "post" in them. In order to do this, refer to the `pandas` subsetting documentation: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

As with most comparisons, you will want to make sure you use the `==` operator from previous assignments. So, be sure to subset to all rows where the string is `"post"`. 

Save the subset of the data as a variable called `post_df`.

### Question 3A: Subsetting to a single matching string (2 points)

In [None]:
# your code for creating the post dataframe goes here
post_df = df[df['word'] == 'post']
post_df

## Question 3B: Describe the subset (1 point)

Once you have created your `post_df` dataframe, print out the length of `post_df` in terms of the number of rows.

In [10]:
# print the number of rows of post_df
print("Number of rows of post_df::", len(post_df))

Number of rows of post_df:: 75


## Question 3C: Create a subset for sense "2" in post_df

Now, create an additional subset to get the "2" sense. Store this smaller subset as `two_post_df`. Print the number of rows in `two_post_df`.

In [11]:
# create smaller subset
two_post_df = post_df[post_df['meaning_label'] == '2']
# print number of rows
print("Number of rows of two_post_df::", len(two_post_df))

Number of rows of two_post_df:: 36


# Question 4: Data Annotation (21 points total)

This question leads you through a long journey to understand how we can get data for our machine learning models. Below, you will find functions that will help us annotate our data inside a notebook. We can store our annotations and then use them to label our data later. 

For this annotation task, we will be simplifying the task of Word Sense Disambiguation down to pairwise comparisons. You will compare randomly selected sentences that contain two instances of the same string (i.e., "post"). We want you to annotate whether the use of "post" means roughly the same or different things in the sentences. This task is challenging because sometimes it is unclear from a sentence or there is not enough information to go on. Try your best and answer the questions below. You may have to go through the data several times, but it should teach you a lot about what we need to do good word sense disambiguation!

## Question 4A: Run and answer the prompt (1 point)

In [12]:
# Run the below code and complete one example

def subtitle_randomizer(subset_df):
  # get two rows
  two_rows = subset_df.sample(n=2, replace=False)
  sent_one, sent_two = two_rows['subtitle'].tolist()
  sense_sent_one, sense_sent_two = two_rows['meaning_label'].tolist()
  senses_match_flag = sense_sent_one==sense_sent_two
  print(f"First sentence: {sent_one}\nSecond sentence: {sent_two}")
  return senses_match_flag

def request_judgment():
  answer = ''
  while answer not in {'same', 'different'}:
    answer = input("Is the meaning of 'post' in these two sentences the same or different?         :    ")
  if answer == 'same':
    same_flag = True
  elif answer== 'different':
    same_flag = False
  return same_flag

def annotate_one(subset_df):
  gold_label = subtitle_randomizer(subset_df)
  annotator_says_match_flag = request_judgment()
  print(f"\nGold label: {gold_label}\nAnnotator answer: {annotator_says_match_flag}\n")
  return gold_label, annotator_says_match_flag

In [13]:
gold_label, annotator_match = annotate_one(post_df)

First sentence: Eat faster ! It 's no big deal , you can post it on the lnternet
Second sentence: He offers you the post of Viceroy of Fujian and Guangdong Provinces
Is the meaning of 'post' in these two sentences the same or different?         :    different

Gold label: False
Annotator answer: False



## Question 4B: Reflection on 4A (2 points)

For the specific example you saw above, elaborate on **how you made your decision** about whether the meanings were the same or different. Go into detail on the specific example -- **be specific**. Were there any **difficulties** that the sentences posed? If it was easy, what kind of information did you use? Is there a way that you can think of that would **make the task easier**? Refer to the readings from Bender and Lascarides (Chapter 4) and SLP3 (Chapter 18) where appropriate.

*    Here, in this particular example word 'post' has two different meanings. The first sentence is about uploding something on the internet. And, the second sentence is about offering position. As you can see, these two sentences are not similar to each other. So by analyzing sense of sentences I made my decision.
*   There were no difficulties in those sentences. They are straightforword sentences with straightforword meaning. As I said it earlier, word 'post' is used in different ways in above sentences.
*   For most of the sentences, meaning of word 'post' is enough to distingush them from one another. But in general, to make this task more easier, I think, if we have given more information about sentences, then it would make easier. 
> For Example: "This is my post ,"  This sentence is quite confusing as it has multiple meanings. One meaning is, it maybe talking about position, and another meaning maybe, it is talking about some social media post. But if we have given more information about sentence then we maybe can guess correct sense.




   






## Question 4C: Randomly sample multiple pairs (8 points)

Create a new list called `accuracies`.

Randomly sample `post_df` 30 times using a `for` loop. For this question, you will make judgments about whether the senses in the two sentences you get are the same or not. To say whether two senses are the same, you should write "same" (without quotes) in the text box. If the two senses are different, you should write "different" (without quotes). **Keep track of what makes your judgments difficult** for Question 4C. Try to do your very best on these classifications and go through the answers until you get about 80% of these correct.

For each answer you make, you should **compare whether your answer matched** the `gold_label`. So, inside your loop, create a variable (a boolean) called `correct` that is whether your answer and the gold label match. Refer to the code in the `subtitle_randomizer` function if necessary.

In [15]:
# Create a list to store the accuracy of your responses
accuracies = []

# Create a loop to iteratively compare 30 pairs of sentences
for sample in range(30):
  ## - Use the annotate_one function with post_df
  gold_label, annotator_match = annotate_one(post_df)
  ## check whether your answer is correct
  if gold_label == annotator_match:
    correct = True
    ## add the correctness score for your answer to accuracy list
    accuracies.append(int(correct))
  else:
    correct = False
    ## add the correctness score for your answer to accuracy list
    accuracies.append(int(correct))

First sentence: Alice , hiding behind the post .
Second sentence: How can I pay you back ? I 'll post the money back , okay ?
Is the meaning of 'post' in these two sentences the same or different?         :    different

Gold label: False
Annotator answer: False

First sentence: Jerry , I 'm sorry . We have a call from the Milwaukee post office . The mail for Henry Finch is being forwarded here to Manhattan .
Second sentence: Bobby , set that on the corner post .
Is the meaning of 'post' in these two sentences the same or different?         :    different

Gold label: False
Annotator answer: False

First sentence: And you ai n't no lamp post . You 're a heel . Buy you a drink .
Second sentence: All right . You set up a command post , okay ?
Is the meaning of 'post' in these two sentences the same or different?         :    different

Gold label: False
Annotator answer: False

First sentence: We must inform the militia at once . I , Philip Philipovich , have taken up an official post . 

## Question 4D: Assessing your accuracy (2 points)

Now, take the average of `accuracies` using `np.mean` and save it into a variable called `mean_accuracy`. Report this value by printing it. 

#### Note: If your mean_accuracy is under .8, please go back up to 4C and rerun. Keep track of the number of times you have run.

In [18]:
# compute the proportion of your responses that were correct
print(accuracies)
array = np.array(accuracies)
mean_accuracy = np.mean(array)
print(mean_accuracy)

[1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
0.8666666666666667


## Question 4E: Reflection on multiple rounds of annotation (8 points)

We asked you to assess your accuracy in 4D. Recall that in 4A we asked you to try to do your best. Tell us a little bit about the following:

1. About how long did it take you to get to 80% accuracy on the annotation?
2. What made some annotations easy?
3. What made some annotations hard?
4. Name some specific examples of sentences you got wrong and why you think that happened.
Refer to the readings from Bender and Lascarides (Chapters 3 and 4) and SLP3 (Chapter 18) where appropriate.

1.   About how long did it take you to get to 80% accuracy on the annotation?
> After two times I got 86.66% accuracy on the annotation. During my first tried I got 74% accuracy on the annotation. 
2.   What made some annotations easy?
> Some annotations were easy because they were easy to understand. Senses of two sentences were straightforwoard. That is by reading only, you can understand whether meaning of two sentences are same or different.\
For Example:\
First sentence: And you ain't no lamp post.You're a heel.Buy you a drink.\
Second sentence: All right . You set up a command post , okay ?\
So, in the first sentence, word 'post' is used to describe as pole whereas in the second sentence it is used to describe as position. So from the meaning and overall sense of the sentence it made easier to make decision.
3.  What made some annotations hard?
> Sometimes it was confusing to find out in what sense the word 'post' used in the sentence. It happened because of lack of information or difficulty to understand sentences.\
For Example:\
First sentence: I got a package to pick up at the post office .\
Second sentence: This is my post ,\
So, in the first sentence, meaning of the word 'post' is organized delivery of mail. I got these sense form the word 'post office','package'. That means, it's not only sense of one word is imposrtant. Overall information like in what sense that word is used is also important.
But, in second sentence, word 'post' can be position or some social media post or something related to mail. It was confusing for me to guess sense as there is less information.
4.   Name some specific examples of sentences you got wrong and why you think that happened. Refer to the readings from Bender and Lascarides (Chapters 3 and 4) and SLP3 (Chapter 18) where appropriate.
>  Following are the set of sentences that I got wrong;\
Example 1:\
First sentence: So famous , in fact , that everybody has a reproduction . There are post cards -- We have the calendar .\
Second sentence: Let 's post this on the Dental Society database -- see if anyone responds .\
Is the meaning of 'post' in these two sentences the same or different?         :    same\
Gold label: False
Annotator answer: True
I thought both of these sentences are talking about uploding delivery or upload. But, I ignore the fact that, word 'post' used in the first sentence is related to mailing whereas in the second sentence it is used as something share on social media. Here the mistake was, I considered wrong definations of word 'post'. 











# Bonus (Free response: 5 points)

Describe another way of solving the problem of annotating senses. Flesh out the procedure you would use, and explain why you think it would be better than the way we did the task here. What improvements does your task make upon the task in Questions 0-4? Given what we have discussed about senses and the difficulty in marking clear boundaries with them, does your method solve this problem? Why or why not?

*   Another way of solving the problem of annotating senses is finding semantic similarity between the meaning of two sentences. To determine the semantic similarity of two sentences, the easiest and most straightforward method is to take the average of the word embeddings in each sentence and calculate their cosine. 
*  First, we will remove the stopwords from each sentence. And, then instead of storing words, we can store meaning of each word into vectors. From the total overlapping we'll assign meaning to each sentence.Then we can map meaning of one sentence to another by cosine similarity.
* This procedure is better than the way we did in the task because it reduces the chances of ambiguty between two setence.





