# SNLP Assignment 7 - Word Sense Disambiguation

Name 1: William LaCroix<br/>
Student id 1: 7038732<br/>
Email 1: williamplacroix@gmail.com<br/>


Name 2: Nicholas Jennings<br/>
Student id 2: 2573492<br/>
Email 2: s8nijenn@stud.uni-saarland.de<br/>

**Instructions:** Read each question carefully. <br/>
Make sure you appropriately comment your code wherever required. Your final submission should contain the completed Notebook and the respective Python files for any additional exercises necessary. There is no need to submit the data files should they exist. <br/>
Upload the zipped folder on CMS. Only one member of the group should make the submisssion.

---

## <span style="color:red">EM for Word Sense Disambiguation</span>

In this exercise, you will implement [expectation-maximization](https://machinelearningmastery.com/expectation-maximization-em-algorithm/) for word sense disambiguation.


There is no starter code provided for this exercise, and you have to write your code from scratch. We have however provided some pseudo-code in this notebook.

### The Dataset

NLTK has a corpus reader for the Senseval 2 dataset. This data set contains data for four ambiguous words: hard, line, serve and interest. You can read more here: http://www.nltk.org/howto/corpus.html (search for ”senseval”). The PROVIDED CODE(???) loads the dataset for you and converts each sample into a sample object. The sample class has two attributes: context and label. Label is the ground truth sense of the ambiguous word. As EM is an unsupervised method, we will only use the labels for final evaluation. Context is the left and right context of the ambiguous word as word IDs: it is a list of integers, which you can use later to index matrices.

### Pseudo Code

```
instances = senseval.instances(hard_f)

# All training samples as a list.
samples = [sample(inst) for inst in instances]

# Convert contexts to indices so they can be used for indexing.
for sample in samples:
    sample.context_to_index(word_to_id)
```

Now, ”samples” contains all the sample objects for the ambiguous word
”hard”.

In [52]:
from nltk.corpus import senseval
import sklearn.feature_extraction.text as sklfe

# Load instances for the 'hard' word
instances = senseval.instances('hard.pos')

# Define a class to represent a sample
class Sample:
    """
    Class representing a sample with context and label.
    Attributes:
        context (list): The left and right context of the ambiguous word.
        label (str): The ground truth sense of the ambiguous word.
        indexed_context (list): The context represented as word IDs.
    """
    def __init__(self, context_tuples, label):
        self.context_string: str = self.context_to_string(context_tuples)
        self.label: str = label
        self.indexed_context: list = []
    
    def context_to_string(self, context) -> str:
        """
        Converts the context to a string.
        Returns:
            str: The context as a string.
        """
        stringed = " ".join([tuple[0] for tuple in context])
        return stringed
    
    def context_string_to_index(self) -> None:
        """
        Converts the context to a list of word IDs.
        Args:
            None
        Returns:
            None. The indexed context is stored in the `indexed_context` attribute.
        """
        self.indexed_context = [vectorizer.vocabulary_.get(word, 0) for word in self.context_string.split()]

# Convert instances to samples
samples = [Sample(inst.context, inst.senses[0]) for inst in instances]
corpus = [sample.context_string for sample in samples]
vectorizer = sklfe.CountVectorizer(token_pattern=r"(?u)\b'*[\w\-]*\b")
vectorizer.fit_transform(corpus)

# Print the indexed contexts for each sample
for sample in samples:
    sample.context_string_to_index()
    print(f"Context: {sample.context_string}")
    print(f"Indexed Context: {sample.indexed_context}")
    print(f"Label: {sample.label}")


Context: `` he may lose all popular support , but someone has to kill him to defeat him and that 's hard to do . ''
Indexed Context: [0, 5370, 7087, 6785, 603, 8624, 11099, 0, 1809, 10575, 5341, 11553, 6330, 5496, 11553, 3208, 5496, 711, 11420, 0, 5313, 11553, 3579, 0, 0]
Label: HARD1
Context: clever white house `` spin doctors '' are having a hard time helping president bush explain away the economic bashing that low-and middle-income workers are taking these days .
Indexed Context: [2325, 12406, 5626, 0, 10690, 3583, 0, 867, 5357, 286, 5313, 11530, 5426, 8761, 1800, 4240, 1093, 11422, 3838, 1236, 11420, 6812, 7262, 12542, 867, 11239, 11444, 3126, 0]
Label: HARD1
Context: i find it hard to believe that the sacramento river will ever be quite the same , although i certainly wish that i 'm wrong .
Indexed Context: [5704, 4504, 6100, 5313, 11553, 1343, 11420, 11422, 9786, 9637, 12447, 4149, 1276, 9061, 11422, 9832, 0, 645, 5704, 2080, 12494, 11420, 5704, 0, 12588, 0]
Label: HARD1
Context

## Implementation

There are three matrices PROVIDED FOR YOU(???), which you will have to use for the implementation: $priors$, $probs$ and $C$.

 - $priors$: It is a vector of length K, where K is the number of clusters. Each value corresponds to a cluster prior $p(s_k)$. It’s initialized randomly.

 - $probs$: It is a V x K sized matrix, where V is the size of the vocabulary. Each column is a conditional probability distribution over the words. $probs[i, k]$ is $p(v_i |s_k )$.

 - $C$ : contains the counts of the words in a given context. The size of the matrix is number of samples x vocabulary size. $C[i, j]$ is $C(v_j~in~c_i )$, the count of word j in context i.


Get the IDs of the words. We will use this to index the rows of the probs matrix:

```
context_index = sample.context
words_given_sense = probs[context.index, :]
```

This is a matrix: the number of lines is the number of words in the context, the number of columns is number of senses. We multiply this column wise to get the probability of a context given a sense $p(c_i |s_k )$:

```
context_given_sense = np.prod(words_given_sense, axis=0)
```

This is a vector where each value is a class conditional probability (size is K).

## Exercises

 - Briefly describe what EM is and when using it is appropriate? Give an example use case (1 point).

 - Write the code for the expectation step (3 points).

 - Write the code for the maximization step. In order to check your implementation, make sure that the log likelihood increases over the iterations (4 points).

 - The code will print the ordered frequency of the labels within each cluster. Each cluster corresponds to one of the senses of "hard". How does the algorithm perform? Briefly explain. (1 point)

 - Does EM find the global optimum? Explain why/why not? (1 point)



|