## Part-of-speech tagging

The directory `pos_tagging` is to perform part-of-speech tagging based on the training data.

### Approaches

I created a first-order HMM model on `pos_tagging.py`, in which POS tags are approximately predicted by the following equation:

$$ \hat{t}_{1:n} = argmax_{t_i \cdots t_n} \prod_{i = 1}^n P(w_i|t_i)P(t_i|t_{i - 1}) $$

where $P(w_i|t_i)$ is called emission probability and $P(t_i|t_{i - 1})$ transition probability. In order to calculate the emission probability when $w_i$ is unseen in the training data, I performed the following smoothing:

$$ P(w_i|t_i) = \lambda P(w_i|t_i) + (1 - \lambda) \frac{1}{N_{t_i}} $$

where $N_{t_i}$ is the vocabulary size calculated as the unique number of tokens emitted from $t_i$ plus $1$, which represents `<UNK>`.  

To explore the appropriate value for $\lambda$, I conducted hyperparameter tuning using `tuning.py`. As a result, the highest accuracy was obtained when $\lambda = 0.99999$.

### Results

The highest accuracy achieved was $86.87%$. To calculate this, run `eval/gradepos.pl eval/pred.txt eval/wiki-en-test.pos`, whose `gradepos.pl` is obtained [here](https://github.com/neubig/nlptutorial/tree/master/script).

## Dependency Parsing

The directory `depend_parsing` is to perform dependency parsing based on the training data.

---
## Installation

For both of the above tasks, submitted codes are tested by Python 3.9.9 with packages specified in the `requirements.txt` in each directory.

To download data, after executing `cd data` in each directory, run the following code for part-of-speech tagging
```
wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.norm_pos
wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.norm_pos
```
and the following for dependency parsing.
```
wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/mstparser-en-train.dep
wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/mstparser-en-test.dep
```