# Assignment #4: Extracting syntactic groups using machine-learning techniques: Prerequisites
Author: Pierre Nugues

__You must execute this notebook before your start the assignment.__

The goal of the assignment is to create a system to extract syntactic groups from a text. You will apply it to the CoNLL 2000 dataset. 

In this part, you will collect the datasets and the files you need to train your models. You will also collect the script you need to evaluate them.

## Collecting a Training and a Test sets

As annotated data and annotation scheme, you will use the data available from [CoNLL 2000](https://www.clips.uantwerpen.be/conll2000/chunking/).
1. Read the description of the CoNLL 2000 task
2. Download both the training and test sets and decompress them. See below

CoNLL 2000 is an early dataset and contrary to many current ones, it has no development set.

You can also download them from this site: https://huggingface.co/datasets/conll2000

In [1]:
!wget http://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz
!wget http://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz

--2023-10-02 10:41:13--  http://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz
Slår upp www.clips.uantwerpen.be (www.clips.uantwerpen.be)... 146.175.13.81
Ansluter till www.clips.uantwerpen.be (www.clips.uantwerpen.be)|146.175.13.81|:80 … ansluten.
HTTP-begäran skickad, väntar på svar... 301 Moved Permanently
Adress: https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz [följer]
--2023-10-02 10:41:14--  https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz
Ansluter till www.clips.uantwerpen.be (www.clips.uantwerpen.be)|146.175.13.81|:443 … ansluten.
HTTP-begäran skickad, väntar på svar... 200 OK
Längd: 611540 (597K) [application/x-gzip]
Sparar till: ”train.txt.gz”


2023-10-02 10:41:14 (4,05 MB/s) - ”train.txt.gz” sparades [611540/611540]

--2023-10-02 10:41:14--  http://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz
Slår upp www.clips.uantwerpen.be (www.clips.uantwerpen.be)... 146.175.13.81
Ansluter till www.clips.uantwerpen.be (www.clips.uantwerpe

In [2]:
!gunzip train.txt.gz
!gunzip test.txt.gz

In [3]:
!mkdir corpus
!mv train.txt test.txt corpus

## The evaluation script

You will train the models with the training set and the test set to evaluate them. For this, you will apply the `conlleval` script that will compute the harmonic mean of the precision and recall: F1. 

`conlleval` was written in Perl. Some people rewrote it in Python and you will use such such a translation in this lab. The line below installs it. The source code is available from this address: https://github.com/kaniblu/conlleval

In [7]:
!pip3 install conlleval

Collecting conlleval
  Using cached conlleval-0.2-py3-none-any.whl (5.4 kB)
Installing collected packages: conlleval
Successfully installed conlleval-0.2


## Collecting the Embeddings

You will represent the words with dense vectors, instead of a one-hot encoding. GloVe embeddings is one such representation. The Glove files contain a list of words, where each word is represented by a vector of a fixed dimension. In this notebook, we will use the file of 400,000 lowercase words with the 100-dimensional vectors.
Download either:
*  The GloVe embeddings 6B from <a href="https://nlp.stanford.edu/projects/glove/">https://nlp.stanford.edu/projects/glove/</a> and keep the 100d vectors; or
* A local copy of this dataset with the cell below (faster)

In [5]:
!wget https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz

--2023-10-02 10:47:26--  https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz
Slår upp fileadmin.cs.lth.se (fileadmin.cs.lth.se)... 130.235.16.7
Ansluter till fileadmin.cs.lth.se (fileadmin.cs.lth.se)|130.235.16.7|:443 … ansluten.
HTTP-begäran skickad, väntar på svar... 200 OK
Längd: 134409071 (128M) [application/x-gzip]
Sparar till: ”glove.6B.100d.txt.gz”


2023-10-02 10:47:30 (31,8 MB/s) - ”glove.6B.100d.txt.gz” sparades [134409071/134409071]



In [6]:
!gunzip glove.6B.100d.txt.gz
!mv glove.6B.100d.txt corpus