# Assignment Resit - Part B


Deadline: Tuesday, November 30 2021 before 17:00

This part of the assignment should be submitted as a zip file containing two python modules:

* utils.py
* texts_to_coll.py
* ASSIGNMENT-RESIT-A.ipynb (notebook containing part A)

Please name your zip file as follows: RESIT-ASSIGNMENT.zip and upload it via Canvas (Resit Assignment). 


Please submit your assignment on Canvas: Resit Assignment

If you have questions about this topic, please contact the teachers' mailing list: cltl.python.course@gmail.com.

Note that we currently only check this mailing list once a day. We have given a week extra time, so please start timely.
Answers to general questions will be covered on Piazza (https://piazza.com/class/kt1o9ir48ph50c), so please check if your question has already been answered.

All of the covered chapters are important to this assignment. However, please pay special attention to:

* Chapter 14 - Reading and writing text files
* Chapter 15 - Off to analyzing text
* Chapter 16 - Data formats I (CSV and TSV)
* Chapter 19 - More about Natural Language Processing Tools (spaCy)



In this assignment, we are going to write code which conversts raw text to a structured format frequently used in Natural Lanugage Processing. No matter what field you will end up working in, you will always have to be able to convert data from format A to format B. You have already gained some experience with such conversions in Block 4. 

**The CoNLL format**

Before you use the output of a text analysis system, you usually want to store the output in a structured format. One way of doing this is to use naf - a format using xml. In this assignment, we are going to look at CoNLL, which is a table-based format (i.e. it is similar to csv/tsv). 

The format we are converting to is called CoNLL. CoNLL is the name of a conference (Conference on Natural Language Learning). Every year, the conference hosts a 'competion'. In this competition, participants have to build systems for a certain Natural Language Processing problem (usually referred to as 'task'). To compare results, participants have to stick to the CoNLL format. The format has become a popular format for storing the output of NLP systems. 

The goal of this assignment is to write a python module which processes all texts in ../Data/Dreams/. The output should be written to a new directory, in which each text is stored as a csv/tsv file following CoNLL conventions. 

**Text analysis with SpaCy**

In part A of this assignment, you have already used SpaCy to process text. In this part of the assignment, you can make use of the code you have already written. The output files will contain the following information:

* The tokens in each text
* Information about the sentences in each text
* Part-of-speech tags for each token
* The lemma of each token
* information about entities in a text (i.e. people, places, organizations, etc that are mentioned)

**The assignment**

We will guide you towards the final file-conversion step-by-step. The assignment is divided in 3 parts. We provide small toy exampls you can use to develop your code. As a final step, you will be asked to transfer all your code to python modules and process a directory of text files with it. 

Exercise 1: A guided tour of the CoNLL format

Exercise 2: Writing a conversion function (text_to_conll)

Exercise 3: Processing multiple files using python modules

**Attention: This notebook should be placed in the same folder as the other Assignments!**


## 1. Understanding the CoNLL format


The CoNLL format represents information about a text in table format. Each token is represented on a line. Each column contains a piece of information. Sentence-boundaries are marked by empty lines. In addition, each token has an index. This index starts with 1 and indentifies the positoion of the token in the sentence. Punctuation marks are also included. 

Consider the following example text: 

*This is an example text. The text mentions a former president of the United States, Barack Obama.*

The representation of this sentence in CoNLL format looks like this:

|   |      |   |      |        |  |
|----|-----------|-----|-----------|--------|---|
| 1  | This      | DT  | this      |        | O |
| 2  | is        | VBZ | be        |        | O |
| 3  | an        | DT  | an        |        | O |
| 4  | example   | NN  | example   |        | O |
| 5  | text      | NN  | text      |        | O |
| 6  | .         | .   | .         |        | O |
|    |           |     |           |        |   |
| 1  | The       | DT  | the       |        | O |
| 2  | text      | NN  | text      |        | O |
| 3  | mentions  | VBZ | mention   |        | O |
| 4  | a         | DT  | a         |        | O |
| 5  | former    | JJ  | former    |        | O |
| 6  | president | NN  | president |        | O |
| 7  | of        | IN  | of        |        | O |
| 8  | the       | DT  | the       | GPE    | B |
| 9  | United    | NNP | United    | GPE    | I |
| 10 | States    | NNP | States    | GPE    | I |
| 11 | ,         | ,   | ,         |        | O |
| 12 | Barack    | NNP | Barack    | PERSON | B |
| 13 | Obama     | NNP | Obama     | PERSON | I |
| 14 | .         | .   | .         |        | O |

**The columns represent the following information:**

* Column 1: Token index in sentence 
* Column 2: The token as it appears in the text (including punctuation)
* Column 3: The part-of-speech tag
* Column 4: The lemma of the token 

Column 5: Information about the type of entity (if the token is part of an expression referring to an entity). For example, Barack Obama is recognized as a person

Column 6: Information about the position of the token in the entiy-mention. B stands for 'beginning', I stands for 'inside' and O stands for 'outside'. Anything that is not part of an entity mention is marked as 'outside'. (This is important information for dealing with entity mentions. Don't worry, you do not have to make use of this information here.)


## 2. Writing the conversion function

In this section of the assignment, we will guide you through writing your function. You can accomplish the entire conversion in a single function (i.e. there will be no helper functions at this point). We will first describe what your function should do and then provide small toy examples to help you with some of the steps. 

**The conversion function: text_to_conll**

(1) Define a function called text_to_conll

(2) The function should have the following parameters:

* text: The input text (str) that should be processed and written to a conll file
* nlp: the SpaCy model 
* output_dir: the directory the file should be written to
* basename: the name of the output file without the path (i.e the file will be written to output_dir/basename
* delimiter: the field delimiter (by default, it should be a tab)
* start_with_index: By default, this should be True. 
* overwrite_existing_conll_file: By default, this should be set to True. 

(3) The function should do the following:

* Convert text to CoNll format as shown in the example in exercise 1. 

* The file should have the following columns:
    * Token index in sentence (as shown in example) If start_with_index is set to False, the first column should be the token.
    * token 
    * part of speech tag (see tips below)
    * lemma 
    * entity type (see tips below)
    * entity iob label (indicates the position of a token in an entity-expression (see tips below)


* If the parameter overwrite_existing_conll_file is set to True, the file should be written to output_dir/basename.

* If the parameter overwrite_existing_conll_file is set to False, the function should check whether the file (path: output_dir/basename) exists. If it does, it should print 'File exists. Set param overwrite_exisiting_conll_file to True if you want to overwrite it.' If it does not exist, it should write it to the specified file. (See tips below)

* The delimiter between fields should be the delimiter specified by the parameter delimiter.


You can define the function in the notebook. Please test it using the following test text. Make sure to test the different paprameters. Your test file should be written to `test_dir/test_text.tsv`.

In [None]:
# your function

In [None]:
# test your function
text = 'This is an example text. The text mentions a former president of the United States, Barack Obama.'
basename = 'test_text.tsv'
output_dir = 'test_dir'
text_to_conll_simple(text, 
                         nlp, 
                         output_dir,
                         basename,
                         start_with_index = False,
                        overwrite_existing_conll_file = True)

## Tip 0: Import spacy and load your model

(See part A and chapter on SpaCy for more information)

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

### Tip 1: Tokens, POS tags, and lemmas

Experiment with a small example to get the tokens and pos tags. Please refer to the chapter on SpaCy for an example on how to process text with spacy.

Spacy has different pos tags. For this exercise, it does not matter which one you use. Hint: To get a string (rather than a number, use the SpaCy attributes ending with '_'). 

You can use the code below to experiment:

In [3]:

test = 'This is a test.'

doc = nlp(test)

tok = doc[0]
tok.text

'This'

### Tip 2: Entities

**Entity types**

Entities are things (usually people/places/organizations/etc) that exist in the real world. SpaCy can tag texts with entity types. If an expression refers to an entity in the world, it will receive a lable indicating the type (for example, Barack Obama will be tagged as 'PERSON'. Since the expression 'Barack Obama' consists of two tokens, each token will receive such a label. Use dir() on a token object to find out how to get this information. Hint: **Everything about entities starts with ('ent_')**

**Position of the entity token**

An expression referring to an entity can consist of multiple tokens. To indicate that multiple tokens are part of the same/of different expressions, we often use the IOB system. In this system, we indicate whether a token is outside an entity mention, inside an entity mention or at the beginning of an entity mention.  In practice, most tokens of a text will thus be tagged as 'O'. 'Barack' will be tagged as 'B' and 'Obama' as 'I' (see example above). SpaCy can do this type of labeling. Use dir() on a token object to find out how to get this information. 

In [25]:
test = 'This is a test.'

doc = nlp(test)

tok = doc[0]
tok.text
dir(tok)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_extension',
 'has_vector',
 'head',
 'i',
 'idx',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'left_edge',
 'lefts',
 'lemma',
 'lemma_',
 'lex_id',
 'l

### Tip 3: Dealing with directories and files

Use os to check if files or directories exist. You can also use os to make a directory if it does not exist yet.

* os.path.isdir(path_to_dir) returns a boolean value. If the directory exists, it returns True. Else it returns False. You can use this to check if a directory exists. If it does not, you can make it.

* os.path.isfile(path_to_file) returns a boolean value. If the file exists, it returns True. Else it returns False. 

* os.mkdir(path_to_dir) makes a new directory. Try it out and create a directory called 'test_dir' in the current directory. 



In [11]:
# Check if file exists

import os

a_path_to_a_file = '../Data/books/Macbeth.txt'

if os.path.isfile(a_path_to_a_file):
    print('File exists:', a_path_to_a_file)
else:
    print('File not found:', a_path_to_a_file)

another_path_to_a_file = '../Data/books/KingLear.txt'

if os.path.isfile(another_path_to_a_file):
    print('File exists:', another_path_to_a_file)
else:
    print('File not found:', another_path_to_a_file)

File exists: ../Data/books/Macbeth.txt
File not found: ../Data/books/KingLear.txt


In [18]:
# check if directory exists

a_path_to_a_dir = '../Data/books/'

if os.path.isdir(a_path_to_a_dir):
    print('Directory exists:', a_path_to_a_dir)
else:
    print('Directory not found:', a_path_to_a_dir)

another_path_to_a_dir = '../Data/films/'

if os.path.isdir(another_path_to_a_dir):
    print('Directory exists:', another_path_to_a_dir)
else:
    print('Directory not found:', another_path_to_a_dir)

Directory exists: ../Data/books/
Directory not found: ../Data/films/


## 3. Building python modules to process files in a directory

In this exercise, you will write two python modules:

* utils.py
* texts_to_conll.py

The module texts_to_conll.py should do the following: 

* process all text files in a specified directory (we will use '../Data/Dreams')
* write conll files representing these texts to another directory 


**Step 1: Preparation**:  

* Create the two python modules in the same directory as this notebook 
* copy your function `text_to_conll` to the python module `texts_to_conll.py`
* Move the function `load_text` you have defined in part A to `utils.py` and import it in `text_to_conll.py` 
* Move the function `get_paths` you have defined in part A to `utils.py` and import in it `text_to_conll.py`

**Step 2: convert all text files in ../Data/dreams**:

Use your functions to convert all files in  `../Data/dreams/`. Please fulfill the following criteria:

* The new files should be placed in a directory placed in the current directory called dreams_conll/
* Each file should be named as follows: [original name without extension].tsv (e.g.vicky1.tsv)
* The files should contain an index column

Tips:

* Use a loop to iterate over the files in ../Data/dreams. 
* Use string methods and slicing to create the new filename from the original filename (e.g. split on '/' and/or '.', use indices to extract certain substrings, etc.)
* Look at the resulting files to check if your code works. 


**Step 3: Test and submit**

Please test your code carefully. Them submitt all your files in a .zip file via Canvas. 


**Congratulations! You have completed your first file conversion exercise!**


In [3]:
# Files in '../Data/Dreams':
%ls ../Data/Dreams/

IGNORE_ME!    vickie2.txt   vickie5.txt   vickie8.txt
vickie1.txt   vickie3.txt   vickie6.txt   vickie9.txt
vickie10.txt  vickie4.txt   vickie7.txt
