# Lab-1: Introduction to Natural Language Processing tools

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

## Lab-1 in a nutshell
In this notebook, we try to answer the most important questions you might have about this lab session:

* How to work with the Terminal (Mac, Linux) or the Command Line (Windows)?
* How do I work with Jupyter Notebooks?
* In what order should I go through the notebooks of Lab-1?

In this course, you will be working with so-called Jupyter notebooks. Jupyter notebooks can be opened in a web browser and let you run python commands, visualise the output and add documentation using Markdown language. They are very easy to use and adapt so that you can experiment with the material step-by-step. By making your own notes in addition you can create your own learning experience. We will next explain how to install and use it.

## Working with the terminal

Mac (unix) and Linux come with a terminal application that is straight forward to use. Windows also has a terminal, which can be installed and activated as explained here:

 https://docs.microsoft.com/en-us/windows/terminal/get-started

You can also simulate a Linux terminal in Windows through Git bash:

 https://www.atlassian.com/git/tutorials/git-bash
 

## Why do you need to work from the command line?

Why do you need to learn how to work from the command line if there are so many cool visual applications?

Working from the command line seems awfully out-dated. However, they have some advantages:

* they do not take up unnecessary memory when running code on large amounts of data
* they are more transparant, which gives you full controll over what is running, where the data come from and where the result is stored
* when it runs from command line, it can run practically anywhere, which makes sharing easier
* graphical tools get out-dated quickly and may not run anymore in newer versions of operating systems

You may argue that Jupyter notebook is also a graphical tool and you are right although a very ligh-weight graphical tool. However, we also do not advice you to use Jupyter notebooks in your future work for building and doing complex Natural Language Processing. We use notebooks solely for teaching purposes and documenting scientific experiments and eventually, we teach you how to put important code in python files that can run from the command line and which can be called from high-level notebooks as well. In the latter case, the notebook documents your experiment whereas the work is done in the python script files.

## Useful command line commands

A terminal gives you access to a so-called UNIX/Linux *shell*. There are many things you can within such a shell, which was the only interface for computers for years. Working with shell commands is extremely fast and efficient. You do not need to learn all these commands but some of them come in very handy when doing Natural Language Processing.

If you want to have a taste of the power of shell commands, please check out the following paper "Unix for Poets" by Kenneth Church. It explains how to make your own scripts for counting words, sort word lists, make n-grams and make concordances, for large amounts of text:

https://www.cs.upc.edu/~padro/Unixforpoets.pdf

For now, we will go through a small number of basic commands that you need to know for the classes:

<ul>
<li> <b>pwd</b>: gives the path to the current directory
<li> <b>ls</b> (<b>dir</b>): gives a list of what is stored in the current directory
<li> <b>cd</b>: change directory, either going up to the parent or going into a subdirectory
<li> <b>mkdir</b>: create a new directory in the current directory
<li> <b>rmdir</b>: remove a subfolder when empty
<li> <b>rm</b>: permanently remove files and folders 
</ul>

You can carry out some of these Unix commands in your notebook using the prefix '%'. We will go through the above commands in this notebook but you should try them in a terminal or Windows command line.

### pwd

When you open a terminal or command line box, you will be somewhere on your hard disk. You will see a prompt where you can type a command to the computer. The first thing you want to know is where you are on your disk. For this we use the *pwd* command. Let's try this in this notebook, where we need to put "!" in front of it to tell Jupyter to run a shell command. On your terminal you can type *pwd* directly after the prompt and hit enter.

In [1]:
!pwd

/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs/lab1.toolkits


The result is the full path to the location where this notebook is running. Note the use of slashes. On a Mac and Linux, *pathes* are created with forward slashes to separate subdirectories. On a Windows machine, you will see backward slashes, e.g. "C:\\Users\piek\Desktop\t-MA-HLT-introduction-2023\ma-hlt-labs\lab1.toolkits".

The use of slashes in a path is a very common error when people exchange code across platform. Note that I use a Mac for the notebooks so any path I show you has forward slashes. If you are on a Windows machine, flip these to backward slashes.

### ls (Mac/Linux) or dir (Windows)

In a visual interface, opening a directory immediately shows you the content. In a command line interface you use the *ls* (Mac, Linux) or *dir* (WIndows) command: 

In [2]:
%ls

Lab1-apple-samsung-example.txt      Lab1.3-introduction-to-spaCy.ipynb
Lab1-assignment.ipynb               [34mimg[m[m/
Lab1.1-introduction.ipynb           my_parse_tree.ps
Lab1.2-introduction-to-NLTK.ipynb   spacy_tree_structure.svg


We now get a listing of all the content of the lab1-toolkits directory, which is where we are.

### cd

Now we can try out other things. Lets use the 'cd' (change directory) command to enter the directory named 'img', which is inside the lab1.toolkits directory.

In [3]:
%cd img

/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs/lab1.toolkits/img


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


We can inspect the content of this directory using the same *ls* or *dir* command. However, we will now use a parameter "-l" that we add after the command. This parameter tells the system to provide details for each file and subdirectory shown as columns.

In [4]:
%ls -l

total 9456
-rw-r--r--  1 piek  staff   135495 Jul 17 13:48 anaconda-navigator.png
-rw-r--r--  1 piek  staff  1283626 Jul 17 13:48 anaconda-terminal.png
-rw-r--r--  1 piek  staff    34449 Jul 17 13:48 f.png
-rw-r--r--  1 piek  staff   575830 Jul 17 13:48 import-error-env.png
-rw-r--r--  1 piek  staff    68604 Jul 17 13:48 jupyter-lab.png
-rw-r--r--  1 piek  staff   157126 Jul 17 13:48 lab1.png
-rw-r--r--  1 piek  staff    43641 Jul 17 13:48 nltk-parse-2.png
-rw-r--r--  1 piek  staff    48520 Jul 17 13:48 nltk-parse.png
-rw-r--r--  1 piek  staff    62260 Jul 17 13:48 nltk-parser-3.png
-rw-r--r--  1 piek  staff   211656 Jul 17 13:48 nltk.download.png
-rw-r--r--  1 piek  staff    30893 Jul 17 13:48 nltk.tools.png
-rw-r--r--  1 piek  staff  1189585 Jul 17 13:48 pip-list.png
-rw-r--r--  1 piek  staff   202859 Jul 17 13:48 terminal.png
-rw-r--r--  1 piek  staff   769161 Jul 17 13:48 venv.png


We see several image files (.png) that are used in the notebooks of Lab1. The file names are preceded by their creation date and time and their size. The first three columns provide information on ownership and rights to use. We will not explain all of these here but you can find documentation online what they mean and how to change use and access rights.

Let's try a few more commands. We already used *cd* to enter a subdirectory but you can image that you also want to move out of a directory by going a level up. For leaving a directory and going back to the parent you do not have to provide the name but can use the generic specification "..":

In [5]:
%cd ..
%ls

/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs/lab1.toolkits
Lab1-apple-samsung-example.txt      Lab1.3-introduction-to-spaCy.ipynb
Lab1-assignment.ipynb               [34mimg[m[m/
Lab1.1-introduction.ipynb           my_parse_tree.ps
Lab1.2-introduction-to-NLTK.ipynb   spacy_tree_structure.svg


In [6]:
%cd ..
%ls

/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs
README.md                   [34mlab2.word_meaning[m[m/
[34mdata-formats[m[m/               [34mlab3.machine_learning[m[m/
[34mlab0.eliza[m[m/                 [34mlab4.contextualized_models[m[m/
[34mlab0.llama[m[m/                 [34mlab5.final_assignment[m[m/
[34mlab1.toolkits[m[m/              [34mvenv[m[m/


You can see that the ".." is an abstract reference to the parent and we can repeat this all the way up through the path as a tree till we end at the root.

Obviously, you can also specify the exact path to go somehwere if you know it. For example the next command brings us back to the "img" subdirectory directly.

In [7]:
%cd /Users/piek/Desktop/t-MA-HLT-introduction-2023/ma-hlt-labs/lab1.toolkits/img

[Errno 2] No such file or directory: '/Users/piek/Desktop/t-MA-HLT-introduction-2023/ma-hlt-labs/lab1.toolkits/img'
/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs


  bkms = self.shell.db.get('bookmarks', {})


### mkdir

If you want to create a directory yourself you can do this using the *mkdir* command follwed by a name. So let's create a subdirectory witht he name "test":

In [8]:
%mkdir test
%cd test
%ls -l

/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs/test
total 0


What happens if you repeat running the above cell?

### rmdir

Make sure you clean your mess afterwards and delete the test folder. We do this using the *rmdir*:

In [9]:
%rmdir test

rmdir: test: No such file or directory


He! We get an error. Are you surprised? Let check where we are using *pwd* and check what is there using *ls*:

In [10]:
%pwd

'/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs/test'

In [11]:
%ls

We are still inside the "test" folder and there is nothing in there. To remove it, we need to go up to the parent first and remove it from there:

In [11]:
%cd ..
%ls
%rmdir test
%cd .
%ls

/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs
README.md                   [34mlab3.machine_learning[m[m/
[34mdata-formats[m[m/               [34mlab4.contextualized_models[m[m/
[34mlab0.eliza[m[m/                 [34mlab5.final_assignment[m[m/
[34mlab0.llama[m[m/                 [34mtest[m[m/
[34mlab1.toolkits[m[m/              [34mvenv[m[m/
[34mlab2.word_meaning[m[m/
/Users/piek/Desktop/t-MA-HLT-introduction-2024/ma-hlt-labs
README.md                   [34mlab2.word_meaning[m[m/
[34mdata-formats[m[m/               [34mlab3.machine_learning[m[m/
[34mlab0.eliza[m[m/                 [34mlab4.contextualized_models[m[m/
[34mlab0.llama[m[m/                 [34mlab5.final_assignment[m[m/
[34mlab1.toolkits[m[m/              [34mvenv[m[m/


You can see that *test* was there before the *rmdir* and no longer is there now.

### To use or not to use the "rm" command

If a directory is not empty, *rmdir* will not work. There is a very powerful shell command *rm* that can remove anything and can do that recursively. This is the *rm* command. You can use it to remove directories and files and when used recursively it will also delete all subdirectories with their content.

Before you use this command, we advise you to study the documentation carefully. The *rm* command will NOT put your data into the trash but immediately remove it permanently.

What will happen if you go to the root of your disk using *cd* and call the *rm* command recursively to remove any directory or file using a wild card for their name? Don't try it, think about it!!!

It will permanently delete everything from your harddrive! To be save remove files and directories using your graphical interface unless you know exactly how the *rm* command works.

A final note on working with the terminal/command line. In your professional work, you are most likley going to use the terminal. You may have to connect your laptop to a server that runs Linux or Unix or Windows and run scripts with series of commands without a graphical interface. Typically, you run jobs with a lot of text data (hunderds of thousands, millions of documents) on these servers and not on your laptop.

So it is good to familiarize yourself in due time with different operating systems: Linux, Unix (Mac), Windows and using a command line interface. If it is too overwhelming now, don't worry. You will get it eventually during the master.

Some other useful links:
* Unix tutorial: https://www.ibm.com/developerworks/aix/library/au-unixtext/
* Linux tutorial: https://tldp.org/LDP/intro-linux/html/intro-linux.html

## How do I work with Jupyter Notebooks? 

The first choice you have to make is what editor you want to use for running and creating notebooks. The editor that we recommend is [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/index.html). 

The best way to launch [JupyterLab] is from the command line:

<ul>
<li> Step 1: open the command line or a terminal.
<li>  Step 2: type "jupyter lab" and press enter.
<li>  Step 3: navigate to the folder where you stored the notebooks for Lab1.
</ul>

The documentation for JupyterLab can be found [here](https://jupyterlab.readthedocs.io/en/stable/).
    
After that you've chosen and opened an editor, it is important that you know how to work with notebooks. 
We are going to practice Python using Notebooks. These Notebooks contain instructions and so-called 'code blocks'. The instructions are paragraphs of text that explain the concepts we are going to use. The 'code blocks' contain Python code.

Notebooks are pretty straightforward. Some tips:
* Cells in a notebook contain code or text (Markdown). If you run a cell, it will either run the code or render the text from the Markdown.
* There are five ways to run a cell:
    * Click the 'play' button next to the 'stop' and 'refresh' button in the toolbar.
    * Alt + Enter runs the current cell and creates a new cell. 
    * Ctrl + Enter runs the current cell without creating a new cell.
    * Shift + Enter runs the current cell and moves to the next one.
    * Use the menu and select *Kernel* -> *Restart Kernal and Run All Cells*
* The instructions are written in Markdown. [Here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) is a nice Markdown cheatsheet if you want to write some more.
* Explore the menus for more options! You can even create a presentation using Notebooks.

If want to know more, [this](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) is a nice tutorial on notebooks.

## In what order should I go through the notebooks of Lab 1?
Please go through the notebooks in the following order:
* **Lab1.1-introduction.ipynb** (the notebook you are now reading)
* **Lab1.2-introduction-to-NLTK.ipynb** 
* **Lab1.3-introduction-to-spaCy.ipynb**
* **Lab1-assignment.ipynb**

## What are we going to discuss in this lab session?
Natural Language Processing (NLP) tools transform text in some meaningful way (e.g. translation or summarization) or create interpretations (e.g. determining morpho-syntactic text structure, text meaning or use it for answering questions). This can be through low-level, e.g. splitting a text into separate words, assign part-of-speech tags to each word, and high-level analysis, e.g. detecting entities in text, detecting events, determining the sentiment or emotions.

There are powerfull Python toolkits available that allow you to run all the above NLP tasks and more. 

In this lab, we introduce two popular toolkits:
* [Natural Language Toolkit](https://www.nltk.org/) (NLTK) (see notebook **Lab1.2-introduction-to-NLTK.ipynb**)
* [spaCy](https://spacy.io/) (see notebook **Lab1.3-introduction-to-spaCy.ipynb**)

In the notebook in which we introduce NLTK, we not only show how to perform these tasks in NLTK, but we also **explain** how they work. The online NLTK book is a good aid for this.

For SpaCy, we show how to **interpret** its output. SpaCy is a more modern library that can be used for production purposes. It is fast, works for many languages and is regularly updated. It is less useful for didactic purposes, as it is less clear what technology is inside.

## End of this notebook