# Lab-1: Introduction to Natural Language Processing tools

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

## Lab-1 in a nutshell
In this notebook, we try to answer the most important questions you might have about this lab session:
* What do I need to have installed on my computer?
* How do I work with Jupyter Notebooks?
* In what order should I go through the notebooks of Lab-1?
* What are we going to discuss in this lab session?
* Can you tell me something about the assignment?
* Who to contact for questions?

## What do I need to have installed on my computer?
Please install [Anaconda Python 3](https://www.anaconda.com/distribution/) on your computer. The install may take a few minutes. You probably already did this for the Python course.

To inspect which Python version you have installed, please run the following cell (click on the cell and then click the *play* button at the top of this notebook):

In [1]:
import sys
print(f'{sys.version_info.major}.{sys.version_info.minor}')

3.7


The output should say **3.7** or higher. If this is the case, you are ready!

## How do I work with Jupyter Notebooks? 
The first choice you have to make is what editor you want to use. One editor that we recommend is [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/index.html) (of course, you are free to use your own preferred editor). You don't have to install it. It is there when you download Anaconda. There are two ways of opening it.

1. via Anaconda Navigator
    * Step 1: open Anaconda Navigator (see [here](https://docs.anaconda.com/anaconda/navigator/getting-started/) for the documentation on how to start Navigator on your computer.)
    * Step 2: you should see something similar to [this](https://user-images.githubusercontent.com/10782481/37635857-265734e6-2c24-11e8-8648-a517ebe23d6c.PNG) on your screen.
    * Step 3: click on the icon of jupyterlab
    * Step 4: Navigate in the left bar to the folder where you stored the notebooks for Lab 1
2. using the command line
    * Step 1: open the command line ([on Windows](https://www.lifewire.com/how-to-open-command-prompt-2618089), [on Mac](https://macpaw.com/how-to/use-terminal-on-mac))(Note that you can also open a terminal using the Anaconda-Navigator)
    * Step 2: type `jupyter-lab` and press enter.
    * Step 3: navigate to the folder where you stored the notebooks for Lab 1

The documentation for JupyterLab can be found [here](https://jupyterlab.readthedocs.io/en/stable/).
    
After that you've chosen and opened an editor, it is important that you know how to work with notebooks. 
We are going to practice Python using Notebooks. These Notebooks contain instructions and so-called 'code blocks'. The instructions are paragraphs of text that explain the concepts we are going to use. The 'code blocks' contain Python code.

Notebooks are pretty straightforward. Some tips:
* Cells in a notebook contain code or text (Markdown). If you run a cell, it will either run the code or render the text from the Markdown.
* There are five ways to run a cell:
    * Click the 'play' button next to the 'stop' and 'refresh' button in the toolbar.
    * Alt + Enter runs the current cell and creates a new cell. 
    * Ctrl + Enter runs the current cell without creating a new cell.
    * Shift + Enter runs the current cell and moves to the next one.
    * Use the menu and select *Kernel* -> *Restart Kernal and Run All Cells*
* The instructions are written in Markdown. [Here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) is a nice Markdown cheatsheet if you want to write some more.
* Explore the menus for more options! You can even create a presentation using Notebooks.

If want to know more, [this](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) is a nice tutorial on notebooks.

## In what order should I go through the notebooks of Lab 1?
Please go through the notebooks in the following order:
* **Lab1.1-introduction.ipynb** (the notebook you are now reading)
* **Lab1.2-introduction-to-NLTK.ipynb** 
* **Lab1.3-introduction-to-spaCy.ipynb**
* **Lab1-assignment.ipynb**

## What are we going to discuss in this lab session?
Text data is unstructured. But if you want to extract information from large volumes text, it is no option to read. You need to process those texts to obtain structured representations. The common idea for all Natural Language Processing (NLP) tools is that they try to transform text in some meaningful way. This can be through low-level and high-level analysis. Examples include:

* **Sentence splitting:** splitting texts into sentences
* **Tokenization:** splitting sentences into the individual words and punctuation
* **Part-of-speech (POS) tagging:** identifying the parts of speech of words in context (verbs, nouns, adjectives, etc.)
* **Morphological analysis:** separating words into morphemes and identifying their classes (e.g. tense/aspect of verbs, number, gender)
* **Stemming:** identifying a base form of words in context by removing inflectional/derivational affixes, such as 'troubl' for 'troubles/troubling/troubled'
* **Lemmatization:** identifying the lemmas (dictionary forms) of words in context, such as 'trouble' for 'troubles/troubling/troubled'
* **Word Sense Disambiguation (WSD):** assigning the correct meaning to words in context, 'bank' as river side or financial institute
* **Named Entity Recognition (NER):** identifying phrases that name people, locations, organizations, etc. in text
* **Constituency/dependency parsing:** analyzing the grammatical structure of a sentence: *noun and verb phrases*, syntactic relations such as *subject* or *object* with predicates
* **Semantic Role Labeling (SRL):** analyzing the semantic structure of a sentence (*who* does *what* to *whom*, *where* and *when*)
* **Sentiment Analysis:** determining whether a text is positive or negative
* **Topic modeling:** determining the topic of text: *sports*, *politics*, *disaster*, *weather*, etc.
* **Word Vectors (or Word Embeddings) and Semantic Similarity:** representing the meaning of words as vectors (rows of real valued numbers) where each point captures a dimension of the word's meaning and where semantically similar words have similar vectors (very popular these days and explained later in this course)

There are powerfull Python libraries available that allow you to run all the above NLP tasks and more using Python. 

In this lab, we introduce two of the most popular ones:
* [Natural Language Toolkit](https://www.nltk.org/) (NLTK) (see notebook **Lab1.2-introduction-to-NLTK.ipynb**)
* [spaCy](https://spacy.io/) (see notebook **Lab1.3-introduction-to-spaCy.ipynb**)

For this lab session, we show how to apply the following tasks using NTLK and spaCy:
* **Sentence splitting**
* **Tokenization** 
* **Part-of-speech (POS) tagging** 
* **Stemming and Lemmatization** 
* **Constituency/dependency parsing** 
* **Named Entity Recognition (NER)** 

In the notebook in which we introduce NLTK, we not only show how to perform these tasks in NLTK, but we also **explain** how they work. The online NLTK book is a good aid for this.

For SpaCy, we show how to **interpret** the output. SpaCy is a more modern library that can be used for production purposes. It is fast, works for many languages and is regularly updated. It is less useful for didactic purposes, as it is less clear what technology is inside.

## Using command line interface such as Terminal

A simple way to access the command line is using the Anacdonda Navigator. The next image shows a list of *environments* that are available within my Anaconda installation. You list will be different but for opening a terminal within Anaconda, you can select *base (root)* and *Open Terminal*. 
<img src="./img/anaconda-terminal.png">


You now should see a black window which is what is called a terminal from which you can send commands to your computer by typing!!

<img src="./img/terminal.png">

Mac (unix) and Linux come with a terminal application that is straight forward to use. Windows also has a terminal, which can be installed and activated as explained here:

 https://docs.microsoft.com/en-us/windows/terminal/get-started

You can also simulate a linux terminal in Windows through Git bash:

 https://www.atlassian.com/git/tutorials/git-bash
 
Finally, you can carry out Unix commands in your notebook using the prefix '%'. Let's try this with the 'ls' command which gives a listing of what is in the current folder.
 

In [1]:
%ls

Lab1-apple-samsung-example.txt      Lab1.2-introduction-to-NLTK.ipynb
Lab1-assignment.ipynb               Lab1.3-introduction-to-spaCy.ipynb
Lab1-getting-started.pdf            [34mimg[m[m/
Lab1.1-introduction.ipynb


We see here the content of the Lab1 that was downloaded because this is where you run this notebook. Now we can try out other things. Lets use the 'cd' (change directory) command to enter the folder named 'img'.

In [2]:
%cd img
%ls -l

/Users/piek/Desktop/MA-HLT-introduction-2020/ma-hlt-labs/lab1.toolkits/img
total 552
-rw-r--r--@ 1 piek  staff   34449 Aug 20 18:10 f.png
-rw-r--r--@ 1 piek  staff  211656 Sep  3 16:26 nltk.download.png
-rw-r--r--  1 piek  staff   30893 Aug 20 18:10 nltk.tools.png


We see 3 image files (.png) that are used in the notebooks of Lab1. Let's try a few more. Try to figure out what these commands do yourself.

In [3]:
%cd ..
%ls

/Users/piek/Desktop/MA-HLT-introduction-2020/ma-hlt-labs/lab1.toolkits
Lab1-apple-samsung-example.txt      Lab1.2-introduction-to-NLTK.ipynb
Lab1-assignment.ipynb               Lab1.3-introduction-to-spaCy.ipynb
Lab1-getting-started.pdf            [34mimg[m[m/
Lab1.1-introduction.ipynb


In [4]:
%mkdir test
%cd test
%ls -l

/Users/piek/Desktop/MA-HLT-introduction-2020/ma-hlt-labs/lab1.toolkits/test


What happens if you repeat running the above cell?

Make sure you clean your mess afterwards.

Now you know how to run a command line instruction in a notebook, you can also install software from a notebook. We will explain this in the next notebook

A final note on working with the terminal/command line. In your professional work, you are most likley are going to use the terminal. You may have to connect your laptop to a server that runs Linux or Unix or Windows and run scripts with series of commands without a graphical interface. Typically, you run jobs with a lot of text data (hunderds of thousands, millions of documents) on these servers and not on your laptop.

So it is good to familiarize yourself in due time with different operating systems: Linux, Unix, Windows and using a command line interface. If it is too overwhelming now, don't worry. You will get it eventually during the course.

### A final note on virtual environments in Python

It is important to realize that it is common practice to install python packages in different virtual environments. You can define as many environments as you want and some installations, like Anaconda, create their own. The environments listed in the Navigator image shown above are environments within the Anaconda installation that available on my machine. In my Anaconda Navigator, you see some exotic names such as *cltl-chat-ui* and *leolani* which are environments created by me. The list on your machine will be different. You can find instructions and more information about creating, activating and stopping virtual environments here: https://docs.python.org/3/library/venv.html

*Why do you need to know about this now?*
As a beginner, it is not likelyy that you create environments right away. However when you install more and more of these fantastic NLP packages and modules that people created and shared with the world, your machine gets cluttered with *stuff*. You may get conflicts because packges depend on other packages but have version conflicts or packages get installed in different environments. 

The next image for example shows two terminals on my machine. One opened from Anaconda and the other smaller window opened from the Terminal application on my mac. The command *pip list* shows all the packages installed with the *pip-installer*. From the screen dumps you can see that both terminals list different packages:

<img src="./img/pip-list.png">

If you do not know in which environment you are working, you might be suprised that your code cannot find or import something although you are sure you installed it. For example, you can see in ethe above image that the package "blinker" is installed in the standard basic Terminal of my Mac but not in the Anaconda terminal. The next screen dump show what happens if we start Python in both terminals and try to import the module. Even though both terminals tell you that the same Python is running (3.7.6), the module *blinker* can only be important in the standard Terminal and not within Anaconda:

<img src="./img/import-error-env.png">

Depending on which terminal you use, you have different installations available. A notebook started from Anaconda cannot import *blinker* unless you inmstall it within the Anaconda environment beforehand. 

It is therefore important to realize how you opened a terminal to install packages. Try to do that in a consistent way, especially when starting, so that you do not run into installation and import problems. When you are more skilled, you should start defining virtual environments yourself that you activate at will and install all you need and nothing else within that environment.

## End of this notebook