# Lab-1: Introduction to Natural Language Processing tools

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL


## Lab-1 in a nutshell

In this notebook, we try to answer the most important questions you might have about this lab session:
* What do I need to have installed on my computer?
* How do I work with Jupyter Notebooks?
* In what order should I go through the notebooks of Lab-1?
* What are we going to discuss in this lab session?
* Can you tell me something about the assignment?


## What do I need to have installed on my computer?
We will work with Anaconda, a distribution of Python as well as R that is widely used for scientific computing, particularly data science and machine learning. 

Please install [Anaconda Python 3.12](https://www.anaconda.com/download/). The install may take a few minutes.

### What if I already have Python/Anaconda installed?
To run the code in the lab sessions you will really need the **Anaconda** distribution with Python **3.12**. To inspect which Python version you have installed, please run the following cell (click on the cell and then click the *play* button at the top of this notebook):

In [2]:
import sys
print(f'{sys.version_info.major}.{sys.version_info.minor}')

3.12


The output should say **3.12**. If this is the case, you are ready!

Note that you may have older versions of Python and Anaconda installed on your machine. This could raise incompatibility issues when installing packages during the course. If you come across problems with installing packages, check the Python versions, remove old versions and perhaps reinstall Anaconda as well.

If you want to keep multiple environments on your laptop, make sure you know what you are doing as we cannot guarantee that everything will work.

## How do I work with Jupyter Notebooks? 
The first choice you have to make is what editor you want to use. One editor that we recommend is [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/index.html) (of course, you are free to use your own preferred editor). You don't have to install it. It is there when you download Anaconda. There are two ways of opening it.

1. via Anaconda Navigator
    * Step 1: open Anaconda Navigator (see [here](https://docs.anaconda.com/anaconda/navigator/getting-started/) for the documentation on how to start Navigator on your computer.)
    * Step 2: you should see something similar to [this](https://user-images.githubusercontent.com/10782481/37635857-265734e6-2c24-11e8-8648-a517ebe23d6c.PNG) on your screen.
    * Step 3: click on the icon of jupyterlab
    * Step 4: Navigate in the left bar to the folder where you stored the notebooks for Lab 1
2. using the command line
    * Step 1: open the command line ([on Windows](https://www.lifewire.com/how-to-open-command-prompt-2618089), [on Mac](https://macpaw.com/how-to/use-terminal-on-mac))
    * Step 2: type `jupyter-lab` and press enter.
    * Step 3: navigate to the folder where you stored the notebooks for Lab 1

The documentation for JupyterLab can be found [here](https://jupyterlab.readthedocs.io/en/stable/).
    
After you've chosen and opened an editor, it is important that you know how to work with Notebooks. 
We are going to practice NLP using Notebooks. These Notebooks contain instructions and so-called 'code blocks'. The instructions are paragraphs of text that explain the concepts we are going to use. The 'code blocks' contain Python code.

Notebooks are pretty straightforward. Some tips:
* Cells in a notebook contain code or text. If you run a cell, it will either run the code or render the text.
* There are five ways to run a cell:
    * Click the 'play' button next to the 'stop' and 'refresh' button in the toolbar.
    * Alt + Enter runs the current cell and creates a new cell. 
    * Ctrl + Enter runs the current cell without creating a new cell.
    * Shift + Enter runs the current cell and moves to the next one.
    * Use the menu and select *Kernel* -> *Restart Kernal and Run All Cells*
* The instructions are written in Markdown. [Here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) is a nice Markdown cheatsheet if you want to write some more.
* Explore the menus for more options! You can even create a presentation using Notebooks.

If want to know more, [this](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) is a nice tutorial on notebooks.

## In what order should I go through the notebooks of Lab 1?
Please go through the notebooks in the following order:
* **Lab1.1-introduction.ipynb** (the notebook you are now reading)
* **Lab1.2-introduction-to-NLTK.ipynb** 
* **Lab1.3-introduction-to-spaCy.ipynb**
* **Lab1-assignment.ipynb**

## What are we going to discuss in this lab session?
Text data is unstructured. But if you want to extract information from text, then you often need to process that data into a more structured representation. The common idea for all Natural Language Processing (NLP) tools is that they try to structure or transform text in some meaningful way. Examples include:

* **Sentence splitting:** splitting texts into sentences
* **Tokenization:** splitting texts into individual words
* **Stop words recognition:** identifying commonly used words (such as 'the', 'a(n)', 'in', etc.) in text, possibly to ignore them in other tasks
* **Part-of-speech (POS) tagging:** identifying the parts of speech of words in context (verbs, nouns, adjectives, etc.)
* **Morphological analysis:** separating words into morphemes and identifying their classes (e.g. tense/aspect of verbs)
* **Stemming:** identifying the stems of words in context by removing inflectional/derivational affixes, such as 'troubl' for 'trouble/troubling/troubled'
* **Lemmatization:** identifying the lemmas (dictionary forms) of words in context, such as 'go' for 'go/goes/going/went'
* **Word Sense Disambiguation (WSD):** assigning the correct meaning to words in context
* **Named Entity Recognition (NER):** identifying people, locations, organizations, etc. in text
* **Constituency/dependency parsing:** analyzing the grammatical structure of a sentence
* **Semantic Role Labeling (SRL):** analyzing the semantic structure of a sentence (*who* does *what* to *whom*, *where* and *when*)
* **Sentiment Analysis:** determining whether a text is mostly positive or negative
* **Word Vectors (or Word Embeddings) and Semantic Similarity:** representating the meaning of words as rows of real valued numbers where each point captures a dimension of the word's meaning and where semantically similar words have similar vectors (very popular these days)

There are Python libraries available that allow you to run these NLP tasks using Python. In this lab, we introduce two of the most popular ones:
* [Natural Language Toolkit](https://www.nltk.org/) (NLTK) (see notebook **Lab1.2-introduction-to-NLTK.ipynb**)
* [spaCy](https://spacy.io/) (see notebook **Lab1.3-introduction-to-spaCy.ipynb**)

For this lab session, we show how to perform the following tasks using NTLK and spaCy:
* **Sentence splitting**
* **Tokenization** 
* **Part-of-speech (POS) tagging** 
* **Stop words recognition** 
* **Stemming and Lemmatization** 
* **Constituency/dependency parsing** 
* **Named Entity Recognition (NER)** 

In the notebook in which we introduce NLTK, we not only show how to perform these tasks in NLTK, but we also **explain** them.
For SpaCy, we show how to **interpret** the output.

## Can you tell me something about the assignment?
You will run NLTK and spaCy on the same text. You are then asked to compare the output.