# Lab-1: Introduction to Natural Language Processing tools

We thank [Chantal van Son](https://chantalvanson.wordpress.com/) for the creation of [a notebook](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2019%20-%20More%20about%20Natural%20Language%20Processing%20Tools%20(spaCy).ipynb) of which parts were used for the material of Lab 1.

Text data is unstructured. But if you want to extract information from text, then you often need to process that data into a more structured representation. The common idea for all Natural Language Processing (NLP) tools is that they try to structure or transform text in some meaningful way. Examples include:

* **Sentence splitting:** splitting texts into sentences
* **Tokenization:** splitting texts into individual words
* **Stop words recognition:** identifying commonly used words (such as 'the', 'a(n)', 'in', etc.) in text, possibly to ignore them in other tasks
* **Part-of-speech (POS) tagging:** identifying the parts of speech of words in context (verbs, nouns, adjectives, etc.)
* **Morphological analysis:** separating words into morphemes and identifying their classes (e.g. tense/aspect of verbs)
* **Stemming:** identifying the stems of words in context by removing inflectional/derivational affixes, such as 'troubl' for 'trouble/troubling/troubled'
* **Lemmatization:** identifying the lemmas (dictionary forms) of words in context, such as 'go' for 'go/goes/going/went'
* **Word Sense Disambiguation (WSD):** assigning the correct meaning to words in context
* **Named Entity Recognition (NER):** identifying people, locations, organizations, etc. in text
* **Constituency/dependency parsing:** analyzing the grammatical structure of a sentence
* **Semantic Role Labeling (SRL):** analyzing the semantic structure of a sentence (*who* does *what* to *whom*, *where* and *when*)
* **Sentiment Analysis:** determining whether a text is mostly positive or negative
* **Word Vectors (or Word Embeddings) and Semantic Similarity:** representating the meaning of words as rows of real valued numbers where each point captures a dimension of the word's meaning and where semantically similar words have similar vectors (very popular these days)

## NLP toolkits
There are toolkits available that allow you to run these NLP steps using Python. In this lab, we introduce two of the most popular NLP toolkits that make use of Python:
* [Natural Language Toolkit](https://www.nltk.org/) (see notebook **Lab1-introduction-to-NLTK.ipynb**)
* [spaCy](https://spacy.io/) (see notebook **Lab1-introduction-to-Spacy.ipynb**)

For this Lab, we focus on the following NLP tasks:
* **Sentence splitting**
* **Tokenization** 
* **Part-of-speech (POS) tagging** 
* **Stop words recognition** 
* **Stemming**
* **Lemmatization** 
* **Constituency/dependency parsing** 
* **Named Entity Recognition (NER)** 

In the notebook in which we introduce NLTK, we not only show how to perform these tasks in NLTK, but we also **explain** them.
For SpaCy, we show how to **interpret** the output.

**What do these toolkits have in common?**
They both allow you to run NLP modules on unstructured text.

**In what do they differ?**
* With NLTK, you run all NLP steps one by one. This allows you to fully go through each step and understand the input and output. Spacy is more of a black-box. You provide it with input and it runs all of the NLP steps for you.
* NLTK was developed for teaching and research, whereas spaCy is what they call *industrial-strength*, i.e., meant for use on large amounts of data.

**Which one is faster?** [SpaCy](https://spacy.io/usage/facts-figures)

We encourage you to continue with the two notebooks that introduce NLTK and spaCy.