<div class="row">
    <div class="column">
        <img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="Data Science Campus Logo"
             align="right" 
             width = "340"
             style="margin: 0px 60px"
             />
    </div>
    <div class="column">
        <img src="https://cdn.ons.gov.uk/assets/images/ons-logo/v2/ons-logo.svg"
             alt="ONS Logo"
             align="left" 
             width = "420"
             style="margin: 0px 30px"/>
    </div>
</div>

# Introduction to Natural Language Processing

## Trainers

<font size="+0.5">Dr Saliha Minhas</font>   
(<saliha.minhas@ons.gov.uk>)  
<br>
<font size="+0.5">Jonathon Mellor</font>   
(<jonathon.mellor@ons.gov.uk>)  

<br>

**Pre-Requisites:** Knowledge of Python data structures and how to perform selection and iteration is required. It is expected that you have completed an Introduction to Python course, at minimum.

<br>

**Brief Description:** Natural Language Processing (NLP) is a sub-field of Artificial Intelligence. It is used for processing and analysing text. NLP can be applied at scale to help gain insights from unstructured data sources. Some example applications include: 
* Search engines (Google) 
* Text classification (spam filters)
* Identifying sentiments for a product (sentiment analysis)
* Methods for discovering abstract topics in a collection of documents (topic modelling)
* Translation
* Information retrieval

This is an Introduction to Natural Language Processing, and so the main concepts are about cleaning, processing, exploring datasets, and becoming familiar with working with text-based data. 

<br>

**Aims, Objectives and Intended Learning Outcomes:** This module will introduce the Natural Language Processing field using the Python programming language. It covers some basic terminology, the process of 'cleaning' a dataset, exploring it and applying simple feature engineering techniques to transform the data. By the end of the module learners will understand and apply the necessary steps to 'clean', explore and transform their dataset in the appropriate order.

<br>

**Datasets:** `Patent Dataset`, `Hep Dataset` (High_Energy_Physics), `Spam/Ham`

<br>

**Libraries:** Please read pre course instructions for installation instructions. 

Briefly, versions of the following third party libraries are used.

`pandas, numpy, matplotlib, wordcloud, spacy, nltk, stanza, sklearn`

**Acknowledgements:** Many thanks  to all below for their valuable contributions to the development of this course.
<br>
Skevi Pericleous, Kaveh Jahanshahi, Savvas Stephanides, Joshi Chaitanya, Ian Grimstead, Thanasis Anthopoulos, Gareth Clews, Isabela Breton, Dan Lewis, Dave Pugh.
<br>

## Table of Contents
<br>

<a href="#Module 1 - Background"><font size="+1">Module 1 - Background</font></a>
<ol>
  <li>What is special about language?</li>
  <li>Major levels of linguistic structure</li>
  <li>Challenging tasks in language processing</li>
  <li>Real-life applications and challenges</li>
  <li>How we work</li>
  <li>spaCy, stanza and nltk packages</li>
</ol>

**Learning Outcomes:** 

* Describe what is special about human language
* List the major levels of linguistic structure
* Describe how language processing can be challenging
* Define areas where progress has or hasn't been made in language processing
* Describe the work procedure for this course

<br>

### 1.1 What is special about language?

<br>

<img src="../pics/hello-in-many-languages-word-cloud.png" alt="Word cloud of greetings in various languages">

<br>

Natural language refers to any language that is used by humans and has evolved over time. Natural language is not planned or systematically designed. It differs from formal language, such as that used by computers.

* Language is uniquely human.
* “Infinite use of finite means” (Hauser, Chomsky, & Fitch, 2002). Meaning an infinite number of unique sentences can be produced with a limited number of words and rules.
* Language enables you to say things you have never heard before.
* Unlike animal communication we can use language to refer to the past, future, and abstract notions.
* Language is co-operative and enables expression of a shared goal.
* It is a complex system learned quickly and easily by infants with almost no instruction.
* All languages have certain features in common.
* Every language has a system of rules constructing syllables, words and sentences.
* "Knowing" a language means: knowing these rules which are subconscious.
* Language is social. It varies according to region, speaker, identity and situation.
* Language always changes. They can be born and die.
* There are no primitive languages.

<br>

**Origins of Language**

<br>

There are about 5000 languages spoken in the world today (a third of them in Africa), but scholars group them together into relatively few families – probably less than twenty. Languages are linked to each other by shared words or sounds or grammatical constructions. The theory is that the members of each linguistic group have descended from one language, a common ancestor. In many cases that original language is judged by the experts to have been spoken in surprisingly recent times – as little as a few thousand years ago. (The London Language School, 2017)

While it is difficult to establish a precise date for the evolution of language, the capacity probably coincides with the emergence of modern *Homo sapiens*, and debatably other hominids. (*Henshilwood CS, Dubreuil B. Reading the artefacts: gleaning language skills from the Middle Stone Age in southern Africa. Cradle Language. 2009;2:61–92.*)



<img src="../pics/old-world-language.jpg" alt="Old World language family derivation tree">

<br>

**Language Universals**


* All languages have verbs and nouns.<br>
* All spoken languages have consonants and vowels.<br>
*  In declarative sentences with nominal subject and predicate, the dominant order is almost always in which the subject precedes the predicate [Greenberg 1]

<br>


<img src="../pics/delcarative-sentence.png" alt="Sentence structure with subject and predicate labelled">



<br>


A declarative sentence is the most common type of sentence in the English language. It is written in the present tense
and usually ends with a period. It is a sentence that makes a statement, provides a fact, offers an explanation, or conveys information. [Greenberg 29]
<br>


"If a language has inflection, it always has derivation"
(Radev, 2016)
<br>

An inflection is a part of a word that modifies a derivate word to note things like quanitiy, gender and tense.

![Derivation & inflection examples](attachment:derivationinflection.png)

<br>

Affixes like *-ful, -less, -(t)ion*  are called derivational: they generally change the word
class of the stem such as from noun to adjective, or verb to noun.

For example: "use", "useful", "useless" - have different meanings

"intend", "intentded", "intention" - are different parts of speech.

Inflectional affixes like *-s,-er or -ing* in turn introduce less of meaning change: they
add grammatical information such as agreement, comparative degree or aspect.
Not all languages have derivational affixes but languages with inflection have been found to have also derivation.
It seems more important for languages to use affixes for creating new words than to use them to modify existing words.
(Moravcsik, 2013)

In NLP we will often need to work with similar words that have the same or related roots. They will differ with affixes, or inflections and we need strategies to tackle removing them. 

These words have roots which will allow us to compare them.

"running" -> "run

"deception" -> "decieve"

"works" -> "work"

This is explored in Module 2 - Text Normalisation.



<br>

NLP draws on research in Linguistics. We are going to give an overview of techniques that help us quantify otherwise unstructured data. It is only when we can derive structure from the text that we can analyse it appropriately.

<br>

### 1.2 Major Levels of Linguistic Structure 


<br>
We will look at the different ways we can break down the structure of language. Each different structure level can give us insight into the text in different ways.

<br>
<br>


<img src="../pics/NLP-Pyramid.png" alt="Sentence structure with subject and predicate labelled" width=400>


<br>

**Morphology** 
<br>

The way words break down into component parts that carry meanings like singular versus plural
<br>

<img src="../pics/pt2_intro_morph_1.gif" alt="Morphology of the word 'unhappiness'">

<br>

**Syntax** 

<br>

Rules that govern the ways in which words combine to form phrases, clauses, and sentences. 
For example - a sentence includes a subject and a predicate. The subject is a noun phrase and the predicate is a verb phrase.

Noun phrase: The cat, Samantha, She

Verb phrase: arrived, went away, had dinner

In general - a "parser" is a program that builds a data structure from some input data. In computing these are often used for things like reading in files. They can also be used in natural language processing.

A natural language <strong>parser</strong> is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or predicate of a verb.

<br>

![A parse tree exemplifying the output of parsers](attachment:parse%20tree.png)

<br>

**Semantics** 

Meaning conveyed in language. Example below:

> "How much Chinese silk was exported to Western Europe by the end of the 18th century?"

To answer this question, we need to know something about <strong>lexical semantics</strong>, the meaning of all the words - *export* or *silk* as well as <span strong>compositional semantics</strong> - what exactly constitutes *Western Europe* as opposed to *Eastern* or *Southern Europe*, what does *end* mean when 
combined with the *18th century*. We also need to know something about the relationship of the words to the syntactic structure. For example, we need to know that by the *end of the 18th century* is a temporal end-point (Jurafsky and Martin, 2019).

<br>

**Pragmatics** 

> *“Use of language in social contexts”* (Nordquist, 2017) 

In practice - the sharing information takes into account many factors. How we use natual language depends on:

* structural knowledge of language (sentences, words)
* the context of what is being shared

"I don't have any money" 

What does this mean?

This consideration of context can needs to be taken into account when we analyse language.
<br>
 


### 1.3 Challenging tasks in Language Processing ###
<br>

**Ambiguity**


Most tasks in speech and language processing can be viewed as resolving <strong>ambiguity</strong> at one of these levels.

<strong>"I made her duck"</strong>

* I cooked waterfowl for her*  
* I cooked waterfowl belonging to her*   
* I created the (plaster?) duck she owns*  
* I caused her to quickly lower her head or body*  
* I waved my magic wand and turned her into undifferentiated waterfowl*  

(Jurafsky and Martin, 2019)

Consider the following:

* Time flies like an arrow
* Fruit flies like bananas

In the above examples "flies" is spelt identically, and mean the same thing - in reference to flying. However, they imply very different concepts.

<br>

**Coreference resolution**

"After the American Civil War ended in 1865 there were changes to statehood."  - <strong>"How many states were in the United States that year?"</strong>

What year is that year? In our example it refers to 1865, but it's not clear from the sentence in question alone.

This task of coreference resolution makes use of knowledge about how words like *that* or pronouns like *it* or *she* refer to previous parts of the discourse.

<br>

**Other Challenges**
<br>

<img src="../pics/other_challenges.jpg" alt="Examples of challenges to overcome in NLP">

(Jurafsky and Martin, 2019)
<br>

### 1.4 Real-Life Applications and Challenges
<br>

![Applications of NLP](attachment:intro1.png)

Below are some example applications of NLP with their corresponding challenges.


1. Machine Translation Technologies <br>
   Challenge: preserve the meaning of the sentence from one language to the other  <br>
   <br>
2. Search Engines eg. Google <br>
   Challenge: recognize natural language questions, extract the meaning of the question and give an answer <br>
   <br>
3. Text Classification eg. Spam Filters <br>
   Challenge: Overcome False Negatives and False Positives ie. sending to spam folder non-spam emails and vice-versa<br>
    <br>
4. Sentiment Analysis eg. identify sentiments for a product <br>
   Challenge: understanding sarcasm and ironic comments <br>
    <br>
5. Topic Modelling: method for discovering the abstract topics in a document collection<br>
   Challenge: using a robust algorithm, sacrifice speed over accuracy? <br>
    <br>
6. Transcription of speech (turning spoken language into written languages)<br>
   Challenge: dealing with looser grammar  <br>
    <br>
7. Question Answering: build systems that automatically answer questions posed by humans in a natural language.<br>
   Challenge: understanding the infinitely varied forms of expression <br>
   <br>



**Progress Made**

<img src="../pics/challenges.PNG" alt="Progress made in the field of NLP">

<br>


(Jurafsky and Martin, 2019)

**The task is difficult!  What tools do we need?**

* Knowledge about language
* Knowledge about the world
* A way to combine knowledge sources
* How we generally do this:
* Probabilistic models built from language data
   * P(“maison” -> “house”)   high
   * P(“L’avocat" "général” -> “the general avocado”) low


(Jurafsky and Martin, 2019)

More recent advances in natural language processing such as OpenAI's [GPT-3](https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/) and Google's [BERT](https://blog.google/products/search/search-language-understanding-bert/) have made massive strides in these more challenging areas of NLP.
<br>  



### 1.5 How we work  

<br>

**Steps**

1. Have a dataset/corpus 

2. Text preprocessing (Data Cleaning)

3. Exploratory Analysis and Data Transformation

4. Split the dataset (Data Scientists may prefer to do the exploratory analysis after they split the Dataset)

5. Identify the technique that is most suitable for your dataset and what you may think can take out of it. This will impact the data preprocessing undertaken

6. Explore different features of the model on the Validate Dataset (Tuning)

7. Test the accuracy and the robustness of your model

8. Communicate your results

9. Make a prediction, if it is possible  

<br>

The above are the steps in the CRISP-DM model, which is typically an industry standard for executing any data science project. Typically, any NLP-based problem can be solved by a methodical workflow that has a sequence of steps. The major steps are depicted in the following figure.


<br>

**Note** This is an Introduction to Natural Language Processing, and thus anything after steps 2-3 are beyond this course. For further exploration see "Intermediate NLP".


![Example steps to take in NLP workflow](attachment:crisp-dm.png)


<br>







### 1.6 spaCy, stanza and nltk packages 
<br>  

<span style="color:blue">nltk</span>, <span style="color:blue">stanza</span> and <span style="color:blue">spaCy</span> are the Python packages that some data scientists have strong feelings in favour of one or the other. In this course we will use all three packages. 

#### nltk

* nltk can be used for teaching and understanding but it is slow.
* Good general tool on a comprehensive range of nlp tasks.
* nltk was originally a tool for academic research.
* A high degree of customisation is achievable.

#### stanza

* Developed by the University of Stanford.
* A collection of  tools that can be applied to many human languages.
* Touted as having state of the art parsers.

_Not available on ONS machines and possibly some other restricted devices_

#### spaCy

* Considered to be production-oriented.
* spaCy is considered fast and robust.
* Less potential to customise than other NLP libraries.
* Sold as having robust tools for large scale language processing.

The first edition of the book on nltk, published by O'Reilly, is available at http://nltk.org/book_1ed/.

The official website of spaCy is https://spacy.io and the source code on github is available at https://github.com/explosion/spaCy.  
<br>
Modules in this course use functions from all three libraries. 

<br>

**Other Libaries and Tools**

<ul>
<li><span style="color:blue">Stanford Parser</span></li>
The Stanford Parser is a statistical natural language parser from the Stanford Natural Language Processing Group. Used to parse input data written in several languages such as English, German, Arabic and Chinese.
<li><span style="color:blue">MALLET</span></li>
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
<li><span style="color:blue">Gensim</span></li>
Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization.
</ul>
<br>

**Linguistic Knowledge**
<br>

<ul>
<li><span style="color:blue">WordNet</span></li></li>
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser.
<li><span style="color:blue">BabelNet</span></li>
BabelNet is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and a semantic network which connects concepts and named entities in a very large network of semantic relations, made up of about 16 million entries, called Babel synsets. Each Babel synset represents a given meaning and contains all the synonyms which express that meaning in a range of different languages.
</ul>
<br>

#### Exercise

To find the solutions to the below questions see the `\soltuons\` folder.

<br>

<ol>
  <li><p><i>I can say something in a natural language that no one has ever said in the history of the universe.</i></p> True or False ? Give a reason for your answer.</li>
  <li> For the following sentence label each word with a part of speech, such as *noun* or verb. You may need to look up some words:
      <p><i>The chef cooks the soup</i></p></li>
  <li><p><i>I feel sick today, I dont want to go to work, what do you think Siri?</i></p> What type of NLP application is this? Why would it be difficult to answer?</li>
  <li><p><i>I went to the bank.</i></p> How would such a sentence be difficult for a language proceessing application.
What measure could be taken to overcome the issue?</li>
</ol>




Write your answers in this cell.





### References


Introduction - Natural Language Processing, University of Michigan
https://www.youtube.com/watch?v=n25JjoixM3I
<br>
Introducing Language Typology - Edith A. Moravcsik (2013)
<br>
Speech and Language Processing (3rd ed. draft) - Dan Jurafsky and James H. Martin (2019)
<br>
A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Proven and tested hands-on strategies to tackle NLP tasks - Dipanjan (DJ) Sarkar (2018)
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

<br>