# NLP in 10 lines tutorial notebook
##### PyCon 2016 

# [spaCy overview](http://spacy.io/docs/#examples)

## Load spaCy resources

In [None]:
# Import spacy and English models


Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.

#### What are Stop Words?

Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.

Example: _"the","and","a","are","is"_

#### What is a Corpus?

A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.

Examples:

1. Every word written in the complete works of Shakespeare
2. Every word spoken on BBC Radio channels for the past 30 years 

## Process text

In [None]:
# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy


## Get tokens and sentences

#### What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Cytora is going to PyCon!"
	["Cytora","is","going","to","PyCon!"]

In [None]:
# Get first token of the processed document

# Print sentences (one sentence per line)

## Part of speech tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here](http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/).

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
3. INTJ, Interjection - "Uhhhhhhhhhhh"

In [None]:
# For each token, print corresponding part of speech tag


## Visual part of speech tagging ([displaCy](https://displacy.spacy.io))

## Syntactic dependencies

#### What are syntactic dependencies?

We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.

Example:

<img src="images/syntax-dependencies-oliver.png" align="left" width=500>

In [None]:
# Write a function that walks up the syntactic tree of the given token and collects all tokens to the root token (including root token).

def tokens_to_root(token):
    """
    Walk up the syntactic tree, collecting tokens to the root of the given `token`.
    :param token: Spacy token
    :return: list of Spacy tokens
    """

    return tokens_to_r

# For every token in document, print it's tokens to the root
for token in doc:
    print('{} --> {}'.format(token, tokens_to_root(token)))

# Print dependency labels of the tokens
for token in doc:
    print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))


## Named entities

#### Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name. 

Example:

	1. Barack Obama
	2. Edinburgh
	3. Ferrari Enzo

In [None]:
# Print all named entities with named entity types

doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")

## Noun chunks

#### What is a Noun Chunk?
Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.

Example:

The sentence "The boy saw the yellow dog" has 2 noun objects, the boy and the dog. 
Therefore the noun chunks will be

	1. "The boy"
	2. "the yellow dog"

In [None]:
# Print noun chunks for doc_2


## Unigram probabilities

In [None]:
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus 


## Word embedding / Similarity

#### What are Word embeddings?

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

Example:
	
With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:

	king-queen==man-woman

In [None]:
# For a given document, calculate similarity between 'apples' and 'oranges' and 'boots' and 'hippos'
doc = nlp("Apples and oranges are similar. Boots and hippos aren't.")
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]

# Print similarity between sentence and word 'fruit'
apples_sent, boots_sent = doc.sents
fruit = doc.vocab['fruit']

In [None]:
# Matplotlib Jupyter HACK
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

# Real text analysis

We got familiar with Spacy. In the next section we are going to analyse a real text (Pride & Prejudice). 

We would like to:
* Extract all the actors from the book (e.g. Elizabeth, Barcy, Bingley)
* Visualize actors' occurences with regards to relative position in the book
* Authomatically describe any actor from the book
* Find out which characters have been mentioned in a context of the marriage
* Build keywords extraction that could to display word cloud ([example](http://www.cytora.com/data-samples.html))

## Load text file

In [None]:
def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()

## Process full text

In [None]:
# Process `text` with Spacy NLP Parser
text = read_file('data/pride_and_prejudice.txt')
processed_text = nlp(text)

In [None]:
# How many sentences are in the Pride & Prejudice novel?

# Print sentences from index 10 to index 15, to make sure that we have correctly parsed the book

## Find all the personal names

In [None]:
# Extract all the personal names from Pride & Prejudice and count their occurrence. 
# Expected output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].

def find_actor_occurences(doc):
    """
    Return a list of actors from `doc` with corresponding occurences.
    
    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266)]
    """
    

print(find_actor_occurences(processed_text)[:20])

## Plot actors personal names as a time series 

In [None]:
# Plot actor mentions as a time series relative to the position of the actor's occurrence in a book.

def get_actors_offsets(doc):
    """
    For every actor in a `doc` collect all the occurences offsets and store them into a list. 
    The function returns a dictinary that has actor lemma as a key and list of occurences as a value for every actor.
    
    :param doc: Spacy NLP parsed document
    :return: dict object in form
        {'elizabeth': [123, 543, 4534], 'darcy': [205, 2111]}
    """


actors_occurences = get_actors_occurences(processed_text)

In [None]:
from matplotlib.pyplot import hist
from cycler import cycler

NUM_BINS = 10

def normalize(occurencies, normalization_constant):
    return [o / float(len(processed_text)) for o in occurencies]

def plot_actor_timeseries(actor_offsets, actor_labels, normalization_constant=None):
    """
    Plot actors' personal names specified in `actor_labels` list as time series.
    
    :param actor_offsets: dict object in form {'elizabeth': [123, 543, 4534], 'darcy': [205, 2111]}
    :param actor_labels: list of strings that should match some of the keys in `actor_offsets`
    :param normalization_constant: int
    """
    x = [actor_offsets[actor_label] for actor_label in actor_labels] 
    
    if normalization_constant:
        x = [normalize(actor_offset, normalization_constant) for actor_offset in x]
        

    with plt.style.context('fivethirtyeight'):
        plt.figure()
        n, bins, patches = plt.hist(x, NUM_BINS, label=actor_labels)
        plt.clf()
        
        ax = plt.subplot(111)
        for i, a in enumerate(n):
            ax.plot([x / (NUM_BINS - 1) for x in range(len(a))], a, label=actor_labels[i])
            
        matplotlib.rcParams['axes.prop_cycle'] = cycler(color=['r','k','c','b','y','m','g','#54a1FF'])
        ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plot_actor_timeseries(actors_occurences, ['darcy', 'bingley'], normalization_constant=len(processed_text))

## Spacy parse tree in action

In [None]:
# Find words (adjectives) that describe Mr Darcy.

def get_actor_adjectives(doc, actor_lemma):
    """
    Find all the adjectives related to `actor_lemma` in `doc`
    
    :param doc: Spacy NLP parsed document
    :param actor_lemma: string object
    :return: list of adjectives related to `actor_lemma`
    """
    

print(get_actor_adjectives(processed_text, 'darcy'))

In [None]:
# Find actors that are 'talking', 'saying', 'doing' the most. Find the relationship between 
# entities and corresponding root verbs.

# Find all the actors that got married in the book
#
# Here is an example sentence from which this information could be extracted:
# 
# "her mother was talking to that one person (Lady Lucas) freely,
# openly, and of nothing else but her expectation that Jane would soon
# be married to Mr. Bingley."
#


## Extract Keywords

In [None]:
# Extract Keywords using noun chunks from the news article (file 'article.txt').
# Spacy will pick some noun chunks that are not informative at all (e.g. we, what, who).
# Try to find a way to remove non informative keywords.

article = read_file('data/article.txt')
doc = nlp(article)


## Open task on the RAND Terrorism Dataset

This is an open challenge to apply what you have learnt analysing Pride and Prejudice with spaCy on a dataset of real events. We have preprocessed the [RAND Terrorism Dataset]((http://www.rand.org/nsrd/projects/terrorism-incidents.html) for this task reducing the data to 10033 articles from 1968 to 2009.

Can you find out the following using the code you have written?
- Who are the terrorist groups and other persons mentioned in each article?
- What locations are mentioned in each article? Hint: a location just has a different label to a person
- From all of your entities, can you find out which named entities are terrorists from the syntactic relationships?
- With all of this information, can you plot a figure expressing the relationships between locations and terrorists?

There are no right answers to any of these questions, and there might not even be an answer at all.

In [None]:
# To get you started we can import Pandas and Seaborn which might help you
# build a graph or visualisation of the data

import pandas as pd
import seaborn as sns

terrorism_file = read_file('data/rand-terrorism-dataset.txt')
terrorism_articles = nlp(terrorism_file)

#### Example output using Pandas and Seaborn

![Heatmap of terrorist group and country](images/example_output.png)