# NLP in 10 lines tutorial notebook
##### PyCon 2016 

# [spaCy overview](http://spacy.io/docs/#examples)

## Load spaCy resources

In [None]:
# Import spacy and English models


Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.

#### What are Stop Words?

Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.

Example: _"the","and","a","are","is"_

#### What is a Corpus?

A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.

Examples:

1. Every word written in the complete works of Shakespeare
2. Every word spoken on BBC Radio channels for the past 30 years 

## Process text

In [None]:
# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy


## Get tokens and sentences

#### What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Cytora is going to PyCon!"
	["Cytora","is","going","to","PyCon!"]

In [None]:
# Get first token of the processed document

# Print sentences (one sentence per line)

## Part of speech tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here](http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/).

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
3. INTJ, Interjection - "Uhhhhhhhhhhh"

In [None]:
# For each token, print corresponding part of speech tag


## Visual part of speech tagging ([displaCy](https://displacy.spacy.io))

## Syntactic dependencies

#### What are syntactic dependencies?

In [None]:
# Write a function that walk up the syntactic tree of the given token and collects all tokent to the root token (including root token).

# For every token in document, print it's tokens to the root
    
# Print dependency labels of the tokens


## Named entities

#### Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name. 

Example:

	1. Barack Obama
	2. Edinburgh
	3. Ferrari Enzo

In [None]:
# Print all named entities with named entity types

doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")

## Noun chunks

#### What is a Noun Chunk?
Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.

Example:

The sentence "The boy saw the yellow dog" has 2 noun objects, the boy and the dog. 
Therefore the noun chunks will be

	1. "The boy"
	2. "the yellow dog"

In [None]:
# Print noun chunks for doc_2


## Word probabilities

In [None]:
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus 


## Word embedding / Similarity

#### What are Word embeddings?

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

Example:
	
With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:

	king-queen==man-woman

In [None]:
# For a given document, caclulate similarity between 'apples' and 'oranges' and 'boots' and 'hippos'
doc = nlp("Apples and oranges are similar. Boots and hippos aren't.")
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]

# Print similarity between sentence and word 'fruit'
apples_sent, boots_sent = doc.sents
fruit = doc.vocab['fruit']

In [None]:
# Matplotlib Jupyter HACK
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

# Real text analysis

## Load text file

In [None]:
def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()

## Process full text

In [None]:
# Process `text` with Spacy NLP Parser
text = read_file('data/pride_and_prejudice.txt')
processed_text = nlp(text)

In [None]:
# How many sentences are in Pride & Prejudice book?

# Print sentences from index 10 to index 15, to make sure that we have parsed correct book

## Find all the personal names

In [None]:
# Extract all the personal names from Pride & Prejudice and count theirs occurences. 
# Expected output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].


## Plot actors personal names as a time series 

In [None]:
# Plot actor mentions as a time series relative to the position of the actor's ocurence in a book.


## Spacy parse tree in action

In [None]:
# Find words (adjectives) that describe Mr Darcy.

In [None]:
# Find actors that are 'talking', 'saying', 'doing' the most. Find the relationship between 
# entities and corresponding root verbs.

# Find all the actors that got married in the book
# Some sentence from which information could be extracted
# 
# her mother was talking to that one person (Lady Lucas) freely,
# openly, and of nothing else but her expectation that Jane would soon
# be married to Mr. Bingley.
#


## Extract Keywords

In [None]:
# Extract Keywords using noun chunks from the news article (file 'article.txt').
# Spacy will pick some noun chunks that are not informative at all (e.g. we, what, who).
# Try to find a way to remove non informative keywords.

article = read_file('data/article.txt')
doc = nlp(article)
