# [spaCy](http://spacy.io/docs/#examples) introduction

## Load spaCy resources

In [0]:
# Import spacy and English models
import spacy

nlp = spacy.load('en_core_web_lg')

In [18]:
import spacy.cli
spacy.cli.download("en_core_web_lg")   #large
spacy.cli.download("en_core_web_md")  #medium
spacy.cli.download("en_core_web_sm")  #small


[93m    Linking successful[0m
    /usr/local/lib/python2.7/dist-packages/en_core_web_lg -->
    /usr/local/lib/python2.7/dist-packages/spacy/data/en_core_web_lg

    You can now load the model via spacy.load('en_core_web_lg')


[93m    Linking successful[0m
    /usr/local/lib/python2.7/dist-packages/en_core_web_md -->
    /usr/local/lib/python2.7/dist-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')


[93m    Linking successful[0m
    /usr/local/lib/python2.7/dist-packages/en_core_web_sm -->
    /usr/local/lib/python2.7/dist-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [3]:
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')
nlp2 = spacy.load('en_core_web_sm')
nlp3 = spacy.load('en_core_web_md')

# Process whole documents
text = (u""" 1001A, B wing, 10th Floor, The Capital, Bandra-Kurla Complex, Bandra (East), Mumbai - 400 051
CIN: U74990MH2008NPL189067
NPCI/2018-19/UPI OC NO/061 October 22, 2018

To,
All Members of UPI (Unified Payment Interface)
Dear Sir / Madam,
Subject: UPI transaction frequency limit revised to 10 transactions for P2P w.e.f
October 21, 2018
To encourage genuine transactions in the UPI ecosystem and bring in rationality, the UPI
transaction frequency limit has been revised to the following:
 10 transactions per bank account for P2P segment (in a span of 24 hours, where
timestamp of 1st transaction is considered as start time).
 This limit is inclusive only for P2P transactions originating from a unique bank
account.
Requisite changes have been implemented on UPI Fraud & Risk management system and is
in effect from October 21, 2018
Member Banks are hereby advised to make note of this change and do the needful /
communicate accordingly to their partners and stakeholders.
Yours faithfully,
Bharat Panchal
SVP & Head – Risk Management""")
doc = nlp(text)
print(doc)
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp(u"What happens if I enter wrong UPI-PIN during a transaction?")
doc2 = nlp(u"i gave wrong passcode in UPI, what will happen?")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)


print('==============================================================================')

doc = nlp2(text)
print(doc)
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp2(u"What happens if I enter wrong UPI-PIN during a transaction?")
doc2 = nlp2(u"i gave wrong passcode in UPI, what will happen?")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)


print('==============================================================================')

doc = nlp3(text)
print(doc)
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp3(u"What happens if I enter wrong UPI-PIN during a transaction?")
doc2 = nlp3(u"i gave wrong passcode in UPI, what will happen?")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)



 1001A, B wing, 10th Floor, The Capital, Bandra-Kurla Complex, Bandra (East), Mumbai - 400 051
CIN: U74990MH2008NPL189067
NPCI/2018-19/UPI OC NO/061 October 22, 2018

To,
All Members of UPI (Unified Payment Interface)
Dear Sir / Madam,
Subject: UPI transaction frequency limit revised to 10 transactions for P2P w.e.f
October 21, 2018
To encourage genuine transactions in the UPI ecosystem and bring in rationality, the UPI
transaction frequency limit has been revised to the following:
 10 transactions per bank account for P2P segment (in a span of 24 hours, where
timestamp of 1st transaction is considered as start time).
 This limit is inclusive only for P2P transactions originating from a unique bank
account.
Requisite changes have been implemented on UPI Fraud & Risk management system and is
in effect from October 21, 2018
Member Banks are hereby advised to make note of this change and do the needful /
communicate accordingly to their partners and stakeholders.
Yours faithfully,
Bhara

Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.

#### What are Stop Words?

Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.

Example: _"the","and","a","are","is"_

#### What is a Corpus?

A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.

Examples:

1. Every word written in the complete works of Shakespeare
2. Every word spoken on BBC Radio channels for the past 30 years 

## Process text

In [0]:
# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy
doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')

## Get tokens and sentences

#### What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Cytora is going to PyCon!"
	["Cytora","is","going","to","PyCon!"]

In [0]:
# Get first token of the processed document
token = doc[0]
print(token)

# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)

## Part of speech tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here](http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/).

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
3. INTJ, Interjection - "Uhhhhhhhhhhh"

In [0]:
# For each token, print corresponding part of speech tag
for token in doc:
    print('{} - {}'.format(token, token.pos_))

## Visual part of speech tagging ([displaCy](https://displacy.spacy.io))

## Syntactic dependencies

#### What are syntactic dependencies?

We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.

Example:

<img src="https://github.com/explosion/spacy-notebooks/blob/master/notebooks/conference_notebooks/pycon_nlp/images/syntax-dependencies-oliver.png?raw=1" align="left" width=500>

In [0]:
# Write a function that walks up the syntactic tree of the given token and collects all tokens to the root token (including root token).

def tokens_to_root(token):
    """
    Walk up the syntactic tree, collecting tokens to the root of the given `token`.
    :param token: Spacy token
    :return: list of Spacy tokens
    """
    tokens_to_r = []
    while token.head is not token:
        tokens_to_r.append(token)
        token = token.head
        tokens_to_r.append(token)

    return tokens_to_r

# For every token in document, print it's tokens to the root
for token in doc:
    print('{} --> {}'.format(token, tokens_to_root(token)))

# Print dependency labels of the tokens
for token in doc:
    print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))


## Named entities

#### Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name. 

Example:

	1. Barack Obama
	2. Edinburgh
	3. Ferrari Enzo

In [0]:
# Print all named entities with named entity types

doc_2 = nlp(u"I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
    print('{} - {}'.format(ent, ent.label_))

## Noun chunks

#### What is a Noun Chunk?
Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.

Example:

The sentence "The boy saw the yellow dog" has 2 noun objects, the boy and the dog. 
Therefore the noun chunks will be

	1. "The boy"
	2. "the yellow dog"

In [0]:
# Print noun chunks for doc_2
print([chunk for chunk in doc_2.noun_chunks])

## Unigram probabilities

In [0]:
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus 
for token in doc_2:
    print(token, ',', token.prob)

## Word embedding / Similarity

#### What are Word embeddings?

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

Example:
	
With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:

	king-queen==man-woman

In [32]:

# For a given document, calculate similarity between 'apples' and 'oranges' and 'boots' and 'hippos'
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

doc1 = nlp(u"Payment not credited")
doc2 = nlp(u"payment failure")

doc3 = nlp(u"payment made but to wrong account")
doc4 = nlp(u"my money is credited to someone else")
doc5 = nlp(u"mny gone to wrng accnt")


doc6 = nlp(u"unable to add account in iphone")
doc7 = nlp(u"can't reset password in my phone")

print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))
print(doc3.similarity(doc4))
print(doc5.similarity(doc4))
print(doc7.similarity(doc6))

# print(apples.similarity(oranges))
# print(boots.similarity(hippos))

# print()
# # Print similarity between sentence and word 'fruit'
# apples_sent, boots_sent = doc.sents
# fruit = doc.vocab[u'fruit']
# print(apples_sent.similarity(fruit))
# print(boots_sent.similarity(fruit))

0.7906641738402326
0.8496588629615003
0.7531852532506461
0.8733585464915755
0.282217543111559
0.8143663374700432


In [1]:
from keras.datasets import imdb

Using TensorFlow backend.


In [2]:
vocabulary_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
Loaded dataset with 25000 training samples, 25000 test samples


In [54]:
print('---review---')
print(X_train[6])
print('---label---')
print(y_train[6])

---review---
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    

In [53]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[10])

---review with words---
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 

In [5]:
print('Maximum review length: {}'.format(
len(max((X_train + X_test), key=len))))

Maximum review length: 2697


In [6]:
print('Minimum review length: {}'.format(
len(min((X_test + X_test), key=len))))

Minimum review length: 14


In order to feed this data into our RNN, all input documents must have the same length. We will limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0). We can accomplish this using the pad_sequences() function in Keras. For now, set max_words to 500.

In [0]:
from keras.preprocessing import sequence
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [8]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

# to change it to multiclass; add neurons in last dense layer & activation to softmax 
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

In [10]:
batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9873a65bd0>

In [12]:
scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

('Test accuracy:', 0.86756)


In [0]:
from gensim.models import Word2Vec

#loading the downloaded model
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, norm_only=True)

#the model is loaded. It can be used to perform all of the tasks mentioned above.

# getting word vectors of a word
dog = model['dog']

#performing king queen magic
print(model.most_similar(positive=['woman', 'king'], negative=['man']))

#picking odd one out
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

#printing similarity index
print(model.similarity('woman', 'man'))

In [33]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [34]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [0]:
from textblob import TextBlob

In [0]:
wiki = TextBlob("Python is a high-level, general-purpose programming language.")

In [35]:
wiki.tags

[('Python', u'NNP'),
 ('is', u'VBZ'),
 ('a', u'DT'),
 ('high-level', u'JJ'),
 ('general-purpose', u'JJ'),
 ('programming', u'NN'),
 ('language', u'NN')]

In [0]:
testimonial = TextBlob("Transactions on @UPI_NPCI have jumped 85% with 57 banks live on this platform. Read More: http://bit.ly/2h0paOt  #UPIGrows #PayDigitally")

In [37]:
testimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

In [38]:
testimonial.sentiment.polarity

0.39166666666666666

In [0]:
b = TextBlob("I havv goood speling!")

In [40]:
print(b.correct())

I have good spelling!


In [0]:
en_blob = TextBlob(u'Simple is better than complex.')

In [42]:
en_blob.translate(to='es')

TextBlob("Lo simple es mejor que lo complejo.")

In [43]:
!git clone https://github.com/loretoparisi/word2vec-twitter.git

Cloning into 'word2vec-twitter'...
remote: Enumerating objects: 24, done.[K
Unpacking objects:   4% (1/24)   Unpacking objects:   8% (2/24)   Unpacking objects:  12% (3/24)   Unpacking objects:  16% (4/24)   Unpacking objects:  20% (5/24)   Unpacking objects:  25% (6/24)   Unpacking objects:  29% (7/24)   Unpacking objects:  33% (8/24)   Unpacking objects:  37% (9/24)   Unpacking objects:  41% (10/24)   Unpacking objects:  45% (11/24)   Unpacking objects:  50% (12/24)   remote: Total 24 (delta 0), reused 0 (delta 0), pack-reused 24[K
Unpacking objects:  54% (13/24)   Unpacking objects:  58% (14/24)   Unpacking objects:  62% (15/24)   Unpacking objects:  66% (16/24)   Unpacking objects:  70% (17/24)   Unpacking objects:  75% (18/24)   Unpacking objects:  79% (19/24)   Unpacking objects:  83% (20/24)   Unpacking objects:  87% (21/24)   Unpacking objects:  91% (22/24)   Unpacking objects:  95% (23/24)   Unpacking objects: 100% (24/24)   Unpacking objects: 100% (2

In [46]:
!wget http://yuca.test.iminds.be:8900/fgodin/downloads/word2vec_twitter_model.tar.gz

--2018-12-23 12:29:08--  http://yuca.test.iminds.be:8900/fgodin/downloads/word2vec_twitter_model.tar.gz
Resolving yuca.test.iminds.be (yuca.test.iminds.be)... 193.191.148.189
Connecting to yuca.test.iminds.be (yuca.test.iminds.be)|193.191.148.189|:8900... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4544343010 (4.2G) [application/x-gzip]
Saving to: ‘word2vec_twitter_model.tar.gz’


2018-12-23 12:52:40 (3.07 MB/s) - ‘word2vec_twitter_model.tar.gz’ saved [4544343010/4544343010]



In [50]:
!ls
!cp word2vec_twitter_model.tar.gz word2vec-twitter/
!cd word2vec-twitter  
!ls

sample_data  word2vec-twitter  word2vec_twitter_model.tar.gz
sample_data  word2vec-twitter  word2vec_twitter_model.tar.gz


In [51]:
!python word2vec-twitter/word2vecReader.py

Loading the model, this can take some time...
Traceback (most recent call last):
  File "word2vec-twitter/word2vecReader.py", line 267, in <module>
    model = Word2Vec.load_word2vec_format(model_path, binary=True)
  File "word2vec-twitter/word2vecReader.py", line 115, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "/content/word2vec-twitter/word2vecReaderUtils.py", line 661, in smart_open
    return open(fname, mode)
IOError: [Errno 2] No such file or directory: './word2vec_twitter_model.bin'
