### What is NLP?

NLP - **Natural Language Processing** is the process of analysing Natural Language (as in, How we humans speak) and extracting meaningful insights from the given data. NLP has become one of the very popular areas of interest due to increase in NLP and also development in Information Extraction (IE) methodologies. 

### Sources of Natural Langauge

* Social Media  (like FB Posts/Comments, Twitter Tweets, Youtube Comments)
* Speech Transcripts (Call Center Conversations) 
* Voice Agents (Amazon Echo, Google Home, Apple Siri) 

### Some Applications of NLP

* Automated Customer Service 
* Chatbots
* Social Listening
* Market Trends and much more

### About this Dataset

This dataset contains a bunch of tweet that came with this tag **#JustDoIt** after **Nike** released the ad campaign with Colin Kaepernick that turned controversial. 

<img src="https://www.thenation.com/wp-content/uploads/2018/09/Kaepernick-Nike-Ad-sg-img.jpg" alt="drawing" width="400"/>

### About spaCy:

spaCy by [explosion.ai](https://explosion.ai/) is a library for advanced **Natural Language Processing** in Python and Cython.
spaCy comes with
*pre-trained statistical models* and word
vectors, and currently supports tokenization for **20+ languages**. It features
the **fastest syntactic parser** in the world, convolutional **neural network models**
for tagging, parsing and **named entity recognition** and easy **deep learning**
integration. It's commercial open-source software, released under the MIT license.

### About this Kernel:

In this Kernel, We will learn how to use *spaCy* in Python to perform a few things of NLP. 

### Installation

In [1]:
# !! pip3 install spacy

In [6]:
# !! python3 -m spacy download en_core_web_sm

['Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0',
 '  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)',
 'Installing collected packages: en-core-web-sm',
 '  Running setup.py install for en-core-web-sm: started',
 "    Running setup.py install for en-core-web-sm: finished with status 'done'",
 'Successfully installed en-core-web-sm-2.0.0',
 '',
 '\x1b[93m    Linking successful\x1b[0m',
 '    /usr/local/lib/python3.7/site-packages/en_core_web_sm -->',
 '    /usr/local/lib/python3.7/site-packages/spacy/data/en_core_web_sm',
 '',
 "    You can now load the model via spacy.load('en_core_web_sm')",
 '']

Let us begin our journey by loading required libraries. 

### Loading the required Libraries

In [2]:
#import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#import os
#print(os.listdir("../input"))
import spacy
import random 
from collections import Counter #for counting
import seaborn as sns #for visualization

As we have seen above, *spaCy* comes with Pre-trained Language models and since our tweets are predominantly English, let us load our *en* model using the following code:

### Loading Spacy English Model

In [3]:
nlp = spacy.load('en_core_web_sm')

Please note that you can download other language models by running a code like below in your shell or terminal

`python -m spacy download en_core_web_sm` 

and then loading using `spacy.load()`. The last argument in the above code is the name of the langauge model that's to be downloaded. 

Now that our model is successfully loaded into `nlp`, let us read our input data using `read_csv()` of `pandas`. 

### Reading input file - Tweets

In [4]:
tweets = pd.read_csv("justdoit_tweets_2018_09_07_2.csv")

As with any dataset, let us do a few basics like understanding the shape (dimension) of the dataset and then see a sample row. 

### Dimension of the input file

In [5]:
tweets.shape

(5089, 72)

### Sample Row

In [6]:
tweets.head(1)

Unnamed: 0,tweet_contributors,tweet_coordinates,tweet_created_at,tweet_display_text_range,tweet_entities,tweet_extended_entities,tweet_favorite_count,tweet_favorited,tweet_full_text,tweet_geo,...,user_profile_text_color,user_profile_use_background_image,user_protected,user_screen_name,user_statuses_count,user_time_zone,user_translator_type,user_url,user_utc_offset,user_verified
0,,,Fri Sep 07 16:25:06 +0000 2018,"[0, 75]","{'hashtags': [{'text': 'quote', 'indices': [47...","{'media': [{'id': 1038100853872197632, 'id_str...",0,False,Done is better than perfect. — Sheryl Sandberg...,,...,333333,True,False,UltraYOUwoman,91870.0,,none,https://t.co/jGlJswxjwS,,False


Now that we know `tweet_full_text` is the column name in which tweets are stored, let us print some sample tweets.

### Sample Tweets Text

For simplicity, Let us take a sample of tweets.

In [7]:
random.seed(888)
text = tweets.tweet_full_text[random.sample(range(1,100),10)]
text

11    Colin Kaepernick's business partner @Nike send...
56    This is why Colin kneels.  We all should kneel...
75    @Nike is aligning itself with the core values ...
57    Sounds like a plan! #JustDoIt #FirstAmendment ...
64    @washingtonpost Thank you #Kap #JustDoIt #Nike...
50    Invest in #Mojo50?\n\n#JustDoIt \n\n@DocThomps...
81    Owned Yet, Libs? https://t.co/D7I86zTfL7 #Nike...
48    If you work hard, limitless and focus on your ...
70    @JWKeady @Kaepernick7 @KillerMike @tmorello #N...
54    Create Your Own Nike Just Do It Colin Kaeperni...
Name: tweet_full_text, dtype: object

### Annotation:

Let us begin our NLP journey with Lingustic Annotation, which means marking each and every word with its linguistic type like if it's a NOUN, VERB and so on. This help us in giving grammatical labels to our Text Corpus. The function `nlp()` takes only string so let us use `str()` to combine all our rows above into one long string. 

In [8]:
text_combined = str(text)

In [9]:
doc = nlp(text_combined)

### Tokenization 

`doc` is the annotated text (that we did using the loaded langauge model). Now, let us tokenize our text. Tokenization has been done along with the above process. We can now print the **chunks**. The tokenized parts are called **chunks**. As a naive description, tokenization is nothing but breaking the long sentences/text corpus into a small chunks (or mostly words). 

In [10]:
for token in doc:
    print(token)

11
   
Colin
Kaepernick
's
business
partner
@Nike
send
...


56
   
This
is
why
Colin
kneels
.
 
We
all
should
kneel
...


75
   
@Nike
is
aligning
itself
with
the
core
values
...


57
   
Sounds
like
a
plan
!
#
JustDoIt
#
FirstAmendment
...


64
   
@washingtonpost
Thank
you
#
Kap
#
JustDoIt
#
Nike
...


50
   
Invest
in
#
Mojo50?\n\n#JustDoIt
\n\n@DocThomps
...


81
   
Owned
Yet
,
Libs
?
https://t.co/D7I86zTfL7
#
Nike
...


48
   
If
you
work
hard
,
limitless
and
focus
on
your
...


70
   
@JWKeady
@Kaepernick7
@KillerMike
@tmorello
#
N
...


54
   
Create
Your
Own
Nike
Just
Do
It
Colin
Kaeperni
...


Name
:
tweet_full_text
,
dtype
:
object


Since we have already done the annotation, Let us print our chunks with their Parts-of-speech tags.

In [11]:
for token in doc:
    print(token.text, token.pos_)

11 NUM
    SPACE
Colin PROPN
Kaepernick PROPN
's PART
business NOUN
partner NOUN
@Nike VERB
send VERB
... PUNCT

 SPACE
56 NUM
    SPACE
This DET
is VERB
why ADV
Colin PROPN
kneels NOUN
. PUNCT
  SPACE
We PRON
all DET
should VERB
kneel VERB
... PUNCT

 SPACE
75 NUM
    SPACE
@Nike ADJ
is VERB
aligning VERB
itself PRON
with ADP
the DET
core NOUN
values NOUN
... PUNCT

 SPACE
57 NUM
    SPACE
Sounds PROPN
like ADP
a DET
plan NOUN
! PUNCT
# NOUN
JustDoIt PROPN
# SYM
FirstAmendment PROPN
... PUNCT

 SPACE
64 NUM
    SPACE
@washingtonpost ADJ
Thank VERB
you PRON
# SYM
Kap PROPN
# SYM
JustDoIt PROPN
# SYM
Nike PROPN
... PUNCT

 SPACE
50 NUM
    SPACE
Invest VERB
in ADP
# SYM
Mojo50?\n\n#JustDoIt PROPN
\n\n@DocThomps NOUN
... PUNCT

 SPACE
81 NUM
    SPACE
Owned VERB
Yet ADV
, PUNCT
Libs PROPN
? PUNCT
https://t.co/D7I86zTfL7 X
# SYM
Nike PROPN
... PUNCT

 SPACE
48 NUM
    SPACE
If ADP
you PRON
work VERB
hard ADV
, PUNCT
limitless ADJ
and CCONJ
focus VERB
on ADP
your ADJ
... PUNCT

 SPACE
70 N

That's good, We've got a bunch of chunks and their respective POS tags. Perhaps, we don't want to see everything but just NOUN as Phrases.  Below is the code how we can print only the nouns in the text.

In [12]:
nouns = list(doc.noun_chunks)
nouns

[11    Colin Kaepernick's business partner,
 We,
 itself,
 the core values,
 57    Sounds,
 a plan,
 #JustDoIt #FirstAmendment,
 you,
 #Kap #JustDoIt #Nike,
 #Mojo50?\n\n#JustDoIt \n\n@DocThomps,
 Yet, Libs,
 #Nike,
 you,
 #N,
 Nike,
 It,
 Colin Kaeperni,
 Name,
 dtype,
 object]

Sometimes, we might need to tokenization based on sentences. Let's say we've got Chat Transcript from Customer Service and in that case we need to tokenize our transcript based on sentences. 

In [13]:
list(doc.sents)

[11    Colin Kaepernick's business partner @Nike send...,
 56    ,
 This is why Colin kneels.  ,
 We all should kneel...,
 75    @Nike is aligning itself with the core values ...,
 57    Sounds like a plan!,
 #JustDoIt #FirstAmendment ...,
 64    @washingtonpost Thank you #Kap #JustDoIt #Nike...,
 50    Invest in #Mojo50?\n\n#JustDoIt \n\n@DocThomps...,
 81    ,
 Owned,
 Yet, Libs?,
 https://t.co/D7I86zTfL7 #Nike...,
 48    ,
 If you work hard, limitless and focus on your ...
 70    @JWKeady @Kaepernick7 @KillerMike @tmorello,
 #N...,
 54    ,
 Create Your Own,
 Nike,
 Just Do It Colin Kaeperni...,
 Name:,
 tweet_full_text, dtype: object]

### Named Entity Recognition (NER)

NER is the process of extracting Named Entities like Person, Organization, Location and other such infromation from our Text Corpus.  spaCy also has an object `displacy` that lets us visualize our text with NER. We can display Named Entities using the following code:

In [14]:
for ent in doc.ents:
    print(ent.text,ent.label_)

11 CARDINAL
Colin Kaepernick's PERSON

 GPE
Colin PERSON

 GPE

 GPE
   Sounds PERSON

 GPE
64 CARDINAL
Kap PERSON
Nike ORG

 GPE
50 CARDINAL
#Mojo50?\n\n#JustDoIt \n\n@DocThomps MONEY

 GPE
81 CARDINAL
Libs PERSON
Nike ORG

 GPE
48 CARDINAL

 GPE
@Kaepernick7 NORP
54 DATE
Colin Kaeperni PERSON


**spaCy** also allows to visualize Named Entities along woith the Text Labels. 

In [15]:
spacy.displacy.render(doc, style='ent',jupyter=True)

In [26]:
spacy.explain("GPE")

'Countries, cities, states'

In [27]:
spacy.explain("CARDINAL")

'Numerals that do not fall under another type'

### A better Text

In [23]:
hypothetical_text = nlp("Vijay Mallya is returning Indian Government his debt of $4 billion")

In [24]:
for token in hypothetical_text:
    print(token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha,token.is_stop)

Vijay vijay PROPN NNP compound Xxxxx True False
Mallya mallya PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
returning return VERB VBG ROOT xxxx True False
Indian indian ADJ JJ amod Xxxxx True False
Government government PROPN NNP dobj Xxxxx True False
his -PRON- ADJ PRP$ poss xxx True True
debt debt NOUN NN dobj xxxx True False
of of ADP IN prep xx True True
$ $ SYM $ quantmod $ False False
4 4 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [25]:
spacy.displacy.render(hypothetical_text, style='ent',jupyter=True)

### Lemmatization

Lemmetiztion is the process of retrieving the root word of the current word. Lemmatization is an essential process in NLP to bring different variants of a single word to one root word. 

In [23]:
for token in doc:
    print(token.text, token.lemma_)

11 11
       
Colin colin
Kaepernick kaepernick
's 's
business business
partner partner
@Nike @nike
send send
... ...

 

56 56
       
This this
is be
why why
Colin colin
kneels kneel
. .
   
We -PRON-
all all
should should
kneel kneel
... ...

 

75 75
       
@Nike @nike
is be
aligning align
itself -PRON-
with with
the the
core core
values value
... ...

 

57 57
       
Sounds sounds
like like
a a
plan plan
! !
# #
JustDoIt justdoit
# #
FirstAmendment firstamendment
... ...

 

64 64
       
@washingtonpost @washingtonpost
Thank thank
you -PRON-
# #
Kap kap
# #
JustDoIt justdoit
# #
Nike nike
... ...

 

50 50
       
Invest invest
in in
# #
Mojo50?\n\n#JustDoIt mojo50?\n\n#justdoit
\n\n@DocThomps \n\n@docthomp
... ...

 

81 81
       
Owned own
Yet yet
, ,
Libs libs
? ?
https://t.co/D7I86zTfL7 https://t.co/d7i86ztfl7
# #
Nike nike
... ...

 

48 48
       
If if
you -PRON-
work work
hard hard
, ,
limitless limitless
and and
focus focus
on on
your -PRON-
... ...

 

70 70
       


As you can see in the above output, words like *aligning* and *values* have been converted to their root words *align* and *value*. 

### Dependency Parser Visualization

In [24]:
spacy.displacy.render(doc, style='dep',jupyter=True)

In [19]:
spacy.displacy.render(hypothetical_text, style='dep',jupyter=True)

### Thank you!!!

![](get_in_touch.png)