## Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/
#### 1. From the command line or terminal:
> `conda install -c conda-forge spacy`
> <br>*or*<br>
> `pip install -U spacy`
> ### Alternatively you can create a virtual environment:
> `conda create -n spacyenv python=3 spacy=2`
#### 2. Next, also from the command line (you must run this as admin or use sudo):
> `python -m spacy download en`
> ### If successful, you should see a message like:
> **`Linking successful`**<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->`<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en`<br>
> ` `<br>
> `    You can now load the model via spacy.load('en')`

# Working with spaCy in Python

This is a typical set of instructions for importing and working with spaCy. Don't be surprised if this takes awhile - spaCy has a fairly large library to load:

In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')  # This is an english model of spaCy

# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


This doesn't look very user-friendly, but right away we see some interesting things happen:
1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***.

___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

<img src="pipeline1.png" width="600">

We can check to see what components currently live in the pipeline. In later sections we'll learn how to disable components and add new ones as needed.

>**tagger** - Assign part-of-speech tags.
>**parser** - Assigns Dependency Parser
>**ner** - Detect and label named entities

In [2]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c4cc36cd08>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c4cc298168>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c4cc298228>)]

In [4]:
nlp.pipe_names

['tagger', 'parser', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [7]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.
It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [8]:
print(doc2)
print(doc2[0])
print(type(doc2))

Tesla isn't   looking into startups anymore.
Tesla
<class 'spacy.tokens.doc.Doc'>


___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [8]:
doc2[0].pos_

'PROPN'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.
For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [9]:
doc2[0].dep_

'nsubj'

To see the full name of a tag use `spacy.explain(tag)`

In [10]:
spacy.explain('PROPN')

'proper noun'

In [11]:
spacy.explain('nsubj')

'nominal subject'

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [13]:
# Lemmas(the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
looking
look


In [14]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [15]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [16]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [17]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [18]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [19]:
type(life_quote)

spacy.tokens.span.Span

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [20]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [21]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [22]:
doc4[6].is_sent_start

True

_______________________________________________________________________________________________________________________

### Using spaCy in a dataset 

The dataset is collection of tweets by Elon Musk from November 16,2012 and September 29,2017

In this Kernel, We will learn how to use spaCy in Python to perform a few things of NLP. This is just starter pack for analysis of tweets. There are no steps for cleaning the text

In [27]:
import numpy as np 
import pandas as pd 
import os
#print(os.listdir("../input"))
import spacy
import random 
from collections import Counter #for counting
import seaborn as sns #for visualization

os.getcwd()

'C:\\Users\\avina\\Documents\\MLProject\\NLP Practice Notebooks'

In [25]:
nlp = spacy.load('en')

In [32]:
#Reading tweets
tweets = pd.read_csv("data_elonmusk.csv",encoding='latin1')
tweets = tweets.assign(Time=pd.to_datetime(tweets.Time)).drop('row ID', axis='columns')

In [33]:
tweets.head(10)

Unnamed: 0,Tweet,Time,Retweet from,User
0,@MeltingIce Assuming max acceleration of 2 to ...,2017-09-29 17:39:19,,elonmusk
1,RT @SpaceX: BFR is capable of transporting sat...,2017-09-29 10:44:54,SpaceX,elonmusk
2,@bigajm Yup :),2017-09-29 10:39:57,,elonmusk
3,Part 2 https://t.co/8Fvu57muhM,2017-09-29 09:56:12,,elonmusk
4,Fly to most places on Earth in under 30 mins a...,2017-09-29 09:19:21,,elonmusk
5,RT @SpaceX: Supporting the creation of a perma...,2017-09-29 08:57:29,SpaceX,elonmusk
6,BFR will take you anywhere on Earth in less th...,2017-09-29 08:53:00,,elonmusk
7,Mars City\nOpposite of Earth. Dawn and dusk sk...,2017-09-29 06:03:32,,elonmusk
8,Moon Base Alpha https://t.co/voY8qEW9kl,2017-09-29 05:44:55,,elonmusk
9,Will be announcing something really special at...,2017-09-29 02:36:17,,elonmusk


In [34]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Tweet         3218 non-null   object        
 1   Time          3218 non-null   datetime64[ns]
 2   Retweet from  525 non-null    object        
 3   User          3218 non-null   object        
dtypes: datetime64[ns](1), object(3)
memory usage: 100.7+ KB


In [35]:
tweets.shape

(3218, 4)

**Sampling the tweets text**

In [48]:
random.seed(123)
text = tweets.Tweet[random.sample(range(1,240),10)]
text

14     Prev ideas for paying ~$10B dev cost incl. Kic...
69                 @newscientist https://t.co/mA8ZgutrbE
23     @USATODAYmoney @NathanBomey That's not a lot o...
197    @skynetcomputer Yeah, it's better than a Model...
105                       @VoltzCoreAudio Exactly. Yeah.
28     @Bobby_Gupta Def not ok. Just sent a reminder ...
215                       First! https://t.co/OXNEgDnhku
231    Love this Tesla P100D drag racing video https:...
224    @caradocp Side booster rockets return to Cape ...
10     RT @SpaceX: Nine years ago today, Falcon 1 bec...
Name: Tweet, dtype: object

#### Annotation
Let us begin our NLP journey with Lingustic Annotation, which means marking each and every word with its linguistic type like if it's a NOUN, VERB and so on. This help us in giving grammatical labels to our Text Corpus. The function nlp() takes only string so let us use str() to combine all our rows above into one long string.

In [49]:
text_combined = str(text)

In [50]:
doc = nlp(text_combined)

In [51]:
print(doc)

14     Prev ideas for paying ~$10B dev cost incl. Kic...
69                 @newscientist https://t.co/mA8ZgutrbE
23     @USATODAYmoney @NathanBomey That's not a lot o...
197    @skynetcomputer Yeah, it's better than a Model...
105                       @VoltzCoreAudio Exactly. Yeah.
28     @Bobby_Gupta Def not ok. Just sent a reminder ...
215                       First! https://t.co/OXNEgDnhku
231    Love this Tesla P100D drag racing video https:...
224    @caradocp Side booster rockets return to Cape ...
10     RT @SpaceX: Nine years ago today, Falcon 1 bec...
Name: Tweet, dtype: object


#### Tokenization
doc is the annotated text (that we did using the loaded langauge model). Now, let us tokenize our text. Tokenization has been done along with the above process. We can now print the **chunks**. The tokenized parts are called chunks. As a naive description, tokenization is nothing but breaking the long sentences/text corpus into a small chunks (or mostly words).

In [53]:
for token in doc:
    print(token)

14
    
Prev
ideas
for
paying
~$10B
dev
cost
incl
.
Kic
...


69
                
@newscientist
https://t.co/mA8ZgutrbE


23
    
@USATODAYmoney
@NathanBomey
That
's
not
a
lot
o
...


197
   
@skynetcomputer
Yeah
,
it
's
better
than
a
Model
...


105
                      
@VoltzCoreAudio
Exactly
.
Yeah
.


28
    
@Bobby_Gupta
Def
not
ok
.
Just
sent
a
reminder
...


215
                      
First
!
https://t.co/OXNEgDnhku


231
   
Love
this
Tesla
P100D
drag
racing
video
https
:
...


224
   
@caradocp
Side
booster
rockets
return
to
Cape
...


10
    
RT
@SpaceX
:
Nine
years
ago
today
,
Falcon
1
bec
...


Name
:
Tweet
,
dtype
:
object


Perhaps, we don't want to see everything but just NOUNs. Below is the code how we can print only the nouns in the text.

In [54]:
nouns = list(doc.noun_chunks)
nouns

[14     Prev ideas,
 ~$10B dev cost incl,
 Kic,
 @USATODAYmoney @NathanBomey,
 a lot,
 it,
 a Model,
 a reminder,
 https://t.co/OXNEgDnhku
 231    Love,
 this Tesla P100D drag racing video https,
 224    @caradocp Side booster rockets,
 Cape,
 RT @SpaceX,
 Nine years ago today, Falcon 1 bec,
 Name,
 Tweet,
 dtype,
 object]

Sometimes, we might need to tokenization based on sentences. Let's say we've got Chat Transcript from Customer Service and in that case we need to tokenize our transcript based on sentences.

In [55]:
list(doc.sents)

[14     Prev ideas for paying ~$10B dev cost incl.,
 Kic...,
 69                 ,
 @newscientist,
 https://t.co/mA8ZgutrbE
 23     ,
 @USATODAYmoney @NathanBomey,
 That's not a lot,
 o...,
 197    ,
 @skynetcomputer,
 Yeah, it's better than a Model...,
 105                       ,
 @VoltzCoreAudio,
 Exactly.,
 Yeah.,
 28     ,
 @Bobby_Gupta Def,
 not ok.,
 Just sent a reminder ...,
 215                       ,
 First!,
 https://t.co/OXNEgDnhku
 231    Love this Tesla P100D drag racing video https:...,
 224    @caradocp Side booster rockets return to Cape ...,
 10     ,
 RT @SpaceX:,
 Nine years ago today, Falcon 1 bec...,
 Name: Tweet, dtype: object]

#### Named Entity Recognition (NER)
NER is the process of extracting Named Entities like Person, Organization, Location and other such infromation from our Text Corpus. spaCy also has an object displacy that lets us visualize our text with NER. We can display Named Entities using the following code:

In [56]:
for ent in doc.ents:
    print(ent.text,ent.label_)

14 CARDINAL
~$10B ORG
69 CARDINAL
23 CARDINAL
197 CARDINAL
105 CARDINAL
28 CARDINAL
215 CARDINAL
231 CARDINAL
224 CARDINAL
@caradocp ORG
Side PRODUCT
Cape PERSON
10 CARDINAL
Nine years ago today DATE


**spaCy** also allows to visualize Named Entities along woith the Text Labels.

In [57]:
spacy.displacy.render(doc, style='ent',jupyter=True)

#### Lemmatization
Lemmetiztion is the process of retrieving the root word of the current word. Lemmatization is an essential process in NLP to bring different variants of a single word to one root word.

In [58]:
for token in doc:
    print(token.text, token.lemma_)

14 14
         
Prev prev
ideas idea
for for
paying pay
~$10B ~$10B
dev dev
cost cost
incl incl
. .
Kic kic
... ...

 

69 69
                                 
@newscientist @newscientist
https://t.co/mA8ZgutrbE https://t.co/ma8zgutrbe

 

23 23
         
@USATODAYmoney @usatodaymoney
@NathanBomey @NathanBomey
That that
's be
not not
a a
lot lot
o o
... ...

 

197 197
       
@skynetcomputer @skynetcomputer
Yeah yeah
, ,
it -PRON-
's be
better well
than than
a a
Model model
... ...

 

105 105
                                             
@VoltzCoreAudio @VoltzCoreAudio
Exactly exactly
. .
Yeah yeah
. .

 

28 28
         
@Bobby_Gupta @Bobby_Gupta
Def def
not not
ok ok
. .
Just just
sent send
a a
reminder reminder
... ...

 

215 215
                                             
First first
! !
https://t.co/OXNEgDnhku https://t.co/oxnegdnhku

 

231 231
       
Love Love
this this
Tesla Tesla
P100D P100D
drag drag
racing race
video video
https https
: :
... ...

 

224 224
       
@c

As you can see in the above output, words like aligning and values have been converted to their root words align and value.

#### Dependency Parser Visualization

In [59]:
spacy.displacy.render(doc, style='dep',jupyter=True)