# Spacy demo 

In this notebook, we will explore the [Spacy](https://spacy.io/) tool by using Fake News dataset. 

Let us import spacy and also import the 'english' language model.

In [2]:
import numpy as np
import pandas as pd
import spacy

# Import the english language model
nlp = spacy.load('en_core_web_sm')

Let us look at the number of rows and columns present in the dataset.

In [3]:
df = pd.read_csv("fake.csv")
df.shape

(12999, 20)

The description of the columns are as follows:

* uuid - Unique identifier
* ord_in_thread
* author - author of story
* published - date published
* title - title of the story
* text - text of story
* language - data from webhose.io
* crawled - date the story was archived
* site_url - site URL from BS detector
* country - data from webhose.io
* domain_rank - data from webhose.io
* thread_title
* spam_score - data from webhose.io
* main_img_url - image from story
* replies_count - number of replies
* participants_count - number of participants
* likes - number of Facebook likes
* comments - number of Facebook comments
* shares - number of Facebook shares
* type - type of website (label from BS detector)

Now let us look at the top few rows of the dataset to gain some more understanding.

In [4]:
df.head()

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689.0,Muslims BUSTED: They Stole Millions In Gov’t B...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
1,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,0,reasoning with facts,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english,2016-10-29T08:47:11.259+03:00,100percentfedup.com,US,25689.0,Re: Why Did Attorney General Loretta Lynch Ple...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
2,c70e149fdd53de5e61c29281100b9de0ed268bc3,0,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,english,2016-10-31T01:41:49.479+02:00,100percentfedup.com,US,25689.0,BREAKING: Weiner Cooperating With FBI On Hilla...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
3,7cf7c15731ac2a116dd7f629bd57ea468ed70284,0,Fed Up,2016-11-01T05:22:00.000+02:00,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,english,2016-11-01T15:46:26.304+02:00,100percentfedup.com,US,25689.0,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,0.068,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias
4,0206b54719c7e241ffe0ad4315b808290dbe6c0f,0,Fed Up,2016-11-01T21:56:00.000+02:00,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,english,2016-11-01T23:59:42.266+02:00,100percentfedup.com,US,25689.0,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0.865,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias


Columns "title", "text" and "thread_title" has textual data. For this introduction, let us concentrate on the 'title' column. So let us look at the top few rows of the columns alone

In [5]:
df["title"].head()

0    Muslims BUSTED: They Stole Millions In Gov’t B...
1    Re: Why Did Attorney General Loretta Lynch Ple...
2    BREAKING: Weiner Cooperating With FBI On Hilla...
3    PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...
4    FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...
Name: title, dtype: object

**Word-Level Attributes:**

Just calling the function "nlp" on the text column gets us a lot of information. Let us take an example row from the dataset and then apply the same.

In [6]:
txt = df["title"][1009]
txt

"Queen Elizabeth II owns every dolphin in Britain and doesn't need a driving licence and doesn't pay tax — here are the incredible powers you didn't know the monarchy has"

In [7]:
doc = nlp(txt)    
olist = []
for token in doc:
    l = [token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_]
    olist.append(l)
    
odf = pd.DataFrame(olist)
odf.columns= ["Text", "StartIndex", "Lemma", "IsPunctuation", "IsSpace", "WordShape", "PartOfSpeech", "POSTag"]
odf

Unnamed: 0,Text,StartIndex,Lemma,IsPunctuation,IsSpace,WordShape,PartOfSpeech,POSTag
0,Queen,0,Queen,False,False,Xxxxx,PROPN,NNP
1,Elizabeth,6,Elizabeth,False,False,Xxxxx,PROPN,NNP
2,II,16,II,False,False,XX,PROPN,NNP
3,owns,19,own,False,False,xxxx,VERB,VBZ
4,every,24,every,False,False,xxxx,DET,DT
5,dolphin,30,dolphin,False,False,xxxx,NOUN,NN
6,in,38,in,False,False,xx,ADP,IN
7,Britain,41,Britain,False,False,Xxxxx,PROPN,NNP
8,and,49,and,False,False,xxx,CCONJ,CC
9,does,53,do,False,False,xxxx,AUX,VBZ


So using "nlp" we got a lot of information. The details are as follows:

* Text - Tokenized word
* StartIndex - Index at which the word starts in the sentence
* Lemma - Lemma of the word (we need not do lemmatization separately)
* IsPunctuation - Whether the given word is a punctuation or not
* IsSpace - Whether the given word is just a white space or not
* WordShape - Gives information about the shape of word (If all letters are in upper case, we will get XXXXX, if all in lower case then xxxxx, if the first letter is upper and others lower then Xxxxx and so on)
* PartOfSpeech - Part of speech of the word
* POSTag - Tag for part of speech of word

**Named Entity Recognition:**

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. 

We also get named entity recognition as part of spacy package. It is inbuilt in the english language model and we can also train our own entities if needed.

In [8]:
doc = nlp(txt)
olist = []
for ent in doc.ents:
    olist.append([ent.text, ent.label_])
    
odf = pd.DataFrame(olist)
odf.columns = ["Text", "EntityType"]
odf

Unnamed: 0,Text,EntityType
0,Britain,GPE


The complete list of different entity types can be seen [here](https://spacy.io/usage/linguistic-features#entity-types)

Spacy also includes a [displacy visualizer](displaCy visualizer with Jupyter support) with jupyter notebook support. This can be used to visualize the named entity recognition data.

In [9]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

Wow. This one looks cool. We can also take one more example and visualize the same. 

In [11]:
txt = df["title"][3003]
doc = nlp(txt)
colors = {'GPE': 'lightblue', 'NORP':'lightgreen'}
options = {'ents': ['GPE', 'NORP'], 'colors': colors}
displacy.render(doc, style='ent', jupyter=True, options=options)

**Noun Phrase Chunking:**

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". 

Now let us look at how to do noun phrase chunking using spacy. In addition to noun phrase chunking, spacy also gets us the root of the noun.

In [12]:
txt = df["title"][2012]
print(txt)

Nukes and the UN: a Historic Treaty to Ban Nuclear Weapons


In [13]:
doc = nlp(txt)
olist = []
for chunk in doc.noun_chunks:
    olist.append([chunk.text, chunk.label_, chunk.root.text])
odf = pd.DataFrame(olist)
odf.columns = ["NounPhrase", "Label", "RootWord"]
odf

Unnamed: 0,NounPhrase,Label,RootWord
0,Nukes,NP,Nukes
1,the UN,NP,UN
2,a Historic Treaty,NP,Treaty
3,Ban Nuclear Weapons,NP,Weapons


**Dependency Parser**

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads - [Stanford NLP](https://nlp.stanford.edu/software/nndep.html)

Spacy can be used to create these dependency parsers which can be used in a variety of tasks. 

In [14]:
doc = nlp(df["title"][1009])
olist = []
for token in doc:
    olist.append([token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children]])
odf = pd.DataFrame(olist)
odf.columns = ["Text", "Dep", "Head text", "Head POS", "Children"]
odf

Unnamed: 0,Text,Dep,Head text,Head POS,Children
0,Queen,compound,II,PROPN,[]
1,Elizabeth,compound,II,PROPN,[]
2,II,nsubj,owns,VERB,"[Queen, Elizabeth]"
3,owns,ccomp,are,AUX,"[II, dolphin, and, need]"
4,every,det,dolphin,NOUN,[]
5,dolphin,dobj,owns,VERB,"[every, in]"
6,in,prep,dolphin,NOUN,[Britain]
7,Britain,pobj,in,ADP,[]
8,and,cc,owns,VERB,[]
9,does,aux,need,VERB,[]


The description of the columns are
* Text: The original token text.
* Dep: The syntactic relation connecting child to head.
* Head text: The original text of the token head.
* Head POS: The part-of-speech tag of the token head.
* Children: The immediate syntactic dependents of the token.

The best way to understand the dependency parser is to visualize the same and looking at it.

In [15]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

In [16]:
doc = nlp(df["title"][3012])
displacy.render(doc, style='dep', jupyter=True, options={'distance': 60})

**Word Similarity:**

Spacy has word vector model as well. So we can use the same to find similar words. The list of available models can be seen [here](https://spacy.io/models/).

For our case, let us use the 'en_core_web_lg' model available in spacy (more details about the model can be accessed in this [link](https://spacy.io/models/en#en_core_web_lg)). First step is to load the model.

In [17]:
nlp = spacy.load('en_core_web_lg')

Now we can use the cosine similarity to find the words that are similar to the word "Queen".

In [18]:
from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

queen = nlp.vocab['Queen'].vector
computed_similarities = []
for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
    similarity = cosine_similarity(queen, word.vector)
    computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

['Queen', 'Miss', 'she', 'St', 'r.', 'She', 'St.', 'Mont.', 'Mrs', ';-D']


Different versions of king and queen came out as the top similar words. Now let us take the other important words from the sentence "Elizabeth", "Britain", "Dolphin" and also "King' and check the similarity.

In [19]:
queen = nlp.vocab['Queen']
elizabeth = nlp.vocab['Elizabeth']
britain = nlp.vocab['Britain']
dolphin = nlp.vocab['Dolphin']
king = nlp.vocab['King']
 
print("Word similarity score between Queen and Elizabeth : ",queen.similarity(elizabeth))
print("Word similarity score between Queen and Britain : ",queen.similarity(britain))
print("Word similarity score between Queen and Dolphin : ",queen.similarity(dolphin))
print("Word similarity score between Queen and King : ",queen.similarity(king))

Word similarity score between Queen and Elizabeth :  0.5465951561927795
Word similarity score between Queen and Britain :  0.29826614260673523
Word similarity score between Queen and Dolphin :  0.19133657217025757
Word similarity score between Queen and King :  0.659298300743103


"King" is the most similar word followed by "Elizabeth" and "Britain".

**References:**
1. [Complete Guide to Spacy](https://nlpforhackers.io/complete-guide-to-spacy/)
2. [Spacy documentation](https://spacy.io/)

**More to come. Stay tuned.!**