## What we'll be doing:
- Wonder about the problem
- Take a dataset (https://www.kaggle.com/stackoverflow/stacksample)
- Discover the language beyond just a simple text (Clue: Natural Language)
- Demonstrate spaCy and it's capabilities
- Benchmark the results on a couple of models.

In [2]:
import pandas as pd

df = pd.read_csv("Questions.csv", nrows=1_000_000,
                 encoding="ISO-8859-1", usecols=['Title', 'Id'])

In [7]:
titles = [_ for _ in df['Title']]

In [8]:
df.head()

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...


In [9]:
def has_golang(text):
    return " go " in text

g = (title for title in titles if has_golang(title))
[next(g) for i in range(2)]

['Where does Console.WriteLine go in ASP.NET?',
 'Should try...catch go inside or outside a loop?']

#### Uh Oh! Seems like a simple text matching (Even regex) can't help in getting the tokens/data right as it cannot understand the Natural English while parsing... But wait on... Spacy does come to the rescue! Let's see how...

In [10]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.0.6-cp38-cp38-manylinux2014_x86_64.whl (13.0 MB)
[K     |████████████████████████████████| 13.0 MB 848 kB/s eta 0:00:01
[?25hCollecting spacy-legacy<3.1.0,>=3.0.4
  Downloading spacy_legacy-3.0.5-py2.py3-none-any.whl (12 kB)
Collecting pydantic<1.8.0,>=1.7.1
  Downloading pydantic-1.7.3-cp38-cp38-manylinux2014_x86_64.whl (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 1.7 MB/s eta 0:00:01
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp38-cp38-manylinux2014_x86_64.whl (458 kB)
[K     |████████████████████████████████| 458 kB 1.9 MB/s eta 0:00:01
[?25hCollecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp38-cp38-manylinux2014_x86_64.whl (35 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting catalogue<2.1.0,>=2.0.3
  Downloading catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp38-cp38-manylinux2014_x86_64.whl (9.8 MB)
[K  

In [12]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 2.2 MB/s eta 0:00:01    |████████████████▍               | 7.0 MB 4.3 MB/s eta 0:00:02
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [16]:
[t for t in nlp("Hey I am Anant!")]

[Hey, I, am, Anant, !]

In [15]:
type(nlp)

spacy.lang.en.English

In [17]:
doc = nlp("Hey I'm Anant!")

In [18]:
doc[0]

Hey

In [19]:
type(doc[0])

spacy.tokens.token.Token

In [20]:
from spacy import displacy

displacy.render(doc)

#### Ain't it cool, what you spent hours understanding in NLP class was just a command away... We can now see the dependency graphs of the sentence so easily...

In [21]:
spacy.explain('intj')

'interjection'

In [22]:
for t in doc:
    print(t, t.pos_, t.dep_)

Hey INTJ intj
I PRON nsubj
'm VERB ROOT
Anant PROPN attr
! PUNCT punct


### Let's try to rectify what we was happening in vanilla python there...

In [23]:
doc = nlp('Where does Console.WriteLine go in ASP.NET?')
for t in doc:
    print(t, t.pos_, t.dep_)

Where ADV advmod
does VERB ROOT
Console PROPN nsubj
. PUNCT punct
WriteLine PROPN nsubj
go VERB ROOT
in ADP prep
ASP.NET PROPN pobj
? PUNCT punct


### Ahhh... There you go! We can filter out the sentence based on its pos tag. *go* here was a *VERB* ... Let's see how we can build a logic out of this information!

In [24]:
# Getting only the data containing go in it's sentence/title...
df = pd.read_csv("Questions.csv", nrows=2_000_000, encoding="ISO-8859-1", usecols=['Title',"Id"])

titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title']]

In [25]:
def has_golang(text):
    doc = nlp(text)
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_ != "VERB":
                return True
    return False

g = (title for title in titles if has_golang(title))
[next(g) for i in range(10)]

['Removing all event handlers in one go',
 'How to Create a Dropdown List Hyperlink without the GO button?',
 'Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go',
 "What's the point of having pointers in Go?",
 'Simulate a tcp connection in Go',
 'Trouble reading from a socket in go',
 "How to listen for iPhone keyboard action/touch (ex, 'GO', 'Search', etc)",
 'jQuery UI problem: why do the elements go flying around the screen?']

### Woah... This is lot better isn't it... There are instances which has go neither as VERB, nor the language we want... Let's dive deeper and build a more better logic out of this

In [26]:
displacy.render(nlp('Embedding instead of inheritance in Go'))

In [27]:
spacy.explain("pobj")

'object of preposition'

In [28]:
displacy.render(nlp('Removing all event handlers in one go'))

### Umm okay... Intriguing, let's confirm this idea with another example...

In [29]:
displacy.render(nlp('How to Create a Dropdown List Hyperlink without the GO button?'))

In [31]:
displacy.render(nlp("multi package makefile example for go"))

### You didn't notice it didn't you? It is subtle... Try to figure out that there is pobj dependency for ```go```

In [36]:
%%time
def has_golang(text):
    doc = nlp(text)
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_ != "VERB":
                if t.dep_ == "pobj":     # Let's add the dependency relation to make it more stronger
                    return True
    return False

g = (title for title in titles if has_golang(title))
[next(g) for i in range(10)]

CPU times: user 20.3 s, sys: 11.9 ms, total: 20.3 s
Wall time: 20.3 s


['Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go',
 "What's the point of having pointers in Go?",
 'Simulate a tcp connection in Go',
 'Trouble reading from a socket in go',
 "What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?",
 'Convert string to integer type in Go?',
 'Is there any automated conversion from Go to Python?',
 'Implementing the â\x80\x98deferâ\x80\x99 statement from Go in Objective-C?']

### Woah... 9 out of 10 are bang on right!!! Let's make it better and optimise it for better performance

In [37]:
%%time
def has_golang(doc):
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_ != "VERB":
                if t.dep_ == "pobj":   
                    return True
    return False

g = (doc for doc in nlp.pipe(titles) if has_golang(doc))     #Note that we added nlp.pipe for performance
[next(g) for i in range(10)]

CPU times: user 4.12 s, sys: 2.85 ms, total: 4.12 s
Wall time: 4.12 s


[Embedding instead of inheritance in Go,
 Shared library in Go?,
 multi package makefile example for go,
 What's the point of having pointers in Go?,
 Simulate a tcp connection in Go,
 Trouble reading from a socket in go,
 What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?,
 Convert string to integer type in Go?,
 Is there any automated conversion from Go to Python?,
 Implementing the âdeferâ statement from Go in Objective-C?]

### Did you see the difference in time taken in both the code segments? 20.3s v/s a mere 4.12s... That's nearly 5 times a better performance from its counterpart code... Wowzaa!!!

In [38]:
# Let's try disabling the ner module for now and see the performance...
nlp = spacy.load('en_core_web_sm', disable=['ner'])

df = pd.read_csv("Questions.csv", nrows=2_000_000, encoding="ISO-8859-1", usecols=['Title',"Id"])

titles = [_ for _ in df['Title']]

In [39]:
%%time
def has_golang(doc):
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_ != "VERB":
                if t.dep_ == "pobj":   
                    return True
    return False

g = (doc for doc in nlp.pipe(titles) if has_golang(doc))     #Note that we added nlp.pipe for performance
[next(g) for i in range(10)]

CPU times: user 3.58 s, sys: 7.23 ms, total: 3.58 s
Wall time: 3.58 s


[Embedding instead of inheritance in Go,
 Shared library in Go?,
 multi package makefile example for go,
 What's the point of having pointers in Go?,
 Simulate a tcp connection in Go,
 Trouble reading from a socket in go,
 What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?,
 Convert string to integer type in Go?,
 Is there any automated conversion from Go to Python?,
 Implementing the âdeferâ statement from Go in Objective-C?]

#### Not quite a noticeable change, but still better... Let's Model the data now.

In [45]:
df_tags = pd.read_csv("Tags.csv")
go_ids = df_tags.loc[lambda d: d['Tag'] == "go"]['Id']   #Finding out the Ids of Go Tags

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            return True
    return False

all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()
#The above line finds all the senetences in DF by matching the Ids from Tags DF.

detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]
#The above code checks for the word go in all the sentences... (Note that we are just using the word go, 
#rather than seeing the concepts and directly detecting them.)

non_detectable = (df
                  .loc[lambda d: ~d['Id'].isin(go_ids)]
                  .loc[lambda d: d['Title'].str.lower().str.contains("go")]
                 ['Title'].tolist())
#The above code picks for the word go in other sentences that are not Tagged as Go, but contains the word instead.

non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]

len(all_go_sentences), len(detectable), len(non_detectable)

(1858, 1208, 1696)

In [46]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=['ner'])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ != "VERB":
                if t.dep_ == "pobj":
                    return True
    return False

method = "not-verb-but-pobj"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
#We find the total number of correct tokens according to the logic built.

wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
#We find the total number of correct tokens according to the logic built in sentences just having the word Go
#but is not tagged go accordingly. (Wrongly suggesting it is Go)

precision = correct/(correct+wrong)
#precision is out of total predicted to be true, how many were actually true

recall = correct/len(detectable)
#recall is out of total true, how many were predicted to be true.

accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))
#accuracy is total right predictions

f"{precision}, {recall}, {accuracy}, {model_name}, {method}"

'0.9615384615384616, 0.3518211920529801, 0.7245179063360881, en_core_web_sm, not-verb-but-pobj'

In [47]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=['ner'])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ != "VERB":
                return True
    return False

method = "not-verb"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct+wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))
f"{precision}, {recall}, {accuracy}, {model_name}, {method}"

'0.9229144667370645, 0.7235099337748344, 0.8598484848484849, en_core_web_sm, not-verb'

### As we see the recall of second model is way way better stressing out that it can out of all the values that are true, it is able to predict 72% of them, which is highly good...

In [48]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=['ner'])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ == "NOUN":
                return True
    return False

method = "is-noun"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct+wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))
f"{precision}, {recall}, {accuracy}, {model_name}, {method}"

'0.8674698795180723, 0.17880794701986755, 0.647038567493113, en_core_web_sm, is-noun'

In [49]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=['ner'])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ == "NOUN":
                if t.dep_ == "pobj":
                    return True
    return False

method = "is-noun-is-pobj"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct+wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))
f"{precision}, {recall}, {accuracy}, {model_name}, {method}"

'0.9054054054054054, 0.11092715231788079, 0.6253443526170799, en_core_web_sm, is-noun-is-pobj'

### Second model seemed to have good metrics overall (precision, recall and accuracy)