# spaCy

Make sure you've installed spacy.  If not, run the cell below after uncommenting it.  We will cover the capabilities we examined in `scikitlearn` as well as some additional functionality with `spaCy` including:

- Annotating Text
- Tokenizing Text
- POS tagging and dependencies
- Named entity extraction
- Similarity

In [2]:
%%bash
pip install spacy

Couldn't find program: 'bash'


In [1]:
import spacy

ModuleNotFoundError: No module named 'spacy'

In [2]:
nlp = spacy.load('en')

In [3]:
doc = nlp('This is a sentence about a dog, a priest, and a bicycle in North Dakota named Microsoft.')

### The spaCy pipeline

![](images/spacy_pipeline.svg)

In [6]:
for word in doc:
    print((word.text, word.pos_, word.dep_))

('This', 'DET', 'nsubj')
('is', 'VERB', 'ROOT')
('a', 'DET', 'det')
('sentence', 'NOUN', 'attr')
('about', 'ADP', 'prep')
('a', 'DET', 'det')
('dog', 'NOUN', 'pobj')
(',', 'PUNCT', 'punct')
('a', 'DET', 'det')
('priest', 'NOUN', 'appos')
(',', 'PUNCT', 'punct')
('and', 'CCONJ', 'cc')
('a', 'DET', 'det')
('bicycle', 'NOUN', 'conj')
('in', 'ADP', 'prep')
('North', 'PROPN', 'compound')
('Dakota', 'PROPN', 'pobj')
('named', 'VERB', 'acl')
('Microsoft', 'PROPN', 'oprd')
('.', 'PUNCT', 'punct')


In [7]:
for word in doc:
    print(word.text)

This
is
a
sentence
about
a
dog
,
a
priest
,
and
a
bicycle
in
North
Dakota
named
Microsoft
.


In [8]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

This this DET DT nsubj Xxxx True False
is be VERB VBZ ROOT xx True True
a a DET DT det x True True
sentence sentence NOUN NN attr xxxx True False
about about ADP IN prep xxxx True True
a a DET DT det x True True
dog dog NOUN NN pobj xxx True False
, , PUNCT , punct , False False
a a DET DT det x True True
priest priest NOUN NN appos xxxx True False
, , PUNCT , punct , False False
and and CCONJ CC cc xxx True True
a a DET DT det x True True
bicycle bicycle NOUN NN conj xxxx True False
in in ADP IN prep xx True True
North north PROPN NNP compound Xxxxx True False
Dakota dakota PROPN NNP pobj Xxxxx True False
named name VERB VBD acl xxxx True False
Microsoft microsoft PROPN NNP oprd Xxxxx True False
. . PUNCT . punct . False False


### POS Tags and Dependency Parsing

spaCy POS Tags: https://spacy.io/api/annotation



spaCy Dedpendencies: https://spacy.io/usage/linguistic-features


In [4]:
from spacy import displacy

In [5]:
doc = nlp('This is a sentence')
opts = {'color': 'white', 'bg': 'lightblue', 'compact': True}
displacy.render(doc, style='dep', jupyter=True, options = opts)

### Named Entity Recognition

spaCy entities: https://spacy.io/usage/linguistic-features#entity-types

In [7]:
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [8]:
displacy.render(doc, style='ent', jupyter=True)

In [9]:
import pandas as pd

In [10]:
train = pd.read_csv('data/ny_donors.csv')

In [11]:
train.head()

Unnamed: 0.1,Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,13,p173555,9b7f355e34bc9ca5740779b69ee14d8e,Mrs.,NY,2016-11-15 22:13:39,Grades 3-5,Literacy & Language,Literature & Writing,Extra! Extra! Read all about it!! We love to ...,"Each day my fifth graders walk into our \""home...",My students have had a taste of good reading! ...,,,"My students need good books, with life lessons...",5,0
1,21,p116615,b3593a375f2cf7fd4469b928ffac1c95,Mrs.,NY,2016-09-30 08:12:37,Grades PreK-2,"Applied Learning, Music & The Arts","Early Development, Performing Arts",Oral Language Development through the use of p...,Teaching kindergarten in a diverse district po...,Students don't often get the chance to 'play' ...,,,My students need the opportunity to develop or...,0,1
2,30,p081434,17563b7d138a9ca1e7308f0f480e7d09,Ms.,NY,2016-12-06 21:19:44,Grades PreK-2,"Health & Sports, Special Needs","Health & Wellness, Special Needs",Seating Like a Boss- Our 21st Century Room,"\""Great job buddy!\"" is something I hear every...",In order to promote essential learning skills ...,,,My students need an opportunity to sit and wor...,9,0
3,32,p156550,a902ce7ebdce6f236873d6b443c3ca08,Ms.,NY,2017-03-30 20:05:08,Grades 9-12,"Applied Learning, Special Needs","Other, Special Needs",Keeping Students Focused with Fun and Technology!,"Attending a District 75 high school in Bronx, ...","With a classroom lacking technology, student i...",,,My students need a wider variety of Interactiv...,0,1
4,59,p186381,da67f09a612a32fa30c9c80bed7e6365,Mrs.,NY,2016-09-24 11:36:26,Grades PreK-2,Literacy & Language,"Literacy, Literature & Writing",Listening & Learning in First Grade,Who doesn't enjoy listening to a great story? ...,The Listening Center in the classroom is alway...,,,My students need wireless headphones to use in...,9,1


In [13]:
essay = train.project_essay_1[10]

In [14]:
essay

"I have the amazing opportunity to collaborate with over 500 students in our STEM (Science, Technology, Engineering, Math) Inquiry-Based Learning Lab.  We are located in a small, rural community in Western New York.  (We can see cows from our classroom!)  I work with every student in our K-6 population on STEM project based learning.  \\r\\n\\r\\nWe would like to utilize engineering, robotics, and programming to develop solutions to real-world STEM problems.  Please help us move our youth's focus from passive consumers of technology to innovative creators of programs and inventions.  This will allow our students to begin thinking about career paths at an early-age and to develop confidence with a positive mindset concerning the STEM content areas."

In [17]:
doc = nlp(essay)

In [18]:
displacy.render(doc, style='ent', jupyter=True)

In [19]:
doc.user_data['title'] = train.project_title[0]
displacy.render(doc, style='ent', jupyter=True)

In [20]:
sents = essay.split('.')

In [21]:
sents[0]

'I have the amazing opportunity to collaborate with over 500 students in our STEM (Science, Technology, Engineering, Math) Inquiry-Based Learning Lab'

In [22]:
sent = sents[0].replace('\\', '')

In [23]:
sent

'I have the amazing opportunity to collaborate with over 500 students in our STEM (Science, Technology, Engineering, Math) Inquiry-Based Learning Lab'

In [24]:
doc = nlp(sent)

In [25]:
for word in doc:
    print(word.text, word.pos_)

I PRON
have VERB
the DET
amazing ADJ
opportunity NOUN
to PART
collaborate VERB
with ADP
over ADP
500 NUM
students NOUN
in ADP
our ADJ
STEM NOUN
( PUNCT
Science NOUN
, PUNCT
Technology PROPN
, PUNCT
Engineering PROPN
, PUNCT
Math PROPN
) PUNCT
Inquiry PROPN
- PUNCT
Based VERB
Learning PROPN
Lab NOUN


In [26]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

500 56 59 CARDINAL
STEM 76 80 ORG
Science, Technology, Engineering 82 114 ORG


In [15]:
from spacy.lang.en.stop_words import STOP_WORDS

In [28]:
sentence = []
for w in doc:
    if w.text != 'n' and not w.is_stop and not w.is_punct and not w.like_num:
        sentence.append(w.lemma_)

In [29]:
sentence

['-PRON-',
 'amazing',
 'opportunity',
 'collaborate',
 'student',
 'stem',
 'science',
 'technology',
 'engineering',
 'math',
 'inquiry',
 'base',
 'learning',
 'lab']

In [31]:
noun_chunks = list(doc.noun_chunks)
print(noun_chunks)  

sentences = list(doc.sents)

[I, the amazing opportunity, over 500 students, Science, Technology, Engineering, Math]


### Essay Exercise


- Can you locate a few essays that are talking about helping students improve their mathematics?  
- Tell me something about the kinds of needs a few teachers in different areas describe in the `project_resource_summary` column.  Create visualization of dependency tree for two different teachers.

In [16]:
essays = train.project_essay_2

In [18]:
essays[essays.str.contains('math')]

33       Our classroom has a Smartboard that has taken ...
35       My second grade students love math even though...
36       I am requesting supplies such as laminating sh...
37       PROJECT\r\n\r\nThe use of whiteboards is almos...
50       My students all learn in different ways, and m...
51       My first graders love working on the computer,...
54       Having LEGO WeDo sets in our classroom would b...
57       Chromebooks are fantastic for the quality and ...
62       Students will use the ipads during research wr...
63       General school supplies are expensive!  Our cl...
65       Having these 2 computers will help enhance my ...
74       HAVING A PROPER CAMCORDER TO RECORD A VARIETY ...
76       My students are having such a great time recor...
97       Ever since my students have realized that dono...
106      I teach math and science to 52 amazing 6th-gra...
108      The chrome books that I have asked for will be...
119      The students in class 1-306 at P.S. 110 in Car.

In [20]:
essays[106]

'I teach math and science to 52 amazing 6th-grade students.  Most of the time we are working with partners or in groups.  When it is time to work alone sometimes it is difficult for my students to stay focused on the assignment in front of them.   These privacy boards are easy to move so that when they need to collaborate they can remove them, but when they need to focus on their own work, like for a test or quiz, we can use them to help.  \\r\\nI am also requesting pencils, eraser tops, and a pencil sharpener so my students always have the supplies they need so they are always ready to work!'

In [21]:
doc = nlp(essays[106])

In [31]:
sents = []
for sent in doc.sents:
    sents.append(sent)

type(sents[4])

spacy.tokens.span.Span

In [33]:
doc = nlp(str(sents[3]))
opts = {'color': 'white', 'bg': 'lightblue', 'compact': True}
displacy.render(doc, style = 'dep', jupyter = True, options = opts)

In [35]:
doc = nlp(str(sents))
opts = {'color': 'white', 'bg': 'lightblue', 'compact': True}
displacy.render(doc, style = 'ent', jupyter = True, options = opts)