# Challenge 1: spaCy Linguistic Features

Use spaCy to create a [linguistic features table](https://spacy.io/usage/linguistic-features). [Part of speech tagging](https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/) is a means to "[assign] parts of speech to individual words in a sentence". 

What are the [ten parts of speech](https://www.theclassroom.com/10-parts-speech-8344653.html)?

[Click here to view part of speech tagging abbreviations](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [6]:
import pandas as pd
from string import punctuation
import spacy

In [8]:
# Load the small pretrained model
nlp = spacy.load('en_core_web_sm')

Note: if you want to check the difference in performance between Spacy's small, medium and large
models, see [here](https://stackoverflow.com/questions/50487495/what-is-difference-between-en-core-web-sm-en-core-web-mdand-en-core-web-lg-mod).

In [9]:
type(nlp)

spacy.lang.en.English

In [10]:
# Define our function 
def lemmatize(tokens):
    """Return the lemmas for each word in `tokens`."""
    
    # spacy models operate on strings not lists, so we turn the tokens back into
    # a string of words
    words = ' '.join(tokens)
    
    # this line does all sorts of processing, including the lemmatization.
    # `doc` will be like a list of tokens that we can iterate over
    doc = nlp(words)
    
    # each token in `doc` holds information about that token. The `lemma_`
    # attribute holds the lemma of that token represented as a string. For
    # performance reasons, the `lemma` (without the trailing underscore) holds
    # an integer representation of the token, that we'll rarely ever need.
    return [token.lemma_ for token in doc]

In [11]:
tokens = ('''I was thinking if off the top of your head you are aware of a 
generalizable comprehension to quickly stem all words in a list of tokens and 
how to quickly write up a one-minute example? This will be really useful for 
students interested in text preprocessing.''').split()

In [12]:
lemmas = lemmatize(tokens)
# Notice that spacy lemmatizes pronouns (e.g. "you", "I", "your") in a funny way '-PRON-'.
# It just tells us that they are pronouns, rather than giving us something like
# "your" -> "you".
print(lemmas)

['-PRON-', 'be', 'think', 'if', 'off', 'the', 'top', 'of', '-PRON-', 'head', '-PRON-', 'be', 'aware', 'of', 'a', 'generalizable', 'comprehension', 'to', 'quickly', 'stem', 'all', 'word', 'in', 'a', 'list', 'of', 'token', 'and', 'how', 'to', 'quickly', 'write', 'up', 'a', 'one', '-', 'minute', 'example', '?', 'this', 'will', 'be', 'really', 'useful', 'for', 'student', 'interested', 'in', 'text', 'preprocessing', '.']


### Another example: 
    
Now that our function is defined, let's try it on another sentence.

In [13]:
# Does this cell work correctly? Does it give us any extraneous information?
tokens2 = ("Thinking, jumping, running, quicking, eating, quickly - all in a days work for me, you, she, and he!").split()

lemmas2 = lemmatize(tokens2)
print(lemmas2)

['think', ',', 'jumping', ',', 'running', ',', 'quicke', ',', 'eat', ',', 'quickly', '-', 'all', 'in', 'a', 'day', 'work', 'for', '-PRON-', ',', '-PRON-', ',', '-PRON-', ',', 'and', '-PRON-', '!']


In [14]:
# Remove punctuation ...
sentence = "Thinking, jumping, running, quicking, eating, quickly - all in a days work for me, you, she, and he!"

for char in punctuation:
    sentence = sentence.replace(char, "")
    
print(sentence)

Thinking jumping running quicking eating quickly  all in a days work for me you she and he


In [15]:
# Re-run the lemmatizer!
# Why is "quickly" not lemmatized? (Because it is an adverb perhaps? Is 'quicking' a word?)
tokens3 = sentence.split()

lemmas3 = lemmatize(tokens3)
print(lemmas3)

['think', 'jump', 'run', 'quicking', 'eat', 'quickly', 'all', 'in', 'a', 'day', 'work', 'for', '-PRON-', '-PRON-', '-PRON-', 'and', '-PRON-']


### Convert to data frame

Convert the linguistic features table to a data frame

In [16]:
nlp(" ".join(tokens3))

Thinking jumping running quicking eating quickly all in a days work for me you she and he

In [22]:
# Define doc
doc = nlp(" ".join(tokens3))

d = []
for token in doc:
    d.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop))

d

[('Thinking', 'think', 'VERB', 'VBG', 'ROOT', 'Xxxxx', True, False),
 ('jumping', 'jump', 'VERB', 'VBG', 'xcomp', 'xxxx', True, False),
 ('running', 'run', 'VERB', 'VBG', 'amod', 'xxxx', True, False),
 ('quicking', 'quicking', 'NOUN', 'NN', 'dobj', 'xxxx', True, False),
 ('eating', 'eat', 'VERB', 'VBG', 'advcl', 'xxxx', True, False),
 ('quickly', 'quickly', 'ADV', 'RB', 'advmod', 'xxxx', True, False),
 ('all', 'all', 'ADV', 'RB', 'advmod', 'xxx', True, True),
 ('in', 'in', 'ADP', 'IN', 'prep', 'xx', True, True),
 ('a', 'a', 'DET', 'DT', 'det', 'x', True, True),
 ('days', 'day', 'NOUN', 'NNS', 'pobj', 'xxxx', True, False),
 ('work', 'work', 'NOUN', 'NN', 'advcl', 'xxxx', True, False),
 ('for', 'for', 'ADP', 'IN', 'prep', 'xxx', True, True),
 ('me', 'me', 'PRON', 'PRP', 'pobj', 'xx', True, True),
 ('you', 'you', 'PRON', 'PRP', 'npadvmod', 'xxx', True, True),
 ('she', 'she', 'PRON', 'PRP', 'appos', 'xxx', True, True),
 ('and', 'and', 'CCONJ', 'CC', 'cc', 'xxx', True, True),
 ('he', 'he', 

In [21]:
out = pd.DataFrame(d, columns=("text", "lemma", "pos", "tag", "dep", "shape", 
                               "is_alpha", "is_stop"))
out

Unnamed: 0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop
0,Thinking,think,VERB,VBG,ROOT,Xxxxx,True,False
1,jumping,jump,VERB,VBG,xcomp,xxxx,True,False
2,running,run,VERB,VBG,amod,xxxx,True,False
3,quicking,quicking,NOUN,NN,dobj,xxxx,True,False
4,eating,eat,VERB,VBG,advcl,xxxx,True,False
5,quickly,quickly,ADV,RB,advmod,xxxx,True,False
6,all,all,ADV,RB,advmod,xxx,True,True
7,in,in,ADP,IN,prep,xx,True,True
8,a,a,DET,DT,det,x,True,True
9,days,day,NOUN,NNS,pobj,xxxx,True,False


In [None]:
# What if we want to maintain pronouns instead of using spaCy's '-PRON-' tag? 
# We could do something like this:

doc = nlp(" ".join(tokens3))

d = []
for token in doc:
    if token.lemma_ != '-PRON-':
        d.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
    else:
        d.append((token.text, token.lower_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
d

# Challenge 2: Repeat

Repeat Challenge 1 with a text of your choosing.

In [18]:
## YOUR CODE HERE




# Challenge 3: Context

If you are doing anything that involves text for your individual final project, be sure to start thinking about **_why_** you might want to use n-grams, word2vec, or BERT instead of single-word tokenization. Start brainstorming now even if it all doesn't make total yet!