# Social Media and Human-Computer Interaction - Part 3



###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [1]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime

# 3.0 Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb), covering the natural language processing analysis of our tweet corpus, including 

  1. Natural Language Processings
  2. Construction of classifiers
  
Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 3.0.1 Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb)amd [Part 2](SocialMedia - Part 2.ipynb):

1. Our Tweets class
3. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
4. Configuration of our Twitter connection

In [2]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

*REDACT FOLLOWING DETAILS*

In [3]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [4]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

# 3.1 Natural langauge processing

Our ultimate goal is to build a classifier capable of distinguishing tweets related to tobacco smoing from other, unrelated tweets. To do this, we will use ome basic natural language processing to explore the types of words and language found in the tweets. 

To do this, we will use the [spaCy](https://spaCy.io/) Python NLP package. spaCy provides significant NLP power out-of-the box, with customization facilities offering greater flexibility at various stages of the Pipeline. Details can be found at the  [spaCy web site](https://spaCy.io/), and in this [tutorial](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/). spaCy is built on a neural network model based on recent developments in NLP researzch. See the [spaCy architecture](https://spaCy.io/api/) description for an overview.


However, before we get into the deails, a bit of a roadmap. 

Natural Language Processing involves a series of operations on an input text, each building off of the previous step to add additional insight and undertanding.  Thus, many NLP packages run as pipeline processors providing modular components at each stage of the process. Separating key steps into discrete packages provides needed modularity, as developers can modify and customize individual components as needed. spaCy, like other NLP tools including [GATE](https://gate.ac.uk/) and [cTAKES](https://ctaes.apache.org)  operate on such a model. Although the specific components of each pipeline vary from system to system (and from tasks to task, the key tasks are rougly similar:

1. *Tokenizing*: splitting the text into words, punctuation, and other markers.
2. *Part of speech tagging*: Classifying terms as nouns, verbs, adjective, adverbs, ec.
3. *Dependency Parsing* or *Chunking*: Defining relationships between tokens (subject and object of sentence) and grouping into noun and veb phrases.
4. *Named Entity Recognition*: Mapping words or phrases to standard vocabularies or other common, known values. This step is often key for linking free text to accepted terms for diseases, symptoms, and/or anatomic locations.

Each of these steps might be accomplished through rules, machine learning models, or some combination of approaches. After these initial steps are complete, results might be used to identify relationships between items in the text, build classifiers, or otherwise conduct further analysis. We'll get into these topics later.

The [spaCy documentation](https://spaCy.io/usage/spaCy-101) and [cTAKES default pipeline description](https://cwiki.apache.org/confluence/display/CTAKES/Default+Clinical+Pipeline) provide two examples of how these components might be arranged in practice.  For more information on NLP theory and methods, see [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/), perhaps the leading NLP textbook.

Given this introduction, we can read in our tweets and get to work.

# 3.1.1 Reading in data

At the end of [Part 2](SocialMedia - Part 2.ipynb) you had saved two sets of tweets one for smoking and one for vaping. Let's read  in the vaping twets.

In [5]:
vaping=Tweets()
vaping.readTweets("tweets-vaping.json")

In [6]:
vaping.countTweets()

100

and the smoking tweets...

In [7]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.countTweets()

100

# 3.1.2 NLP Roadmap


spaCy, like many other natural laguage processing tools, operates as a *pipeline* - a sequential series of operations, each of which conducts some analysis and passes results on to the next.  Each of the steps on the pipeline can operate both on the original text and on any of the results of the previous stages. The basic Spacy pipeline starts with the following steps:

1. Tokenizing - splitting into individual elements.
2. Tagging - assigning part-of-speech tags
3. Parsing - identifying relaionships between elements of a sentence.
4. Named Entity Recogntion (NER) - identifying domain-specific nounds and concepts. In biomedical literature, this might mean diseases, symptoms, anatomic locations, etc. 

Tokenizing is the assumed first stage of every pipeline. To see the conetnts of a pipeline, we can create an NLP object for the English language and iterate over the components of the pipeline. Although we'll usually use all of the components of the pipeline, they can be [customized](https://spacy.io/usage/processing-pipelines).

In [8]:
import spacy
nlp = spacy.load('en')
for name,proc in nlp.pipeline:
    print(name,proc)

tagger <spacy.pipeline.Tagger object at 0x109ed5780>
parser <spacy.pipeline.DependencyParser object at 0x109ef38e0>
ner <spacy.pipeline.EntityRecognizer object at 0x109ef3888>


# 3.1.3 Tokenizing

Tokenizing is the process of splitting a text into individual components - words - for further processing. Although this might sound simple, the pecularities of the English language and how it is used often make tokenizing more complex than we might expect.

To see some of the challenges, we will grab a specifc pre-chosen tweet and process it.  

This will give us a beginning feel for what [Spacy](https://spacy.io) can do, how we might use it, and how we might want to extend and revise the tokenizing process.

In [9]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
sample

'#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf \n#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr'

Tweets have usage patterns that are non-standard English - URLs, hashtags, user references (this particularly tweet was not selected accidentally). These patterns create challenges for extracting content - we might want to know that "#QuitSmoking" is, in a tweet, a hashtag that should be considered as a complete unit.  

We'll see soon how we might do this, but first, to start the NLP process, we can import the spaCy components and create an NLP object:

In [10]:
import spacy
nlp = spacy.load('en')

we can then parse out the text from the first tweet.

In [11]:
parsed = nlp(sample)

The result is a list of tokens. We can print out each token to start:

In [12]:
print([token.text for token in parsed])

['#', 'Smoking', 'affects', 'multiple', 'parts', 'of', 'our', 'body', '.', 'Know', 'more', ':', 'https://t.co/hwTeRdC9Hf', '\n', '#', 'SwasthaBharat', '#', 'NHPIndia', '#', 'mCessation', '#', 'QuitSmoking', 'https://t.co/x7xHO9G2Cr']


We can see right away that this parsing isn't quite what we would like. Default English parsing treats  `#QuitSmoking`  as two separate tokens - `#` and `QuitSmoking`. To treat this as a hashtag, we will indeed need to revise the tokenizer. 

For anoother example, consider this potential tweet text

In [13]:
smoketweet='E-cigarette use by teens linked to later tobacco smoking, study says https://t.co/AhTpFUw0TW'
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['E', '-', 'cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Note that "E-cigarette" becomes three tokens. This is not what we want - we want it to be held together as one. 

We will revise the spaCy tokenizer to handle these two difficulties - hashtags and "E-cigarette" tokenizing. 

## 3.1.3.1 Exception rules

"E-cigarette" can be handled with some simple exception rules.

To do this, we can refer to the spaCy docuentation, which describes the process for adding a [special-case tokenizer rule](https://spacy.io/usage/linguistic-features#section-tokenization). Essentially, these rules allow for the possibility of adding new rules to customize parsing for specific domains:

Each new rule will be a dictionary with three fields:
    * `ORTH` is the text that will be matched
    * `LEMMA` is the lemma form
    * `POS` is the part-of-speech
    
These can then be added to the tokenizer:

In [14]:
from spacy.symbols import ORTH, LEMMA, POS
special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
nlp.tokenizer.add_special_case(u'E-cigarette', special_case)

These commands suggest the text "e-cigarette" should be handled by the special case rule saying that it is a single token. Now, let's take a look at the result:

In [15]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['e-cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Now we capture "E-cigarette" as one token. Note the importance of including both capitalizations.

# 3.1.3.2 Tokenizing hashtags

As indicators of the progress and content of Twitter conversations, hashtags are important in tweets. For example, some analyses might want to use trends in hashtags, and their mentions in tweets and retweets, to understand conversational dynamics and the spread of ideas. However, as we saw, they are not handled properly by the deafult tokenier. As a reminder: 

In [16]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
print(sample)
parsed = nlp(sample)
print( [tok.text for tok in parsed])

#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf 
#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr
['#', 'Smoking', 'affects', 'multiple', 'parts', 'of', 'our', 'body', '.', 'Know', 'more', ':', 'https://t.co/hwTeRdC9Hf', '\n', '#', 'SwasthaBharat', '#', 'NHPIndia', '#', 'mCessation', '#', 'QuitSmoking', 'https://t.co/x7xHO9G2Cr']


we can look specifically at "#Smoking", which becomes two tokens:

In [17]:
print(parsed[0])
print(parsed[1])

#
Smoking


Note how "#Smoking" is split into "#" and "Smoking". To avoid this, we will can add a specialized processing component as a member of a [spaCy pipeline](https://spacy.io/usage/processing-pipelines).

To process hashtags, we will use code suggested by a [spaCy
GitHub issue](https://github.com/explosion/spaCy/issues/503). To see how this should work, let's walkt through some steps:

First, let's look at the tokens in the tweet parsed above. We can iterate through with enumerate. We can also look at a few interesting elements:

* `nbor` gets the next token after a token.
* `idx ` is the position of the token in the list of characters, starting at 0.

In [18]:
print(str(parsed[0].idx)+" "+parsed[0].text)
print(str(parsed[0].nbor().idx)+" "+str(parsed[0].nbor().text))
print(str(parsed[1].nbor().idx)+" "+str(parsed[1].nbor().text))
print(str(parsed[2].nbor().idx)+" "+str(parsed[2].nbor().text))

0 #
1 Smoking
9 affects
17 multiple


Thus, '#' starts of the string,  and 'Smoking' occupies characters 7 characters starting with character 1.  The 9th characer (index 8) is a space, so the next token ('affects') starts on the 10th character, which has index 9, etc.

We can use this information to find a hash tag. essentially, we can look for a tag that has the text '#'. If we find one, we can look at the next tag and merge all of the characters from the start of the first tag to the end of the second tag. 

In [19]:
start=parsed[0].idx
length = len(parsed[1].text)
end = start+length+1
print(str(start))
print(str(end))
parsed.merge(start,end)

0
8


#Smoking

This combines the character starting with 0 up until the character before the character at index 8 (which is a space) to form a new token.

Now, if we look at the list of tokens, we see that the first two are merged:

In [20]:
print( [tok.text for tok in parsed])

['#Smoking', 'affects', 'multiple', 'parts', 'of', 'our', 'body', '.', 'Know', 'more', ':', 'https://t.co/hwTeRdC9Hf', '\n', '#', 'SwasthaBharat', '#', 'NHPIndia', '#', 'mCessation', '#', 'QuitSmoking', 'https://t.co/x7xHO9G2Cr']


To get this to work for all of the tokens in a tweet, we need a routine that will repeatedly iterate over the tokens until we can't find anymore hashtags:


In [21]:
nlp = spacy.load('en')

In [22]:
doc = nlp("twitter #hashtag #")

In [23]:
nlp = spacy.load('en')
def hashtag_pipe(doc):
    merged_hashtag = True
    while merged_hashtag == True:
        merged_hashtag = False
        for token_index,token in enumerate(doc):
            if token.text == '#':
                try:
                    nbor = token.nbor()
                    start_index = token.idx
                    end_index = start_index + len(token.nbor().text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
                except:
                    pass
    return doc

This routine might require a bit of explanation. The main routine in lines 6-16 does the bulk of the work shown above - we find a token that contains only the single character '#', we find the end of the next token, and we merge the two.

There is one catch in that inner loop. If the last token in the string is a '#', the attempt to read the next token (on line 9) will cause an exception. If this happens, we're done anyway. So we `try` to get the next token. If it fails, we must be at the end of the document, so the `except`  clause does nothing, as indicated by the `pass`.

However, this is not the whole story. The merging of these two tokens removes one from the list of tokens returned by `enumerate(doc)`. If we continue on, the result of the enumeration will evenutally blow  up, as the code will try to access an element in the set of tokens that is no longer there (try it and see). 

To get around this, we change the inner loop to `break` out as soon as a pair of tokens are merged. This will start the process over with a new enumeration. This process will repeat until we make it all the way through lines 6-16 - in other words, all of the way through the tweet -  without finding a pair of tokens to merge. When this happens, `merged_hashtag` will stay False, and the outer loop will exit.

Once we have this routine written, we can then add it to the first position in the pipeline, which will put it after the default tokenizer, but before the part of speech tagger and other components.

In [24]:
nlp.add_pipe(hashtag_pipe,first=True)

And then we can try it out...

In [25]:
doc = nlp("twitter #hashtag")
print(doc[0].text)
print(doc[1].text)

twitter
#hashtag


Returning to our first example...

In [26]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
print(sample+"\n")
parsed = nlp(sample)
print( [tok.text for tok in parsed])

#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf 
#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr

['#Smoking', 'affects', 'multiple', 'parts', 'of', 'our', 'body', '.', 'Know', 'more', ':', 'https://t.co/hwTeRdC9Hf', '\n', '#SwasthaBharat', '#NHPIndia', '#mCessation', '#QuitSmoking', 'https://t.co/x7xHO9G2Cr']


We can try a tweet that ends with a '#':

In [27]:
doc = nlp("twitter #hashtag #")
print([tok.text for tok in doc])

['twitter', '#hashtag', '#']


Great! We can also try a pathological example.

In [28]:
parsed = nlp("weird hashtag ###tag")
print( [tok.text for tok in parsed])

['weird', 'hashtag', '##', '#tag']


Oops. That doesn't work. It's not even clear that this is a legal hashtag. 

**BONUS CHALLENGE**: Perhaps you can extend the routine to make it handle hashtags started by multiple '#' symbols?

Summarizing, we can combine the changes to the tokenizer, wrapping them up in a subroutine as follows:

In [29]:
from spacy.symbols import ORTH, LEMMA, POS

def getTwitterNLP():
    nlp = spacy.load('en')
    special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
    nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
    nlp.tokenizer.add_special_case(u'E-cigarette', special_case)
    def hashtag_pipe(doc):
        merged_hashtag = True
        while merged_hashtag == True:
            merged_hashtag = False
            for token_index,token in enumerate(doc):
                if token.text == '#':
                    try:
                        nbor = token.nbor()
                        start_index = token.idx
                        end_index = start_index + len(token.nbor().text) + 1
                        if doc.merge(start_index, end_index) is not None:
                            merged_hashtag = True
                            break
                    except:
                        pass
        return doc
    nlp.add_pipe(hashtag_pipe,first=True)
    return nlp

In [30]:
nlp = getTwitterNLP()

In [31]:
parsed = nlp("weird e-cigarette hashtag ###tag")
print( [tok.text for tok in parsed])

['weird', 'e-cigarette', 'hashtag', '##', '#tag']


Note that spaCy can also detect sentences. If you have multiple sentences, they will be found in the results of the parser as spans, each with a start and endpoint, given in terms of the positions of the tokens: 

In [32]:
parsed= nlp("This is an example of parsing two sentences. Here is the second sentence.")

In [33]:
for span in parsed.sents:
    print(str(span.start)+" "+str(span.end))

0 9
9 15


Thus the first sentence includes token 0-8 and the second includes 9-14:

It's also possible to access the text of the sentences directly:

In [34]:
sents = list(parsed.sents)
sents[0].text

'This is an example of parsing two sentences.'

In [35]:
print(parsed[0].text)
print(parsed[8].text)
print(parsed[9].text)
print(parsed[14].text)

This
.
Here
.


Tokenizers are traditional built using optimized [regular expressions](https://www.regular-expressions.info/). For more information about tokenizing in paCy, see [spaCy 101](https://spacy.io/usage/spacy-101#section-features) and the [detailed discussion of the spaCy tokenizer](https://spacy.io/usage/linguistic-features#tokenization). For a more general introduction, see [Chapter 2 of Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/).

## 3.1.3.3 Lemmatization, stop words, and alpha characterization

The spaCy tokenizer proivdes a few other useful features along the way:

* Lemmatization: For each token, spaCy can find the*lemma_*: the "standard" or "base" form, reducing verb forms to their base verb, plurals to appropriate singular nouns, etc.  
* Stop word identification - labelling words as commonly-found words taht add little or no information.
* Alphanumeric identification - identifying those tokens that contain only alphanumeric values.

To see these in action, let's review a few tokens:

In [36]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
parsed=nlp(sample)
print(sample)
print(parsed[1].text)
print(parsed[1].lemma)
print(parsed[1].lemma_)
print(parsed[1].is_stop)
print(parsed[1].is_alpha)

#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf 
#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr
affects
17543419487618836897
affect
False
True


So, `affects` has the lemma `affect`.  Note that spaCy stores many fields as both hashes for efficiency and as text  for readability. You'll want to use the text form for interpreting results, but the hash for computing. They differ only in the use of the trailing underscore - thus `lemma` is the hash while `lemma_` is the human readable form.

We can also see that `affect` is not a stop word, and it is alphabetic. 

Some NLP systems will go a bit further than spaCy's lemmatization, using a process called "stemming" to reduce words to base forms. With a stemming algorithm, "scared" might be reduced to "scare" - see this description of [Porter's stemming algorithm](https://tartarus.org/martin/PorterStemmer/) for more detail. 

# 3.1.4  Part-Of-Speech Tagging 

The next step in NLP is *Part of speech tagging* - classifying each token as one of the parts of speech that we all learned in elementrary school. Parts of speech are assigned to attributes of each token:


In [37]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
parsed=nlp(sample)
print(parsed[1].text)
print(parsed[1].pos)
print(parsed[1].pos_)

affects
99
VERB


As discussed before, we have two attributes here - `pos` is the hash code for the part of speech, used for efficiency, while `pos_` is the human readable form. Other attributes derived by spaCy follow the same pattern.

A second attribute - `tag` - provide es additional information.

As described in the [spaCy documentation for part-of-speech tags](https://spacy.io/api/annotation#pos-tagging), the tags associated with these two fields come from different sources. 'tag_' uses parts-of-speech from a version of the [Penn Treebank](https://www.seas.upenn.edu/~pdtb/), a well-known corpus of annotated text. 'pos_' uses a simpler set of tags from [A Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086), published by researchers from Google.  

The tags for `affects` provide an example of the difference. According to the [spaCy documentation ](https://spacy.io/api/annotation#pos-tagging) `VBZ` from the Penn tag set indicates a 'verb, 3rd person singular present', while 'the 'VERB' result for 'pos_' is a more general tag from the Google set. There are many types of verbs in the Penn Treebank that correspond tot the 'VERB' tag from the Google set. 

In [38]:
print(parsed[1].text)
print(parsed[1].tag_)
print(parsed[1].pos_)

affects
VBZ
VERB


If you want to learn more about a part of spech tag, you can use `spacy.explain`

In [39]:
print(spacy.explain(parsed[1].pos_))
print(spacy.explain(parsed[1].tag_))

verb
verb, 3rd person singular present


Let's look at token 0 ("#Smoking"), token 3 ("parts"), token 11 ("https://t.co/hwTeRdC9Hf'"),  and token 13("#") to see a few more tokens in action.

In [40]:
t0 = parsed[0]
t3 = parsed[3]
t11= parsed[11]
t13 = parsed[13]
print (t0.text,t0.lemma_,t0.pos_,t0.tag_,t0.is_stop,t0.is_alpha)
print (t3.text,t3.lemma_,t3.pos_,t3.tag_,t3.is_stop,t3.is_alpha)
print (t11.text,t11.lemma_,t11.pos_,t11.tag_,t11.is_stop,t11.is_alpha)
print (t13.text,t13.lemma_,t13.pos_,t13.tag_,t13.is_stop,t13.is_alpha)

#Smoking #smoking NOUN NN False False
parts part NOUN NNS False True
https://t.co/hwTeRdC9Hf https://t.co/hwterdc9hf PROPN NNP False False
#SwasthaBharat #swasthabharat NOUN NN False False


Note that URLS are neither alphabetical  nor stop-words, but they are proper nouns


Let's turn the code that we used above into a routine, along with a routine to print out token details and try another tweet or two. To make things easy to read, we'll use some spaces to format things in columns. 

In [41]:
def printTokDetails(parsed):
    print("{:25} {:25} {:7}{:7}{:7}{:7}".format("Token text","Lemma","POS","Tag","Stop?","Alpha?"))
    for tok in parsed:
        print("{:25} {:25} {:7}{:7}{:7}{:7}".format(str(tok.text),str(tok.lemma_),str(tok.pos_),str(tok.tag_),str(tok.is_stop),str(tok.is_alpha)))

In [42]:
tweet_id=random.choice(list(smoking.getIds()))
sample2 = smoking.getText(tweet_id)

In [43]:
sample2

'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926üèæ\u200d‚ôÇÔ∏è I gotta stop smokingüòÇ https://t.co/NCbNOyvZXe'

In [44]:
parsed2=nlp(sample2)

In [45]:
printTokDetails(parsed2)

Token text                Lemma                     POS    Tag    Stop?  Alpha? 
Made                      make                      VERB   VBN    False  True   
a                         a                         DET    DT     False  True   
sandwich                  sandwich                  NOUN   NN     False  True   
10                        10                        NUM    CD     False  False  
min                       min                       NOUN   NN     False  True   
ago                       ago                       ADV    RB     False  True   
and                       and                       CCONJ  CC     False  True   
been                      be                        VERB   VBN    False  True   
looking                   look                      VERB   VBG    False  True   
for                       for                       ADP    IN     False  True   
it                        -PRON-                    PRON   PRP    False  True   
ever                      ev

You might see some interesting pattners arising here.  For example:

* We see many different type of speech. Initially, we might want to focus on the nouns alone, as they provide much of the content.  

* Look for words like "is" or "was" - these might all refer to a common lemma term - "be", corresponding to the generic form of he verb. Do you see any other incidents of lemma forms that differ from the parsed text?

* URLs and icons might be present in tweets. Are they classified as alphanumeric? Should we include them as part of the "useful" text from a tweet? 

How about another?

In [46]:
tweet_id=random.choice(list(smoking.getIds()))
sample2 = smoking.getText(tweet_id)
sample2

'jay peek is at the bus stop! Smoking? Weird weird and odd. Not him maybe.'

In [47]:
parsed2=nlp(sample2)
printTokDetails(parsed2)

Token text                Lemma                     POS    Tag    Stop?  Alpha? 
jay                       jay                       PROPN  NNP    False  True   
peek                      peek                      NOUN   NN     False  True   
is                        be                        VERB   VBZ    False  True   
at                        at                        ADP    IN     False  True   
the                       the                       DET    DT     False  True   
bus                       bus                       NOUN   NN     False  True   
stop                      stop                      NOUN   NN     False  True   
!                         !                         PUNCT  .      False  False  
Smoking                   smoking                   NOUN   NN     False  True   
?                         ?                         PUNCT  .      False  False  
Weird                     weird                     ADJ    JJ     False  True   
weird                     we

Try a few more of these to get a bit more of a feel for he distribution of lemmas and POS tags. The following shortcut routine will make this a bit easier. 

In [48]:
def getTweetText(tweets):
    tweet_id = random.choice(list(tweets.getIds()))
    return tweets.getText(tweet_id)

---
## EXERCISE 3.1: Filtering tokens

Although NLP parsing is often a good start, further filtering is often necessary to focus on data relevant for specific tasks. In this problem, we will review some additional tweets and develop a post-processing routine capable of filtering tweets as necessary for our needs. 

3.1.1 Using the `getTweetText`, and `printTokDetails` routines above, aong with the spaCy `parser` command, examine several tweets to decide which tokens should be included or not.  List criteria for keeeping/removing tokens. Remember to use `spacy.explain()` for any unfamiliar POS or tag entries. Note that your  criteria will not be perfect, and will likely need refinining. Examiine enough tweets to feel confident in your criteria.

3.1.2 Write a routine  `includeToken` that will return True if a token matches the criteria that you identified in 3.11, and false otherwise. Assume for now that we are only interested in nouns and verbs, as they might be a good starting point to find information about vaping or smoking. 

3.1.3 Write a routine `filterTweetTokens` that will filter the parsed tokens from a single tweet, returning a list of the tokens to be included, based on your criteria.

3.1.4 Run `filterTweetTokens` on a few tweets. Identify any inaccuracies and explain them. When possible, identify an approach for improving performance, and implement it in a revision version of `filterTweetTokens`.

3.1.5. Add these routines to the tweet class, along with some new routines.

3.1.5.1 `parseTweet` will parse one of the tweets in the collection, storing the full list of tokens will be stored in a new entry in the dictionary entitled 'tokens'. `parseTweet` will also filter the tweets, storing the resulting list in an entry entitled 'filteredTokens'.

*NOTE*: The tweets class might or might not have an NLP object available for any given call to `parseTweet`. You should have the class create an NLP object when it is initialzed. 

3.1.5.2 `parseTweets` will call `parseTweet` on all of the tweets in a collection.

3.1.5.3 `getTokens` will be used to get all of the tokens for a given tweet.

3.1.5.4 `getFilteredTokens` will be used to get all of the filtered tokens for a tweet. 


3.1.6 When you are done, test this new version of the class by reading in and parsing the 'smoking' tweet set. 

---
*ANSWER FOLLOWS Cut below here*

### 3.1.1 Sample tweets

In [49]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

me: smoking weed hasn‚Äôt affected me at all

someone: count to 10

me: https://t.co/SUoGzARpom
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
me                        -PRON-                    PRON   PRP    False  True   
:                         :                         PUNCT  :      False  False  
smoking                   smoke                     VERB   VBG    False  True   
weed                      weed                      NOUN   NN     False  True   
has                       have                      VERB   VBZ    False  True   
n‚Äôt                       not                       ADV    RB     False  False  
affected                  affect                    VERB   VBN    False  True   
me                        -PRON-                    PRON   PRP    False  True   
at                        at                        ADP    IN     False  True   
all                       all                       ADV    RB     False  True   


         

In [50]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

I need me some solid smoking partnas
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
I                         -PRON-                    PRON   PRP    False  True   
need                      need                      VERB   VBP    False  True   
me                        -PRON-                    PRON   PRP    False  True   
some                      some                      DET    DT     False  True   
solid                     solid                     ADJ    JJ     False  True   
smoking                   smoking                   NOUN   NN     False  True   
partnas                   partna                    NOUN   NNS    False  True   


In [51]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

me: smoking weed hasn‚Äôt affected me at all

someone: count to 10

me: https://t.co/SUoGzARpom
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
me                        -PRON-                    PRON   PRP    False  True   
:                         :                         PUNCT  :      False  False  
smoking                   smoke                     VERB   VBG    False  True   
weed                      weed                      NOUN   NN     False  True   
has                       have                      VERB   VBZ    False  True   
n‚Äôt                       not                       ADV    RB     False  False  
affected                  affect                    VERB   VBN    False  True   
me                        -PRON-                    PRON   PRP    False  True   
at                        at                        ADP    IN     False  True   
all                       all                       ADV    RB     False  True   


         

In [52]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

Made a sandwich 10 min ago and been looking for it ever since thenü§¶üèæ‚Äç‚ôÇÔ∏è I gotta stop smokingüòÇ https://t.co/NCbNOyvZXe
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
Made                      make                      VERB   VBN    False  True   
a                         a                         DET    DT     False  True   
sandwich                  sandwich                  NOUN   NN     False  True   
10                        10                        NUM    CD     False  False  
min                       min                       NOUN   NN     False  True   
ago                       ago                       ADV    RB     False  True   
and                       and                       CCONJ  CC     False  True   
been                      be                        VERB   VBN    False  True   
looking                   look                      VERB   VBG    False  True   
for                       for                       ADP  

Criteria: 
    
* Alpha is true, and 
* Stop is false, and 
* text is not "RT"
* Tag is NN, Tag is NNP, or POS is VERB

### 3.1.2  `includeToken`

Our routine will accept a token only if it meets the criteria given above. 

In [53]:
def includeToken(tok):
    val =False
    if tok.is_alpha == True and tok.is_stop == False:
        if tok.text =='RT':
            val = False
        elif tok.pos_=='NOUN' or tok.pos_=='VERB':
            val = True
    return val

In [54]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)

Are e-cigarettes leading young people to take up smoking? A new study says yes https://t.co/3Hv17tnER5


In [55]:
print(parsed[0])
includeToken(parsed[0])

Are


True

In [56]:
print(parsed[1])
includeToken(parsed[1])

e


False

In [57]:
print(parsed[2])
includeToken(parsed[2])

-


False

In [58]:
for tok in parsed:
    print(tok,includeToken(tok))

Are True
e False
- False
cigarettes True
leading True
young False
people True
to False
take True
up False
smoking True
? False
A False
new False
study True
says True
yes False
https://t.co/3Hv17tnER5 False


Looks ok. 

### 3.1.3 Write a routine `filterTweetTokens` that will parse a single tweet

In [59]:
def filterTweetTokens(tokens):
    filtered=[]
    for tok in tokens:
        if includeToken(tok) == True:
            filtered.append(tok)
    return filtered

In [60]:
f= filterTweetTokens(parsed)
print(sample)
for tok in f:
    print(tok)

Are e-cigarettes leading young people to take up smoking? A new study says yes https://t.co/3Hv17tnER5
Are
cigarettes
leading
people
take
smoking
study
says


### 3.1.4 Run `filterTweetTokens` on a few tweets

In [61]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

Lmao he smoking big bricks. 100? Bro gon have to miss me like aubrey https://t.co/b7SJrqsmH8
smoking
bricks
gon
have
miss
aubrey


In [62]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

Made a sandwich 10 min ago and been looking for it ever since thenü§¶üèæ‚Äç‚ôÇÔ∏è I gotta stop smokingüòÇ https://t.co/NCbNOyvZXe
Made
sandwich
min
been
looking
got
stop
smoking


In [63]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

me: smoking weed hasn‚Äôt affected me at all

someone: count to 10

me: https://t.co/SUoGzARpom
smoking
weed
has
affected
someone
count


In [64]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)


me: smoking weed hasn‚Äôt affected me at all

someone: count to 10

me: https://t.co/SUoGzARpom
smoking
weed
has
affected
someone
count


In [65]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

I use to think weed wasn‚Äôt a drug ,, until I started literally having withdrawals from not smoking . That shit is real
use
think
weed
was
drug
started
having
withdrawals
smoking
shit
is


In [66]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

He was walking smoking him a blunt by me saying if u really wannabe on tv n free wit ur babies u will chose me n make him our baby maker.
was
walking
smoking
saying
tv
wit
babies
will
chose
make
baby
maker


### 3.1.5  Adding to the Tweets class

In [67]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        self.nlp = getTwitterNLP()
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
  
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary
    
    # new routine for classifying a token
    def includeToken(self,tok):
        val =False
        if tok.is_alpha == True and tok.is_stop == False:
            if tok.text =='RT':
                val = False
            elif tok.pos_=='NOUN' or tok.pos_=='VERB':
                val = True
        return val
    
    # new routine for filtering a list of tokens.
    def filterTweetTokens(self,tokens):
        filtered=[]
        for tok in tokens:
            if includeToken(tok) == True:
                filtered.append(tok)
        return filtered
    
    def parseTweet(self,id):
        text = self.getText(id)
        parsed = nlp(text)
        self.tweets[id]['tokens']=parsed
        filtered= self.filterTweetTokens(parsed)
        self.tweets[id]['filteredTokens']=filtered
        
    def parseTweets(self):
        ids=self.getIds()
        for id in ids:
            self.parseTweet(id)
            
    def getTokens(self,id):
        if 'tokens' in self.tweets[id]:
            return self.tweets[id]['tokens']
        else: 
            return None
    
    def getFilteredTokens(self,id):
        if 'filteredTokens' in self.tweets[id]:
             return self.tweets[id]['filteredTokens']
        else:
            return None

### 3.1.6 Trying out the new routines for parsing a collection.

In [68]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.parseTweets()
smoking.countTweets()

100

In [69]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['Say', 'no', 'to', 'smoking', 'kids']
['Say', 'smoking', 'kids']


In [70]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['Made', 'a', 'sandwich', '10', 'min', 'ago', 'and', 'been', 'looking', 'for', 'it', 'ever', 'since', 'then', '\U0001f926', 'üèæ\u200d', '‚ôÇ', 'Ô∏è', 'I', 'got', 'ta', 'stop', 'smoking', 'üòÇ', 'https://t.co/NCbNOyvZXe']
['Made', 'sandwich', 'min', 'been', 'looking', 'got', 'stop', 'smoking']


In [71]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['@Ij4realOkoli', '@whema', 'Continue', 'smoking']
['Continue', 'smoking']


*END OF ANSWER cut above here*

---

# 3.1.6  Dependency parsing

*Dependency parsing* is the process of identifying the syntactic linkages between elements in a sentence. Dependency parsers lin noun phrases and modifiers, subjects to objects, etc. The [spaCy description of dependency parsing](https://spacy.io/usage/linguistic-features#dependency-parse) provides a detailed introduction - here, we provide a brief summary.

Perhaps the easiest way to look at the parsing results is to look at the noun chunks found by spaCy. These can be found by looking at the `noun_chunks` attributes of the parser output:

In [72]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
print(sample)
print("----")
parsed=nlp(sample)
for chunk in parsed.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,chunk.root.head.text)

#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf 
#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr
----
#Smoking #Smoking nsubj affects
multiple parts parts dobj affects
our body body pobj of
https://t.co/hwTeRdC9Hf https://t.co/hwTeRdC9Hf ROOT https://t.co/hwTeRdC9Hf


We can see a few things from this example:

1. `#Smoking` is a noun subject of the sentence, dependent on the verb `affects`.
2. `multiple parts` is a noun phrase with the root text `parts`. It is a direct object of the verb `affects`
3. `our body` is he object of `of`
4. The URL is at the root level of the sentence. 


We can look in more deail a the text, dependency,  head, and children, of each token.   

In [73]:
def printParseTree(parsed):
    print("{:10} {:10} {:7} {:7} {:30}".format("Token text","dep","Head text","POS","Children"))
    for tok in parsed:
        children=[child.text for child in tok.children]
        children=",".join(children)
        print("{:10} {:10} {:7} {:7} {:30}".format(str(tok.text),str(tok.dep_),str(tok.head.text),str(tok.head.pos_),children))

In [74]:
sents =list(parsed.sents)
print(sents[0])
printParseTree(sents[0])

#Smoking affects multiple parts of our body.
Token text dep        Head text POS     Children                      
#Smoking   nsubj      affects VERB                                  
affects    ROOT       affects VERB    #Smoking,parts,.              
multiple   amod       parts   NOUN                                  
parts      dobj       affects VERB    multiple,of                   
of         prep       parts   NOUN    body                          
our        poss       body    NOUN                                  
body       pobj       of      ADP     our                           
.          punct      affects VERB                                  


Here we can see that that 'affects' is the root verb, with `#Smoking` as a noun subjects and `parts` as the object. `Parts`  is modified by `muliple` and `of our body`'.  

We can use the [displacy](https://spacy.io/usage/visualizers#section-dep) renderer to show a graphical depiction of the dependencies. Since displacy seems to prefer showing thepare tree fror an entire document, we'll try it on a single sentence.

Note - the "%%capture" line below tells Jupyter to hide some very ugly errors, whie still displaying the nice graphical result. 

In [75]:
%%capture --no-display
from spacy import displacy

text="#Smoking affects multiple parts of our body."
parsed=nlp(text)
displacy.render(docs=[parsed],jupyter=True, options={'distance': 90})

This diagram shows the structure given above in the printed version of the parse tree. 

These relationships might be useful for some NLP goals, particularly those involving relationships between concpets. 

A variety of approaches - including greedy algorithms, graph-based methods, and machine learning - can be used to extract dependencies. 

# 3.1.7 Named Entity Recognition

*Named entity recognition* is the process of extracting categories to known entities - places, people, things, ec. spaCy provides a statistical model capable of assigning an [entity type](https://spacy.io/api/annotation#named-entities) to many of the terms in a document. For an example, let's look at the entities found in a tweet:

In [76]:
tweet_id='974316845250633730'
sample = smoking.getText(tweet_id)
print(sample)
parsed=nlp(sample)
print("----")
for ent in parsed.ents:
    print(ent.text,ent.label_)

Scott Gottlieb points to potential of 8 million fewer smoking-related deaths -- "an undeniable public health benefit" -- as FDA starts process to cut nicotine levels in cigarettes. @lauriemcginley2  https://t.co/APLwo5Kpf1
----
Scott Gottlieb PERSON
8 million CARDINAL
FDA ORG


Not that etities are note equivalent to tokens: `Scott Gottlieb` and `8 million` are entities, but not tokens. For comparison:

In [77]:
print([tok.text for tok in parsed])

['Scott', 'Gottlieb', 'points', 'to', 'potential', 'of', '8', 'million', 'fewer', 'smoking', '-', 'related', 'deaths', '--', '"', 'an', 'undeniable', 'public', 'health', 'benefit', '"', '--', 'as', 'FDA', 'starts', 'process', 'to', 'cut', 'nicotine', 'levels', 'in', 'cigarettes', '.', '@lauriemcginley2', ' ', 'https://t.co/APLwo5Kpf1']


Thus, two tokens - `Scott` and `Gottlieb` are combined to form a single entity - `Scott Gottlieb'.' We can modify the above to see where each entity starts and ends:

In [78]:
print("{:15} {:5} {:5} {:5}".format("Text","Start","End","Type"))
for ent in parsed.ents:
    print("{:15} {:5} {:5} {:5}".format(ent.text,ent.start_char,ent.end_char,ent.label_))

Text            Start End   Type 
Scott Gottlieb      0    14 PERSON
8 million          38    47 CARDINAL
FDA               124   127 ORG  


Thus, `Scott Gottlieb` starts at character 0 and goes up through (but not including) character 14.

We can also use the spaCy visualizer to look at the named entities in a sentence:

In [79]:
%%capture --no-display
displacy.render(docs=[parsed],jupyter=True, style='ent')



Let's try another.

In [80]:
tweet_id='974316628136652803'
sample = smoking.getText(tweet_id)
print(sample)
parsed=nlp(sample)
print("----")
for ent in parsed.ents:
    print(ent.text,ent.label_)

Should smoking be banned in movies? Peterborough Public Health officials are in favour:
https://t.co/2uEZPG3QF1 #Ptbo #Peterborough #smoking #smokinginmovies
----
Peterborough Public Health ORG


Here, we note that hashtags are not necessarily categorized as entities.  This might be a shortcoming if we were going to use named entities as part of our strategy for classiying tweets. The spaCy named entity recognizer is based on statistical models that can be extended given enough training data. See the discussion of [training the named entity recognizer](https://spacy.io/usage/training#section-ner) for details on how this might be done. 

*Challenge*: Collect some tweets with hashtags and train the spaCy named entity recognizer add a `HASHTAG` as a new entity type.

# Exercise 3.2

The natural language processing pipeline consists of several processes that add substantial structure to our understanding of these Tweets. Tokenizing, part of speech tagging, lemmatiziation, dependency parsing, and named entity recognition each add different details that might be used to understand and classify documents, while also providing some hints as to interesting questions that we might ask.

Review some tweets and discuss any patterns or questions that arise. You might consider some of the following:
* Are there terms that show up more frequently in the `vaping` tweets as opposed to the `smoking` tweets?
* Are the tokens that we filtered (in Exercise 3.1) useful, or do we need the whole set of tokens to inerpret
* Are the named entities informative? 

Describe any other interesting phenomena that you think you might see in the corpus. Note that this question is not asking for fully statistically supported models. Rather, we're just looking for things that might be interesting to pursue further: it may turn out that any "patterns" you identify here are just incidental.