<a href="https://colab.research.google.com/github/afeld/python-public-policy/blob/main/lecture_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NYU Wagner - Python Coding for Public Policy**
# Class 5: Natural language processing (NLP)

# LECTURE

This lecture is more about showing what's possible with NLP than getting hung up on the specifics of how to do it.

But first, [a Twitter thread showing the power of data analysis](https://twitter.com/kate_ptrv/status/1332398737604431874).

In [40]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I couldn’t just walk past this Tweet, so here is some fun <a href="https://twitter.com/hashtag/dataviz?src=hash&amp;ref_src=twsrc%5Etfw">#dataviz</a><br><br>Scented candles: An unexpected victim of the COVID-19 pandemic 1/n <a href="https://t.co/xEmCTQn9sA">https://t.co/xEmCTQn9sA</a> <a href="https://t.co/tVecEiX5Jc">pic.twitter.com/tVecEiX5Jc</a></p>&mdash; Kate Petrova (@kate_ptrv) <a href="https://twitter.com/kate_ptrv/status/1332398737604431874?ref_src=twsrc%5Etfw">November 27, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

## Setup

We will be using [spaCy](https://spacy.io/) for NLP. First, we need to download the model:

In [1]:
!python -m spacy download en_core_web_md --quiet

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [2]:
import spacy

nlp = spacy.load("en_core_web_md")

## Concepts

- Model
- Training
- Document
- [Features](https://spacy.io/usage/spacy-101#features)
  - Tokens
  - Parts of speech
  - Lemmas
  - Entities
- Vectors

## Text structure

Example from [DataCamp](https://campus.datacamp.com/courses/advanced-nlp-with-spacy/finding-words-phrases-names-and-concepts?ex=4).

In [3]:
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

In [4]:
from spacy import displacy

displacy.render(doc)

This shows the [parts of speech](https://spacy.io/api/annotation#pos-tagging).

## Entities

In [5]:
displacy.render(doc, style='ent')

## Sentiment analysis

We are going to use a different library, [TextBlob](https://textblob.readthedocs.io/). Works similarly to spaCy, though.

First, need to [download the model](https://textblob.readthedocs.io/en/dev/install.html):

In [6]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /Users/afeld/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /Users/afeld/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/afeld/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/afeld/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /Users/afeld/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/afeld/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [7]:
from textblob import TextBlob

blob = TextBlob("I love this class!")
blob.sentiment

Sentiment(polarity=0.625, subjectivity=0.6)

In [8]:
blob2 = TextBlob("I hate pandas")
blob2.sentiment.polarity

-0.8

Let's get some real data to work with: [Public comments](https://beta.regulations.gov/document/DOT-OST-2018-0068-12959/comment) from the proposed [Traveling by Air with Service Animals](https://beta.regulations.gov/document/DOT-OST-2018-0068-12959) rule. We'll use a cleaned-up version.

In [10]:
import pandas as pd

pd.options.display.max_colwidth = None

df = pd.read_csv('https://nyu.box.com/shared/static/1zdvfmpy5452nqdlb5b1yho6m3bdihk9.csv')
df

Unnamed: 0,id,date,content
0,DOT-OST-2018-0068-14971,2020-02-24T05:00:00Z,"My wife was depressed on a daily basis. She cried all the time. Once we got our little puppy, my wife feels she has a purpose and has not cried since. She needs to have him with her when travelling or she is very stressed and depressed. He has made such a difference. Please don't take away the ability for our dog to travel with her."
1,DOT-OST-2018-0068-14917,2020-02-24T05:00:00Z,"Docket number (DOT-OST-2018-0068) and the Regulatory ID number (RIN No. 2105-AE63)I am very much in favor of amended regulations that provide strict definitions for what type of animals might be permitted to accompany flyers in the cabin of commercial aircraft. As I have read the pending rules, only a properly trained service dog would fulfill the criteria to accompany a passenger with special needs in the cabin. The wide and random acceptance of non-professional trained emotional support animals in the cabin of aircraft would cease. This is essential for the health and safety of the passengers at large as well as the hard working crew. I hope to see this change made very soon. Thank you."
2,DOT-OST-2018-0068-14918,2020-02-24T05:00:00Z,Service animals should be restricted to those animals previously designated to aid physically incapacitated individuals and emotional support animals do not fulfil that category. Additionally dogs appear to be the only animal that I have observed capable of performing this task. All of the other pre-requisites appear to be reasonable requirements.
3,DOT-OST-2018-0068-14919,2020-02-24T05:00:00Z,"RE: DOCKET NUMBER DOT-OST-2018-0068I am a psychotherapist who has worked with many people with PTSD and Severe panic attacks. I have spoken with men whose PTSD, from serving in the military, has been so severe that they cannot go anywhere without their emotional support animals. I have also treated people who have such bad panic attacks that they become paralyzed or sometimes even faint.I implore you to not take away the privilege of these people having their ESAs on planes, to provide necessary comfort. Just because you cannot see a disorder, doesn't mean it doesn't exist. I liken it to a domestic violence survivor who was emotionally abused. His or her pain can be just as great as one who was physically abused.You will be doing a tremendous disservice to people who legitimately need their ESAs, if you don't let these people travelwith them. How can passengers prove that the service their ESAs serve is to help them feel safe? This is not an observerable task.Clearly people have abused this policy and that couldn't be more wrong. However, there must be a better solution than what the DOT is proposing. Some airlines are having passengers sign waivers that their pets are well-trained. Can't that be enough? Also, can't the proposal just consider allowing people with just dogs and cats as ESAs?Please consider what I have said. If this rule is instituted, many people's lives will be severely negatively affected.Thank you."
4,DOT-OST-2018-0068-14920,2020-02-24T05:00:00Z,"I really appreciate the time that has been spent to really create a workable definition for what should reasonably considered a service animal. I would like to take the time to comment as a frequent passenger of airlines, as my experience is limited to that specifically on the call for comment on restriction by breed. Cabin spaces can be incredibly confined and the presence of a dog that exhibits characteristics of Bully breeds might initially suggest that these dogs should be prohibited. However I would like to offer two counterpoints (1) Bully breeds vary in size and (2) Significant accommodation and allowance are made for large bags As in all breeds there is significant variance of size. While what some would consider pit bulls can be as large as boxers there are others that due to their mixed background are much smaller than a golden retriever or German shepherd which might be considered the poster dogs for service animals. Simply restricting on the basis of appearance of Bully breeding, is close but not sufficient as a workable definition and places legitimate service animals and their handlers at risk. Second, from a narrative standpoint, as a student I have consistently observed the large backpacks taken by students onto planes at the beginning or end of breaks. These bags stuffed with books, computers, and supplies are often too large to fit underneath the slightly smaller spaces allotted for the aisle and window seats. In my own experience, I have only known legitimate service dogs to sit in the front row where they have space to lay or in a separate reserved seat. Given the tolerance we understandably extend to college students at the holidays, I think it would not be unreasonable to extend similar patience towards dogs that significantly improve and aid their handler's successful maneuvering through society."
...,...,...,...
493,DOT-OST-2018-0068-15072,2020-02-24T05:00:00Z,"Yes, I totally approve of the rules for service animals. Passengers claiming pets as emotional support animals threaten the safety and health of other passengers. I was on a flight recently in which a unleashed dog was allowed to wander around the plane. In addition to allergy problems for passengers, dogs can spread germs and bacteria. Please ban all untrained emotional support animals."
494,DOT-OST-2018-0068-15074,2020-02-24T05:00:00Z,"As a dog owner and frequent traveler, I don't have a service animal. But I feel as long as the animal behaves and is not too big, who cares.Some people have issues and they should be helped."
495,DOT-OST-2018-0068-15075,2020-02-24T05:00:00Z,Only true service dogs should be allowed in the airline cabin. Only dogs. And even dogs should be severely restricted. Emotional support animals should not be included. The DOT should require a License of some type that those with service dogs must obtain. It should require a statement from a medical professional re the need for the service dog and also a certificate showing training.I realize some people really need their service dog and of course am willing to make accomodations. But other people have needs as well. My wife is deathly afraid of dogs (even really small ones which I find amusing). I am not afraid of dogs but I am severely allergic to dogs. Planes are full of enough pathogens for all of us to worry about without having to worry about dog dander as well. Thank you for your consideration.
496,DOT-OST-2018-0068-15078,2020-02-24T05:00:00Z,"I am 100% behind new the new proposed regulations limiting emotional support animals. This has been abused for way too long. Anyone can get a note from their vet and/or doctor saying that their poor pet is needed for emotional stability. Enough. Flying is for humans, not pets - at least not pets that share cabin seats. I would like it to be limited to only official & documented service dogs (not all animals). Enough with this attitude that my pet can go with me anywhere."


How would you determine how many are for/against?

In [11]:
df.content.str.contains('support').value_counts()

True     291
False    207
Name: content, dtype: int64

In [12]:
df.content.str.contains('against').value_counts()

False    471
True      27
Name: content, dtype: int64

## Vectors

In short, they represent the "meaning" of a word as a bunch of numbers, which corresponds to a point in space. [More info.](https://www.youtube.com/watch?v=LSS_bos_TPI&list=PLRqwX-V7Uu6aQ0oh9nH8c6U1j9gCg-GdF&index=2&t=336s)

In [13]:
nlp('apple').vector

array([-3.6391e-01,  4.3771e-01, -2.0447e-01, -2.2889e-01, -1.4227e-01,
        2.7396e-01, -1.1435e-02, -1.8578e-01,  3.7361e-01,  7.5339e-01,
       -3.0591e-01,  2.3741e-02, -7.7876e-01, -1.3802e-01,  6.6992e-02,
       -6.4303e-02, -4.0024e-01,  1.5309e+00, -1.3897e-02, -1.5657e-01,
        2.5366e-01,  2.1610e-01, -3.2720e-01,  3.4974e-01, -6.4845e-02,
       -2.9501e-01, -6.3923e-01, -6.2017e-02,  2.4559e-01, -6.9334e-02,
       -3.9967e-01,  3.0925e-02,  4.9033e-01,  6.7524e-01,  1.9481e-01,
        5.1488e-01, -3.1149e-01, -7.9939e-02, -6.2096e-01, -5.3277e-03,
       -1.1264e-01,  8.3528e-02, -7.6947e-03, -1.0788e-01,  1.6628e-01,
        4.2273e-01, -1.9009e-01, -2.9035e-01,  4.5630e-02,  1.0120e-01,
       -4.0855e-01, -3.5000e-01, -3.6175e-01, -4.1396e-01,  5.9485e-01,
       -1.1524e+00,  3.2424e-02,  3.4364e-01, -1.9209e-01,  4.3255e-02,
        4.9227e-02, -5.4258e-01,  9.1275e-01,  2.9576e-01,  2.3658e-02,
       -6.8737e-01, -1.9503e-01, -1.1059e-01, -2.2567e-01,  2.41

In [14]:
apple = nlp('apple')
pear = nlp('pear')
apple.similarity(pear)

0.5968762340877272

In [15]:
truck = nlp('truck')
apple.similarity(truck)

0.18432754056976158

## Finding duplicates

In [16]:
df.shape

(498, 3)

In [17]:
df.drop_duplicates('content').shape

(473, 3)

To do more robust cleanup of duplicates, you'd want to do things to normalize the text, such as:

- [Removing extra whitespace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html)
- [Normalizing the case](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.casefold.html)

## Classification

We want to identify which comments are in support of regulation vs. not…without reading all of them. In NLP, this is known as [text classification](https://monkeylearn.com/text-classification/).

## Classification

1. Create a training dataset.
  1. Get a subset of the data.
  1. Manually "label" it.
1. Create a model, feeding it the training dataset.
1. Apply the model to the unlabeled comments.

In [18]:
train = [
    ('I love this sandwich.', 'pos'),
    ('this is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('this is my best work.', 'pos'),
    ("what an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('he is my sworn enemy!', 'neg'),
    ('my boss is horrible.', 'neg')
]

from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

cl.classify('this is great')

'pos'

In [19]:
cl.classify('I hate it')

'neg'

We can use a _different_ set of pre-labeled data to see how well the model works:

In [20]:
test = [
    ('the beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

cl.accuracy(test)

0.8333333333333334

## [In-class exercise](https://colab.research.google.com/github/afeld/python-public-policy/blob/main/lecture_5_exercise.ipynb)

## Features

In [21]:
cl.show_informative_features(5)

Most Informative Features
            contains(my) = True              neg : pos    =      1.7 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
             contains(I) = False             pos : neg    =      1.4 : 1.0
             contains(I) = True              neg : pos    =      1.4 : 1.0
            contains(my) = False             pos : neg    =      1.3 : 1.0


## TF-IDF

Term frequency–inverse document frequency, a.k.a.

> How specific is a given word to this particular "document" relative to the others?

[More info](https://monkeylearn.com/blog/what-is-tf-idf/)

In [22]:
nlp.Defaults.stop_words

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [23]:
for sentence in train:
    doc = nlp(sentence[0])
    non_stop_tokens = [token for token in doc if token.text not in nlp.Defaults.stop_words]
    print(non_stop_tokens)

[I, love, sandwich, .]
[amazing, place, !]
[I, feel, good, beers, .]
[best, work, .]
[awesome, view]
[I, like, restaurant]
[I, tired, stuff, .]
[I, deal]
[sworn, enemy, !]
[boss, horrible, .]


## [Homework 5 + 6](https://colab.research.google.com/github/afeld/python-public-policy/blob/main/hw_5.ipynb)

## Lecture 6