# DSCI 521: Methods for analysis and interpretation <br>Chapter 2: Feature engineering and language processing

## 2.0 Features

When analyzing large amounts of data, it's important to be able to find trends and patterns. A useful tool which helps in this regard is through the use of _features_, which are created through a process usually referred to as _feature engineering_. Any specific characteristic of a data point may be used as a _feature_. 

Say we had a large group of people, and we wanted to create an algorithm that predicts which individuals are professional basketball players. The most immediate thing that may pop into one's head when it comes to professional basketball players is height, so we could create a height feature by measuring each person's height and recording it into our dataset. This is an example of a _numeric feature_&mdash;the focus of __Chapter 1__! These will be incredibly useful for us in the future, when we start using machine learning techniques. Of course, there are plenty of other types of non-numeric features, as well, which is the focus of this chapter! But to get started let's just look ahead (statistically) to a familiar example method of numerical feature enginerring (standardization).

### 2.0.1 Example: standardizing numerical features

In order to avoid the impact of using different units such as inches or pounds for measured features, it's common and important for analysis to standardize data. Essentially, this is a numerical operation where if the data is denoted by $x$ and its mean and standard deviation are denoted by $\mu$ and $\sigma$, respectively, the standardization operation can be expressed as the following transformation:

$$x_i \mapsto \frac{x_i - \mu}{\sigma}$$

Mathematically, this _translates_ the _mean_ of the data to 0 via subtraction (addition) and _scales_ the _standard deviation_ to 1 via division (multiplication). We'll think more about numeric feature development in future chapters, using different measures and quantitative frameworks, but to calculate the current transformation, we only need use functions from `numpy` module (we'll use a [dataset of baseball player heights and weights](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)):

In [9]:
import numpy as np

# defining a function to perform the standardization

def standardize(data):
    mean = np.mean(data)
    stdev = np.std(data)
    
    standardized_data = (data - mean) / stdev
    
    return standardized_data

In [11]:
import pandas as pd

# load some sample data of baseball player heights and weights
baseball_data = pd.read_csv("./data/baseball_heightweight.csv", header = 0)
print(baseball_data.head())

# standardizing the heights
baseball_data["Height"] = standardize(baseball_data["Height"])

# standardizing the weights
baseball_data["Weight"] = standardize(baseball_data["Weight"])

# standardizing the ages
baseball_data["Age"] = standardize(baseball_data["Age"])

baseball_data.head(15)

              Name Team       Position  Height  Weight    Age
0    Adam_Donachie  BAL        Catcher      74   180.0  22.99
1        Paul_Bako  BAL        Catcher      74   215.0  34.69
2  Ramon_Hernandez  BAL        Catcher      72   210.0  30.78
3     Kevin_Millar  BAL  First_Baseman      72   210.0  35.43
4      Chris_Gomez  BAL  First_Baseman      73   188.0  35.71


Unnamed: 0,Name,Team,Position,Height,Weight,Age
0,Adam_Donachie,BAL,Catcher,0.131344,-1.033741,-1.330806
1,Paul_Bako,BAL,Catcher,0.131344,0.634409,1.378644
2,Ramon_Hernandez,BAL,Catcher,-0.736447,0.396102,0.473178
3,Kevin_Millar,BAL,First_Baseman,-0.736447,0.396102,1.550011
4,Chris_Gomez,BAL,First_Baseman,-0.302552,-0.652449,1.614852
5,Brian_Roberts,BAL,Second_Baseman,-2.038133,-1.224387,0.151286
6,Miguel_Tejada,BAL,Shortstop,-2.038133,0.348441,0.470863
7,Melvin_Mora,BAL,Third_Baseman,-1.170343,-0.080512,1.466643
8,Aubrey_Huff,BAL,Third_Baseman,0.999134,1.396992,0.336548
9,Adam_Stern,BAL,Outfielder,-1.170343,-1.033741,-0.390603


As can be seen, the data no longer depends on units, and rather it is expressed in a standard way that very visibly describes a player's variation from the mean. The closer a value is to 0, the closer that attribute of the player is to the average.

### 2.0.2 Working with unstructured data

Much of the data we will run into is fairly unstructured in nature. This is where expertise in data science becomes especially essential. With the rise of social media platforms online, a huge amount of data is comprised of text, and textual data is not generally distributed or generated in neat, structured manners with a set of clearly-defined features. How can we analyze this ubiquitous textual data? As an example, if you wanted to compare two people by the text they've written it might be natural to compare their words, but how do you identify the words in a document?

### 2.0.3 Featurization

The adding of structure to an unstructured text object is called _featurization_. Featurization does not only refer to a process for text, but really any curcumstance in which you might need a higher-level, more-succicnt, or refined form of data representation. For images, it's common to extract feature objects. For example, an image's objects might include cars, people, or lines of paint on the road, depending on the application. These might be represented polygons or boxed pixel regions.

As it turns out, featurization in images is a relatively complex task whose study could probably comprise an entire course on its own. Since text is quite accessible as a data type, still unstructured, and is human readable, we'll study featurization in its context. Moreover, the basic methodology (regular expressions) required to featurize text is the same for fixing file malformations and other pre-processing tasks, so it may already be familiar! 

Featurization of textual data is a small part of the large field of Natural Language Processing (NLP), so we'll review the field a bit and then explore its tools.

## 2.1 NLP

### 2.1.0 What is NLP?

As mentioned, a huge amount of the data available on the internet comes in the form of unstructured text, mostly generated by humans. Such text of strictly human origin is usually referred to as _natural language_. Thus, a simple way to describe what NLP really is is to say that it's the study of techniques which allow for the processing of natural language. In recent years, it has become one of the most important and hottest subfields of computer science. Definitely not a bad thing to familiarize yourself with! Some of the more typical problems include speech recognition (spoken word also counts as natural language!), natural language understanding (machine reading comprehension of natural language), and natural language generation (using a machine to automatically generate human-like text, think of something like Siri or Alexa).

### 2.1.1 Regular expressions

Regular expressions, or regex, are "sequences of characters that define a search pattern", according to Wikipedia. These patterns can be used to search for, find, replace, and do a great deal more with strings.

Regular expression patterns are constructed with both ordinary and special characters. The simplest regular expressions are simply ordinary characters like "A", or "5", or "status". These patterns only match themselves, allowing you to search for exact patterns of characters. Some characters are "special" for regex, like "|" or "\[" or "^". These characters can be used to construct regex that is more powerful than straightforward matching.

#### 2.1.1.1 Basics

Python's included `re` module can be used to construct and use regular expressions. It comes with many useful functions. The most basic of match object if the pattern matched the string and a `None` value if it didn't. This means `re.search()` outputs can be used with conditional statements (like `if` statements).

In [12]:
import re 

silly_string = "one fish two fish red fish blue fish"
print(re.search("fish", silly_string))

<re.Match object; span=(4, 8), match='fish'>


In [13]:
print(re.search("salmon", silly_string))

None


In [14]:
if re.search("fish", silly_string):
    print("Fish were found.")
else:
    print("There were no fish.")

Fish were found.


Another useful function is `re.sub()`, which takes two patterns and a string as input and replaces the first pattern with the second.

In [None]:
silly_cats = re.sub("fish", "cat", silly_string)

print(silly_cats)

`re.findall()` will return all matches of a pattern in a string:

In [None]:
print(re.findall("cat", silly_cats))

#### 2.1.1.2 A few useful character classes and other means for flexibility

- `.` __(wild card)__ In the default mode, this matches any character except a newline.
- `[...]` __(character class)__ Used to indicate flexible matching across a specificed set of characters.
- `[^...]` __(complimentary character class)__ Used to indicate flexible matching across _everything but_ a specificed set of characters.
- `[a-z]` __(lowercase range)__ Used to indicate flexible matching across lowercase letter ranges
- `[A-Z]` __(uppercase range)__ Used to indicate flexible matching across uppercase letter ranges
- `[0-9]` __(numeric range)__ Used to indicate flexible matching across numeric ranges
- '|' __(or)__ Creates a regular expression that will match either A or B. 

Like the `string.split()` method, `re` also has a `re.split()` method that can be used with regex patterns. We could combine this with a character class:

In [None]:
not_a_silly_string = "Oftentimes, different punctuation characters are used; these indicate different types of stops."

## split a string by several types of punctuation
clauses = re.split("[,;.]", not_a_silly_string)
print(clauses)

If we have some text that we suspect contains Philadelphia area ZIP codes, we could use character classes to extract these.

In [None]:
text = "Drexel's University City campus falls in 19104, while the Collge of Nursing is in 19102 and the Philadelphia City Hall is in 19107."

zipcodes = re.findall("191[0-5][0-9]", text) # we know philly zipcodes go from 19102 to 19154
print(zipcodes)

#### 2.1.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

In [None]:
# code goes here

#### 2.1.1.4 Grouping, numbered groups and extensions
Grouping is a great way to modify and extend strings, without simply replacing them. With grouping, you can use the matched content in a substitute string. It's great for re-formatting text. Groups can also serve extended functions if they are initiated by an unescaped question mark.
- `(...)` __(group)__ Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the `\1`, `\2`, etc., special sequences, described below.
- `\1`, `\2`, etc. __(captured groups)__ Matched groups are captured and held in order: low to high from left to right, and in the case of nested groups, from outside to inside.
- `(?:...)` __(non-matching group)__ Matches `...` as in the parentheses, but does not capture it in a group. This becomes especially important when applying multipliers.
- `(?=...)` __(lookahead)__ Matches if `...` matches next, but doesn’t consume any of the string.
- `(?!...)` __(negative look ahead)__ Matches if `...` doesn’t match next.
- `(?<=...)` __(positive look behind)__ Matches if the current position in the string is preceded by a match for `...` that ends at the current position. 
- `(?<!...)` __(negative look behind)__ Matches if the current position in the string is not preceded by a match for `...`.

In [None]:
tommy_two_tone = ("Apparently, 867-5307 is Jenny's phone number,"+
                 " but I'm not sure what her area code is.")

## let's capture Jenny's phone number and insert the area code
modified_tommy_two_tone = re.sub(r"([0-9][0-9][0-9]-[0-9][0-9][0-9][0-9])",
                                 r"1-800-\1", 
                                 tommy_two_tone)

print(modified_tommy_two_tone)

#### 2.1.1.5 Multipliers (quantifiers)
It was a little bit of overkill to use the numeric character class so many times in a row in the last expression. This is an example of where multiplies can come in really handy.
- `*` __(zero or more)__ Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. 
- `+` __(one or more)__ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
- `?` __(zero or one)__ Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
- `{m}` __(exactly m times)__ Specifies that exactly m copies of the previous RE should be matched.
- `{m,n}` __(m throug n times)__ Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. 

In [None]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's get all of the phone numbers in a string
numbers = re.findall("[0-9]{3}-[0-9]{4}", tommy_two_tone)
print(numbers)

In [None]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## capture the word that appears before a lookahead: "'s phone number" 
## by matching one or more non-space characters before 
## along with the number itself with flexible "|" matching
whos_number = re.findall("([^ ]+)(?='s phone number)|([0-9]{3}-[0-9]{4})", tommy_two_tone)

print(whos_number)

In [None]:
## We can even get a bit more flexible with our area-code handling!
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."
my_contact_information = "If you need my office line, it's 215-895-2185"

## By grouping and using a `{1,2}` flexible match, we can get full and partial numbers
## Note: we have to use a non-capturing group in order to make sure we get the full expression
## without capturing the first three digits, only.
numbers =  re.findall("(?:[0-9]{3}-){1,2}[0-9]{4}", tommy_two_tone)
print(numbers)
numbers =  re.findall("(?:[0-9]{3}-){1,2}[0-9]{4}", my_contact_information)
print(numbers)

#### 2.1.1.6 Escapes and special sequences
As it turns out, some character classes are so common that they have their own special-characters. So, our phone-number example could be even more concise with the `\d` special character.
- `\` __(escape)__ Either escapes special characters (permitting you to match characters like `*`, `?`, and so forth), or signals a special sequence.
- `\d` __(digits)__ Matches any Unicode decimal digit. This includes `[0-9]`, and also many other digit characters.
- `\D` __(digits)__ Matches any Unicode non-digit.
- ` \s` __(whitespace)__ Matches Unicode whitespace characters, including `[\t\n\r]` and space.
- `\w` __(word characters)__ Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
- `\W` __(non-word characters)__ Matches Unicode non-word characters;
- `\t` __(tab)__ Matches a tab character.
- `\n` __(newline)__ matches a newline character.
- `\r` __(carriage return)__ matches a carriage return character.

In [None]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's get all of the phone numbers in a string
numbers = re.findall("(?:\d{3}){1,2}-\d{4}", tommy_two_tone)
print(numbers)

In [None]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's get all of the phone numbers in a string
numbers = re.search("((?:\d{3}){1,2}-\d{4})", tommy_two_tone)
print(numbers.groups())

#### 2.1.1.7 Anchors
Anchors allow you to make the positions of matches absolute in the overally position in a string. These become especially handy if you are pre-processing semi-structured text, like a screenplay, stenographer's court record, or the index of a book.
- `^` __(start anchor)__ Matches the start of the string.
- `$` __(end anchor)__ Matches the end of the string or just before the newline at the end of the string.

In [None]:
## an example of some sem-structured text
macbeth = ("First Witch: When shall we three meet again? In thunder, lightning, or in rain?\n"+
           "Second Witch: When the hurlyburly's done, when the battle's lost and won.")
print(macbeth)
print("")

## make some empty lists for our data
speakers = []
speeches = []

## split the document into the lines of the play
lines = macbeth.split("\n")

## loop over the lines
for line in lines:
    
    ## retrieve the matched groups
    ## Note: if we simply split by a colon 
    ## we might mess up what people are saying in the text!
    ## Also note: the super greedy ".*?" matching ANYTHING, zero or more times!
    ## This comes in very handy when you want loosely anything
    ## that happens to be surrounded by some specified structure
    speaker, speech = re.search("^(.*?): (.*?)$", line).groups()

    ## Grow the lists
    speakers.append(speaker)
    speeches.append(speech)

print(speakers)
print("")
print(speeches)
print("")

#### 2.1.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

In [None]:
text = 'Lost and weary, Catelyn Stark gave herself over to her gods. She knelt before the Smith, who fixed things that were broken, and asked that he give her sweet Bran his protection. She went to the Maid and beseeched her to lend her courage to Arya and Sansa, to guard them in their innocence. To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, and she asked the Warrior to keep Robb strong and shield him in his battles. Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. "Guide me, wise lady," she prayed. "Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."'

# code goes here

### 2.1.2 Tokenization
Tokenization is the process of breaking up text into smaller units. Usually, this means breaking a string up into words. 
#### 2.1.2.1 `string.split()`
The simplest possible tokenization would be to use the `string.split()` method:

In [1]:
sentences = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

words = sentences.split()

print(words)

['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']


The problem with this is, punctuation has been captured as part of some words. 

#### 2.1.2.2 NLTK tokenization
For a more advanced tokenizer, we'll use one of the most well-known Python modules for natural language processing, the Natural Language Toolkit (`nltk`). (Install it with `pip3 install nltk`, then import it with `import nltk` and run `nltk.download()`, which will open up a graphical window and allow you to download the data NLTK needs to perform many tasks.) ([Docs](https://www.nltk.org/))

In [15]:
import nltk

words = nltk.tokenize.word_tokenize(sentences)

print(words)

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


#### 2.1.2.3 Sentence tokenization
Clearly, tokenization becomes complex, quickly, but it's also not just for the task of separating words. Another common need in language processing is sentence tokenization. NLTK has functionality for this, too, but first let's explore a hardline strategy using only a simple regex pattern, can you write sentences that break this tokenizer?

In [16]:
## regex-based sentence tokenizer
sentences_tokenized = re.split("\s*(?<=[\.\?\!][^a-zA-Z0-9])\s*", sentences)
sentences_tokenized

['Good muffins cost $3.88\nin New York.',
 'Please buy me\ntwo of them.',
 'Thanks.']

#### 2.1.2.4 Exercise: Improving a regex-based sentence tokenizer
First, write a few sentences in a complex (but grammatically acceptable) way so that the (above) regex-based tokenizer breaks. Then, fix the pattern so that the tokenizer can handle your text appropriately.

In [None]:
## code here

#### 2.1.2.5 Sentence tokenization with NLTK
Among many other functionalities, NLTK has a relatively-light sentence tokenizer. Here's how it works:

In [17]:
sentences_tokenized = nltk.sent_tokenize(sentences)
sentences_tokenized

['Good muffins cost $3.88\nin New York.',
 'Please buy me\ntwo of them.',
 'Thanks.']

#### 2.1.2.6 Tokenization with Spacy
A newer set of tools can be found in the `spacy` module (`pip3 install spacy`). ([Docs](https://spacy.io/usage))

In [7]:
import spacy

nlp = spacy.load("en")
doc = nlp(sentences)

# spacy creates "token" objects which have quite a few properties. Check the documentation out if you're interested in learning more.

words = []

for token in doc:
    words.append((token.text, token.i, token.idx))
    
print(words)

[('Good', 0, 0), ('muffins', 1, 5), ('cost', 2, 13), ('$', 3, 18), ('3.88', 4, 19), ('\n', 5, 23), ('in', 6, 24), ('New', 7, 27), ('York', 8, 31), ('.', 9, 35), (' ', 10, 37), ('Please', 11, 38), ('buy', 12, 45), ('me', 13, 49), ('\n', 14, 51), ('two', 15, 52), ('of', 16, 56), ('them', 17, 59), ('.', 18, 63), ('\n\n', 19, 64), ('Thanks', 20, 66), ('.', 21, 72)]


In [None]:
# Install a conda package in the current Jupyter kernel
# import sys
# !conda install --yes --prefix {sys.prefix} spacy

In [4]:
import sys
!{sys.executable} -m pip install spacy

Collecting spacy
  Downloading spacy-2.3.5-cp38-cp38-macosx_10_9_x86_64.whl (10.2 MB)
[K     |████████████████████████████████| 10.2 MB 10.2 MB/s eta 0:00:01
Collecting wasabi<1.1.0,>=0.4.0
  Using cached wasabi-0.8.0-py3-none-any.whl (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl (18 kB)
Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl (177 kB)
[K     |████████████████████████████████| 177 kB 9.2 MB/s eta 0:00:01
[?25hCollecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp38-cp38-macosx_10_9_x86_64.whl (31 kB)
Collecting catalogue<1.1.0,>=0.0.7
  Using cached catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp38-cp38-macosx_10_9_x86_64.whl (982 kB)
[K     |████████████████████████████████| 982 kB 4.4 MB/s eta 0:00:01
[?25hCollecting plac<1.2.0,>=0.9.6
  Using cached plac-1.1.3-py2.py3-none-any.whl (20 kB)
Collecting blis<

In [6]:
!{sys.executable} -m spacy download en

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 8.7 MB/s eta 0:00:01
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=381135ca8bc3882f4f0f59e0c969bd1975ca9e8376689f48b0e9801c8b860ef8
  Stored in directory: /private/var/folders/jk/cc3j41890hg12ds4qbk4gqrc0000gn/T/pip-ephem-wheel-cache-habx5t2_/wheels/ee/4d/f7/563214122be1540b5f9197b52cb3ddb9c4a8070808b22d5a84
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Applications/Anaconda

The problem – as you may have noticed – is that there are variations between tokenizers that can result in different outcomes for you further down the line. Choice of tokenizer can make or break a particular application. 

### 2.1.3 Higher-level NLP
One very convenient (albeit somewhat heavy) aspect of Spacy is that the function (which we named `nlp()`) returned by `spacy.load("en")` will actually perform _a lot_ of different language modeling operations, including part of speech (POS) tagging, lemmatization, and grammatical parsing. We'll explore these two capabilities in detail, as they can be very useful for downstream applications. However,  it's important at this point to note that language processing is in and of itself a challenging and deep field that data scientists (and many others) are engaged in. As an engineering discipline, NLP is generally broken down into tasks, like POS tagging. While systems built to satisfy tasks like lemmatization have become quite mature&mdash;to the point of being packaged into Spacy and nltk&mdash;there are a number of other tasks of interest, especially focused on semantic processing and language generation&mdash;that are very much still unresolved. Moreover, the sophisticated solutions for any of these tasks often rely on some complex modeling strategies that not only require knowedge of future chapters in this course, but should really have a full course devoted to their study.

#### 2.1.3.0 Spacy's `Token` object type
Another import aspect of Spacy and it language modeling capacity is its foundational definition of  the `token` object, which, like a `string` or `dict`, has methods and attributes. There are many, so it's best to consult the [docs](https://spacy.io/api/token), but to get us started here are a few we'll play around with:

- A few important attributes:
    - `Token.i`: The index of the token within the parent document.
    - `Token.idx`: The character offset of the token within the parent document.
    - `Token.head`: The syntactic parent, or "governor", of this token.
    - `Token.lemma_`: Base form of the token, with no inflectional suffixes.
    - `Token.pos_`: Coarse-grained part-of-speech.
    - `Token.tag_`: Fine-grained part-of-speech.
    - `Token.dep_`: Syntactic dependency relation.

- A few tree-navigation methods:
    - `Token.ancestors`: The rightmost token of this token's syntactic descendants.
    - `Token.children`: A sequence of the token's immediate syntactic children.
    - `Token.subtree`: A sequence of all the token's syntactic descendants.

#### 2.1.3.1 Part of speech (POS) tagging
POS tags coarsely indicate how individual (word) tokens are categorized. Example tags include 'noun', 'verb', etc., and some are harder to get right than others. For example, a word like `'run'` might get used as a noun or a verb, depending on context. It is up to the POS tagger to determine which tag is correct! Thus, POS tagging is most-commonly resolved as a supervised learning problem, where human annotators label text tokens by hand for tags. These 'gold standard' data are then used to train algorithms, like those at work inside of spacy. Let's see what Spacy can do!

In [None]:
running_sentence = "I run around all day, even after I go for a run in the morning."
doc = nlp(running_sentence)

print("token\tcoarse\tfine")
for token in doc:
    print(token.text + "\t" + token.pos_ + "\t" + token.tag_)

#### 2.1.3.2 Exercise: POS tagging 
Apply POS tagging to a sentence of your choosing and filter for only verbs and nouns.

In [None]:
## code here

#### 2.1.3.3 Lemmatization
Simplistically, a lemma is the (somewhat arbitrarily) form of a word chosen to head an entry in a dictionary. For example, the forms 'run', 'ran', and 'running' might have the commen lemma of 'run'. Let's see how Spacy does with this! 

In [None]:
running_sentence = """I ran out of gas after running around all day—maybe I shouldn't go for runs in the morning anymore."""
doc = nlp(running_sentence)

print("token\tlemma")
for token in doc:
    print(token.text + "\t" + token.lemma_+ "\t" + token.pos_)

#### 2.1.3.4 Grammatical parsing
Parse, or syntax trees have been am important object of study for the NLP community, focusing on the relations between words in sentences, like 'subject', 'object', etc.  While grammatical parsing has come a long way in recent years, it can only tell us so much about the meanings of words in text. However, for those with knowledge of grammar and how the different relations represent interaction between words, grammatical parsing can be a powerful tool for featurization and rule-based processing. If languages were entirely beholden to grammar, there would certainly be a lot less ambiguity for NLP to sort out!

Before we get going, it's important to get a few things straight. Grammatical dependencies are encoded on _parse trees_. Generally, a verb will be at the root of such a tree, and subsequently _dependent_ words will fall below in a kind of heirarchy. This makes it important to know what a give word's parents and children are. Thus, to 'navigate' parse trees Spacy has a number of `Token` methods to generate sequences of related tokens. To understand what's going on here let's look at a relatively-simple sentence:

In [None]:
ez_sentence = "I like to work on NLP projects."
doc = nlp(ez_sentence)

print("token\thead\tchildren")
for token in doc:
    print(token.text + "\t" + token.head.text + "\t", list(token.children))

Now that we have a way of knowing which tokens operate on which others, we can explore what these operations (relations) are. Since this is the `token.dep_`, i.e., dependency attribute, it refers to relation between a given token and its parent(`token.head`):

In [None]:
ez_sentence = "I like to work on NLP projects."
doc = nlp(ez_sentence)

print("token\thead\tdependency")
for token in doc:
    print(token.text + "\t" + token.head.text + "\t", token.dep_)

All of this information (and more) can actually be output in a single organized (JSON) format using `doc.print_tree`:

In [None]:
ez_sentence = "I like to work on NLP projects."
doc = nlp(ez_sentence)
doc.print_tree()

Finally, in case you were wondering where Spacy's sentence tokenization is, look no further than the `doc.sents` attribute:

In [None]:
doc = nlp(sentences)
list(map(str, doc.sents))

#### 2.1.3.5 Exercise: using grammar for information extraction
Apply the spacy grammatical parsing and extract any subject-verb token pairs.

In [None]:
## code here

### 2.1.4 Numeric representations of text features
So far, we've covered some of how text is processed in various ways to construct meaningful features. However, as discussed in __Section 2.0.1__, much of the work in feature development is quantitative. Even though our textual objects to this point have been categorical&mdash;including the complex, grammatical parsing&mdash; we can still build numeric features from them. 

#### 2.1.4.1 Word-Frequency Distributions 
Word-frequency distributions are a very common input to machine learning and statistical classifiers. For a single document, $d$, let's call $f_d(w)$ the word frequency function, which indicates the number of times the word $w$ appeared in document $d$. 

#### 2.1.4.2 Computing word frequencies
Word frequencies are probably the first and easiest numerical representation of text to compute. In some communities, this is referred to as the bag of words (BOW) model. Put simply, the BOW model simply counts up the number of times each word appears in a document. This of course depends on a few things, e.g., case and lemmatization. However, constructing a basic BOW model is quite straightforward, especially using `Counter`. Let's use this very paragraph as our example text for the BOW model.

In [None]:
from collections import Counter

text = """Word frequencies are probably the first and easiest 
numerical representation of text to compute. In some communities, 
this is referred to as the bag of words (BOW) model. 
Put simply, the BOW model simply counts up the 
number of times each word appears in a document. 
This of course depends on a few things, e.g., case and lemmatization. 
However, constructing a basic BOW model is quite straightforward, especially using `Counter`. 
Let's use this very paragraph as our example text for the BOW model."""

doc = nlp(text)
word_counts = Counter()

for word in doc:
    word_counts[word.text] += 1

word_counts.most_common(25)

#### 2.1.4.3 Stop words and lemmatization
Since I had to put irregular whitespace (newlines) to make the text, it ended up attached to some of the words and we also ended up counting an empty string. However, there may generally be words that are not of interest for a BOW model, and if we want to exclude them they are called 'stop words'. It's straightforward to put together a stop word `set()` and use Python's infullness to satisfy this need:

In [None]:
stop_words = {'\n', ',', '.', '`', 'the', 'and', 'of'}

doc = nlp(text)
word_counts = Counter()

for word in doc:
    if word.text not in stop_words:
        word_counts[word.text] += 1

word_counts.most_common(25)

Another issue here is capitalization, is "The" a different word from "the"? We could solve this by lowercasing all words before counting them, too, but since we've got Spacy's full power behind us let's just use the lemmas:

In [None]:
stop_words = {'\n', ',', '.', '`', 'the', 'and', 'of'}

doc = nlp(text)
word_counts = Counter()

for word in doc:
    if word.lemma_ not in stop_words:
        word_counts[word.lemma_] += 1

word_counts.most_common(25)

#### 2.1.4.4 Exercise: improved word frequency representation
Build a stop word list and lemmatization strategy (potentially using POS tags) to compute 'better' word frequencies, as you see fit.

In [None]:
## code here

#### 2.1.4.5 Token-level quantifications
While frequency gives us a number that we can attach to each wird (type) that appeared in a text, it's important to note that these frequencies lack context from the individual instances in which words appeared, i.e., are are document-level quantifications of text. There's a developing literature on token-level quantifications, called _embeddings_, or word vectors. The general idea is to model the _semantics_, i.e., meanings of tokens as points in some finite-dimensional vector space. The result (of some very complex modeling!) of such a system assignes a vector to each word in an (often fixed) vocabulary. This topic is a deep one, but fortunately for us we once again have the benefit of some funcionality through Spacy, thanks to two attributes:

- `Token.has_vector`: A boolean value indicating whether a word vector is associated with the token.
- `Token.vector`: A real-valued meaning representation.

The first makes it easy for us to make sure we're not trying to access a vector that doesn't exist (some words are our of vocabulary!), and the second provides us with the vector itself, provided it exists. Let's see how this works! Using Spacy, it appears that word vectors are generally of length 384. 

In [None]:
ez_sentence = "I like to work on NLP projects."
doc = nlp(ez_sentence)
for word in doc:
    if word.has_vector:
        print(word, len(word.vector), word.vector[:5])

### 2.1.5 Working with multiple documents
Very commonly we'll be in a scenario of multiple ($n$) documents, $d_1, \cdots, d_n$. We can measure each document's word frequencies: $f_{d_j}(w)$, but how can we put all of this information together in a convenient way?

#### 2.1.5.1 Term document matrices (TDMs)
It'd be great to build a matrix of word frequencies that spans multiple documents, but this means we'll need to come up with a master index of all of the words in all over the documents. This means establish an ordering (for us, ASCII-sort, etc., alphabetical): $w_1, \cdots, w_m$. Once we have this, we can define a matrix:

$$
TDM = 
\begin{bmatrix}
    f_{d_{1}}(w_{1}) & f_{d_{2}}(w_{1}) & \dots  & f_{d_{n}}(w_{1}) \\
    f_{d_{1}}(w_{2}) & f_{d_{2}}(w_{2}) & \dots  & f_{d_{n}}(w_{2}) \\
    \vdots           & \vdots           & \ddots & \vdots \\
    f_{d_{1}}(w_{m}) & f_{d_{2}}(w_{m}) & \dots  & f_{d_{n}}(w_{m})
\end{bmatrix},
$$

often referred to as a _term document matrix (TDM)_. For an example, let's take the sentences from the 'Names of the gods' example (above) and call them our 'documents', building a TDM.

First things first, let's put our word counting into a function:

In [None]:
def count_words(sentence):
    frequency = Counter()
    for word in sentence:
        frequency[word.text] += 1
    return frequency

Now that we have our word counting off to the side, we can proceed with the matrix construction. But as mentioned above, the first thing we need is a master index of all the words so that we can define an order. Matrices are _ordered_ arrays after all!

In [None]:
text = '''Lost and weary, Catelyn Stark gave herself over to her gods. 
She knelt before the Smith, who fixed things that were broken, 
and asked that he give her sweet Bran his protection. 
She went to the Maid and beseeched her to lend her courage to Arya and Sansa, 
to guard them in their innocence. 
To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, 
and she asked the Warrior to keep Robb strong and shield him in his battles. 
Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. 
"Guide me, wise lady," she prayed. 
"Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."
'''

doc = nlp(text)
    
## the 'master' set, keeps track of the words in all documents
all_words = set()

## store the word frequencies by book
all_doc_frequencies = {}

## loop over the sentences
for j, sentence in enumerate(doc.sents):
    frequency = count_words(sentence)
    all_doc_frequencies[j] = frequency
    doc_words = set(frequency.keys())
    all_words = all_words.union(doc_words)
    
## create a matrix of zeros: (words) x (documents)
TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
## fix a word ordering for the rows
all_words = sorted(list(all_words))
## loop over the (sorted) document numbers and (ordered) words; fill in matrix
for j in all_doc_frequencies:
    for i, word in enumerate(all_words):
        TDM[i,j] = all_doc_frequencies[j][word]

TDM[:10,]

#### 2.1.5.1 Accessing TDM elements
If we want access by words or documents, we need word and document indices. As a convenience, lists have a built in method that reports the index of the first instance of an object. Since our words in the `all_words` list are unique and ordered with our matrix rows, we can just run:
```
i = allwords.index("Moby") 
```
to get indices for a specific word-document combination. Here are the rows (document counts) for the words 'the' and 'she'. As we can see, the former ocurrs in just about every documents, while the later is more sparsely ocurring.

In [None]:
print(TDM[all_words.index('the'), ])
print(TDM[all_words.index('she'), ])

### 2.1.6 Term frequency-inverse document frequency (TF-IDF)
TDMs (i.e., matrices) provide a powerful quantitative framework through which to view text. One important concept introduced by the TDM is the notion of a _corpus_ or collection of documents. This offers a basis for comparison, and better featurization of documents. Many examples use a variant of frequency called _term frequency-inverse document frequency (TF-IDF)_. TF-IDF still uses the frequency of a word (term) in a document:

- $tf_{d_{i}}(w_{j}) = $ number of times a word appears in document.

but goes beyond this to incorporate the frequency of documents containing a word:

- $df_D(w_j) = $ number of documents $d$ in the collection $D$ containing $w_j$.

Words like `'the'`, etc., will generally occur across most or all documents, i.e., with $D\approx m$. The motivation behind _inverse_ document frequency is that boring words will appear like this, distributed widely across all documents. Thus, an inverse document frequency will reduce the feature-importance of such words. 

#### 2.1.6.1 Inverse document frequency
What we're going to do is wind up with a measure that says _how surprising_ it is to see _this_ word appear in _that_ document $k$ times_. The standard TF-IDF procedure goes beyond inversion of document frequency and into a measurement of word-information density.  [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) does a good job laying out an intuitive justification:

> The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

In other words, the quantity known as IDF is actually the _negative logarithm of the fraction of documents that contain a word:_

$$idf_D(w_j) = -\log_2\left(\frac{df_D(w_j)}{N}\right),$$

where $\frac{df_D(w_j)}{N}$ is the portion of documents containing the word.

Note: we'll talk a bit more about logarithms in upcoming chapters. For now, we'll just take it for granted (like lemmatization, or grammatical parsing) that we can compute logariths using `numpy`.

#### 2.1.6.2 Aside: why take the negative logarithm? 
In general, the negative logarithm of a probability is a measure called _entropy_. In some circles, this quantity is called _information_, and in others it is called _suprise_! I'll generally stick to calling it entropy or suprise, and would think of it intuitively as _the smallest number of bits, i.e., $0$s and $1$s we would need to set aside to ensure there is a unique pattern&mdash;a binary encoding&mdash; for each word._

This can be seen directly, because any probability, $p$, can be represented as a (negative) power of $2$, i.e., there exists some $b\geq0$ such that $p = 2^{-b}$. As it turns out, $2^b$ is the number of possible $b$-bit selections (with replacement) from the two states of a bit: $0$ or $1$. Thus, we can interpret $p$ as one binary pattern of $0$s and $1$s from $b$ bits. With this representation we can see:

$$-\log_2(p) = -\log_2(2^{-b}) = b,$$

i.e., the logarithm produces $b$, the number of bits.

#### 2.1.6.3 Putting together TF-IDF
All together, TF-IDF is then ordinarily calculated as:

$$
\begin{align}
tfidf_{d_{i}}(w_{j}) & = idf_D(w_j) tf_{d_{i}}(w_{j}) \\
& = -tf_{d_{i}}(w_{j})\log_2\left(\frac{df_D(w_j)}{N}\right) \\
& = -\log_2\left(\left[\frac{df_D(w_j)}{N}\right]^{tf_{d_{i}}(w_{j})}\right)
\end{align}
$$

While we might compute TF-IDF according to the top expression, it's important to come away with thinking of it in terms of the bottom expression. This is a probability to a power! Basically, this probability (to a power) can be viewed under an independence assumption as:

- the probability that the word, $w_j$, appears in the document, $d_i$, $tf_{d_{i}}(w_{j})$-many times.

But that actually means that tfidf _is_ an entropy/suprise framing in its own right, answering the question:

> How surprising is it to see _this_ word appear in _that_ document $k$ times?

Regardless, the negative logarithms, $tfidf_{d_{i}}(w_{j})$ all non-zero, hence _normalizable_ weights that we can use as features, normalized or not. Whether or not normalized TF-IDF probabilities (e.g., in Naïve Bayes) mean anything in (a modeling sense) is another question.

#### 2.1.6.4 Example: Computing TF-IDF
By working off of our TDM from above, we'll now have an easy time computing TF-IDF. The critical piece here that we've not yet discussed is how to compute logarithms. As mentioned this is available through numpy easily using `np.log2()` (for base 2 logarithms).

In [None]:
num_docs = TDM.shape[1]

## start off with a copy of our TDM (frequencies)
TFIDF = np.array(TDM)
## loop over words
for i, word in enumerate(all_words):
    ## count docs containing the word
    num_docs_containing_word = len([x for x in TDM[i] if x])
    ### computen the inverse document frequence of this word
    IDF = -np.log2(num_docs_containing_word/num_docs)
    ## multiply this row by the IDF to transform it to TFIDF
    TFIDF[i,] = TFIDF[i,]*IDF
    
## check out the TF-IDF for 'the' and 'she'
print(TFIDF[all_words.index('the'), ])
print(TFIDF[all_words.index('she'), ])

In [None]:
print(TDM[all_words.index('the'), ])
print(TDM[all_words.index('she'), ])

#### 2.1.6.5 Exercise: exploring TF-IDF
Rank each of the example TF-IDF matrix's columns by TF-IDF values from high-to-low and interpret the kinds of words that have high TF-IDF values, i.e., are 'more important'. What about the low values, what kinds of words are these?

In [None]:
sorted(zip(TFIDF[:,3], all_words), reverse = True)