# DSCI 521

## Chapter 2: Feature engineering and NLP methods

### Creating features

When analyzing large amounts of data, it's important to be able to quickly find trends and patterns in said data. A useful tool which helps in this regard is through the use of _features_, which are created through a process usually referred to as _feature engineering_. Any specific characteristic of a data point may be used as a _feature_. For example, say we had a large group of people, and we wanted to create an algorithm which predicts which of those people are professional basketball players. The most immediate thing that may pop into one's head when it comes to professional basketball players is height, so we could create a height feature by measuring each person's height and recording it into our dataset. This is an example of a _numeric feature_. These will be incredibly useful for us in the future, when we start using machine learning techniques. Of course, there are plenty of other types of non-numeric features, as well.

Much of the data we will run into is fairly unstructured in nature. This is where your expertise in data science is so essential. With the rise of social media platforms online, a huge amount of data is comprised of text. Clearly, textual data is not distributed in a neat structured manner with a set of clearly-defined numerical features. How can we analyze this ubiquitous textual data? As an example, if you wanted to compare two people by the text they've written it might be natural to compare their words, but how do you identify the words in a document?

#### Featurization

This specific approach to adding structure to an unstructured data (text) object is called _featurization_. Featurization does not only refer to a process for text, but really any curcumstance in which you might need a higher-level, more-succicnt, or refined form of data representation. For images, it's common to extract feature objects. For example, an image's objects might include cars, people, or lines of paint on the road, depending on the application. These might be represented polygons or boxed pixel regions.

As it turns out, featurization in images is a relatively complex task whose study could probably comprise an entire course on its own. Since text is quite accessible as a data type, still unstructured, and is human readable, we'll study featurization in its context. Moreover, the basic method (regular expressions) required to featurize text is the same low-level method through which one might fix a malformed, e.g., delimitation-broken, data file. This makes regular expressions a critical skill for pre-processing.

Such featurization of textual data is a small part of the large and booming field of Natural Language Processing (NLP).

### Basic NLP methods

#### What is NLP?

As mentioned before, a huge amount of the data available on the internet comes in the form of unstructured text, mostly generated by humans. Such text of strictly human origin is usually referred to as _natural language_. Thus, a simple way to describe what NLP really is is to say that it's the study of techniques which allow for the processing of natural language. In recent years, it has become one of the most important and hottest subfields of computer science. Definitely not a bad thing to familiarize yourself with! Some of the more typical problems include speech recognition (spoken word also counts as natural language!), natural language understanding (machine reading comprehension of natural language), and natural language generation (using a machine to automatically generate human-like text, think of something like Siri or Alexa). 

#### Regular Expressions

What are regular expressions?

[From Wikipedia](https://en.wikipedia.org/wiki/Regular_expression):

>A regular expression, regex or regexp ... is ... a sequence of characters that define a search pattern. Usually this pattern is then used by string >searching algorithms for "find" or "find and replace" operations on strings.

So, finding data formatting errors and replacing/fixing them is a big application. Likewise, finding words and extracting them is another. Before we get going let's discuss the different types of characters that exist. Stright from the `re` [module's docs](https://docs.python.org/3.4/library/re.html):



>Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'....

>... Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

Before we go through and experiment with regular expression special characters, let's explore a few important functions:

__Functions__


+ `re.split(regex, string)` (__split on delimiter__) Will break a string apart as delimited by regex, returning a list of the delimited pieces (substrings) inside of string.
+ `re.sub(match_regex, replace_regex, string)` (__match and replace__) Will replace all instances of match_regex with instances of replace_regex inside of string and return the modified version of string.
+ `re.findall(regex, string)` (__find all matches__) Will search for all non-overlapping instances of a pattern, from left to right and return a list of matched string.





In [1]:
import re
a_silly_string = "one fish two fish red fish blue fish"

## splitting by 'fish' to find attributes
types_of_fish = re.split("fish", a_silly_string)
print(types_of_fish)

['one ', ' two ', ' red ', ' blue ', '']


In [2]:
## let's replace our fish with some cats
a_silly_string_about_cats = re.sub("fish",  "cat", a_silly_string)

print(a_silly_string_about_cats)

one cat two cat red cat blue cat


In [3]:
## let's find all of our cats
## Note: this will be more interesting once we've gotten into flexible expressions
found_cats = re.findall("cat", a_silly_string_about_cats)

print(found_cats)

['cat', 'cat', 'cat', 'cat']


__`re.search()` and resulting `Match` objects__

I've left one really important function out of our list so far:

+ `re.search(regex, string)` (__find match anywhere__) Will search for the pattern in the string and return a match object. 

[Note: `re.match(regex, string)` is only successful on the first instance of a match inside of a string. _I strongly recommend using only `re.search()` for any matching!_]

The output of `re.search()` (or `re.match()`, if you want trouble) returns a match object if a match is found. This results in a `True` boolean value, and if no match is found `Null` is returned, which has boolean `False` value. This means you can use `re.search()` statements inside of logical if/else control flow. That's really nice. If a match object is the result, there are more methods that can be applied besides truthiness:

+ `.groups()` (__collect group__). Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. We'll get more into this in the grouping section, below.



In [4]:
## Now let's test to see if our string has cats or ducks inside of it
cat_match = re.search("cat", a_silly_string_about_cats)
if cat_match:
    print("there's definitely at least one cat in our string:")
else:
    print("there's no cats in our string")

duck_match = re.search("duck", a_silly_string_about_cats)
if duck_match:
    print("there's definitely at least one duck in our string:")
else:
    print("there's definitely not a single duck in our string")

there's definitely at least one cat in our string:
there's definitely not a single duck in our string


__Flexible matching__

Most often, we won't want to split, match, replace, etc., by exactly the same string. In these cases, we can use flexible matching. Here are some common flexible special sequences:


+ `.` (__wild card__) In the default mode, this matches any character except a newline.
+ `[...]` (__character class__) Used to indicate flexible matching across a specificed set of characters.
+ `[^...]` (__complimentary character class__) Used to indicate flexible matching across _everything but_ a specificed set of characters.
+ `[a-z]` (__lowercase range__) Used to indicate flexible matching across lowercase letter ranges
+ `[A-Z]` (__uppercase range__) Used to indicate flexible matching across uppercase letter ranges
+ `[0-9]` (__numeric range__) Used to indicate flexible matching across numeric ranges
+ `'|'` (__or__) Creates a regular expression that will match either A or B.




In [5]:
not_a_silly_string = "Oftentimes, different punctuation characters are used; these indicate different types of stops."

## split a string by several types of punctuation
clauses = re.split("[,;.]", not_a_silly_string)
print(clauses)


['Oftentimes', ' different punctuation characters are used', ' these indicate different types of stops', '']


In [6]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's get all of the phone numbers in a string
numbers = re.findall("[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]", tommy_two_tone)
print(numbers)

['867-5307']


__Grouping, numbered groups and extensions__

Grouping is a great way to modify and extend strings, without simply replacing them. With grouping, you can use the matched content in a substitute string. It's great for re-formatting text. Groups can also serve extended functions if they are initiated by an unescaped question mark.


+ `(...)` (__group__) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the `\1`, `\2`, etc., special sequences, described below.
+ `\1`, `\2`, etc. (__captured groups__) Matched groups are captured and held in order: low to high from left to right, and in the case of nested groups, from outside to inside.
+ `(?...)` (__non-matching group__) Matches `...` as in the parentheses, but does not capture it in a group. This becomes especially important when applying multipliers.
+ `(?=...)` (__lookahead__) Matches if `...` matches next, but doesn’t consume any of the string.
+ `(?!...)` (__negative look ahead__) Matches if `...` doesn’t match next.
+ `(?<=...)` (__positive look behind__) Matches if the current position in the string is preceded by a match for `...` that ends at the current position.
+ `(?<!...)` (__negative look behind__) Matches if the current position in the string is not preceded by a match for `...`




In [7]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's capture Jenny's phone number and insert the area code
modified_tommy_two_tone = re.sub(r"([0-9][0-9][0-9]-[0-9][0-9][0-9][0-9])",r"1-800-\1", tommy_two_tone)

print(modified_tommy_two_tone)

Apparently, 1-800-867-5307 is Jenny's phone number, but I'm not sure what her area code is.


__Anchors__

Anchors allow you to make the positions of matches absolute in the overally position in a string. These become especially handy if you are pre-processing semi-structured text, like a screenplay, stenographer's court record, or the index of a book.

+ `^` (__start anchor__) Matches the start of the strings
+ `$` (__end anchor__) Matches the end of the string or just before the newline at the end of the string.


In [9]:

## an example of some sem-structured text
macbeth = "First Witch: When shall we three meet again? In thunder, lightning, or in rain?\nSecond Witch: When the hurlyburly's done, when the battle's lost and won."
print(macbeth, '\n')

## make some empty lists for our data
speakers = []
speeches = []

## split the document into the lines of the play
lines = re.split("\n", macbeth)

## loop over the lines
for line in lines:
    
    ## retrieve the matched groups
    ## Note: if we simply split by a colon 
    ## we might mess up what people are saying in the text!
    ## Also note: the super greedy ".*?" matching ANYTHING, zero or more times!
    ## This comes in very handy when you want loosely anything
    ## that happens to be surrounded by some specified structure
    speaker, speech = re.search("^(.*?): (.*?)$", line).groups()

    ## Grow the lists
    speakers.append(speaker)
    speeches.append(speech)

print(speakers, '\n')
print(speeches, '\n')



First Witch: When shall we three meet again? In thunder, lightning, or in rain?
Second Witch: When the hurlyburly's done, when the battle's lost and won. 

['First Witch', 'Second Witch'] 

['When shall we three meet again? In thunder, lightning, or in rain?', "When the hurlyburly's done, when the battle's lost and won."] 



### Formatting Issues in Text Data

Sometimes the text you've got isn't exactly as you want it. A good way to explore these issues is in the context of a pretty standard text processing procedure: counting words.

__Counting words—entirely dependent on 'tokenization'__

A good first pass at counting words might just split a text by space and use a dictionary to count them up. Recall, there's a special `Counter()` type of dictionary that defaults to integer values. Let's give this a try and see what happens.

Note: they're often called tokens, and not words because it's really hard to make sure they're actually words!

In [11]:
from collections import Counter

document = "\tYou might have an easy time reading this, \nbut the computer has some extra spaces tabs and \nnewlines to deal with.  After all, two spaces after \na stop isn't strange!"
print(document, '\n')
word_counts = Counter()
words = re.split(" ", document)
for word in words:
    word_counts[word] += 1
    
## we can use the .most_common() method on a Counter() 
## to order our counted words
print(word_counts.most_common())

	You might have an easy time reading this, 
but the computer has some extra spaces tabs and 
newlines to deal with.  After all, two spaces after 
a stop isn't strange! 

[('spaces', 2), ('\tYou', 1), ('might', 1), ('have', 1), ('an', 1), ('easy', 1), ('time', 1), ('reading', 1), ('this,', 1), ('\nbut', 1), ('the', 1), ('computer', 1), ('has', 1), ('some', 1), ('extra', 1), ('tabs', 1), ('and', 1), ('\nnewlines', 1), ('to', 1), ('deal', 1), ('with.', 1), ('', 1), ('After', 1), ('all,', 1), ('two', 1), ('after', 1), ('\na', 1), ('stop', 1), ("isn't", 1), ('strange!', 1)]


This doesn't look bad, but there's some extra whitespace on the edges of our words!

__Extra Whitespace__

Oftentimes, text has extra whitespace that you need to remove. For this, you don't even need the regular expressions module; you can just use the `.strip()` string method! This removes any leading or trailing whitespace from a string. Also:

+ it looks like we are counting a bunch of empty string—we can use string object truthiness to avoid this!
+ we're counting punctuation characters in with our words. We can split by non-words `('\W')` to get rid of these!



In [12]:
document = "\tYou might have an easy time reading this, \nbut the computer has some extra spaces tabs and \nnewlines to deal with.  After all, two spaces after \na stop isn't strange!"

word_counts = Counter()
words = re.split("\W", document)
for word in words:
    ## remove anyt extra white space
    word = word.strip()
    if word:
        word_counts[word] += 1
    
## we can use the .most_common() method on a Counter() 
## to order our counted words
print(word_counts.most_common())



[('spaces', 2), ('You', 1), ('might', 1), ('have', 1), ('an', 1), ('easy', 1), ('time', 1), ('reading', 1), ('this', 1), ('but', 1), ('the', 1), ('computer', 1), ('has', 1), ('some', 1), ('extra', 1), ('tabs', 1), ('and', 1), ('newlines', 1), ('to', 1), ('deal', 1), ('with', 1), ('After', 1), ('all', 1), ('two', 1), ('after', 1), ('a', 1), ('stop', 1), ('isn', 1), ('t', 1), ('strange', 1)]


__Nothing's perfect!__

Notice that the above this still isn't perfect! Now that we're using the non-word special character to split our string, we've messed up the contraction: _isn't_, which if anything should be broken into two words: is and n't. Ultimately, 'tokens' are an admission of the imperfect delimitiation of words in text. The better we want things, the more special rules we would have to apply, which makes delimitation more particular to the language being processed.

__Coming back to Delimitation:__

So, whether it has something to do with the way a language lays out, or the way someone has entered their data, delimitation really matters! How can we fix these delimitation issues? Regular expressions! If a file is giving you trouble, read it in as plain text, find and replace the delimitation issues, and print the file back out as a properly-structured object. Unfortunately, in the wild world of problems there is no fix-all for issues like these; instead, it's all about flexible matching and replacing with regular expressions. Good luck!

__Inconsistencies and variations in datetime parsing__

We've discussed inconsistencies in the context of name variations. This might occur if you're trying to process a text and determine who said what to whom, or take in customer data from separate transactions in which the customer has represented themself by different identifiers, e.g., Mr. Williams vs Jake W., etcetera. These would almost certainly be ad hoc jobs for regular expressions and compiled lists of aliases. However, one high-variation entity processing task can be treated in a highly consistent manner: datetime parsing.

In addition to the datetime parser, let's use another module that works on top of datetime to make the parsing of times easy. This `dateutil.parser` is pre-built to handle a variety of formats. Note, for more advanced and specific datetime parsing, see:

+ https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

In [13]:


import dateutil.parser as p

## Sometimes it's not just the nature of times and dates that creates variation
schedule = {
    "First day of class": "January 9th, 2018",
    "Class Time": "Tuesday, 6:30pm",
    "Office Hours 1": "Tuesday, 3pm", 
    "Office Hours 2": "Thursday, 12pm",
    "Term Exams Start": "March 20, 2018",
    "Term Exams End": "March 24, 2018",
    "Assignment 1 due": "1/23/18",
    "Assignment 2 due": "2/6/18",
    "Project Interim Report due": "2/13/18",
    "Assignment 3 due": "2/20/18",
    "Assingment 4 due": "3/13/18",
    "Project Final Report due": "3/20/18",
}

## regardless of format, we can process any of these datetimes
for event in schedule:
    print(event)
    print(p.parse(schedule[event]), '\n')



First day of class
2018-01-09 00:00:00 

Class Time
2018-10-16 18:30:00 

Office Hours 1
2018-10-16 15:00:00 

Office Hours 2
2018-10-11 12:00:00 

Term Exams Start
2018-03-20 00:00:00 

Term Exams End
2018-03-24 00:00:00 

Assignment 1 due
2018-01-23 00:00:00 

Assignment 2 due
2018-02-06 00:00:00 

Project Interim Report due
2018-02-13 00:00:00 

Assignment 3 due
2018-02-20 00:00:00 

Assingment 4 due
2018-03-13 00:00:00 

Project Final Report due
2018-03-20 00:00:00 



__Why datetime parsing?__

Because we can then compare times as numeric objects and ask, for example, how long till the next office hours?


In [14]:
from datetime import datetime

## datetime.now() creates an time object with the current time
current_time = datetime.now()

## let's see how long until the next class
## the .seconds attribute gets the remaining secconds in the difference
next_class = p.parse(schedule["Class Time"])
print("Seconds until the next class: ") 
print((next_class - current_time).seconds, '\n')

## hours are probably more helpful
print("Hours until the next class: ") 
## we can divide seconds by 3600 to get hours
print((next_class - current_time).seconds/3600., '\n')

## let's see how long until the next tuesday office hour
next_office_hour_1 = p.parse(schedule["Office Hours 1"])
print("Hours until the next Tuesday office hour: ") 
print((next_office_hour_1 - current_time).seconds/3600., '\n')

## Since Thursday is farther away, we can add up
## the number of whole .days attribute
## with the fraction of days via .seconds/3600./24.
next_office_hour_2 = p.parse(schedule['Office Hours 2'])
print("Days until the next Thursday office hour: ") 
print((next_office_hour_2 - current_time).days + (next_office_hour_2 - current_time).seconds/3600./24., '\n')



Seconds until the next class: 
13747 

Hours until the next class: 
3.818611111111111 

Hours until the next Tuesday office hour: 
0.3186111111111111 

Days until the next Thursday office hour: 
0.8882754629629629 

