In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns

### Part 1: SD City Employee Example

# Information Extraction: Text

## Text Data

- How do we *quantify* the similarity of two text documents?
- How do we use a document as input in a regression model?
- How do we turn a text document into a vector of numbers?

### Example: San Diego City Employee Salaries
* Recall Lecture 01: Do men and women make similar salaries?
* Follow-up question: Do men and women make similar salaries among those with similar jobs?
    - How to determine "similar jobs?"

In [None]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2017.csv')
jobtitles = salaries['Job Title']

In [None]:
jobtitles.value_counts().iloc[:100]

In [None]:
jobtitles.shape[0], jobtitles.nunique()

In [None]:
jobtitles.value_counts().iloc[:100].plot(kind='bar')

### Cleaning job titles: assessment
* Can we canonicalize job titles?
* Are they self-consistent?
    - Punctuation? Capitalization? Abbreviations?

In [None]:
# run / rerun
jobtitles.sample(10)

### Cleaning job titles: step 1
* Is every first letter capitalized?
* What punctuation exists? should it be cleaned?

In [None]:
# Capitalization
jobtitles[(
    jobtitles.str.contains(r'\b[a-z]+\b')
)]

In [None]:
# punctuation: replace with space?
jobtitles[jobtitles.str.contains('[^A-Za-z0-9 ]')].head(15)

### Cleaning job titles: step 1
* Remove: to, the, for
* Replace: non-alphanumeric with space
* Replace multiple spaces with one space

In [None]:
jobtitles = (
    jobtitles
    .str.replace(' to| the| for', '')  # include the spaces! (why?)
    .str.replace("[^A-Za-z0-9' ]", ' ')
    .str.replace("'", '')
    .str.replace(' +', ' ')
)

In [None]:
jobtitles.sample(10)

### Cleaning job titles: abbreviations 

* Which job titles are inconsistently described?
    - Librarian? Engineer? Director?

In [None]:
jobtitles[jobtitles.str.contains('Libr')].value_counts()

In [None]:
jobtitles[jobtitles.str.contains('Eng')].value_counts()

In [None]:
jobtitles[jobtitles.str.contains('Dir')].value_counts()

### The limits of canonicalization
* How do we find which common words to map a job title to?
    - E.g. pattern matching on 'Eng' or 'Libr'
* What about other titles? (there are too many!)
* Adjectives vs Nouns have different meanings
    - Junior/Senior/Director vs Police/Fire/Engineer

### The limits of canonicalization

Naive procedure: 
1. Compute most common words (distribution of words in the dataset)
2. Select "relevant" words using domain expertise  (Police vs Asst)
3. Check if a given word is contained in a given job title.

E.g.: "fire", "police", etc.

In [None]:
# compute counts of all words (together)
bow = jobtitles.str.split().sum()    #splits on space and sums
#bow
words = pd.Series(bow).value_counts()
#words
words.head(10)

In [None]:
len(words)

In [None]:
# is a given word in a job-title? count number of occurances of each
jobtypes = pd.DataFrame([], index=salaries.index)
for word in words.index:
    re_pat = '\\b{val}\\b'.format(val = word)
    jobtypes[word] = jobtitles.str.count(re_pat).astype(int)

In [None]:
# number of counts each word appear in every row
jobtypes.head(12)

In [None]:
# how many columns? (curse of dimensionality)

len(jobtypes.columns)

In [None]:
# number of job titles that contained that word
# most common 20 words
jobtypes.iloc[:,:20].sum()

In [None]:
# What does this represent?
# most common 20 words
# What about those with sum = 0? sum > 1?
jobtypes.iloc[:,:20].sum(axis=1)#.describe()

### Problem

- Manually finding all keywords (fire, police, etc.) is labor intensive.

### Part 2

# Bags of Words

### What are the closest job titles to 'Asst Fire Chief'


* Idea: which other job titles share "the most" words in common?
* Implementation: use the 'word vectors' in `jobtypes` to count up matching words.

In [None]:
# what are the closest job titles to:
jobidx = 76
job = jobtitles.iloc[jobidx]
job

In [None]:
job1 = jobtitles.iloc[0]
job1

In [None]:
# word vectors side-by-side
pd.concat([jobtypes.iloc[0], jobtypes.iloc[76]], axis=1).head(10)

In [None]:
# multiply

cnts = pd.concat([jobtypes.iloc[0], jobtypes.iloc[76]], axis=1)

(cnts.iloc[:,0] * cnts.iloc[:,1]).head(10).to_frame()

In [None]:
# sum the matches
np.sum(cnts.iloc[:,0] * cnts.iloc[:,1])

## Solution attempt 1: bag of words
1. Create a list of all words appearing among *all* text ('bag of words')
2. Create a vector, indexed by the distinct words, with counts of the words in that entry.
3. Two text entries are similar if their dot product is large.

### Discussion Question

Given the list of sentences below:
1. What is the index for the word vectors of the sentences?
2. How close are the word-vectors of the first and second sentence?

In [None]:
sentences = [
    'the fox and the moon',
    'the cow and the moon',
    'the cow and the spoon'
]
pd.Series(sentences).to_frame()

### Answer

In [None]:
sentences = pd.Series(sentences)
sentences

In [None]:
words = pd.Series(sentences.str.split().sum()).value_counts()
words

In [None]:
wordvecs = pd.DataFrame([], index=sentences.index)
for word in words.index:
    re_pat = '\\b%s\\b' % word
    wordvecs[word] = sentences.str.count(re_pat).astype(int)
    
wordvecs

In [None]:
np.sum(wordvecs.iloc[0] * wordvecs.iloc[1])

### Bag of Words: Salaries
* Compute the dot product among all word vectors and 'Asst Fire Chief'
* Take the job that is the closest match

In [None]:
jobtitles.iloc[jobidx]

In [None]:
jobvec = jobtypes.iloc[jobidx]
jobvec.head(10)

In [None]:
# dot product with 'Asst Fire Chief' and *all* other titles
matches = jobtypes.apply(lambda ser: np.dot(jobvec, ser), axis=1)
matches.head(10)

In [None]:
jobtitles.loc[matches.sort_values(ascending=False).index].head(10)

## Summary: Bag of Words

* Create an index out of *all* distinct words 
    - The basis for the vector space of words.
* Create vectors for each text entry by computing the counts of words in the entry.
* The dot product between two vectors is proportional to their 'similarity':
    - This defines the **cosine distance** between vectors via: $$dist(v, w) = 1 - \cos(\theta) = 1 - \frac{v \cdot w}{|v||w|}$$

### Conclusion: Bag of Words
* Bag of words *embeds a document into a vector space*
* Can then use clustering (e.g. k-means) to group like documents (e.g. into 'job-types')
    - Unfortunately, many clustering techniques don't work well in high dimensions.
* Downside: treats all words as *equally important*.
    - "Asst Chief Oper Ofcr" vs "Asst to the Chief Oper Ofcr"


### Part 3

# TF-IDF

Term Frequency / Inverse Document Frequency

### Term Frequency, Inverse Document Frequency

How do we figure out which words are "important" in a document?

1. The most common words often *don't* have much meaning!
2. The very rare words are also less important!

Goal: balance these two observations.

## Term Frequency, Inverse Document Frequency

* The *term frequency* of a word $t$ in a document $d$, denoted ${\rm tf}(t,d)$, is the likelihood of the term appearing in the document.
   * Word that occurs often is important to document meaning.

* The *document frequency* is how often the a words occurs in the entire set of documents.
   * Common words that appear everywhere.


* Question: what are the frequencies for a word "the"? (high/low?)

## What about their ratio? Intuition

The relevance of this word to the document.

$$\frac{{\rm\ TermFrequency}}{{\rm DocumentFrequncy}}$$

* `TF`: High, `DC`: High
* `TF`: High, `DC`: Low
* `TF`: Low, `DC`: High
* `TF`: Low, `DC`: Low

## Term Frequency, Inverse Document Frequency

* The *term frequency* of a word $t$ in a document $d$, denoted ${\rm tf}(t,d)$, is the likelihood of the term appearing in the document.
* The *inverse document frequency* of a word $t$ in a set of documents $\{d_i\}$, denoted ${\rm idf}(t,d)$ is: 

$$\log(\frac{{\rm\ total\ number\ of\ documents}}{{\rm number\ of\ documents\ in\ which\ t\ appears}})$$

* The *tf-idf* of a term $t$ in document $d$ is given by the product: 

$${\rm tfidf}(t,d) = {\rm tf}(t,d) \cdot {\rm idf}(t)$$

In [None]:
# What is the tf-idf of 'cow' in the second 'document'?
sentences.to_frame()

### Answer

In [None]:
# the term frequency of 'cow' in the second 'document'
tf = sentences.iloc[1].count('cow') / (sentences.iloc[1].count(' ') + 1)
tf

In [None]:
idf = np.log(len(sentences) / sentences.str.contains('cow').sum())

In [None]:
idf

In [None]:
tf * idf

### TF-IDF of all terms in all documents
* What are the different reasons tf-idf can be zero?
* When is it the largest?

In [None]:
sentences

In [None]:
words = pd.Series(sentences.str.split().sum())

In [None]:
tfidf = pd.DataFrame([], index=sentences.index)  # dataframe of documents
for w in words.unique():
    re_pat = '\\b%s\\b' % w
    tf = sentences.str.count(re_pat) / (sentences.str.count(' ') + 1)
    idf = np.log(len(sentences) / sentences.str.contains(re_pat).sum())
    tfidf[w] = tf * idf

In [None]:
tfidf

### Summary: TF-IDF

* Term Frequency, Inverse Document Frequency balances:
    - how often a word appears in a document/sentence, with
    - how often a word appears *across* documents.
* For a given document, the word with the highest TF-IDF best summarizes that document.

### Example: State of the Union Addresses

* What are the important words for each address?

In [None]:
import re
sotu = open('data/stateoftheunion1790-2017.txt').read()

In [None]:
print(sotu[:20000])

In [None]:
speeches = sotu.split('\n***\n')[1:]

In [None]:
len(speeches)

In [None]:
def extract_struct(speech):
    L = speech.strip().split('\n', maxsplit=3)
    L[3] = re.sub("[^A-Za-z' ]", ' ', L[3]).lower()
    return dict(zip(['speech', 'president', 'date', 'contents'], L))

In [None]:
df = pd.DataFrame(list(map(extract_struct, speeches)))

In [None]:
df

In [None]:
words = pd.Series(df.contents.str.split().sum())

In [None]:
tfidf = pd.DataFrame([], index=df.index)  # dataframe of documents
tf_denom = (df.contents.str.count(' ') + 1)
for w in words.value_counts().iloc[0:500].index:
    # imperfect pattern match for speed
    re_pat = ' %s ' % w
    tf = df.contents.str.count(re_pat) / tf_denom
    idf = np.log(len(df) / df.contents.str.contains(re_pat).sum())
    tfidf[w] =  tf * idf

In [None]:
tfidf.head()

In [None]:
summaries = tfidf.idxmax(axis=1)
summaries

In [None]:
tfidf.iloc[0].argsort

In [None]:
def five_largest(row):
    return list(row.index[row.argsort()][-5:])

In [None]:
keywords = tfidf.apply(five_largest, axis=1)
keywords_df = pd.concat([
    df['president'],
    keywords
], axis=1)

In [None]:
from IPython.display import display
with pd.option_context('display.max_rows', 300):
    display(keywords_df)

# Features

## Features

* A **feature** is a measurable property or characteristic of a phenomenon being observed.
* Synonyms: (explanatory) variable, attribute
* Examples include:
    - a column of a dataset.
    - a derived value from a dataset, perhaps using additional information.
    
We have been creating features to summarize data!

### Examples of features in SD salary dataset

* Salary of employee
* Employee salaries, standardized by job status (PT/FT)
* Gender/age of employees (derived from SSA names; accurate?)
* Job Family associated to a job title (uses text-techniques)
* TF-IDF summary of each state of the union address (topic modeling)

## What makes a good feature?

* Fidelity to Data Generating Process (Consistency).
* Strongly associated to phenomenon of interest ("contains information").
* Easily used in standard modeling techniques (e.g. quantitative and scaled).

Datasets often come with weak attributes; features may need to be "engineered" to convey information.

## Feature Engineering

* The process of creating effective, quantitative attributes from data.
* Transforming heterogeneous data into quantitative data is required for statistical models!
* Similar observations in the data should transform to nearby points in the (Euclidean) feature space.

Effective feature engineering makes (relationships in) data easy to understand!
- Either for a statistical model, or visually for the data scientist.

## The goal of feature engineering

* Find transformations that effectively transform data into effective quantitative variables

* Find functions $\phi:{\rm DATA}\to\mathbb{R}^d$ where similar points $x,y\in {\rm DATA}$ have close images $\phi(x), \phi(y)\in \mathbb{R}^d$

* A "good" choice of features depends on many factors:
    - data type (quantitative, ordinal, nominal),
    - the relationship(s) and association(s) being modeled,
    - the model type (e.g. linear models, decision tree models, neural networks).

### Want to build a linear model:

<div class="image-txt-container">
    
* on a dataset of product review data (X),
* to understand the relationship to product ratings (Y).

<img src="imgs/plane.jpg">

</div>


### Want to build a linear model

* Why can't a linear model be built review data directly?
* What needs to happen to build the model?
* What are concrete steps to take?

|UID|AGE|STATE|HAS_BOUGHT|REVIEW|\||RATING|
|---|---|---|---|---|---|---|
|0|32|NY|True|"Meh."|\||&#10025;&#10025;|
|42|50|WA|True|"Worked out of the box..."|\||&#10025;&#10025;&#10025;&#10025;|
|57|16|CA|NULL|"Hella tots lit yo..."|\||&#10025;|
|...|...|...|...|...|\||...|
|(int)|(int)|(str)|(bool)|(str)|\||(str)|

### Basic Transformations: uninformative features

`UID` was likely used to join the user information (e.g., `age`, and `state`) with some `Reviews` table.  The `UID` presents several questions:
* What is the meaning of the `UID` *number*? 
* Does the magnitude of the `UID` reveal information about the rating?
* Does adding `UID` improve our model?
* **Transformation:** drop the feature.


### Dropping Features

There are certain scenarios where manually dropping features might be helpful:

1. when the features **does not to contain information** associated with the prediction task.  
2. when the feature is **not available at prediction time.**  For example, the feature might contain information collected after the user entered a rating.  This is a common scenario in time-series analysis.


### Basic Transformations: scaling features

`AGE` might contain corrupted or 'clumped' data that requires scaling:
- **Transformation:** apply binning to discretize the data into quartiles.
- **Transformation:** apply non-linear transformation (e.g. log, sqrt).
- **Transformation:** normalize/standardize (z-scale; range).

### Basic Transformations: ordinal encoding

How to encode the `RATING` column as a quantitative variable?
* **Transformation:** "number of &#10025;" to "number".
    - Called *ordinal encoding*: map the ordinal values onto the integer, preserving order.
* Does this preserve "distances" between ratings? (yes).
    

In [None]:
order_values = ['✩', '✩✩', '✩✩✩', '✩✩✩✩', '✩✩✩✩✩']
ordinal_enc = {y:x for (x,y) in enumerate(order_values)}
ordinal_enc

In [None]:
df.RATING.replace(ordinal_enc)

### Basic Transformations: one-hot encoding

How to encode the `STATE` column as a quantitative variable?
- How do we make STATE into a meaningful number?
- Ordinal Encoding? AL=1,...,WY=50? (No!)
- **Transformation:** 50 binary variables: `is_AL`,...,`is_WY`.
    

## Nominal feature encoding: One hot encoding

* Transform categorical features into many binary features.
* Given a column `col` with values `A1,A2,...A_N`, define the following quantitative binary columns:

$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x = A1 \\ 0 &  {\rm if\ } x\neq A1 \\ \end{array}\right. $$

* *Also called:* dummy encoding; indicator variables.

### Example: one hot encoding `STATE`

<div class="txt-image-container">

* A column containing US states transforms into 50 feature columns
* e.g. `phi_CA(x) = 1 if x == 'CA' else 0`
* Each row has *exactly one 1*.
* Many of these columns will be *largely* 0.

<img src="imgs/one-hot.png">

</div>

In [None]:
df = pd.DataFrame([['NY'], ['WA'], ['CA'], ['NY'], ['OR']], columns=['STATE'])
df

In [None]:
states = df.STATE.unique()
states

In [None]:
df['STATE'].apply(lambda x: pd.Series(x == states, index=states, dtype=float))