# Vocabulary Analysis Workshop

## Tokenization

The first thing we will do is split the job description text into tokens.  
A token is an element (word, character, symbol) of a string. For example, the string "Cows eat grass" can be tokenized into words or characters:
- "Cows", "eat", "grass"
- 'C', 'o', 'w', 's', ' ', 'e', 'a', 't', ' ', 'g', 'r', 'a', 's', 's'

(Tokenization [wikipedia](https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis))

Tokenization is a text segmentation process that reduces a string into a list of the tokens (elements) that make up the string. In natural language processing (NLP) it is generally used to refer the process of taking text and producing a list of words. Tokens are also used to refer to the characters of a string.

Tokenization is often the first process applied to a piece of text, and because of this it heavily affect downstream processes. This can be complicated because there isn't necessarily a "correct" way to tokenize. Depending on language, kind of document, and downstream processes tokenization can be vastly different.

In [None]:
from __future__ import division, print_function

%matplotlib inline

import nltk
import pandas as pd
import pickle

from vocab_analysis import *

import answers

In [None]:
jobs_df = pd.read_csv('data/job_descriptions.tsv', sep='\t', encoding='UTF-8', index_col=0)

In [None]:
jobs_df

In [None]:
len(jobs_df)

### Exercise 1: Tokenization

Tokenization in English is often implemented with heuristics or regular expressions. Regular expressions can be used in one of two ways: to identify boundaries, or to identify tokens. Here are some common regular expressions for tokenizing English.

Here are some questions to consider when doing tokenization.
- How do I want to treat punctuation?
  - Should "word." tokenize to ["word", "."], ["word."], or ["word"])?
- How do I want to treat contractions (e.g. should we)?
  - Should we pre-treat contractions by expanding all instances in the text?
    - "won't" to "will not" then tokenize to ["will", "not"]
  - Should we keep contractions together? Or break them up?
    - "won't" tokenizes to ["won't"] or ["won", "'", "t"]
  - Are we dealing with a more formal version English, and contractions are rare?
- How do we want to treat numbers?
  - Should "A1" tokenize to ["A1"] or ["A", "1"]?
- How do we treat hyphens?
  - Should "State-of-the-art" tokenize to ["State-of-the-art"], ["State", "of", "the", "art"], or ["State", "-", "of", "-", "the", "-", "art"]
  
Implement your tokenizer. When using regular expression based tokenizers, the common approaches are to either use the regular expression to identify tokens and extract all matches, or identify gaps and split the string. Use the example below to test out your tokenizer.

In [None]:
example_job_description = """
This is a description for a generic job.

The employee is expected have the following:
    1. an A1 certifcation (recent or renewed)
    2. experience in widget-widget interaction
    
She/he will be expected to be stand for 3-4 hours at a time.
She/he won't be expected to actually create widgets.

Full-time
Salary : $50,000/yr
"""

In [None]:
def tokenize(job_description):
    """
    This function takes a job description and returns a list of tokens
    Parameters
    ----------
    job_description : str
        The text of the job description
    Returns
    ----------
    list[str]
        The list of tokens in the job description
    """
    raise NotImplementedError('Implement the tokenizer')

# tokenize = answers.tokenize # uncomment this, and comment the above function to skip this exercise

In [None]:
tokenize(example_job_description)

When you are happy with your tokenization, let's tokenize the descriptions, save our work, and move on to analysis using $\mbox{TF.IDF}$.

In [None]:
jobs_df['tokens'] = jobs_df['description'].apply(tokenize)

In [None]:
jobs_df.to_pickle('data/tokenized.pickle')

Let's also save our tokenization method

In [None]:
save_fun(tokenize, imports=['nltk'])

### NEXT => [2. TF.IDF](2. TF.IDF.ipynb)