### Apostrophe Modification
- [Code modified from "Multiple word replace in text using dictionary"](https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python)
- Drawback: Need to create a dictionary

In [1]:
# Replace words in a text that match key strings in a dict with value strings
import re

def word_replace(text, word_dict):
    # Compile a regular expression pattern into a regular expression object
    # re.escape(string): Return string with all non-alphanumerics backslashed
    # map(function, iterable): Apply function to every item of iterable and return list of results
    pattern = re.compile('|'.join(map(re.escape, word_dict)))
    
    # Define a function to return value from dictionary based on matching
    def translate(match):
        #print match.group()
        return word_dict[match.group()]
    
    return pattern.sub(translate, text)

In [2]:
# Create a dictionary
apostrophes = {"'s": ' is', "'re": ' are', "'ll": ' will', "n't": ' not', "'m": ' am'}

# Text example
text = "That's not possible. I'll see if they're coming. I'm sure they wouldn't mind coming"
print 'Original text: ', text

modified_text = word_replace(text, apostrophes)
print 'Modified text: ', modified_text

Original text:  That's not possible. I'll see if they're coming. I'm sure they wouldn't mind coming
Modified text:  That is not possible. I will see if they are coming. I am sure they would not mind coming


In [3]:
# Another approach: Is it good?
def replace_word(text, word_dict):
    for key in word_dict:
        text = text.replace(key, word_dict[key])
    return text

# Create a dictionary
apostrophes = {"'s": ' is', "'re": ' are', "'ll": ' will', "n't": ' not', "'m": ' am'}

# Text example
text = "That's not possible. I'll see if they're coming. I'm sure they wouldn't mind coming"
print 'Original text: ', text

modified_text = replace_word(text, apostrophes)
print 'Modified text: ', modified_text

Original text:  That's not possible. I'll see if they're coming. I'm sure they wouldn't mind coming
Modified text:  That is not possible. I will see if they are coming. I am sure they would not mind coming


### Removal of Punctuation

In [1]:
# Text example
text = "This $$notebook** is about ~text cleaning#"
print 'Original text: ', text

import string

# string.punctuation is simply a list of punctuation
output = text.translate(string.maketrans("",""), string.punctuation)

print 'Punctuations: ', string.punctuation
print 'Punctuation free text: ', output

Original text:  This $$notebook** is about ~text cleaning#
Punctuations:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Punctuation free text:  This notebook is about text cleaning


In [6]:
# use of makettrans to tokenize on spaces, stripping punctuation
table = string.maketrans({ch: None for ch in string.punctuation})

#[s.translate(table) for s in text.split(' ') if s != '']

TypeError: maketrans() takes exactly 2 arguments (1 given)

### Removal of Diacritical Marks (accents)

In [5]:
import unicodedata

text = 'JesucristÕ. Mas el herido fue por nuestras rebeliones, mÕlido'

unicode_text = unicode(text, "utf-8")
print unicode_text

def remove_diacritic(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

remove_diacritic(unicode_text)

JesucristÕ. Mas el herido fue por nuestras rebeliones, mÕlido


'JesucristO. Mas el herido fue por nuestras rebeliones, mOlido'

### Spelling Correction <font color=steelblue>(works with Tokenized text)</font>
- Peter Norvig's [Algorithm](http://norvig.com/spell-correct.html)

In [6]:
#### Code: Peter Norvig's Spelling Correction Algorithm ####
# Import libraries
import re
import collections

# Function to convert text to lower case
def words(text): return re.findall('[a-z]+', text.lower())

# Function to count how many times each word occurs (i.e. train a Bayesian probability model)
def train(features):
    # Create empty dict that will assign value = 1 to nonexistent key
    model = collections.defaultdict(lambda: 1) # 1 is used for smoothing due to new words
    for f in features:
        model[f] += 1 # Update count 
    return model

# Store 
NWORDS = train(words(file('data/big.txt').read()))

# English alphabets
alphabet = 'abcdefghijklmnopqrstuvwxyz'

# Edit Distance = 1 (According to literature: 80 to 90% of spelling errors are an edit distance of 1 from the target)
def edit_dist_1(word):
    # Split the word iteratively and create a list
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    # Delete a letter from word splits in splits list
    deletes = [a + b[1:] for a, b in splits if b]
    # Transpose: Swap adjacent letters
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1]
    # Replace: Change one letter to another
    replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
    # Insert: Add a letter
    inserts = [a + c + b for a, b in splits for c in alphabet]
    # Return set
    return set(deletes + transposes + replaces + inserts)

# ----------------------------
# Edit Distance = 2 (Apply "edit_dist_1" function to all the results of "edit_dist_1" function)
def edit_dist_2(word):
    return set(e2 for e1 in edit_dist_1(word) for e2 in edit_dist_1(e1))
# To avoid lot of computation - Optimize by keeping the candidates that are known words
# ----------------------------

# Edit Distance = 2 for known words given in "big.txt"
def known_edit_dist_2(word):
    return set(e2 for e1 in edit_dist_1(word) for e2 in edit_dist_1(e1) if e2 in NWORDS)

# Known words in language model training data
def known(words): return (set(w for w in words if w in NWORDS))

# Function to return the element with highest P(candidate) i.e. how likely is candidate to appear in English
def correct(word):
    candidates = known([word]) or known(edit_dist_1(word)) or known_edit_dist_2(word) or [word]
    return max(candidates, key = NWORDS.get)

In [7]:
# Spelling correction test
words = ['thos', 'is', 'a', 'gread', 'idia', 'doktor', 'careus', 'whos']
print [correct(word) for word in words]

['this', 'is', 'a', 'great', 'idea', 'doctor', 'cares', 'who']


### Spelling Correction <font color=steelblue>(works with Tokenized text)</font>
- PyEnchant - [a spelling checking library for Python](http://pythonhosted.org/pyenchant/tutorial.html)

In [15]:
#! pip install pyenchant
import enchant
from nltk.metrics import edit_distance

# Spell Checker Class: Taken from "Python Text Processing with NLTK 2.0 Cookbook return word if word in the dictionary 
# else return closest word 
class SpellingReplacer(object):
    def __init__(self, dict_name='en_US', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = 2
    def replace(self, word):
        # If word in US english dict return word
        if self.spell_dict.check(word):
            return word
        suggestions = self.spell_dict.suggest(word)
        # If word not in US english dict then return the closest matched word
        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        # Otherwise return word
        else:
            return word

# Quick test
replacer = SpellingReplacer()
print replacer.replace('iehp')

iehp


In [16]:
# Spelling correction test
words = ['thos', 'is', 'a', 'gread', 'idia', 'doktor', 'careus', 'whos']
print [replacer.replace(word) for word in words]

['thews', 'is', 'a', 'read', 'Lidia', 'doctor', 'cares', 'whews']


### Removing non-ASCII Characters

- [Reference](http://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python)

In [52]:
import re

# Text example with non-ASCII characters
text = 'Š«Ç‘¾_±Œ_Ä__Ä_îµ stop texting!'

print 'Text (non-ASCII): ', text
print 'Text (): ', text.encode('string-escape')

# Convert all non-ASCII characters into their two-digit \xXX and four-digit \uXXXX hexadecimal representations
def replace_glyph_chars(text):
    # Compile regular expression to search: \xXX
    glyph_chars_pattern = re.compile('\\\\x(\w{2,4})')
    # replacement: re.sub(pattern, repl, string)
    temp = re.sub(glyph_chars_pattern, '', text.encode('string-escape'))
    return temp

print 'Text (glyph replaced): ', replace_glyph_chars(text)

Text (non-ASCII):  Š«Ç‘¾_±Œ_Ä__Ä_îµ stop texting!
Text ():  \xc5\xa0\xc2\x81\xc2\x90\xc2\x8d\xc2\xab\xc3\x87\xc2\x8d\xe2\x80\x98\xc2\xbe_\xc2\xb1\xc5\x92_\xc3\x84__\xc3\x84_\xc3\xae\xc2\xb5 stop texting!
Text (glyph replaced):   stop texting!


### Character Encoding and Unicode
- [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html)

### Split text without spaces into list of words (or Segment text)
- [Algorithm based on Zipf's law](http://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words)
- [Peter Norvig's Natural Language Corpus Data: Beautiful Data (Google's n-Gram word frequency lists and more)](http://norvig.com/ngrams/)

In [58]:
#### Code: Peter Norvig's Text Segmentation Algorithm ####
import operator

# Define a "Memoize function"
def memo(f):
    "Memoize function f"
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

# Word segmentation

# Decorator
@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(remaining) for first, remaining in splits(text))
    return max(candidates, key=probability_word_sequence)

# Function to split text in all possible pairs
def splits(text, L=20):
    """
    Return a list of all possible (first, remaining) pairs,
    where length of first <= L
    Example: splits("aword") 
    Output: [('a', 'word'), ('aw', 'ord'), ('awo', 'rd'), ('awor', 'd'), ('aword', '')]
    """
    text_length = range(min(len(text), L))
    return [(text[:i+1], text[i+1:]) for i in text_length]

# Function to calculate Naive Bayes probability of a sequence of words
def probability_word_sequence(words):
    "Return the Naive Bayes probability of a sequence of words."
    return product(probability_word(w) for w in words)

# Function to return product of a sequence of numbers.
def product(numbers):
    "Return the product of a sequence of numbers."
    # reduce(operator.mul, [1, 2, 3], 0.5) = 3.0
    return reduce(operator.mul, numbers, 1)

# Class to create a word probability distribution
class probability_distribution(dict):
    "A probability distribution estimated from word counts in the datafile"
    def __init__(self, data=[], N=None, missingfn=None):
        for key, count in data:
            # Set dictionary key's value by adding count
            self[key] = self.get(key, 0) + int(count)
        # Set value of N
        self.N = float(N or sum(self.itervalues()))
        # Set missing function
        self.missingfn = missingfn or (lambda k, N: 1.0/N)
    def __call__(self, key):
        # If word in dictionary, then return probability of word
        if key in self: return self[key]/self.N
        # If word not in dictionary (i.e. unknown word), 
        # then return probability of unknown word calculated by missingfn
        else: return self.missingfn(key, self.N)
        
# Function to estimate probability of an unknown word: missingfn
def unknown_word_probability(unknown_word, N):
    "Estimate the probability of an unknown word"
    return 10.0/(N * 10**len(unknown_word))
    
#print unknown_word_probability('jokerisabadcharacterinthedarkknight', N)

# Number of tokens: ~1 trillion in Google unigrams
N = 1024908267229

# Function to read 1/3 million most frequent words with counts
def datafile(name, sep='\t'):
    "Read word, count pairs from file"
    for line in file(name):
        yield line.split(sep)
        
probability_word = probability_distribution(datafile('data/count_1w.txt'), N, unknown_word_probability)

In [59]:
# Testing "segment"
print segment('wouldcontact')

['would', 'contact']


### Convert Emoji/Emoticon to Text
- [Emoji data](http://ftp.unicode.org/Public/emoji/1.0/emoji-data.txt)
- [Twitter data preprocessing](https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/)

In [45]:
import re

# Read emoji data
f = open('data/emoji-data.txt', 'r')

# Emoji dictionary
emoji_dict = {}

for line in f:
    # Translate text (in a particular encoding) into unicode (i.e. decode) 
    line = line.decode('utf-8')
    # Replace '(' and ')' with ';' in '(emojo character)'
    line = re.sub('\(', ';', line)
    line = re.sub('\)', ';', line)
    # Split the line by ';'
    line_token = line.split(';')
    # 5th element is emoji character and 6th element is emoji description
    emoji_dict[line_token[5]] = line_token[6].strip()
    
f.close()

# Note: To write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — 
# this translation out of Unicode is called encoding

### Removing Stopwords (English and Spanish) 
- <font color=steelblue>Works with Tokenized text</font>

In [63]:
# Removing stop words: English and Spanish
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')
sp_stopwords = stopwords.words('spanish')

# Create a list of stopwords
stop_words = []

# Flattening a list of list
[stop_words.extend(el) for el in [en_stopwords, sp_stopwords]] 

print 'Number of English and Spanish stopwords: ', len(stop_words)

# Testing stopword removal
text = ['this', 'is', 'a', 'notebook', 'about', 'text', 'cleaning']
print [t for t in text if t not in stop_words]

# Adding user defined custom stopwords
custom_stopword = en_stopwords
custom_stopword.append("notebook")

print [t for t in text if t not in custom_stopword]

Number of English and Spanish stopwords:  466
['notebook', 'text', 'cleaning']
['text', 'cleaning']


### Tokenization
- Regular Expression tokenizer
- Word tokenizer

In [49]:
# Tokenization - NLTK Regexp 
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

# Testing Regexp tokenizer
text = 'This notebook is about text cleaning!'
print 'Text: ', text

print 'Tokenized text (regexp): ', tokenizer.tokenize(text.lower())

Text:  This notebook is about text cleaning!
Tokenized text (regexp):  ['this', 'notebook', 'is', 'about', 'text', 'cleaning']


### Bigrams
- [Measures](http://www.nltk.org/_modules/nltk/metrics/association.html)

In [71]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

text = 'This notebook is about text cleaning!'

# Tokenize text
tokens = tokenizer.tokenize(text)
print 'Tokens: ', tokens

# Bigram Finder
bigram_finder = BigramCollocationFinder.from_words(tokens)
print bigram_finder

# Bigrams as per measure: Chi Square used here (more measures)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 10)
print bigrams

# Join Tokens and Bigrams
tokens = tokens + bigrams
print tokens

Tokens:  ['This', 'notebook', 'is', 'about', 'text', 'cleaning']
<nltk.collocations.BigramCollocationFinder object at 0x109b7e510>
[('This', 'notebook'), ('about', 'text'), ('is', 'about'), ('notebook', 'is'), ('text', 'cleaning')]
['This', 'notebook', 'is', 'about', 'text', 'cleaning', ('This', 'notebook'), ('about', 'text'), ('is', 'about'), ('notebook', 'is'), ('text', 'cleaning')]


In [85]:
from nltk import bigrams
text = 'This notebook is about text cleaning!'
# Tokenize text
tokens = tokenizer.tokenize(text)
[' '.join(bigram) for bigram in bigrams(tokens)]

['This notebook', 'notebook is', 'is about', 'about text', 'text cleaning']

### Trigrams

In [72]:
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures

text = 'This notebook is about text cleaning!'

# Tokenize text
tokens = tokenizer.tokenize(text)
print 'Tokens: ', tokens

# Bigram Finder
trigram_finder = TrigramCollocationFinder.from_words(tokens)
print trigram_finder

# Bigrams as per measure: Chi Square used here (more measures)
trigrams = trigram_finder.nbest(TrigramAssocMeasures.chi_sq, 10)
print trigrams

# Join Tokens and Bigrams
tokens = tokens + trigrams
print tokens

Tokens:  ['This', 'notebook', 'is', 'about', 'text', 'cleaning']
<nltk.collocations.TrigramCollocationFinder object at 0x109b7e610>
[('This', 'notebook', 'is'), ('about', 'text', 'cleaning'), ('is', 'about', 'text'), ('notebook', 'is', 'about')]
['This', 'notebook', 'is', 'about', 'text', 'cleaning', ('This', 'notebook', 'is'), ('about', 'text', 'cleaning'), ('is', 'about', 'text'), ('notebook', 'is', 'about')]


In [84]:
from nltk import trigrams

text = 'This notebook is about text cleaning!'
# Tokenize text
tokens = tokenizer.tokenize(text)
[' '.join(trigram) for trigram in trigrams(tokens)]

['This notebook is',
 'notebook is about',
 'is about text',
 'about text cleaning']

### Word Cloud

In [None]:
# Plot Wordcloud
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [None]:
# Plot: https://eradiating.wordpress.com/tag/ipython-notebook/
for i in range(tfidf_group.shape[0]):
    words_ = tfidf_group.columns.values[1:]
    weights = tfidf_group.ix[i, 1:].values
    words_weights = zip(words_, weights)
    label = 'Label-' + tfidf_group.ix[i, 'label']
    words = ' '.join([x[0] for x in words_weights for times in
                     range(0, int(x[1]*1000))])
    wordcloud = WordCloud(background_color='black', 
                          width=1200, 
                          height=900).generate(words)
    
    plt.figure(figsize=(15, 8))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.title(label, fontsize = 24)
    plt.tight_layout(pad=0)
    wordcloud.to_file(label+'.png')
    plt.show()

#### Stanford NER

- **Issues with 7 Class** - Entities: Person, Location, Organization, Time, Date, Percent, Money and O: Other
    - Often fails to detect a complete date. Example: Check "Barack Hussein Obama was born on 4 August 1961" on [NER Demo](http://nlp.stanford.edu:8080/ner/process)
    - StanfordCoreNLP has SUTime (rule based Stanford Temporal Tagger) however SUTime is not used in Statistical NER directly
    
- **Stanford Temporal Tagger ([SUTime](http://nlp.stanford.edu/software/sutime.shtml))** - Rule based tagger returns (DATE, TIME, DURATION, and SET) tags
    - Java based no Python 
    - [SUTime Demo](http://nlp.stanford.edu:8080/sutime/process)
    - [english.sutime.txt](https://github.com/evandrix/stanford-corenlp/blob/master/sutime/english.sutime.txt)
    - [english.holidays.sutime.txt](https://github.com/evandrix/stanford-corenlp/blob/master/sutime/english.holidays.sutime.txt)
    - [defs.sutime.txt](https://github.com/evandrix/stanford-corenlp/blob/master/sutime/defs.sutime.txt)

- **Text information extraction** - [Google Presentation](http://videolectures.net/mlas06_nigam_tie/ )  
    
```bash
$ nano .bash_profile
$ export JAVA_HOME=$(/usr/libexec/java_home) # Save and quit
$ source .bash_profile 
$ echo $JAVA_HOME
```

```python
# Set Java Path
import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home'

# Stanford NER
from nltk.tag import StanfordNERTagger

# Class and NER path: Move "stanford-ner-2015-12-09" to Project directory
class_path = 'stanford-ner-2015-12-09/classifiers/english.muc.7class.distsim.crf.ser.gz'
ner_path = 'stanford-ner-2015-12-09/stanford-ner.jar'

# 7 Class NER Tagger
# Entities: Person, Location, Organization, Time, Date, Percent, Money and O: Other
st = StanfordNERTagger(class_path, ner_path)

# Test
test = 'John is studying at Stony Brook University in NY'
tags = st.tag(test.split())
```

### Regular Expression Examples

```python
# Matching: 'a year', 'a month', 'two years ago', '3 years', '2.5 years'
numbers = """(^a(?=\s)|one|two|three|four|five|six|seven|eight|nine|ten|eleven|
             twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|
             twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand)"""

prefix = "(about|almost|under|around|over|since)"
dmy = "(year|day|week|month|yrs|yr|y)"
suffix = "(before|after|earlier|later|ago)"
date1 = "((\d+(\.\d+)?|(" + numbers + "[-\s]?))\s*" + dmy + "s?\s*(?:" + suffix + ")?)"
print date1

date1_reg = re.compile(date1, re.IGNORECASE|re.VERBOSE)

cases = ['a year', '2.5 years', 'two years ago', 'nine months', 'eight yrs', 'four yr', 'five y', '5 years']
for c in cases:
    print date1_reg.findall(c)
    

# Matching: '2011', 'August 2011', 'Since 2012'
year = "((?<=\s)\d{4}|^\d{4})"
date3_reg = re.compile(year)
print year

print date3_reg.findall('2011')
print date3_reg.findall('May 2015')
print date3_reg.findall('Since 2012')

# Matching: 11/21/2015 or 11-21-2015
date4 = "\d+[/-]\d+[/-]\d+"
date4_reg = re.compile(date4, re.IGNORECASE)
print date4

print date4_reg.findall('09/12/16')
print date4_reg.findall('09-12-16')
print date4_reg.findall('1/2/2016')
print date4_reg.findall('1-2-2016')
print date4_reg.findall('01/02/2016')
print date4_reg.findall('01-02-2016')

# Matching: Date in ISO format
iso = "\d+[/-]\d+[/-]\d+ \d+:\d+:\d+\.\d+"
iso_reg = re.compile(iso)
print iso
print iso_reg.findall("2016-07-07T21:16:16+00:00


# Matching: 'Several years', 'few months', 'almost 3 years'
prefix = "(almost|about|under|around|over|this|next|last|few|several)"
dmy = "(year|day|week|month|yrs|yr|y|mo|wk)"
date5 = "((?:" + prefix + ')?\s*(?:(\d+))?\s*' +  '(?:' + dmy + "s?" + ')?)' 

date5_reg = re.compile(date5, re.IGNORECASE)
print date5_reg.findall('several years')
print date5_reg.findall('few weeks')
print date5_reg.findall('almost 4 years')

# Example of ignoring a group
r = '(?:([a-z]{2,})_)?(\d+)_([a-z]{2,}\d+)_(\d+)$'
print r
temp_reg = re.compile(r, re.IGNORECASE)

x = 'SH_6208069141055_BC000388_20110412101855'
y = '6208069141055_BC000388_20110412101855'

print temp_reg.findall(x)
print temp_reg.findall(y)

```

```python
# Example
string = 'foobarbarfoo'

# Look ahead positive: Finds the 1st "bar" which has "bar" after it
pla = 'bar(?=bar)' 
re_1 = re.compile(pla)
print re_1.findall(string)

# Look ahead negative: Finds the 2nd "bar" which does not have "bar" after it
nla = 'bar(?!bar)'
re_2 = re.compile(nla)
print re_2.findall(string)

# Look behind positive: Finds the 1st "bar" which has "foo" before it
plb = '(?<=foo)bar'
re_3 = re.compile(plb)
print re_3.findall(string)

# Look behind negative: Finds the 2nd "bar" which does not have "foo" before it
nlb = '(?<!foo)bar'
re_4 = re.compile(nlb)
print re_4.findall(string)
```

### Using Regular Expressions

- [Regular Expression Tutorial](http://regexone.com/lesson/introduction_abcs)
- [RegExr Tool](http://regexr.com/)
- [RegEx101 (awesome)](https://regex101.com/)
- [Automate boring stuff blog](https://automatetheboringstuff.com/chapter7/)
- [Rexegg](http://www.rexegg.com/)
- [Regular Expression Cheatsheet](http://www.pnotepad.org/docs/search/regular_expressions/)
- [Look ahead/Look behind](http://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups)
---

```python

# Imports
import re
import dateparser
import datetime
import numpy as np

# Predefined strings.
numbers = "(^a(?=\s)|one|two|three|four|five|six|seven|eight|nine|ten| \
          eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
          eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| \
          ninety|hundred|thousand)"
month = "(january|february|march|april|may|june|july|august|september| \
          october|november|december)"
month_short = "(jan|feb|mar|apr|may|jun|jul|aug|sept|oct|nov|dec)"
prefix = "(couple|almost|about|under|around|over|this|next|last|few|several|since|many|some|number|handful|plenty|various)"
suffix = "(before|after|earlier|later|ago|winter|spring|summer|fall)"
dwmy = "(year|day|week|month|yrs|yr|y|wk)"
number_map = {
    "one": 1,
    "two": 2,
    "three": 3,
    "four": 4,
    "five": 5,
    "six": 6,
    "seven": 7,
    "eight": 8,
    "nine": 9,
    "ten": 10,
    "eleven": 11,
    "twelve": 12,
    "thirteen": 13,
    "fourteen": 14,
    "fifteen": 15,
    "sixteen": 16,
    "seventeen": 17,
    "eighteen": 18,
    "nineteen": 19,
    "twenty": 20,
    "thirty": 30,
    "forty": 40,
    "fifty": 50,
    "sixty": 60,
    "seventy": 70,
    "eighty": 80,
    "ninety": 90,
    "hundred": 100,
    "couple": 2}

string_map = {
    "year": "year",
    "yrs": "year",
    "yr": "year",
    "y": "year",
    "day": "day",
    "week": "week",
    "wk": "week",
    "month": "month"}

seasons = ['winter', 'spring', 'summer', 'fall']
seasons_map = {
    "winter": "january",
    "spring": "april",
    "summer": "july",
    "fall": "october"}
obscure = ['few', 'several', 'many', 'some', 'number', 'plenty', 'various']

# Matching: 'a year', 'a month', 'two years ago', '3 years', '2.5 years'
regex_1 = "((\d+(\.\d+)?|(" + numbers + "[-\s]?))\+?\s*" + dwmy + "s?\s*(?:" + suffix + ")?)"
reg_1 = re.compile(regex_1, re.IGNORECASE|re.VERBOSE)

# Matching: '2011', 'August 2011', 'Since 2012'
regex_2 = "((?<=\s)\d{4}|^\d{4})"
reg_2 = re.compile(regex_2)

# Matching: 11/21/2015 or 11-21-2015
regex_3 = ".*?(\d+[/-]\d+[/-]\d+).*?"
reg_3 = re.compile(regex_3, re.IGNORECASE)

# Matching: Date in ISO format
regex_4 = "\d+[/-]\d+[/-]\d+ \d+:\d+:\d+\.\d+"
reg_4 = re.compile(regex_4)

# Matching: 'Several years', 'few months', 'almost 3 years'
regex_5 = "((?:" + prefix + ')?\s*(?:(\d+))?\s*' +  '(?:' + dwmy + "s?" + ')?)' 
reg_5 = re.compile(regex_5, re.IGNORECASE)

# Matching: 'two and a half years'
regex_6 = "((\d+|" + numbers + ')\s*(?=and a).*?(?<=half )' + dwmy + 's?)'
reg_6 = re.compile(regex_6, re.IGNORECASE)

# Matching: 'The August of 2014' or 'Dec 2006'
regex_7 = '.?(' + month + '|' + month_short + ').*?(\d{4})'
reg_7 = re.compile(regex_7, re.IGNORECASE)

# Matching: 'Around 2014', 'Since 2011'
regex_8 = ".*?((?:" + prefix + ')?\s*(?:(\d+))?\s*' +  '(?:' + dwmy + "s?" + ')?)' 

# Matching: 'last fall', 'last march'
regex_9 = "((" + prefix + ')\s*(' + suffix + '|' + dwmy + '|' + month + '|' + month_short +')\s*)'
reg_9 = re.compile(regex_9, re.IGNORECASE)

# Matching: 'couple of months ago', 'couple years ago', 'couple years'
regex_10 = "((" + prefix + ').*?\s*(' + dwmy + ').*?\s*(?:' + suffix + ')?.*?\s*)'
reg_10 = re.compile(regex_10, re.IGNORECASE)

# Matching: '
regex_11 = '.*?(?:(a))?\s*?(months|month|days|day|weeks|week|years|year)s?\s*?(?:(ago))?'
reg_11 = re.compile(regex_11, re.IGNORECASE)

def text_parse(text):
    text = text.lower()
    # Match variations of: a year, two years ago, 3 years, 2.5 years, 2+ years
    found = reg_1.findall(text)
    #print found
    if len(found) > 0:
        a = found[0][1]
        d = found[0][5]
        if a.strip() == 'a':
            b = a
            c = string_map[d]
            temp = ' '.join([b, c])
            date = dateparser.parse(temp)
        elif (a.strip()).isalpha():
            b = str(number_map[a.strip()])
            c = string_map[d]
            temp = ' '.join([b, c])
            date = dateparser.parse(temp)
        else:
            b = str(np.int(np.floor(np.float(a))))
            c = string_map[d]
            temp = ' '.join([b, c])
            date = dateparser.parse(temp)
        return date
    
    # Extract YYYY from variations of '2011', 'August 2011', 'Since 2012'
    found2 = reg_2.findall(text)
    print found2
    #if len(found2) > 0:
        #return dateparser.parse(found2[0])
    
    # Match variations of: 'date is 6/21/16 right?', i.e. 'string MM/DD/YY or MM/DD/YYYY string'
    found3 = reg_3.findall(text)
    #print found3
    if len(found3) > 0:
        temp = found3[0]
        date = dateparser.parse(temp)
        return date
    
    # Matching: Date in ISO format - No need as dateparser parses it
    found4 = reg_4.findall(text)
    #print found4
    
    # Match variations of: 'two and a half years', '3 and a half years'
    found6 = reg_6.findall(text)
    #print found6
    if len(found6) > 0:
        a = found6[0][1]
        d = found6[0][3]
        if (a.strip()).isalpha():
            b = str(number_map[a.strip()])
            c = string_map[d]
            temp = ' '.join([b, c])
            date = dateparser.parse(temp)
        else:
            b = str(np.int(np.floor(np.float(a))))
            c = string_map[d]
            temp = ' '.join([b, c])
            date = dateparser.parse(temp)
        return date
    
    # Match variations of: 'In sept. 2008', 'Last June in 2008' 
    found7 = reg_7.findall(text)
    #print 'found7---', found7
    if len(found7) > 0:
        a = found7[0][0]
        b = found7[0][3]
        temp = ' '.join([a, str(b)])
        date = dateparser.parse(temp)
        return date
    
    # Match variations of: 'last august', 'last fall'
    found9 = reg_9.findall(text)
    #print found9
    if len(found9) > 0:
        crnt_date = datetime.date.today()
        a = found9[0][1]
        b = found9[0][3]
        if a.strip() == 'last':
            last_yr = str(crnt_date.year - 1)
            if b in seasons:
                e = seasons_map[b]
                temp = ' '.join([e, last_yr])
                date = dateparser.parse(temp)
            else:
                e = b
                temp = ' '.join([e, last_yr])
                date = dateparser.parse(temp)
            return date
        
    # Match variations of: 'couple years', 'couple of months'
    found10 = reg_10.findall(text)
    #print found10
    if len(found10) > 0:
        a = found10[0][1]
        b = found10[0][3]  
        if a.strip() == 'couple':
            f = str(number_map[a.strip()])
            temp = ' '.join([f, b])
            date = dateparser.parse(temp)
        elif a.strip() in obscure:
            f = str(np.random.randint(3, 10))
            temp = ' '.join([f, b])
            date = dateparser.parse(temp)
        else:
            date = ''
        return date
                
#print text_parse('3-4 years ago')
#print text_parse('2 years and six months')
#print reg_10.findall('a couple of months ago')
#print reg_10.findall('few months ago')
#print reg_10.findall('many years ago')
#print text_parse('a couple of months ago')
#print text_parse('few months ago')
#print text_parse('many years ago')
#print text_parse('several weeks ago')
#print text_parse('number of weeks ago')
#print text_parse('various weeks')
print text_parse('since 2012')
print text_parse('i think a year ago sure')
print text_parse('years ago')
print text_parse('month ago')
print text_parse('half a year')
print text_parse('years ago')
print text_parse('year and a half')


#under a year
#year and a half
#Less than a month
#For a year and a half.
#years ago
#I don't remember.   years ago
#I year ago
#over a year
#month ago
#over a year ago
#At least a year ago.
#Half a year
#over a year ago
#month ago
#Over a year ago
#Years ago.
#A little over a year ago.
#I don't know about a year ago, a little less.
#Over a year ago.
#More than a year ago
#Almost a year ago.
#year ago
#month ago
#More than a year ago
```

### What is yield statement in Python?
- [Reference](http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python)

In [40]:
## Reading a lists item one by one: Iteration
# Create list (an iterable)
temp_list = [1, 2, 3]
for i in temp_list:
    print i

## List Comprehension creates a list (an iterable)
temp_list = [x*x for x in range(3)]
print temp_list

## Generators: Generators are iterators (can only iterate over them once)
# because generators do not store all values in memory, they generate
# values on the fly
temp_generator = (x*x for x in range(3)) # Used () instead of []
print temp_generator

# Now iterate over generator
for i in temp_generator:
    print i
    
## "yield" is a keyword that is used like "return", except the function will return a generator
def create_generator():
    temp_list = range(3)
    for i in temp_list:
        print 'i = ', i
        yield i**i
        
# Note: When the function is called, the code in the function body does not run   
temp_generator = create_generator() # create_generator() is called
# Calling function only returns the generator object
print temp_generator # print i in create_generator() did not run

# Now iterate over generator
for i in temp_generator:
    print i

1
2
3
[0, 1, 4]
<generator object <genexpr> at 0x1094443c0>
0
1
4
<generator object create_generator at 0x109444370>
i =  0
1
i =  1
1
i =  2
4
