# Text Mining With Python Regular Expression

**Word comparison functions** 
- s.startswith(t)
- s.endswith(t)
- t in s
- s.isupper(); s.islower(); s.istitle()
- s.isalpha(); s.isdigit(); s.isalnum()

**String Operations**
- s.lower(); s.upper(); s.titlecase()
- s.split(t)
- s.splitlines()
- s.join(t)
- s.strip(); s.rstrip()
- s.find(t); s.rfind(t)
- s.replace(u, v)

In [1]:
text = "Ethics are built right into the ideals and objectives of the United Nations "
print(len(text)) # The length of text (number of characters)
print(len(text.split())) # Number of words (split: text => list of words)

76
13


In [2]:
[w for w in text.split() if w.istitle()] # Capitalized words in text

['Ethics', 'United', 'Nations']

In [3]:
set(text.split()) # Unique words in text

{'Ethics',
 'Nations',
 'United',
 'and',
 'are',
 'built',
 'ideals',
 'into',
 'objectives',
 'of',
 'right',
 'the'}

**Meta-characters: Character matches**

- . : wildcard, matches a single character
- ^ : start of a string
- $ : end of a string
- [] : matches one of the set of characters within []
- [a-z] : matches one of the range of characters a, b, …, z
- [^abc] : matches a character that is not a, b, or, c
- a|b : matches either a or b, where a and b are strings
- () : Scoping for operators
- \ : Escape character for special characters (\t, \n, \b)

- \b : Matches word boundary
- \d : Any digit, equivalent to [0-9]
- \D : Any non-digit, equivalent to [^0-9]
- \s : Any whitespace, equivalent to [ \t\n\r\f\v]
- \S : Any non-whitespace, equivalent to [^ \t\n\r\f\v]
- \w : Alphanumeric character, equivalent to [a-zA-Z0-9_]
- \W : Non-alphanumeric, equivalent to [^a-zA-Z0-9_]

- "*" : matches zero or more occurrences
- "+" : matches one or more occurrences
- ? : matches zero or one occurrences
- {n} : exactly n repetitions, n≥ 0
- {n,} : at least n repetitions
- {,n} : at most n repetitions
- {m,n} : at least m and at most n repetitions

**Examples**

In [4]:
import numpy as np
import pandas as pd
import re

import warnings
warnings.filterwarnings('ignore')

In [5]:
dateStr = "23-10-2002\n 23/10/2002\n 23/10/02 \n 10/23/2002\n 23 Oct 2002\n 23 October 2002\n Oct 23, 2002\n October 23, 2002\n"

In [6]:
print(dateStr)

23-10-2002
 23/10/2002
 23/10/02 
 10/23/2002
 23 Oct 2002
 23 October 2002
 Oct 23, 2002
 October 23, 2002



In [7]:
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dateStr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [8]:
re.findall(r'\d{1,2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{2,4}', dateStr)
# It does match the whole string but it pulls out and gives us back only the thing that matched between ()

['Oct']

In [9]:
re.findall(r'\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{2,4}', dateStr)

['23 Oct 2002']

In [10]:
re.findall(r'\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{2,4}', dateStr)

['23 Oct 2002', '23 October 2002']

In [11]:
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{2,4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23', 'October 23']

In [12]:
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{2,4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [13]:
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df['numChar'] = df['text'].str.len() # find the number of characters for each string in df['text']
df['numTokens'] = df['text'].str.split().str.len() # find the number of tokens for each string in df['text']
df

Unnamed: 0,text,numChar,numTokens
0,Monday: The doctor's appointment is at 2:45pm.,46,7
1,Tuesday: The dentist's appointment is at 11:30...,50,8
2,"Wednesday: At 7:00pm, there is a basketball game!",49,8
3,Thursday: Be back home by 11:15 pm at the latest.,49,10
4,"Friday: Take the train at 08:10 am, arrive at ...",54,10


In [14]:
df['text'].str.contains('appointment') # find which entries contain the word 'appointment'

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [15]:
df['text'].str.count(r'\d') # find how many times a digit occurs in each string
df['text'].str.findall(r'\d') # find all occurances of the digits
df['text'].str.findall(r'(\d?\d):(\d\d)') # group and find the hours and minutes

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [16]:
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3]) # replace weekdays with 3 letter abbrevations

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [17]:
df['text'].str.extract(r'(\d?\d):(\d\d)') # create new columns from first match of extracted groups

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [18]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


**Case Study**

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.


Here is a list of some of the variants we might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once we have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

In [19]:
test = "04/20/2009\n  04/20/09\n 4/20/09\n 4/3/09\n Mar-20-2009\n Mar 20, 2009\n March 20, 2009\n Mar. 20, 2009\n Mar 20 2009\n 20 Mar 2009\n 20 March 2009\n 20 Mar. 2009\n 20 March, 2009\n Mar 20th, 2009\n Mar 21st, 2009\n Mar 22nd, 2009\n Feb 2009\n Sep 2009\n Oct 2010 6/2008\n 12/2009\n 2009\n 2010\n"
print(test)

04/20/2009
  04/20/09
 4/20/09
 4/3/09
 Mar-20-2009
 Mar 20, 2009
 March 20, 2009
 Mar. 20, 2009
 Mar 20 2009
 20 Mar 2009
 20 March 2009
 20 Mar. 2009
 20 March, 2009
 Mar 20th, 2009
 Mar 21st, 2009
 Mar 22nd, 2009
 Feb 2009
 Sep 2009
 Oct 2010 6/2008
 12/2009
 2009
 2010



In [20]:
doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df2 = pd.Series(doc)
data = pd.DataFrame(df2, columns=['text'])
data.head()

Unnamed: 0,text
0,03/25/93 Total time of visit (in minutes):\n
1,6/18/85 Primary Care Doctor:\n
2,sshe plans to move as of 7/8/71 In-Home Servic...
3,7 on 9/27/75 Audit C Score Current:\n
4,2/6/96 sleep studyPain Treatment Pain Level (N...


In [21]:
pd.to_datetime('March 20, 2009')

Timestamp('2009-03-20 00:00:00')

In [22]:
regex1 = '(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'
regex2 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[\s]*\d{1,2}[\S]*[\s]*\d{4})'
regex3 = '(\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[\s]+\d{4})'
regex4 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[\s]+\d{4})'
regex5 = '((?:\d{1,2}[/-])?[1|2]\d{3})'

In [23]:
full_regex = '(%s|%s|%s|%s|%s)' %(regex1, regex2, regex3, regex4, regex5)
parsed_date = df2.str.extract(full_regex)
parsed_date = parsed_date.iloc[:,0].str.replace('Janaury', 'January').str.replace('Decemeber', 'December')
parsed_date = pd.Series(pd.to_datetime(parsed_date)) 
parsed_date = parsed_date.sort_values(ascending=True).index
pd.Series(parsed_date.values)

0        9
1       84
2        2
3       53
4       28
      ... 
495    231
496    141
497    186
498    161
499    413
Length: 500, dtype: int64

### Text Preprocessing

In [24]:
import nltk # Natural Language Toolkit 
# Let's get some text corpora
# nltk.download()
# from nltk.book import *

In [25]:
with open('moby.txt', 'r') as f:
    moby_raw = f.read()

**Words tokenization**

In [26]:
# Text tokenization 
moby_tokens = nltk.word_tokenize(moby_raw)
moby_tokens[1:7]

['Moby', 'Dick', 'by', 'Herman', 'Melville', '1851']

In [27]:
print("Number of tokens: ", len(moby_tokens))
print("Number of unique tokens: ", len(set(moby_tokens)))

Number of tokens:  255038
Number of unique tokens:  20742


**Sentence tokenization**

In [28]:
moby_sentence = nltk.sent_tokenize(moby_raw)
print(moby_sentence[4])

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true."


In [29]:
print("Number of sentences: ", len(moby_sentence))

Number of sentences:  9852


**Normalization and stemming**

In [30]:
moby_st = nltk.word_tokenize(moby_raw.lower())
porter = nltk.PorterStemmer()
moby_st = [porter.stem(t) for t in moby_st]
moby_st[1:7]

['mobi', 'dick', 'by', 'herman', 'melvil', '1851']

**Lemmatization**

Common python packages for lemmatization:
- Wordnet Lemmatizer
- Spacy Lemmatizer
- TextBlob
- CLiPS Pattern
- Stanford CoreNLP
- Gensim Lemmatizer
- TreeTagger

In [31]:
from nltk.stem import WordNetLemmatizer

In [32]:
lemmatizer = WordNetLemmatizer()
moby_lem = [lemmatizer.lemmatize(w,'v') for w in moby_tokens]
len(set(moby_lem))

16887

In [33]:
lemmatizer.lemmatize('are','v') # we can improve the lemmatizer by providing the pos tag

'be'

In [34]:
lemmatizer.lemmatize('are')

'are'

In [35]:
print(lemmatizer.lemmatize("stripes", 'n')) 

stripe


In [36]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet
def get_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.VERB)
moby_lem2 = [lemmatizer.lemmatize(w,get_pos(w)) for w in moby_tokens]
len(set(moby_lem2))

16753

**Words Frequency**

In [37]:
dist = nltk.FreqDist(moby_lem)
dist

FreqDist({',': 19204, 'the': 13715, '.': 7306, 'of': 6513, 'be': 6382, 'and': 6010, 'a': 4545, 'to': 4515, ';': 4173, 'in': 3908, ...})

In [38]:
dist.most_common(10)

[(',', 19204),
 ('the', 13715),
 ('.', 7306),
 ('of', 6513),
 ('be', 6382),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908)]

In [39]:
# Longest words
longest=max(dist, key=lambda s: (len(s), s))
print("word: ",longest, "\nlenght: ",len(longest))

word:  twelve-o'clock-at-night 
lenght:  23


In [40]:
from string import punctuation
from nltk.corpus import stopwords

In [41]:
en = stopwords.words('english')

In [42]:
sorted([(v,k) for k,v in dist.items() if k.isalpha() and k not in en], reverse=True)[:10]

[(2113, 'I'),
 (1118, 'whale'),
 (880, 'one'),
 (703, 'But'),
 (621, 'say'),
 (609, 'The'),
 (563, 'like'),
 (548, 'ship'),
 (537, 'upon'),
 (498, 'Ahab')]

In [43]:
sorted([(v,k) for k,v in dist.items() if k.isalpha() and k.lower() not in en], reverse=True)[:10]

[(1118, 'whale'),
 (880, 'one'),
 (621, 'say'),
 (563, 'like'),
 (548, 'ship'),
 (537, 'upon'),
 (498, 'Ahab'),
 (486, 'man'),
 (468, 'go'),
 (459, 'seem')]

**POS tagging**

In [44]:
exp = "A wonderful little production. The filming technique is very unassuming very old time BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece"
print(exp)

A wonderful little production. The filming technique is very unassuming very old time BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece


In [45]:
# nltk.download('averaged_perceptron_tagger')
pos = nltk.word_tokenize(exp)
nltk.pos_tag(pos)

[('A', 'DT'),
 ('wonderful', 'JJ'),
 ('little', 'JJ'),
 ('production', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('filming', 'NN'),
 ('technique', 'NN'),
 ('is', 'VBZ'),
 ('very', 'RB'),
 ('unassuming', 'JJ'),
 ('very', 'RB'),
 ('old', 'JJ'),
 ('time', 'NN'),
 ('BBC', 'NNP'),
 ('fashion', 'NN'),
 ('and', 'CC'),
 ('gives', 'VBZ'),
 ('a', 'DT'),
 ('comforting', 'NN'),
 ('and', 'CC'),
 ('sometimes', 'RB'),
 ('discomforting', 'VBG'),
 ('sense', 'NN'),
 ('of', 'IN'),
 ('realism', 'NN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('entire', 'JJ'),
 ('piece', 'NN')]

In [46]:
# nltk.download('tagsets')
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


In [47]:
# most common parts of speech 
from collections import Counter
pos_list = nltk.pos_tag(moby_tokens)
Counter(w[1] for w in pos_list).most_common(5)

[('NN', 32729), ('IN', 28663), ('DT', 25879), (',', 19204), ('JJ', 17613)]

**Misspelled words correction**

In [48]:
from nltk.corpus import words
#nltk.download('words')
correct_spellings = words.words()

In [49]:
def jaccard_distance(input_list, ngram):
    recommender = []
    for word in input_list:
        spellings = [i for i in correct_spellings if i[0]==word[0]] #should start with the same character
        jdistance = [nltk.jaccard_distance(set(nltk.ngrams(word, ngram)),set(nltk.ngrams(i, ngram))) for i in spellings]
        recommender.append(spellings[np.argmin(jdistance)])
    return recommender
    
jaccard_distance(['cormulent', 'incendenece', 'validrate'], 3)

['corpulent', 'indecence', 'validate']

In [50]:
def edit_distance(input_list):
    recommender = []
    for word in input_list:
        spellings = [i for i in correct_spellings if i[0]==word[0]]
        Edistance = [nltk.edit_distance(word,i,transpositions=True) for i in spellings]
        recommender.append(spellings[np.argmin(Edistance)])
    return recommender
    
edit_distance(['cormulent', 'incendenece', 'validrate'])

['corpulent', 'intendence', 'validate']