## Processing free-text

In [2]:
text1 = '"Ethics are built right into the ideals and objectives of the United Nationss" #UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'

In [3]:
text2 = text1.split( ' ')

In [4]:
text2

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nationss"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

## Finding specific words

* Hastags


In [6]:
[w for w in text1.split(' ') if w.startswith('#')]

['#UNSG']

In [7]:
[w for w in text1.split(' ') if w.startswith('@')]

['@', '@UN', '@UN_Women']

## Finding patterns with regular expressions

* Callouts are more than just tokens beginning with '@'

  @UN_Spokesperson  @katyperry @coursera
  
  
* Match something after '@'

  - Alphabets
  - Numbers
  - Special symbols like '_'
  
  The patterns are like @[A-Za-z0-9_]+

In [9]:
[w for w in text1.split(' ') if w.startswith('@') ]

['@', '@UN', '@UN_Women']

## Import regular expressions first!

In [8]:
import re

In [10]:
[w for w in text1.split(' ') if re.search('@[A-Za-z0-9_]', w)]

['@UN', '@UN_Women']

## Parsing the callout regular expression

![image 1](./../images/1.png)

* starts with @
* followed by any alphabet (upper or lower case), digit, or underscore
* that repeats at least once, but any number of times


## Meta-characters: Character matches

* . : wilcard, matches a single character
* ^ : start of a string
* $ : end of a string
* []: matches one of characters within []
* [a-z]: matches one of the range of characters a, b, c, ...., z 
* [^abc]: matches a character that is not a, b or c
* a|b   : matches either a or b, where a and b are strings
* ()    : scoping for operators
* \     : escape character for special characters (\t, \n, \b)

## Meta-characters: Character symbols

* \b : matches words boundary
* \d : any digits, equivalent to [0-9]
* \D : any non-digit, equivalent [^0-9]
* \s : any whitespace, equivalent to [ \t\n\r\f\v]
* \S : any non-whitespace, equivalent to [^ \t\n\r\f\v]
* \w : any alphanumeric character to [a-z-A-Z0-9_]
* \W : any non-alphanumeric character to [^a-z-A-Z0-9_]


## Meta-characters: Repetitions

* \* : matches zero or more occurrences
* \+ : matches one or more occurrences
* ? : matches one or one occurrence
* {n} : exactly n repetitions, n<>0
* {n, } : at least n repetitions
* {n, } : at mostt n repetitions
* {m, m} : at least m and at most n repetitions

In [13]:
[w for w in text1.split(' ') if re.search( '@\w+', w )]

['@UN', '@UN_Women']

## Let's look at some more examples

* Finding specifics characters

In [15]:
# finding all the vowels
text3 = 'ougadougou'
re.findall(r'[aeiouy]', text3)


['o', 'u', 'a', 'o', 'u', 'o', 'u']

## Case study: Regular expression for dates

* #### Date variation for 23<sup>rd</sup> October 2002

 23-10-2002 <br />
 23/10/2002 <br />
 23/10/02 <br />
 10/23/02 <br />
 23 Oct 2002 <br />
 23 October 2002 <br />
 Oct 23, 2002 <br />
 October 23, 2002 <br />
 matching example  **\d{2}[/-]\d{2}[/-]\d{2,4}**

In [19]:
datestr = "23-10-2002\n 23/10/2002\n23/10/02\n10/23/02\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002"

In [25]:
re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', datestr)

['23-10-2002', '23/10/2002']

In [26]:
re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', datestr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/02']

In [27]:
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', datestr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/02']

In [33]:
re.findall(r'\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)[a-z]* \d{2,4}', datestr)

['23 Oct 2002', '23 October 2002']

In [49]:
re.findall(r'(?:\d{1, 2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', datestr)

['Oct 2002', 'October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [16]:
## Finding all characters without vowels
re.findall(r'[^aeiouy]', text3)

['g', 'd', 'g']

## Exercice 

Write code that would extract hastags from following tweet:
    
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"

In [5]:
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"
the_hastages = [ w for w in tweet.split(' ') if w.startswith('#') ]
the_hastages

['#regex', '#pandas', '#python']

In [None]:
## Take Home Concepts

* What are regular expressions?
* Regular expression meta-characters
* B