  # Regular expressions
  
  Often the basic tools from the previous section will not be enough for text processing.
  Regular expressions are extremely useful for handling complex text manipulation. We will
  demonstrate a few common cases when pre-processing data for analysis.
  
  An concise introduction to regular expressions in python is available [here](https://docs.python.org/3/howto/regex.html).
  You can find a more detailed description [here](https://docs.python.org/3/library/re.html)


    As a first example let us try finding all callout tags in the first tweet

In [42]:
import re

with open('./data/tweets_hashtags_callouts.txt', 'r') as f:
    tweets =  f.readlines()
    
tweets = list([tweet.rstrip() for tweet in tweets])
tweet = tweets[0]
tweet_words = tweet.split(' ')

callouts = [word for word in tweet_words if word.startswith('@')]
callouts


['@WhiteHouse', '@', '@EddieRispone!']

    Find only words that starts with @ and have some other characters after that.

In [43]:
callouts1 = [word for word in tweet_words if re.search("@[A-Za-z0-9]+", word)]
callouts1

['@WhiteHouse', '@EddieRispone!']

    Find a digit in a string

In [44]:
match_digit = re.search("[0-9]+", "A string with a digit 0.")
match_digit.endpos


24

    Using groups

In [45]:
match_groups = re.findall("([0-9]+)\s*(\w+)", "The fox had travelled 2 km after 60 minutes.")
match_groups

[('2', 'km'), ('60', 'minutes')]

 ## Regular expressions for dates
 
 A regular expression to match a date string within the sentence.

In [50]:
date_str1 = "The course started on 2019-11-16 at approximately 10 am."
date_match = re.search('(\d{4})-(\d{2})-(\d{2})', date_str1)

"Found year: {}, month: {}, day: {}".format(*date_match.groups())
 

'Found year: 2019, month: 11, day: 16'

    Using named groups

In [51]:
date_str1 = "The course started on 2019-11-16 at approximately 10 am. The next course will take place at 12 am on 2019-11-17."
date_match_named = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', date_str1)

match_dict = date_match_named.groupdict()
"Found year: {year}, month: {month}, day: {day}".format(**match_dict)



'Found year: 2019, month: 11, day: 16'