### Regular expressions:

- Pattern matching
- re module

In [None]:
import re

##### Basic Patterns:

- a,b,c,A,B,C,0,1,2 : Ordinary charachters, they match with only themselves
- '.' : matches with any character
- '^' : start of string
- '$' : end of string

##### Predefined sets:

- \w : word characters : [a-zA-Z0-9_]
- \s : space characters : [' '\t\n\r]
- \d : digit characters : [0-9]

##### Repition characters:

- '+' : 1 or more (greedy)
- '*' : 0 or more (greedy)
- '?' : 0 or 1 (non-greedy)
- {n} : exactly n occurances
- {l,m} : exactly l to m occurances

##### Custom sets:

- [abc] : matches with a or b or c
- You can use predefined sets inside custom sets
- [\w.-] : matches with all \w characters and . and -
- Inside the [], '.' means '.'
- You can write ranges using '-'
- [a-zA-Z0-9_] == \w 
- If you want to include '-', put it at the end
- [a-z-] : matches with smallcase alphabets and '-'
- '^' at the beginning inverts the custom set:
- [^abc] : matches with every character except a,b,c

##### Grouping:

- () are used to define groups
- Groups are used to divide the matches 

In [None]:
# returns the first match only
out = re.search('w\w+', 'Hello world')

In [None]:
out.span()

(6, 11)

In [None]:
out.group()

'world'

In [None]:
out = re.search('H\w+\s?\w+', 'Hello World')
out.group()

'Hello World'

In [None]:
out = re.search('[a-zA-Z\s.?-]+', 'Hello ?.world-')
out.group()

'Hello ?.world-'

In [None]:
text = '''India is the largest democratic country. It is a big country divided into 29 states and 7 union territories. These states and union territories have been created so that the government can run the country more easily. India also has many different kinds of physical features in different parts of the country that are spread over its states and union territories. India is a very diverse country as well, which means that the people around the country are different in many ways. Even though India is such a diverse place, it is united as one country.'''

In [None]:
out = re.findall('\w+', text)

Q. Write a regular expression to extract any email id from a given text. Separate out the username from the domain name

In [None]:
re.findall('([\w.-]+)@([a-z.]+)', 'Email ids are as follows: abc.123@lpu.com, xyz_21@yahoo.co.in, pqr-22@gmail.com')

[('abc.123', 'lpu.com'), ('xyz_21', 'yahoo.co.in'), ('pqr-22', 'gmail.com')]

In [None]:
out = re.search('([\w.-]+)@([a-z.]+)', 'Email ids are as follows: abc.123@lpu.com,')

In [None]:
out.group(0)

'abc.123@lpu.com'

In [None]:
out.group(1)

'abc.123'

In [None]:
out.group(2)

'lpu.com'

In [None]:
# from version 3.11:
re.findall('b[\w.-]{2,4}', 'baaaaaa')

['baaaa']

In [None]:
re.sub('([\w.-]+)@([a-z.]+)', '<<EMAIL>>',  'Email ids are as follows: abc.123@lpu.com, xyz_21@yahoo.co.in, pqr-22@gmail.com')

'Email ids are as follows: <<EMAIL>>, <<EMAIL>>, <<EMAIL>>'

### Tokenization

In [6]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Error loading punkt: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

In [None]:
text

'India is the largest democratic country. It is a big country divided into 29 states and 7 union territories. These states and union territories have been created so that the government can run the country more easily. India also has many different kinds of physical features in different parts of the country that are spread over its states and union territories. India is a very diverse country as well, which means that the people around the country are different in many ways. Even though India is such a diverse place, it is united as one country.'

In [None]:
words = word_tokenize(text)

In [None]:
def word_tokenize_custom(text):
  return re.findall('\w+', text)

In [None]:
words_ = word_tokenize_custom(text)

In [None]:
sents = sent_tokenize(text)

In [None]:
sents

['India is the largest democratic country.',
 'It is a big country divided into 29 states and 7 union territories.',
 'These states and union territories have been created so that the government can run the country more easily.',
 'India also has many different kinds of physical features in different parts of the country that are spread over its states and union territories.',
 'India is a very diverse country as well, which means that the people around the country are different in many ways.',
 'Even though India is such a diverse place, it is united as one country.']

In [None]:
def sent_tokenize_custom(text):
  return re.findall('[^.]+', text)

In [None]:
sent_tokenize_custom(text)

['India is the largest democratic country',
 ' It is a big country divided into 29 states and 7 union territories',
 ' These states and union territories have been created so that the government can run the country more easily',
 ' India also has many different kinds of physical features in different parts of the country that are spread over its states and union territories',
 ' India is a very diverse country as well, which means that the people around the country are different in many ways',
 ' Even though India is such a diverse place, it is united as one country']

### stopwords

In [2]:
from nltk.corpus import stopwords
nltk.download('stopwords')
sw = stopwords.words('english')

[nltk_data] Error loading stopwords: <urlopen error [WinError 10065] A
[nltk_data]     socket operation was attempted to an unreachable host>


In [4]:
sw.append("also")

In [5]:
sw

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
words_  = [word.lower() for word in words_ if word.lower() not in sw]

### Stemming / lemmatization:

In [None]:
from nltk.stem import WordNetLemmatizer, snowball
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
stemmer = snowball.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
%%time
words_stemmed = {w : stemmer.stem(w) for w in words_}
print(len(words_))

53
CPU times: user 1.11 ms, sys: 21 µs, total: 1.13 ms
Wall time: 1.1 ms


In [None]:
%%time
words_lemma = {w : lemmatizer.lemmatize(w) for w in words_} 
print(len(words_))

53
CPU times: user 2.29 s, sys: 75.5 ms, total: 2.36 s
Wall time: 2.46 s


In [None]:
words_stemmed

{'india': 'india',
 'largest': 'largest',
 'democratic': 'democrat',
 'country': 'countri',
 'big': 'big',
 'divided': 'divid',
 '29': '29',
 'states': 'state',
 '7': '7',
 'union': 'union',
 'territories': 'territori',
 'created': 'creat',
 'government': 'govern',
 'run': 'run',
 'easily': 'easili',
 'also': 'also',
 'many': 'mani',
 'different': 'differ',
 'kinds': 'kind',
 'physical': 'physic',
 'features': 'featur',
 'parts': 'part',
 'spread': 'spread',
 'diverse': 'divers',
 'well': 'well',
 'means': 'mean',
 'people': 'peopl',
 'around': 'around',
 'ways': 'way',
 'even': 'even',
 'though': 'though',
 'place': 'place',
 'united': 'unit',
 'one': 'one'}

In [None]:
words_lemma

{'india': 'india',
 'largest': 'largest',
 'democratic': 'democratic',
 'country': 'country',
 'big': 'big',
 'divided': 'divided',
 '29': '29',
 'states': 'state',
 '7': '7',
 'union': 'union',
 'territories': 'territory',
 'created': 'created',
 'government': 'government',
 'run': 'run',
 'easily': 'easily',
 'also': 'also',
 'many': 'many',
 'different': 'different',
 'kinds': 'kind',
 'physical': 'physical',
 'features': 'feature',
 'parts': 'part',
 'spread': 'spread',
 'diverse': 'diverse',
 'well': 'well',
 'means': 'mean',
 'people': 'people',
 'around': 'around',
 'ways': 'way',
 'even': 'even',
 'though': 'though',
 'place': 'place',
 'united': 'united',
 'one': 'one'}