<a href="https://colab.research.google.com/github/anshupandey/natural_language_processing/blob/master/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expression

In [0]:
import re

In [0]:
data = "my mobile number is 9898989898 and what is your number"

In [0]:
# extracting information
re.findall("[0-9]{10}",data)

['9898989898']

In [0]:
re.sub("[0-9]{10}","**********",data) # hiding tokens / masking tokens

'my mobile number is ********** and what is your number'

In [0]:
re.sub("[0-9]{10}","",data) # removing tokens having specific pattern

'my mobile number is  and what is your number'

In [0]:
data = ''' hi my birthdya is 30-02-1998 and your birthday is 12-5-1997 and his birthday is 5-4-1992 and his friend's birthday is 04-05-1993 
and his sister's birthday is 04/05/1996  
'''

In [0]:
pattern = "[0-9]{1,2}-[0-9]{1,2}-[0-9]{4} | [0-9]{1,2}/[0-9]{1,2}/[0-9]{4}"
re.findall(pattern,data)

['30-02-1998 ', '12-5-1997 ', '5-4-1992 ', '04-05-1993 ', ' 04/05/1996']

# Text Preprocessing with NLTK

In [0]:
# NLTK = Natural Language Toolkit

- NLTK  - all basic text preprocessing
- spaCy - Basic advance text based applications, preprocessing, NER
- gensim - topic modelling, use pretrained word2vec models

- tensorflow, pytorch



We will checkout - 
- Tokenization
- Mosphological Analysis
- PoS Tagging

In [0]:
import nltk

In [0]:
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("tagsets")
nltk.download("averaged_perceptron_tagger")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# Tokenization

- process of converting text data into collection of sentences/words

In [0]:
data = """ John is coming to meet his brother in Delhi from Noida. John has a car. Mr. Mark is enjoying reading novel. Freddy is enjoying playing Pubg game.
today its a good weather. Can you drop me email? my email id is anshu@anshu.com. Thank You! please take care.
"""

In [0]:
data.split('.')

[' John is coming to meet his brother in Delhi from Noida',
 ' John has a car',
 ' Mr',
 ' Mark is enjoying reading novel',
 ' Freddy is enjoying playing Pubg game',
 '\ntoday its a good weather',
 ' Can you drop me email? my email id is anshu@anshu',
 'com',
 ' Thank You! please take care',
 '\n']

In [0]:
nltk.sent_tokenize(data)

[' John is coming to meet his brother in Delhi from Noida.',
 'John has a car.',
 'Mr. Mark is enjoying reading novel.',
 'Freddy is enjoying playing Pubg game.',
 'today its a good weather.',
 'Can you drop me email?',
 'my email id is anshu@anshu.com.',
 'Thank You!',
 'please take care.']

In [0]:
nltk.word_tokenize(data)

['John',
 'is',
 'coming',
 'to',
 'meet',
 'his',
 'brother',
 'in',
 'Delhi',
 'from',
 'Noida',
 '.',
 'John',
 'has',
 'a',
 'car',
 '.',
 'Mr.',
 'Mark',
 'is',
 'enjoying',
 'reading',
 'novel',
 '.',
 'Freddy',
 'is',
 'enjoying',
 'playing',
 'Pubg',
 'game',
 '.',
 'today',
 'its',
 'a',
 'good',
 'weather',
 '.',
 'Can',
 'you',
 'drop',
 'me',
 'email',
 '?',
 'my',
 'email',
 'id',
 'is',
 'anshu',
 '@',
 'anshu.com',
 '.',
 'Thank',
 'You',
 '!',
 'please',
 'take',
 'care',
 '.']

# Morphological Analysis

- concerting a word to its root form

cars - car

children - child

wives - wife

- **Stemming**        - faster, less accurate
- **Lemmatization**   - slightly slower, more accurate



In [0]:
#stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("cars")

'car'

In [0]:
ps.stem("boxes")

'box'

In [0]:
ps.stem("wives")

'wive'

In [0]:
ps.stem("children")

'children'

In [0]:
# lemmatizers
from nltk.stem import WordNetLemmatizer
wd = WordNetLemmatizer()

wd.lemmatize("wives")

'wife'

In [0]:
wd.lemmatize("children")

'child'

In [0]:
wd.lemmatize("happier",'a') #a = adjective

'happy'

In [0]:
wd.lemmatize("went",'v') # v = verb

'go'

# Pos Tagging

In [0]:
nltk.pos_tag(nltk.word_tokenize("Donald Trump is happy today and he will announce something big."))

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('is', 'VBZ'),
 ('happy', 'JJ'),
 ('today', 'NN'),
 ('and', 'CC'),
 ('he', 'PRP'),
 ('will', 'MD'),
 ('announce', 'VB'),
 ('something', 'NN'),
 ('big', 'JJ'),
 ('.', '.')]

In [0]:
nltk.help.upenn_tagset("JJ")

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


# NER - Named Entiy Recognition

In [0]:
import spacy

In [0]:
nlp = spacy.load("en_core_web_sm")

In [0]:
doc = nlp("Rohan is working for Microsoft in India and USA region from 10-05-2018 and has met Kelly a year back.")
from spacy import displacy

displacy.render(doc,style='ent',jupyter=True)

In [0]:
nltk.jaccard_distance(set("orange"),set("orenge"))

0.16666666666666666

In [0]:
nltk.jaccard_distance(set("Anshu"),set("orenge"))

0.8888888888888888

In [0]:
dd = ["orange","mango","apple","banana"]

def correct(word):
  score = 1
  ans = ""
  for w in dd:
    dist = nltk.jaccard_distance(set(w),set(word))
    if dist<score:
      ans = w
      score = dist
  return ans

In [0]:
correct("applo")

'apple'