## Regex
`Link: https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html`
1. Special Character Classes in Regex pattern
    - \s : matches space 
    - \S : anything that is not space 
    - Using Capital Characters negates the class 
2. Python Module `import re`
    - In its function, first arguement is PATTERN then STRING
    - Important methods 
        - `re.search(pattern, string)`
        - `re.match(pattern, string)`
        - `re.findall(pattern, string)`
        - `re.sub(pattern, new_substring, string)`
        - `re.split(pattern, string)`
3. Character Range 
    - To define a new character class 
    - E.g: character class with small and capital alphabets, space, hyphen, dot
        - `regex: r"[a-zA-Z\-\. ]"`
        - hyphen and dot are speciall characters need to escape them.
4. Group 
    - To define explicit patterns
    - E.g 
        - r"(a-z)" : This will match with exact string as "a-z", hyphen don't have special meaning here as in range.
5. or method : | (pipe)
    - E.g 
        - r"(\s+|,)": This will match spaces( ) or comman(,) 

6. How to write unicode in the regex

7. How to use new line character in regex '\n', can we use it in raw string r"\n"
8. Learn more about raw string.

## Tokenization 
`Read about it in pdf`

In [1]:
from nltk.tokenize import word_tokenize

In [2]:
word_tokenize("Hello World! I am Robo.")

['Hello', 'World', '!', 'I', 'am', 'Robo', '.']

## Bag of Words 
`Refer pdf`
1. More frequent a word is, more relevant/centered to the text it is.

In [3]:
from collections import Counter

In [4]:
counter = Counter(word_tokenize("The cat is in the box. The cat loves the box. The box is over the cat"))
counter

Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'in': 1,
         'the': 3,
         'box': 3,
         '.': 2,
         'loves': 1,
         'over': 1})

In [5]:
counter.most_common(4)

[('The', 3), ('cat', 3), ('the', 3), ('box', 3)]

## Preprocessing 
`Refer pdf`
1. Lowercase
2. Tokenization
2. Stopwords Removal 
3. Stemming/Lemmatization
4. Remove Punctuation, stopwords and unwanted tokens

`Methods`
1. "sample_string".isalpha()
    - Checks if string contains alphabets only

In [6]:
from nltk.corpus import stopwords

In [7]:
text = "The cat is in the box. The cat loves the boxes. The box is over the cat"

In [8]:
lower_text = text.lower()

In [9]:
alpha_only = [w for w in word_tokenize(lower_text) if w.isalpha()]

In [10]:
print(alpha_only)

['the', 'cat', 'is', 'in', 'the', 'box', 'the', 'cat', 'loves', 'the', 'boxes', 'the', 'box', 'is', 'over', 'the', 'cat']


In [11]:
no_stops = [ w for w in alpha_only if w not in stopwords.words('english')]
no_stops

['cat', 'box', 'cat', 'loves', 'boxes', 'box', 'cat']

In [12]:
from nltk.stem import WordNetLemmatizer

In [15]:
lemmatized = [ WordNetLemmatizer().lemmatize(w) for w in no_stops]
lemmatized

['cat', 'box', 'cat', 'love', 'box', 'box', 'cat']

In [13]:
[ word_tokenize(w) for w in text.split('.')]

[['The', 'cat', 'is', 'in', 'the', 'box'],
 ['The', 'cat', 'loves', 'the', 'boxes'],
 ['The', 'box', 'is', 'over', 'the', 'cat']]

## Gensim 
1. Open Source NLP library, uses top models to perfrom complex tasks

In [14]:
# List of documents
mydocs = ["The cat is in the box.","The cat loves the boxes.","The box is over the cat"]

In [15]:
# lowercase, tokenise ,stopwords removed, punctuation removed
articles=[[ word for word in word_tokenize(doc.lower()) if word not in stopwords.words('english') if word.isalpha() \
          ] for doc in mydocs]
articles

[['cat', 'box'], ['cat', 'loves', 'boxes'], ['box', 'cat']]

In [16]:
mydocs

['The cat is in the box.',
 'The cat loves the boxes.',
 'The box is over the cat']

In [17]:
from gensim.corpora.dictionary import Dictionary

In [18]:
gDict = Dictionary(articles)

In [19]:
gDict.token2id

{'box': 0, 'cat': 1, 'boxes': 2, 'loves': 3}

In [20]:
gDict.doc2bow(["cat", "loves", "other", "cat", "babies"])

[(1, 2), (3, 1)]

In [21]:
## creating corpus 
mycorpus = [ gDict.doc2bow(article) for article in articles]
mycorpus

[[(0, 1), (1, 1)], [(1, 1), (2, 1), (3, 1)], [(0, 1), (1, 1)]]

## TF-IDF 
`Refer the pdf`
1. Code
    - `from gensim.models.tfidfmodel import TfidfModel`
    - `tfidf = TfidfModel(corpus)`
        - `corpus = [[(id1, freq1), (id2, freq2), (id3, freq3)], [ (id4 freq4, (id5 freq5)]`
    - `tfidf[doc]` : gives tfidf weights for sample doc 
        - `[ (id1, weight1), (id2, weight2), (id3, weight3) ]`
2. TF: term frequency
    - tf(i) = No of occurences of word(i) in the document(j)
        - If the documents have different length then 
        - tf = ( #occurences of word)/(total no of words in doc)
3. Idf: Inverse Document Frequency
    - idf(i,j) = log(N/df)
        - N = Total no of documents
        - df = No of documents in which word/token(i) is present

## NER: Named Entity Recognition 
1. Identifying important entities in the text. 
2. Code:
    - `tokenize_sent = word_tokenize(sentence)`
    - `tagged_sent = nltk.pos_tag(tokenize_sent)`
    - `nltk.ne_chunk(tagged_sent)`
        - Chunk the Tagged sentence into name entity chunks
        - For this task it uses trained statistical and grammatical parsers not some knowledge base like wikipedia
        

## SpaCy
1. Open source Nlp library similar to gensim but different implementation 
2. Another option for NLP tasks 

### Displacy 
1. Visualization tool built by makers of Spacy. 
2. Used to visualize the Parse Trees.

## Polyglot 
1. Open Source NLP library similar to gensim and Spacy 
2. It supports operations and have word vectors for a large no of languages. 


## CountVectorizer and TfidfVectorizer 
1. Bag of words(bow) and Tfidf classes of scikit-learn library 
2. Methods 
    - `.fit()`
        - This method in general tries to find the parameters or norms of data
    - `.transform()`
        - It applies the underlying algorithm or approximation on data
    - `.fit_transform()`
    

## Naive Bayes 
1. `from sklearn.naive_bayes import MultinomialNB`
2. Naive Bayes works well on NLP tasks 
    - It basis on probability. 
    - It ans questions like: Given a particular piece of data how likely is a particular outcome
3. It works on probability. 
4. MultinomialNB
    - Works well when features have integer values 
    - Works well for multi-class classification

## Confusion Matrix 
1. By default, at the top side we have predict labels and left side actual labels
2. `metrics.confusion_matrix(ytest, ypredict, labels=[0,1])`
    - Specifying the labels removes confusion, 1st column of confusion matrix will have label=0 and other column=1