<a href="https://colab.research.google.com/github/bhaskarfx/nlp/blob/main/NLP_FDP_Pre_Processing_and_Feature_Extraction_from_Text_Documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Processing Raw Text**

*by [Dr. Bhaskar Mondal](https://sites.google.com/view/bmondal/bhaskarmondal?authuser=3)*

YouTube Link: https://www.youtube.com/watch?v=sQb63jW4wz4

In [None]:
import nltk

In [None]:
text="""You're flying on cloud no 9 :  the sky is PinkishBlue < . You shouldn't eat cardboard.
Hello Mr. Bob, how are you doing today? The weather is greater, and city is awesome z ."""

**Split by Whitespace**

In [None]:
words=text.split()
print(words)

["You're", 'flying', 'on', 'cloud', 'no', '9', ':', 'the', 'sky', 'is', 'PinkishBlue', '<', '.', 'You', "shouldn't", 'eat', 'cardboard.', 'Hello', 'Mr.', 'Bob,', 'how', 'are', 'you', 'doing', 'today?', 'The', 'weather', 'is', 'greater,', 'and', 'city', 'is', 'awesome', 'z', '.']


Split based on word

In [None]:
import re
words1=re.split(r'\W+', text)
print(words1[:10])

['You', 're', 'flying', 'on', 'cloud', 'no', '9', 'the', 'sky', 'is']


**Number of Words and characters**

In [None]:
word_count=len(str(text).split(" "))
word_count

35

In [None]:
char_count=len(text)
char_count

174

## **Use Regular Expressions for cleaning text**



Split Attached words

In [None]:
words2=" ".join(re.findall("[A-Z][^A-Z]*", text)) #[A-C]--> A or B or C .... or Z; *--> zero or n no. of occurence 
print(len(words2))

181


**Removing Apostrophe**

Using Dictionary

In [None]:
appos={"'s": " is", "'re" : " are", "n't": " not"}

In [None]:
word_list=text.split()

new_sent=[]

for word in word_list:
  for key in appos:
    word=word.replace(key, appos[key])
  new_sent.append(word)

new_text=" ".join(new_sent)
print(new_text)

You are flying on cloud no 9 : the sky is PinkishBlue < . You should not eat cardboard. Hello Mr. Bob, how are you doing today? The weather is greater, and city is awesome z .


Using Reg. Exp.

In [None]:
def decontraction(phrase):
  phrase=re.sub(r"\'re", " are", phrase)
  phrase=re.sub(r"n\'t", " not", phrase)

  return phrase

new_text2=decontraction(text)
print(new_text)

You are flying on cloud no 9 : the sky is PinkishBlue < . You should not eat cardboard. Hello Mr. Bob, how are you doing today? The weather is greater, and city is awesome z .


Using Python Library

In [None]:
pip install contractions

Collecting contractions
  Downloading contractions-0.1.66-py2.py3-none-any.whl (8.0 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 15.7 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 62.4 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85450 sha256=bc81c2904ab20925d9b18e568809978f202dc2aac8921e304c98ffd56a3d5286
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully instal

In [None]:
import contractions

In [None]:
expan_words=[]
for word in words:
  #text.split()
  expan_words.append(contractions.fix(word))

new_text=" ".join(expan_words)
print(new_text)

you are flying on cloud no 9 : the sky is pinkishblue < . you should not eat cardboard. hello mr. bob, how are you doing today? the weather is greater, and city is awesome z .


**Average Word Length**

In [None]:
def avg_word_len(sen):
  words=sen.split(" ")
  total_len=sum(len(word) for word in words)
  no_of_words=len(words)
  return (total_len/no_of_words)

avg_len=avg_word_len(text)
print(avg_len)

4.0


Remove Single Chars

In [None]:
def remove_single_char(sen):
  new_text=""
  words=sen.split()
  for word in words:
    if len(word)>1:
      new_text=new_text + " " + word
  return new_text

new_text=remove_single_char(text)
print(new_text)

 You're flying on cloud no the sky is PinkishBlue You shouldn't eat cardboard. Hello Mr. Bob, how are you doing today? The weather is greater, and city is awesome


In [None]:
new_text=" ".join([w for w in text.split() if len(w)>1])
new_text

"You're flying on cloud no the sky is PinkishBlue You shouldn't eat cardboard. Hello Mr. Bob, how are you doing today? The weather is greater, and city is awesome"

**Normalizing Case:** **convert to lower case** 

In [None]:
#Sky, SKY, skY
def convert_to_lower(words):
  words=[word.lower() for word in words]
  return words

in_lower=convert_to_lower(words)
print(in_lower)

["you're", 'flying', 'on', 'cloud', 'no', '9', ':', 'the', 'sky', 'is', 'pinkishblue', '<', '.', 'you', "shouldn't", 'eat', 'cardboard.', 'hello', 'mr.', 'bob,', 'how', 'are', 'you', 'doing', 'today?', 'the', 'weather', 'is', 'greater,', 'and', 'city', 'is', 'awesome', 'z', '.']


**Removing Stop Words**

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stop_eng=stopwords.words('english')
print(stop_eng)
print(len(stop_eng))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Count no. of stopwords

In [None]:
def count_stop_words(sen):
  words=sen.split()
  my_stop=[]
  for w in words:
    if w in stop_eng:
      my_stop.append(w)
  return len(my_stop)

total_no_stops=count_stop_words(text)
print(total_no_stops)

12


In [None]:
stop_removed=[w for w in convert_to_lower(new_text2.split()) if not w in stop_eng]
print(stop_removed)

['flying', 'cloud', '9', ':', 'sky', 'pinkishblue', '<', '.', 'eat', 'cardboard.', 'hello', 'mr.', 'bob,', 'today?', 'weather', 'greater,', 'city', 'awesome', 'z', '.']


In [None]:
def remove_stopwords(lower_tokens):
    filtered_words=[]
    for word in lower_tokens:
      if word not in stop_eng:
        filtered_words.append(word)
    return filtered_words

print(remove_stopwords(words))

['flying', 'cloud', '9', ':', 'sky', 'pinkishblue', '<', '.', 'eat', 'cardboard.', 'hello', 'mr.', 'bob,', 'today?', 'weather', 'greater,', 'city', 'awesome', 'z', '.']


**Removing Punctuation**

In [None]:
import string

In [None]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
def remove_punc(words):
  table=str.maketrans('', '', string.punctuation)
  stripped=[w.translate(table) for w in words]
  return stripped


words=convert_to_lower(new_text2.split())
punc_removed=remove_punc(words)
print(punc_removed)

['you', 'are', 'flying', 'on', 'cloud', 'no', '9', '', 'the', 'sky', 'is', 'pinkishblue', '', '', 'you', 'should', 'not', 'eat', 'cardboard', 'hello', 'mr', 'bob', 'how', 'are', 'you', 'doing', 'today', 'the', 'weather', 'is', 'greater', 'and', 'city', 'is', 'awesome', 'z', '']


**Number of numerics**

In [None]:
def count_of_num(sen):
  words=convert_to_lower(sen.split())
  count=0
  for w in words:
    if w.isdigit():
      count+=1
  return count

no_of_num=count_of_num(new_text2)
print(no_of_num)

1


Remove numerics 

In [None]:
def remove_nums(words):
  new_string = ''.join((x for x in words if not x.isdigit()))
  return new_string

print(remove_nums(text))

You're flying on cloud no  :  the sky is PinkishBlue < . You shouldn't eat cardboard.
Hello Mr. Bob, how are you doing today? The weather is greater, and city is awesome z .


## **Tokenization**

**Split into Sentences**

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences)

["You're flying on cloud no 9 :  the sky is PinkishBlue < .", "You shouldn't eat cardboard.", 'Hello Mr. Bob, how are you doing today?', 'The weather is greater, and city is awesome z .']


split into words

In [None]:
def tokenization(sentence):
  from nltk.tokenize import word_tokenize
  return (word_tokenize(sentence))

tokens=tokenization(new_text)
print(tokens)

['you', 'are', 'flying', 'on', 'cloud', 'no', '9', ':', 'the', 'sky', 'is', 'pinkishblue', '<', '.', 'you', 'should', 'not', 'eat', 'cardboard', '.', 'hello', 'mr.', 'bob', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'the', 'weather', 'is', 'greater', ',', 'and', 'city', 'is', 'awesome', 'z', '.']


In [None]:
from textblob import TextBlob
TextBlob(new_text).words

WordList(['you', 'are', 'flying', 'on', 'cloud', 'no', '9', 'the', 'sky', 'is', 'pinkishblue', 'you', 'should', 'not', 'eat', 'cardboard', 'hello', 'mr', 'bob', 'how', 'are', 'you', 'doing', 'today', 'the', 'weather', 'is', 'greater', 'and', 'city', 'is', 'awesome', 'z'])

**Spelling correction**

In [None]:
def spell_correct(sentence):
  from textblob import TextBlob
  sentence = TextBlob(sentence).correct()
  return sentence

no_of_nums=spell_correct(new_text)
no_of_nums

TextBlob("you are flying on cloud no 9 : the sky is pinkishblue < . you should not eat cardboard. hello mr. bob, how are you doing today? the weather is greater, and city is awesome z .")

## **Stemming and Lemmatization**


difference is that *stem may not be an actual word* whereas, *lemma is an actual language word*.

Stemming follows an algorithm with steps to perform on the words which makes it faster. 

Lemmatization use a corpus also to supply lemma which makes it slower than stemming. you furthermore might had to define a parts-of-speech to get the proper lemma.

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

porter = PorterStemmer()
lancaster=LancasterStemmer()

#proide a word to be stemmed
print("Porter Stemmer")
print(porter.stem("cats"))
print(porter.stem("trouble"))

print("Lancaster Stemmer") 
print(lancaster.stem("cats"))

Porter Stemmer
cat
troubl
Lancaster Stemmer
cat


**Stem Words**

In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
print(stemmed)

['you', 'are', 'fli', 'on', 'cloud', 'no', '9', ':', 'the', 'sky', 'is', 'pinkishblu', '<', '.', 'you', 'should', 'not', 'eat', 'cardboard.', 'hello', 'mr.', 'bob,', 'how', 'are', 'you', 'do', 'today?', 'the', 'weather', 'is', 'greater,', 'and', 'citi', 'is', 'awesom', 'z', '.']


**Lemmatization** 

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from textblob import Word

def lemmatization(sentence):
  words = sentence.split()
  lemmas=[]
  for w in words:
      lemmas.append(Word(w).lemmatize())
  return lemmas

lemmas=lemmatization(text)
print(lemmas)

["You're", 'flying', 'on', 'cloud', 'no', '9', ':', 'the', 'sky', 'is', 'PinkishBlue', '<', '.', 'You', "shouldn't", 'eat', 'cardboard.', 'Hello', 'Mr.', 'Bob,', 'how', 'are', 'you', 'doing', 'today?', 'The', 'weather', 'is', 'greater,', 'and', 'city', 'is', 'awesome', 'z', '.']


**Part of speech tagging (POS)**

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from textblob import TextBlob
result = TextBlob(new_text)
print(result.tags)

[('you', 'PRP'), ('are', 'VBP'), ('flying', 'VBG'), ('on', 'IN'), ('cloud', 'NN'), ('no', 'DT'), ('9', 'CD'), ('the', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('pinkishblue', 'JJ'), ('<', 'NN'), ('you', 'PRP'), ('should', 'MD'), ('not', 'RB'), ('eat', 'VB'), ('cardboard', 'NN'), ('hello', 'NN'), ('mr.', 'NN'), ('bob', 'NN'), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('doing', 'VBG'), ('today', 'NN'), ('the', 'DT'), ('weather', 'NN'), ('is', 'VBZ'), ('greater', 'JJR'), ('and', 'CC'), ('city', 'NN'), ('is', 'VBZ'), ('awesome', 'JJ'), ('z', 'NN')]


**Chunking using NLTK:**
Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

In [None]:
tokens = nltk.word_tokenize(new_text)
print(tokens)

tag = nltk.pos_tag(tokens)
print(tag)

grammar = "NP: {<DT>?<JJ>*<NN>}"
cp  =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
# result.draw()

['you', 'are', 'flying', 'on', 'cloud', 'no', '9', ':', 'the', 'sky', 'is', 'pinkishblue', '<', '.', 'you', 'should', 'not', 'eat', 'cardboard', '.', 'hello', 'mr.', 'bob', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'the', 'weather', 'is', 'greater', ',', 'and', 'city', 'is', 'awesome', 'z', '.']
[('you', 'PRP'), ('are', 'VBP'), ('flying', 'VBG'), ('on', 'IN'), ('cloud', 'NN'), ('no', 'DT'), ('9', 'CD'), (':', ':'), ('the', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('pinkishblue', 'JJ'), ('<', 'NNP'), ('.', '.'), ('you', 'PRP'), ('should', 'MD'), ('not', 'RB'), ('eat', 'VB'), ('cardboard', 'NN'), ('.', '.'), ('hello', 'VB'), ('mr.', 'JJ'), ('bob', 'NN'), (',', ','), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('doing', 'VBG'), ('today', 'NN'), ('?', '.'), ('the', 'DT'), ('weather', 'NN'), ('is', 'VBZ'), ('greater', 'JJR'), (',', ','), ('and', 'CC'), ('city', 'NN'), ('is', 'VBZ'), ('awesome', 'JJ'), ('z', 'NN'), ('.', '.')]
(S
  you/PRP
  are/VBP
  flying/VBG
  on/IN
  (NP cloud/NN

## **Advance Text Processing: N-grams**

In [None]:
TextBlob(new_text).ngrams(3)

[WordList(['you', 'are', 'flying']),
 WordList(['are', 'flying', 'on']),
 WordList(['flying', 'on', 'cloud']),
 WordList(['on', 'cloud', 'no']),
 WordList(['cloud', 'no', '9']),
 WordList(['no', '9', 'the']),
 WordList(['9', 'the', 'sky']),
 WordList(['the', 'sky', 'is']),
 WordList(['sky', 'is', 'pinkishblue']),
 WordList(['is', 'pinkishblue', 'you']),
 WordList(['pinkishblue', 'you', 'should']),
 WordList(['you', 'should', 'not']),
 WordList(['should', 'not', 'eat']),
 WordList(['not', 'eat', 'cardboard']),
 WordList(['eat', 'cardboard', 'hello']),
 WordList(['cardboard', 'hello', 'mr']),
 WordList(['hello', 'mr', 'bob']),
 WordList(['mr', 'bob', 'how']),
 WordList(['bob', 'how', 'are']),
 WordList(['how', 'are', 'you']),
 WordList(['are', 'you', 'doing']),
 WordList(['you', 'doing', 'today']),
 WordList(['doing', 'today', 'the']),
 WordList(['today', 'the', 'weather']),
 WordList(['the', 'weather', 'is']),
 WordList(['weather', 'is', 'greater']),
 WordList(['is', 'greater', 'and']),


## **Bag-of-Words (BoW) and TF-IDF for Creating Features from Text**
Lets have 3 review:


*   Review 1: This movie is very scary and long
*   Review 2: This movie is not scary and is slow
*   Review 3: This movie is spooky and good

The vocabulary consists of these 11 words:
\begin{array}{c|c|c|c|c|c|c|c|c|c|c|c}\hline
&This& movie& is& very& scary& and& long& not&  slow& spooky&  good& Review Length\\\hline
Review 1&1&1&1&1&1&1&1&0&0&0&0&7\\\hline
Review 2&1&1&2&0&0&1&1&0&1&0&0&8\\\hline
Review 3&1&1&1&0&0&0&1&0&0&1&1&6\\\hline
\end{array}

## **Term Frequency-Inverse Document Frequency**
---
*   **Term Frequency (TF):** In document d, the frequency represents the number of instances of a given word/ term t.

>> $tf(t,d) = \frac{\textit{count of t in d} }{ \textit{number of terms in d}}$

>>for example 
* TF(‘movie’) = 1/8
* TF(‘is’) = 2/8 = 1/4

*   **Document Frequency (DF):** is the number of total occurrences of the term $t$ in the document set N (entire corpus).

>>$df(t) =$ occurrence of $t$ in all documents



* **Inverse Document Frequency (IDF):** Log of the ratio numbers of document $N$ and $df(t)$

>>$idf(t) = \log(\frac{N}{ df(t)})$

>>for example 
* IDF(‘movie’, ) = log(3/3) = 0
* IDF(‘is’) = log(3/3) = 0
* IDF(‘not’) = log(3/1) = log(3) = 0.48

* **Computation of tf-idf weight:** 
>> tf-idf$(t, d) = tf(t, d) \times idf(t)$

>> for example 
* TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0



In [None]:
r1=" This movie is very scary and long"
r2="This movie is not scary and is slow"
r3="This movie is spooky and good"
corpus=[r1, r2, r3]
corpus

[' This movie is very scary and long',
 'This movie is not scary and is slow',
 'This movie is spooky and good']

In [None]:
def preprocess(data):
  preprocessed=[]
  
  for d in data:
    tokens=tokenization(d)
    tokens = convert_to_lower(tokens)
    tokens = remove_stopwords(tokens)
    tokens = remove_punc(tokens)
    preprocessed.append(tokens)
  return preprocessed
preprocessed=preprocess(corpus)
print(preprocessed)

[['movie', 'scary', 'long'], ['movie', 'scary', 'slow'], ['movie', 'spooky', 'good']]


In [None]:
#create a vocabulary list
set().union(*preprocessed)

{'good', 'long', 'movie', 'scary', 'slow', 'spooky'}

In [None]:
#create a vocabulary list
def vocab_list(ListofList):
  list_new=[]
  cnt=0
  while cnt<len(ListofList):
    for i in ListofList[cnt]:
        if i in list_new:
            continue
        else:
            list_new.append(i)
    cnt+=1
  return list_new
vocab=vocab_list(preprocessed)
print(vocab)

['movie', 'scary', 'long', 'slow', 'spooky', 'good']


In [None]:
#Term frequency
def tf(corpus):
    dic={}
    for document in corpus:
        for word in document.split():
            if word in dic:
                dic[word] = dic[word] + 1
            else:
                dic[word]=1
    for word,freq in dic.items():
        print(word,freq)
        dic[word]=freq/sum(map(len, (document.split() for document in corpus)))
    return dic
tf=tf(corpus)
print(tf)

This 3
movie 3
is 4
very 1
scary 2
and 3
long 1
not 1
slow 1
spooky 1
good 1
{'This': 0.14285714285714285, 'movie': 0.14285714285714285, 'is': 0.19047619047619047, 'very': 0.047619047619047616, 'scary': 0.09523809523809523, 'and': 0.14285714285714285, 'long': 0.047619047619047616, 'not': 0.047619047619047616, 'slow': 0.047619047619047616, 'spooky': 0.047619047619047616, 'good': 0.047619047619047616}


In [None]:
#Inverse Document Frequency
import math
import numpy as np
def IDF(corpus, vocab):
  idf_dict={}
  N=len(corpus)
  for i in vocab:
    count=0
    for sen in corpus:
      if i in sen.split():
        count=count+1
        idf_dict[i]=(math.log((1+N)/(count+1)))+1
  return idf_dict

In [None]:
idf=IDF(corpus, vocab)
print(idf)

{'movie': 1.0, 'scary': 1.2876820724517808, 'long': 1.6931471805599454, 'slow': 1.6931471805599454, 'spooky': 1.6931471805599454, 'good': 1.6931471805599454}


In [None]:
#Term Frequency – Inverse Document Frequency (TF-IDF)
tfidf={}

res = {key: tf[key] * idf.get(key, 0) 
                       for key in tf.keys()}

res

{'This': 0.0,
 'and': 0.0,
 'good': 0.08062605621714025,
 'is': 0.0,
 'long': 0.08062605621714025,
 'movie': 0.14285714285714285,
 'not': 0.0,
 'scary': 0.12263638785255054,
 'slow': 0.08062605621714025,
 'spooky': 0.08062605621714025,
 'very': 0.0}

## **Feature extraction: Bag of Words**

**Implementing Bag of Words Algorithm with Python**

In [None]:
def vocab_list(ListofList):
  list_new=[]
  cnt=0
  while cnt<len(ListofList):
    for i in ListofList[cnt]:
        if i in list_new:
            continue
        else:
            list_new.append(i)
    cnt+=1
  return list_new
vocab_l=vocab_list(preprocessed)
print(vocab_l)

['movie', 'scary', 'long', 'slow', 'spooky', 'good']


In [None]:
def vectorize(tokens):
    vocab=vocab_list(tokens)


    vectors=[]
    for t in tokens:
      print(t)
      vector=[]
      for w in vocab:
        vector.append(t.count(w))
      # print(vector)
      vectors.append(vector)
    return vectors

In [None]:
vectors=vectorize(preprocessed)
print(vectors)

['movie', 'scary', 'long']
['movie', 'scary', 'slow']
['movie', 'spooky', 'good']
[[1, 1, 1, 0, 0, 0], [1, 1, 0, 1, 0, 0], [1, 0, 0, 0, 1, 1]]


**Create a Bag of Words Model with Sklearn**

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
#transform
Count_data = CountVec.fit_transform(corpus)
 
#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(cv_dataframe)

   good  long  movie  scary  slow  spooky
0     0     1      1      1     0       0
1     0     0      1      1     1       0
2     1     0      1      0     0       1




**N-Grams?**

**Term Frequency (TF) and inverse document frequency(IDF)**

Feature Extraction with Tf-Idf vectorizer

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#without smooth IDF
print("Without Smoothing:")
#define tf-idf
tf_idf_vec = TfidfVectorizer(use_idf=True, 
                        smooth_idf=False,  
                        ngram_range=(1,1),stop_words='english') # to use only  bigrams ngram_range=(2,2)
#transform
tf_idf_data = tf_idf_vec.fit_transform(corpus)
 
#create dataframe
tf_idf_dataframe=pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names())
print(tf_idf_dataframe)
print("\n")
 
#with smooth
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  
                        smooth_idf=True,  
                        ngram_range=(1,1),stop_words='english')
 
 
tf_idf_data_smooth = tf_idf_vec_smooth.fit_transform(corpus)
 
print("With Smoothing:")
tf_idf_dataframe_smooth=pd.DataFrame(tf_idf_data_smooth.toarray(),columns=tf_idf_vec_smooth.get_feature_names())
print(tf_idf_dataframe_smooth)

Without Smoothing:
       good      long     movie     scary      slow    spooky
0  0.000000  0.772536  0.368117  0.517376  0.000000  0.000000
1  0.000000  0.000000  0.368117  0.517376  0.772536  0.000000
2  0.670092  0.000000  0.319302  0.000000  0.000000  0.670092


With Smoothing:
       good      long     movie     scary      slow    spooky
0  0.000000  0.720333  0.425441  0.547832  0.000000  0.000000
1  0.000000  0.000000  0.425441  0.547832  0.720333  0.000000
2  0.652491  0.000000  0.385372  0.000000  0.000000  0.652491




**Decoding the text**

**Named entity recognition**
Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

**Resources:**
*  http://www.nltk.org/index.html
*  http://textblob.readthedocs.io/en/dev/
*  https://spacy.io/usage/facts-figures
*  https://radimrehurek.com/gensim/index.html
*  https://opennlp.apache.org/
*  https://www.clips.uantwerpen.be/pages/pattern
*  https://nlp.stanford.edu/software/tokenizer.html#About
*  https://tartarus.org/martin/PorterStemmer/
*  http://www.nltk.org/api/nltk.stem.html
*  https://pypi.python.org/pypi/PyStemmer/1.0.1
*  http://ucrel.lancs.ac.uk/claws/
*  http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
*  https://en.wikipedia.org/wiki/Shallow_parsing
*  https://www.ibm.com/support/knowledgecenter/en/SS8NLW_10.0.0/com.ibm.watson.
*  http://www.bart-coref.org/index.html
*  https://www.cs.utah.edu/nlp/reconcile/
*  https://cogcomp.org/page/software_view/Coref
*  https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#relations
* PoS Tagging https://www.guru99.com/pos-tagging-chunking-nltk.html