# **NLP** 
#### *Learning by doing*

### Tokenization

In [1]:
import nltk

##### Downloading packages

In [2]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [3]:
paragraph = """Data Science has many various and at times conflicting definitions. For me, it is the art of using a combination of tools with data at the heart of them to create insights and solve problems.
Most data science solutions are either part of a web application or an API. These web based solutions need to be hosted somewhere. You can build the best object recognition algorithm in the world, but if it is sitting on your computer, it is like the book you bought which remains sitting on the bookcase. No one is getting any value out of either situation.
This doesn’t have to be super complicated either. As we saw in the previous section, there are several types of cloud providers covering a range of use cases — from the super simple to the extremely complicated.
You can start off by using Heroku to deploy your app. This article walks through how to deploy a flask application to Heroku for free using the Heroku CLI.
Simpler solutions continue to be created. Streamlit which has grown in popularity for enabling folks to build data apps quickly, has introduced Streamlit Sharing, a platform help you deploy, manage and share your Streamlit apps."""
sentences = nltk.sent_tokenize(paragraph) # Tokenizing sentences


In [4]:
words = nltk.word_tokenize(paragraph) # Tokenizing words

## **Stemming and Lemmatization Begins** 

## *Stemming*

In [5]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [6]:
# creating object of PorterStemmer
stemmer  = PorterStemmer()
sentences_stemming = nltk.sent_tokenize(paragraph)
# stemming
for i in range(len(sentences_stemming)):
  words = nltk.word_tokenize(sentences_stemming[i])
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences_stemming[i] = ' '.join(words)

In [7]:

sentences_stemming

['data scienc mani variou time conflict definit .',
 'for , art use combin tool data heart creat insight solv problem .',
 'most data scienc solut either part web applic api .',
 'these web base solut need host somewher .',
 'you build best object recognit algorithm world , sit comput , like book bought remain sit bookcas .',
 'No one get valu either situat .',
 'thi ’ super complic either .',
 'As saw previou section , sever type cloud provid cover rang use case — super simpl extrem complic .',
 'you start use heroku deploy app .',
 'thi articl walk deploy flask applic heroku free use heroku cli .',
 'simpler solut continu creat .',
 'streamlit grown popular enabl folk build data app quickli , introduc streamlit share , platform help deploy , manag share streamlit app .']

#### Some disadvantages of stemming:
*  Produces meaningless words
*  eg: scienc, variou, mani, etc.

## *Lemmatization*

In [8]:
from nltk.stem import WordNetLemmatizer

In [9]:
# creating object of lemmatizer
lemmatizer = WordNetLemmatizer()
sentences_lemmitizer = nltk.sent_tokenize(paragraph)
# lemmatization
for i in range(len(sentences_lemmitizer)):
  words = nltk.word_tokenize(sentences_lemmitizer[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentences_lemmitizer[i] = ' '.join(words)

In [10]:
sentences_lemmitizer

['Data Science many various time conflicting definition .',
 'For , art using combination tool data heart create insight solve problem .',
 'Most data science solution either part web application API .',
 'These web based solution need hosted somewhere .',
 'You build best object recognition algorithm world , sitting computer , like book bought remains sitting bookcase .',
 'No one getting value either situation .',
 'This ’ super complicated either .',
 'As saw previous section , several type cloud provider covering range use case — super simple extremely complicated .',
 'You start using Heroku deploy app .',
 'This article walk deploy flask application Heroku free using Heroku CLI .',
 'Simpler solution continue created .',
 'Streamlit grown popularity enabling folk build data apps quickly , introduced Streamlit Sharing , platform help deploy , manage share Streamlit apps .']

### Bag of words

In [11]:
# cleaning the data using re, lemmatizer, stemmer, stopwords
import re

In [14]:
ps = PorterStemmer()
wordnet = WordNetLemmatizer()
sentences_BOW = nltk.sent_tokenize(paragraph)
corpus = []

for i in range(len(sentences_BOW)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [15]:
corpus

['data science many various time conflicting definition',
 'art using combination tool data heart create insight solve problem',
 'data science solution either part web application api',
 'web based solution need hosted somewhere',
 'build best object recognition algorithm world sitting computer like book bought remains sitting bookcase',
 'one getting value either situation',
 'super complicated either',
 'saw previous section several type cloud provider covering range use case super simple extremely complicated',
 'start using heroku deploy app',
 'article walk deploy flask application heroku free using heroku cli',
 'simpler solution continue created',
 'streamlit grown popularity enabling folk build data apps quickly introduced streamlit sharing platform help deploy manage share streamlit apps']

### Bag of Words
Using **Scikit Learn's** *CountVectorizer* library which is one of the data preprocessing library  

---



In [16]:
# creating bag of words

from sklearn.feature_extraction.text import CountVectorizer  
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()  # Bag of word

In [18]:
X.shape  # array consisting all the words in form of vectors

(12, 83)

Bag of words is good in converting sentences into vectors but the major drawback of this tech is that: "It doesn't mention which word is more prominent in defining the sentiment or meaning in the sentence". There is no semantic meaning in the sentence.
This problem is dealt with TF-IDF which generally scores the relevance of eords in the sentence.

### **TF-IDF**

**TF** = No of repetetion of word in a sentence / No of words in a sentence

**IDF** = log (No of sentences / No of wentences containing words)

Then multiply **TF * IDF**

In [19]:
# creating TF-IDf model

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
Y = tfidf.fit_transform(corpus).toarray()


In [21]:
Y.shape

(12, 83)