### Machine understands only numbers 
How do we input text in NLP and assign a number?

There are many methods to do this. You can use word frequency and assign a number. You could use word embedding and assign a number. 

So, I am showing you one of the vocabulary based vectorizer.

#### Count_Vectorizer

In [1]:
#load the library
import sklearn

In [2]:
# checkout which version we are using
sklearn.__version__

'0.22.1'

In [3]:
text = ["This guy is super and he is doing great in his career. He has been awarded and will look forward etc."]

In [4]:
# What should we do to make sure that computer understands this text  
# ideally we should clean the data but in this example we do not do that 

In [5]:
# there will be lots of repeated and unwanted words in the text, we want to remove them 
# NLTK library has collection of stopwords. Ex: a, an , the , they etc.

# import NLTK
import nltk

# if you are using first time, you need to download stopwords 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/next_tech/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# We need to convert text into numbers 
# to do that, you need to use vectozier 
# Let us understand - CountVectorizer 
from sklearn.feature_extraction.text import CountVectorizer

# This is not the good practice to import libraries like this, Ideally we should import all libraries at first. why?
#  because, when you will deploy the appplication, you will be able to make a good Requirement.txt file
#  and will never missing any dependency 

In [7]:
# load english stopwords
stopwords = nltk.corpus.stopwords.words('english')

# create an object of Count Vectorizer 
count_vec = CountVectorizer(stop_words= stopwords)

In [8]:
# lets fit our data to count vectorizer
training_data = count_vec.fit(text)

# let us look on object
training_data

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [9]:
# it makes a vocabulary of the data, we can have a look on that
count_vec.vocabulary_

{'guy': 5,
 'super': 7,
 'great': 4,
 'career': 1,
 'awarded': 0,
 'look': 6,
 'forward': 3,
 'etc': 2}

In [10]:
# find out a word in the vocablury 
count_vec.vocabulary_.get('career')

1

In [11]:
# let us transform the data 
# here you will convert the data into numbers 
trans_count_vector = count_vec.transform(text)

In [12]:
# we can multiple properties 
trans_count_vector.shape

(1, 8)

In [13]:
# let us make an array for this 
trans_count_vector.toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1]])

In [14]:
# stop words functionality with count vectorizer

## N grams 

In [15]:
# what is N-Gram
# bigram - comnination of 2 words toghter

In [16]:
n_gram_count = CountVectorizer(ngram_range=(2, 2), stop_words= stopwords)

In [17]:
ngram_vector = n_gram_count.fit_transform(text)


In [18]:
# vacubloury 
n_gram_count.vocabulary_

{'guy super': 4,
 'super great': 6,
 'great career': 3,
 'career awarded': 1,
 'awarded look': 0,
 'look forward': 5,
 'forward etc': 2}

In [19]:
ngram_vector.toarray()

array([[1, 1, 1, 1, 1, 1, 1]])