### Word Embeddings
Natural language processing is a subfield of AI that deals with natural languages. Most of the tasks that come under NLP can be generally grouped into two categories - understanding natural language, or generating natural language.

NLU or Natural Language Understanding involves tasks where one basically reads the text and tries to do something with it - like classifying it, extracting information from it. While the NLG or Natural Language Generation involves any task where the computer itself has to create an output, like in machine translation. Of course, any particular task that we have in NLP won't be strictly NLU or NLG, it could be a mix of the two.

Some of challenges associated while working with Natural languages are 1) the ambiguity in languages. The human languages are not really clear a lot of times, as in we ourselves face the difficulty in understanding what is happening, so it is not out of the realm of possibility to assume that the computers will face this issue as well. And 2) how to feed in words to a deep learning / machine learning model. These models are mathematical in nature and can process only numbers. So, in order to make the computer understand these languages, we will have to convert our language (made of words) into numbers.

This numeric representation of words is called as word embeddings.

In this notebook we will look at one of the most simple ways to vectorize our text - Count Vectorizer.

The process is kind of simple.
1. Get the text corpus (our term for a list of sentences. these sentences might also be called as documents).
2. Tokenize the documents in the corpus. (basically, send each sentence/doc through a tokenizer and get individual word/tokens, each token will have its own place in the vector space, that is they are all going to be one feature. think of a row vector where each token is a word that you have tokenized).
3. Get the list of unique words across the corpus. This is going to be the size of our vector space.
4. Consider each sentence/doc in the corpus, and basically count the occurence of each token in the sentence/doc and get a row vector representing that count.

In [1]:
import re
import nltk
from nltk.tokenize import RegexpTokenizer
import spacy
from random import shuffle
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

nlp = spacy.load("en_core_web_sm")

In [2]:
#get the corpus
with open("sample text.txt","r") as f:
    content = f.read()

# get the sentences
corpus = nltk.sent_tokenize(content)
shuffle(corpus)
shuffle(corpus)
shuffle(corpus)
shuffle(corpus)
shuffle(corpus)
corpus = corpus[:10]

In [3]:
# clean sents and tokenize
def bag_of_words(s):
    tokenizer = RegexpTokenizer("\w+")
    tokens = tokenizer.tokenize(s)
    tokens = [token for token in tokens if token not in string.punctuation]
    tokens = [token.lower() for token in tokens if re.findall(r"\w", token)]
    tokens = [token.strip().strip('.').strip("—").strip("'") for token in tokens]
    return tokens
    

In [4]:
# get the vocabulary
def get_vocab(all_text):
    tokenized_text = [bag_of_words(s) for s in all_text]
    tokens = [token for tokens in tokenized_text for token in tokens]
    tokens = list(set(tokens))
    tokens.sort()
    return tokens

In [5]:
# get the count_vect(s, unique_words)
def get_count_vect(s, vocab):
    bagofwords = bag_of_words(s)
    cv = [0]*len(vocab)
    for w in bagofwords:
        for i,sw in enumerate(vocab):
            if w == sw:
                cv[i] += 1
                break
    return cv

In [6]:
vocabulary = get_vocab(corpus)
cv = [get_count_vect(s, vocabulary) for s in corpus]

In [8]:
print(corpus[5])
print(*zip(vocabulary, cv[5]),sep="\n")

"A pathetic, futile, broken creature."
('a', 1)
('about', 0)
('against', 0)
('an', 0)
('and', 0)
('aware', 0)
('be', 0)
('before', 0)
('brave', 0)
('broke', 0)
('broken', 1)
('but', 0)
('can', 0)
('chance', 0)
('creature', 1)
('day', 0)
('distant', 0)
('down', 0)
('enough', 0)
('find', 0)
('futile', 1)
('going', 0)
('have', 0)
('he', 0)
('him', 0)
('how', 0)
('i', 0)
('in', 0)
('is', 0)
('knew', 0)
('lake', 0)
('let', 0)
('listening', 0)
('machine', 0)
('man', 0)
('mcpherson', 0)
('might', 0)
('more', 0)
('mr', 0)
('music', 0)
('old', 0)
('pathetic', 1)
('said', 0)
('shoscombe', 0)
('some', 0)
('strong', 0)
('that', 0)
('the', 0)
('through', 0)
('to', 0)
('urgent', 0)
('us', 0)
('voice', 0)
('was', 0)
('well', 0)
('were', 0)
('what', 0)
('with', 0)
('woman', 0)
('you', 0)
('yourself', 0)
