# Text Representation

Transform a pre-processed text into suitable numerical form and fed into ML algorithm for further process is called feature extraction or text representation.
Feature extraction is common step in any ml problem such as image, video, audio. 

 - Images will be transform into matrix representation based on their pixel values.
 - Video also similar, video is just a collection of frames where each frame is an image.  so the video represent as a sequential collection of matrices. 
 - Audio usually transmit as waves. so represent this mathematically, sampled wave amplitude will be 
recorded. this will give array representation of the sound waves.

Text representation approach classified into 4 categories

 - Basic vectorization approaches
 - Distributed representations
 - Universal language representation
 - Handcrafted Features

##### Text which represent by vectors of numbers is called vector space model. It's simple model used for representing any text blob. It's fundamental to many NLP operations like info-retrieval, scoring the documents etc.,  

## Basic vectorization approaches
Match each word in the vocabulary of the text corpus to a unique ID(integer). Then represent sentence in the corpus as a v-dimensional vector. 

### One- Hot Encoding
In this method, each word w in the corpus given a unique integer ID, It's between 1 & |V|. V is the set of the corpus vocabulary. Each word is then represent by a V-dimensional binary vector. 

- One hot encoding is intuitive to undetrstand and straight forward to implement
- Size of the one-hot vector is directly proportional to size of the vocabulary. so for large coprora it is computationaly ineffiecient to compute and store.
- This doesn't give fixed-length representation.
- It treats words as atomic unit and poor at capturing the meaning of the word in relation to other words.(run, ran, apple)
- Out of vocabulary problem. 


In [2]:
# List of sentences
sent_list = ["i read newspaper yesterday.", "I watched TV Today.", "john read newspaper and watched TV today."]

pre_process_list = [i.lower().replace(".","") for i in sent_list]

In [3]:
print(pre_process_list)

['i read newspaper yesterday', 'i watched tv today', 'john read newspaper and watched tv today']


In [4]:
# Build vocabulary set for the pre-processed list
vocab = {}
count = 0
for i in pre_process_list:
    for w in i.split():
        if w not in vocab:
            count = count + 1
            vocab[w] = count

In [5]:
print(vocab)

{'i': 1, 'read': 2, 'newspaper': 3, 'yesterday': 4, 'watched': 5, 'tv': 6, 'today': 7, 'john': 8, 'and': 9}


In [21]:
def get_onehot_encoding(text):
    """
        Generate one hot encoding for string based on vocab set. 
        If word exisst, it's representation in vocab will be returned.
        if not, a list of zero returned.
    """
    one_hot_encoded = []
    for w in text.split():
        temp = [0]*len(vocab)
        if w in vocab:
            temp[vocab[w]-1] = 1# -1 because array indexing start from 0
        one_hot_encoded.append(temp)
    return one_hot_encoded

In [22]:
get_onehot_encoding(pre_process_list[0])

[[1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0, 0]]

In [23]:
# Using Skikit learn
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [24]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [32]:
nest_list = [i.split() for i in pre_process_list]

word_list = [ item for elem in nest_list for item in elem]

print(word_list)

['i', 'read', 'newspaper', 'yesterday', 'i', 'watched', 'tv', 'today', 'john', 'read', 'newspaper', 'and', 'watched', 'tv', 'today']


In [34]:
# Label Encodeing
label_encoder = LabelEncoder()
integer_encoded_values = label_encoder.fit_transform(word_list)

print(integer_encoded_values)

[1 4 3 8 1 7 6 5 2 4 3 0 7 6 5]


In [None]:
# One Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(nest_list)
print(onehot_encoded)

### Bag of words
    
    Similar to one-hot encoding, Bag of word maps to unique integer is between 1 & |V|. Each document in the corpus converted into a vector of |V| dimention. where in the ith component of the vector simply the number od times the word w occurs in the document. Each word in the V by thier occurrences count in the document.
    
    EX: Vocab =  [i =1, read=2, newspaper=3, yesterday=4, today=5, john=6, TV=7, watch=8]
        i read newspaper today. = [1,1,1,0,1,0,0,0]
        i read newspaper today, i watch tv today = [2,1,1,0,2,0,1,1]


- With this method, documents having same words will have thier vector epresentation closer to each other in euclidean space.
- Fixed length of encoding for any sentence of length

- Size of the vector increase == Size of the vocabulary, Restrict by limiting vocabulary
- Doesn''t capture similarity between different words.
- Doesn't handle out of vocabulary words
- Word order information is lost in this representation.

In [46]:
from sklearn.feature_extraction.text import CountVectorizer

In [47]:
print(pre_process_list)

['i read newspaper yesterday', 'i watched tv today', 'john read newspaper and watched tv today']


In [48]:
# Initalize count vectorrizer
count_vect = CountVectorizer()

In [57]:
# Build BOW for the word list
bow = count_vect.fit_transform(pre_process_list)

print(count_vect.vocabulary_)

{'read': 3, 'newspaper': 2, 'yesterday': 7, 'watched': 6, 'tv': 5, 'today': 4, 'john': 1, 'and': 0}


In [58]:
print("i read newspaper yesterday': ", bow[0].toarray())
print("i watched tv today: ",bow[1].toarray())

i read newspaper yesterday':  [[0 0 1 1 0 0 0 1]]
i watched tv today:  [[0 0 0 0 1 1 1 0]]


In [59]:
new_text = count_vect.transform(["i read newspaper today i watch tv today"])
print("i read newspaper today i watch tv today': ", new_text.toarray())

i read newspaper today i watch tv today':  [[0 0 1 1 2 1 0 0]]


In [61]:
#BoW with binary vectors
count_vect = CountVectorizer(binary=True)
bow_rep_bin = count_vect.fit_transform(pre_process_list)
text_2 = count_vect.transform(["i read newspaper today i watch tv today"])
print("i read newspaper today i watch tv today':", text_2.toarray())

i read newspaper today i watch tv today': [[0 0 1 1 1 1 0 0]]


### Bag of N-Grams



### TF-IDF

## Distributed Representations
### Distributional similarity
### Distributional hypothesis
### Distributional representation
### Distributed representation
