# Text Representation

Transform a pre-processed text into suitable numerical form and fed into ML algorithm for further process is called feature extraction or text representation.
Feature extraction is common step in any ml problem such as image, video, audio. 

 - Images will be transform into matrix representation based on their pixel values.
 - Video also similar, video is just a collection of frames where each frame is an image.  so the video represent as a sequential collection of matrices. 
 - Audio usually transmit as waves. so represent this mathematically, sampled wave amplitude will be 
recorded. this will give array representation of the sound waves.

Text representation approach classified into 4 categories

 - Basic vectorization approaches
 - Distributed representations
 - Universal language representation
 - Handcrafted Features

##### Text which represent by vectors of numbers is called vector space model. It's simple model used for representing any text blob. It's fundamental to many NLP operations like info-retrieval, scoring the documents etc.,  

## Basic vectorization approaches
Match each word in the vocabulary of the text corpus to a unique ID(integer). Then represent sentence in the corpus as a v-dimensional vector. 

### One- Hot Encoding
In this method, each word w in the corpus given a unique integer ID, It's between 1 & |V|. V is the set of the corpus vocabulary. Each word is then represent by a V-dimensional binary vector. 

- One hot encoding is intuitive to undetrstand and straight forward to implement
- Size of the one-hot vector is directly proportional to size of the vocabulary. so for large coprora it is computationaly ineffiecient to compute and store.
- This doesn't give fixed-length representation.
- It treats words as atomic unit and poor at capturing the meaning of the word in relation to other words.(run, ran, apple)
- Out of vocabulary problem. 


In [1]:
# List of sentences
sent_list = ["i read newspaper yesterday.", "I watched TV Today.", "john read newspaper and watched TV today."]

pre_process_list = [i.lower().replace(".","") for i in sent_list]

In [2]:
print(pre_process_list)

['i read newspaper yesterday', 'i watched tv today', 'john read newspaper and watched tv today']


In [4]:
# Build vocabulary set for the pre-processed list
vocab = {}
count = 0
for i in pre_process_list:
    for w in i.split():
        if w not in vocab:
            count = count + 1
            vocab[w] = count

In [5]:
print(vocab)

{'i': 1, 'read': 2, 'newspaper': 3, 'yesterday': 4, 'watched': 5, 'tv': 6, 'today': 7, 'john': 8, 'and': 9}


In [21]:
def get_onehot_encoding(text):
    """
        Generate one hot encoding for string based on vocab set. 
        If word exisst, it's representation in vocab will be returned.
        if not, a list of zero returned.
    """
    one_hot_encoded = []
    for w in text.split():
        temp = [0]*len(vocab)
        if w in vocab:
            temp[vocab[w]-1] = 1# -1 because array indexing start from 0
        one_hot_encoded.append(temp)
    return one_hot_encoded

In [22]:
get_onehot_encoding(pre_process_list[0])

[[1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0, 0]]

In [23]:
# Using Skikit learn
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [24]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [32]:
nest_list = [i.split() for i in pre_process_list]

word_list = [ item for elem in nest_list for item in elem]

print(word_list)

['i', 'read', 'newspaper', 'yesterday', 'i', 'watched', 'tv', 'today', 'john', 'read', 'newspaper', 'and', 'watched', 'tv', 'today']


In [34]:
# Label Encodeing
label_encoder = LabelEncoder()
integer_encoded_values = label_encoder.fit_transform(word_list)

print(integer_encoded_values)

[1 4 3 8 1 7 6 5 2 4 3 0 7 6 5]


In [None]:
# One Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(nest_list)
print(onehot_encoded)

### Bag of words
    
    Similar to one-hot encoding, Bag of word maps to unique integer is between 1 & |V|. Each document in the corpus converted into a vector of |V| dimention. where in the ith component of the vector simply the number od times the word w occurs in the document. Each word in the V by thier occurrences count in the document.
    
    EX: Vocab =  [i =1, read=2, newspaper=3, yesterday=4, today=5, john=6, TV=7, watch=8]
        i read newspaper today. = [1,1,1,0,1,0,0,0]
        i read newspaper today, i watch tv today = [2,1,1,0,2,0,1,1]


- With this method, documents having same words will have thier vector epresentation closer to each other in euclidean space.
- Fixed length of encoding for any sentence of length

- Size of the vector increase == Size of the vocabulary, Restrict by limiting vocabulary
- Doesn''t capture similarity between different words.
- Doesn't handle out of vocabulary words
- Word order information is lost in this representation.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [47]:
print(pre_process_list)

['i read newspaper yesterday', 'i watched tv today', 'john read newspaper and watched tv today']


In [48]:
# Initalize count vectorrizer
count_vect = CountVectorizer()

In [57]:
# Build BOW for the word list
bow = count_vect.fit_transform(pre_process_list)

print(count_vect.vocabulary_)

{'read': 3, 'newspaper': 2, 'yesterday': 7, 'watched': 6, 'tv': 5, 'today': 4, 'john': 1, 'and': 0}


In [58]:
print("i read newspaper yesterday': ", bow[0].toarray())
print("i watched tv today: ",bow[1].toarray())

i read newspaper yesterday':  [[0 0 1 1 0 0 0 1]]
i watched tv today:  [[0 0 0 0 1 1 1 0]]


In [59]:
new_text = count_vect.transform(["i read newspaper today i watch tv today"])
print("i read newspaper today i watch tv today': ", new_text.toarray())

i read newspaper today i watch tv today':  [[0 0 1 1 2 1 0 0]]


In [61]:
#BoW with binary vectors
count_vect = CountVectorizer(binary=True)
bow_rep_bin = count_vect.fit_transform(pre_process_list)
text_2 = count_vect.transform(["i read newspaper today i watch tv today"])
print("i read newspaper today i watch tv today':", text_2.toarray())

i read newspaper today i watch tv today': [[0 0 1 1 1 1 0 0]]


### Bag of N-Grams
    One-hot encoding and bag of words treat words as independent units. and there is no word ordering. The bag od N-grams tries to solve this by breaking text into chunks of n touching words. This will help to capture context. Each chunk is called n-gram. vocabulary is nothing but collection of all unique n-gram.
    
    ex: bigram model - i read newspaper today,i watch tv today - i read, read newspaper, newspaper today, today i, i watch, watch tv, tv today.
    
    - It capture some context and word-order information in the orm of n-grams
    - Because of above it can able to capture some semantic similarity. 
    - As n increases, dimensionality only increases rapidly
    - It doen't address OOV problem (handling out of vocabulary)



In [5]:
count_vect = CountVectorizer(ngram_range=(1,3))

In [6]:
bow = count_vect.fit_transform(pre_process_list)

print(count_vect.vocabulary_)

{'read': 10, 'newspaper': 6, 'yesterday': 20, 'read newspaper': 11, 'newspaper yesterday': 9, 'read newspaper yesterday': 13, 'watched': 17, 'tv': 15, 'today': 14, 'watched tv': 18, 'tv today': 16, 'watched tv today': 19, 'john': 3, 'and': 0, 'john read': 4, 'newspaper and': 7, 'and watched': 1, 'john read newspaper': 5, 'read newspaper and': 12, 'newspaper and watched': 8, 'and watched tv': 2}


In [7]:
print("i read newspaper yesterday': ", bow[0].toarray())
print("i watched tv today: ",bow[1].toarray())

i read newspaper yesterday':  [[0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1]]
i watched tv today:  [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0]]


In [8]:
new_text = count_vect.transform(["i read newspaper today i watch tv today"])
print("i read newspaper today i watch tv today': ", new_text.toarray())

i read newspaper today i watch tv today':  [[0 0 0 0 0 0 1 0 0 0 1 1 0 0 2 1 1 0 0 0 0]]


### TF-IDF
     In all other approaches text are treated as important. there is no impression of some words in the document being more important than others. TF-IDF (Term Frequenct-inverse document frequency ) solv this problem. It try to quantify the importance of a given word relative to other words. it commonly used in information-retriveal system. 
     
     Idea behind TF-IDF is if the word "W" apperas in many times in doument A and not occur much in document B. then word "W" is much important to document A.
     
     TF - measures how often a term or word occurs in given document. This may give biased results when comes to longer documents. to resolve that the number of occurance divided by the length of the doument.
     
     IDF - Measure the importance of the term across a corpus. when computing TF all terms are given equal importance.but stop words are very common and occur many times in the document and those words are not important. So IDF weighs down the terms that are very common across coprus and weighs up the rare terms.
     
     IDF = loge(total number of documents in the corpus) (Number of documents with term t in them)
     
     TF-IDF score  = TF * IDF
     
     Even though it perform better compare to other methods still suffers from the curs of high dimensionality.
     


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
tfidf = TfidfVectorizer()

In [11]:
bow_tfidf = tfidf.fit_transform(pre_process_list)

In [12]:
#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)

#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names())


IDF for all words in the vocabulary [1.69314718 1.69314718 1.28768207 1.28768207 1.28768207 1.28768207
 1.28768207 1.69314718]
All words in the vocabulary ['and', 'john', 'newspaper', 'read', 'today', 'tv', 'watched', 'yesterday']


In [14]:
print("TFIDF representation for all documents in our corpus\n",bow_tfidf.toarray()) 

TFIDF representation for all documents in our corpus
 [[0.         0.         0.51785612 0.51785612 0.         0.
  0.         0.68091856]
 [0.         0.         0.         0.         0.57735027 0.57735027
  0.57735027 0.        ]
 [0.45212331 0.45212331 0.34385143 0.34385143 0.34385143 0.34385143
  0.34385143 0.        ]]


In [15]:
new_text = tfidf.transform(["i read newspaper today i watch tv today"])
print("i read newspaper today i watch tv today': ", new_text.toarray())

i read newspaper today i watch tv today':  [[0.         0.         0.37796447 0.37796447 0.75592895 0.37796447
  0.         0.        ]]


## Distributed Representations
    There are some key drawbacks in the vectorization method. To over come that method to learn low-dimentional representation were devised. 
    they use nural netwrok architecture to create dense, low dimnetional representation of words and texts. 
    
### Distributional similarity
    This is called meaning understood by context. ex: "MJ Rocks" - Rocks literally meaning stones but in this context it means good.
    
### Distributional hypothesis
    Word that occur in similar context have similar meanings. cat, mouse. is having similar context that's animal and their characterstics. according to distribuationl hypothesis there should be strong similarities between the meaning of these two words.
    
### Distributional representation
    
    
### Distributed representation


### Embedding
    Embedding is a mapping between vector space coming from distibutional representation to vector space coming from distributed representation.

### Vector semantics
    Set of NLP methods that aim to learn the word representations based on distributional properties words in a large corpus.
    


## Word Embessings
    Lets just say we give word "Cat" distributionally similar words could be other animals either domostic or wild animals.  Neural network word2vec model based on distributional similarity can capture word analogy relationship "king-man+woman = queen". when we learn semantically rich relationships word2vec ensures that the learned word representation are low dimentional.Thease representaion are called embeddings.
    
    Conceptually, Word2vec takes a large corpus of text as input and “learns” to represent the words in a common vector space based on the contexts in which they appear in the corpus. Given a word w and the words appearing in its context C, how do we find the vector that best represents the meaning of the word? For every word w in corpus, we start with a vector v initialized with random values. The Word2vec model refines the values in v by predicting v , given the vectors for words in the context C. It does this using a two-layer neural network.
    

## Pre-Trained Word embeddings
    Train our own embedding is pretty expensive in both time and computing. Some one already trained embedding on large cporpus such as  wikipedia, news article or entier web. these contain key value pair where key represent the words and value represents corresponding word vectors. 

In [2]:
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

import os #This module provides a way of using operating system dependent functionality

import psutil #This module helps in retrieving information on running processes and system resource utilization
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

import time #This module is used to calculate the time

In [3]:
from gensim.models import Word2Vec, KeyedVectors

In [4]:
pre_trained_embeddings = "../data/GoogleNews-vectors-negative300.bin.gz"

In [5]:
print(process.memory_info().rss / 10**9)

0.094830592


In [6]:
print(f"Total_memory_available:{mem.total}")

Total_memory_available:8424902656


In [7]:
w2v_model = KeyedVectors.load_word2vec_format(pre_trained_embeddings, binary=True)

In [8]:
print(process.memory_info().rss / 10**9)

3.559550976


In [9]:
print("Number of words in vocablulary: ",len(w2v_model.vocab))

Number of words in vocablulary:  3000000


In [10]:
# check most similar words
w2v_model.most_similar('cat')

[('cats', 0.8099379539489746),
 ('dog', 0.7609456777572632),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326233983039856),
 ('beagle', 0.7150583267211914),
 ('puppy', 0.7075453996658325),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931377410889),
 ('chihuahua', 0.6709762215614319)]

In [11]:
# check most similar words
w2v_model.most_similar('phone')

[('telephone', 0.8224020600318909),
 ('cell_phone', 0.7831966876983643),
 ('cellphone', 0.7629485130310059),
 ('Phone', 0.7060797214508057),
 ('phones', 0.6894922256469727),
 ('landline', 0.6263927221298218),
 ('voicemail', 0.6252243518829346),
 ('caller_id', 0.6023746132850647),
 ('RingCentral_cloud_computing', 0.5935890674591064),
 ('telephones', 0.5929964780807495)]

In [12]:
# Find vector representation for the word
w2v_model['watch']

array([ 7.81250000e-03,  2.05078125e-02,  1.89453125e-01,  2.85156250e-01,
       -2.55859375e-01,  4.61425781e-02,  2.36816406e-02, -1.11328125e-01,
        1.50390625e-01,  3.83300781e-02, -1.41601562e-02, -3.65234375e-01,
       -7.56835938e-02,  2.09960938e-02, -1.19140625e-01,  1.63085938e-01,
        1.05468750e-01,  1.64062500e-01, -2.03857422e-02, -6.64062500e-02,
        7.95898438e-02,  1.75781250e-01,  1.32812500e-01,  9.42382812e-02,
        3.44238281e-02, -1.90429688e-01,  1.40625000e-01,  1.60156250e-01,
       -5.10253906e-02, -3.54003906e-03, -1.42578125e-01,  1.19140625e-01,
       -3.49121094e-02, -1.82617188e-01,  1.10839844e-01, -1.82617188e-01,
       -8.54492188e-02, -1.46484375e-01, -3.24707031e-02,  7.17773438e-02,
        7.51953125e-02, -3.58886719e-02,  2.20703125e-01, -1.56250000e-01,
       -2.39257812e-01,  1.47460938e-01, -3.61328125e-02, -4.10156250e-02,
        7.53784180e-03, -2.87109375e-01, -1.25000000e-01, -7.56835938e-02,
        2.15820312e-01,  

## Training own embeddings
    * Continous bag of words
    * Skipgram

### Continuous bag of words
    Primary task is to build a model that predicts center word give the context word in which the center word appears. language model is statistical model that tries to gie probabilty distribution to the sequence of words. objective of the language model is to give high probability to good sentences and low probability to bad sentences. Good sentences means the sentences which is semantically and syntactically correct. ex: "cat jumped over the dog " prob = 1.0 . "jumped over the dog cat" prob=0.0
    
    CBOW tries to learn a model that predict the center word from in its context. ex : The furious tiger killed many people. center word is targer and remain words in the windows are y.
    
    if k =2 ie context = 2/ use 2k+1 sliding window to find the context and target. 
    The furious tiger = the, furious)tiger
    The furious tiger killed = the, tiger, kiled) furious

### SkipGram
    it's similar to CBOW with some minor changes.in 2k+1 sliding window unlike cbow center words in the window is X, and k words on either side of the center word is y. 
    
    The furious tiger = (the, furious)(the, tiger)
    The furious tiger killed = (furious, the)(furious, tiger)(furious, killed)
    
    there are several hyper parameter tuning available like window size, dimentionality of the vectorss to be learned, learin rate, epoche.
    

In [13]:
# Import test data from gensim model
from gensim.test.utils import common_texts

In [15]:
print(common_texts)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


In [16]:
model = Word2Vec(common_texts, size=8, window=4, min_count=1, workers=4)

In [17]:
model.save('../data/gensim_common_text_model.w2v')

In [18]:
print(model.wv.most_similar("computer"))

[('human', 0.3920215368270874), ('minors', 0.24046307802200317), ('user', 0.23201751708984375), ('interface', 0.197819322347641), ('survey', 0.1638130098581314), ('eps', -0.008773356676101685), ('trees', -0.013679832220077515), ('time', -0.13756054639816284), ('graph', -0.1475311517715454), ('response', -0.2668741047382355)]


In [20]:
print(model["computer"])

[ 0.0039434   0.04938755 -0.05820011  0.02601842 -0.00767939  0.05581497
 -0.01596179 -0.01445093]


### Word embedding to get feature represerntation of larger text.
    Simple approach is to break sentence into words then take embedding for individual words then combin everything together.
    
    Both pre-trained and self-trained word embeddings spend on the vocabulary they see it on the training data. W2v or any other text representation don't have good way of handling out of vocabulary words. simple approach is to exclude thse words from the feature extraction process.

In [23]:
import spacy
nlp = spacy.load("en_core_web_md")

In [24]:
text = nlp("goa is wonderful tourist destination")

In [25]:
# vector for individual word
print(text[0].vector)

[-0.89549    0.38773    0.64984   -0.16708    0.72494   -0.065563
  0.20031   -0.39032    0.35382    0.74078   -1.3711    -0.65238
 -0.38228   -0.23277    0.47455    0.14023   -0.27709    1.9277
  0.55714    0.76838    0.24489    0.091607   0.15209    0.087329
  0.19561    0.070279   0.17415    0.019008  -0.35183    1.146
  0.28155   -0.82137    0.099048   0.25678   -0.42638    0.27792
 -0.25204   -0.31803   -0.50234   -0.36031    0.13668   -0.70532
  0.2811    -0.56934   -0.40299   -0.51336    0.17735   -0.18854
  0.50197    0.15772   -0.036079  -0.066684  -0.25667   -0.81924
 -0.28292    0.32283   -0.041018  -0.42019   -0.018701  -0.46989
 -0.57918   -0.57153    0.19196   -0.33212    0.19154   -0.075422
 -0.015175   0.63033    0.10762   -0.14905   -0.10694   -0.024239
 -0.13572   -0.29651    0.72742    0.1151    -0.18163   -0.10087
  0.97217   -0.058608   0.33354   -0.016199  -0.3009     0.51322
 -0.45041   -0.40163    0.6992     0.45031    0.56161   -0.19313
 -0.086237   0.13358   -

In [26]:
print(text.vector) # average vector for the sentence

[ 1.45461798e-01  1.82797998e-01  2.62645492e-03 -2.22708389e-01
  6.11549973e-01  1.53727397e-01  1.23977400e-01 -2.01021999e-01
 -1.79411396e-01  2.07283592e+00 -5.95480978e-01 -1.43926412e-01
 -1.35687992e-01  2.26209946e-02  1.33888215e-01 -3.11084002e-01
 -2.24286795e-01  1.32608414e+00  4.16280985e-01  9.98557955e-02
 -1.20672002e-01  3.47440019e-02 -9.54186022e-02  1.35446265e-01
  2.04374820e-01 -1.16621945e-02  8.05706009e-02 -1.45420805e-01
 -4.11785990e-01  4.81037915e-01  1.60996795e-01 -3.44648600e-01
  1.58593610e-01  1.81034595e-01 -1.62568256e-01  4.78303954e-02
  1.60867810e-01 -2.12299988e-01 -1.19415268e-01 -1.19790994e-01
  1.66080091e-02 -1.35701984e-01 -1.15529932e-02 -1.90399989e-01
 -9.51439962e-02 -9.52480547e-03  1.07617997e-01 -1.02274001e-01
  1.48561209e-01 -6.36880025e-02 -5.66932037e-02 -1.40958009e-02
 -2.96495967e-02 -1.82335183e-01  2.02155992e-01  7.04765990e-02
  8.09641927e-02 -8.16802010e-02  2.17477977e-02  5.74860089e-02
  3.32629686e-04 -3.56807

* All text representation are biased based on what they saw in training data. 
* Inlike basic vectoriztion approaches, pre trained embeddings are generally large-sized files. this possese some challenge when deployment
* 