## DictVectorizer

In [5]:
from sklearn.feature_extraction import DictVectorizer

In [16]:
v = DictVectorizer(sparse=False)
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]        #sorted by feature name
X = v.fit_transform(D)
print(X)


print(v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}])

v.transform({'foo': 4, 'unseen_feature': 3})  #unseen_feature is rejected


[[2. 0. 1.]
 [0. 1. 3.]]
True


array([[0., 0., 4.]])

## FeatureHasher (Memory Efficient)
Docstring:     
Implements feature hashing, aka the hashing trick.

This class turns sequences of symbolic feature names (strings) into
scipy.sparse matrices, using a hash function to compute the matrix column
corresponding to a name. The hash function employed is the signed 32-bit
version of Murmurhash3.


Feature names of type byte string are used as-is. Unicode strings are
converted to UTF-8 first, but no Unicode normalization is done.
Feature values must be (finite) numbers.

In [23]:

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=40)
D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(D)
f.todense()




matrix([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         -1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -4.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          2.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         -2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -5.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.]])

## TF-IDF

In [65]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the First document and sec doc.',
    'This document is the second document.',
    'And this is the third one and .',
    'Is this the first document? Umang',
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)
# print(vectorizer.get_stop_words())          #--None
vectorizer.inverse_transform(X)             #return terms per document with nonzero entries in X.
# vectorizer.build_analyzer()


['doc', 'document', 'sec', 'second', 'umang']
(4, 5)


[array(['doc', 'sec', 'document'], dtype='<U8'),
 array(['second', 'document'], dtype='<U8'),
 array([], dtype='<U8'),
 array(['umang', 'document'], dtype='<U8')]

### Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

Methods to generate this mapping include neural networks       
dimensionality reduction on the word co-occurrence matrix,               
probabilistic models                 
explainable knowledge base method            
explicit representation in terms of the context in which words appear 


Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe, AllenNLP's Elmo,fastText, Gensim,Indra and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

Thought Vector is extension of word embedding

# GloVe: Global Vectors for Word Representation 
Introduction

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

1.   Nearest neighbors

The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human's vocabulary. For example, here are the closest words to the target word frog:

2. 2.   Linear substructures

The similarity metrics used for nearest neighbor evaluations produce a single scalar that quantifies the relatedness of two words.


man - woman


company - ceo


city - zip code


comparative - superlative

The underlying concept that distinguishes man from woman, i.e. sex or gender, may be equivalently specified by various other word pairs, such as king and queen or brother and sister. To state this observation mathematically, we might expect that the vector differences man - woman, king - queen, and brother - sister might all be roughly equal. This property and other interesting patterns can be observed in the above set of visualizations.

## Phrase Detection Features -- gensim : word2vec embedding


sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

In [30]:
import nltk


from nltk.corpus import stopwords
from gensim.models import Word2Vec
import re
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\UMANG
[nltk_data]     PATEL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [90]:

paragraph = """I have three vision for India.In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""



# Preprocessing the data
# \s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. 
# In all flavors discussed in this tutorial, it includes [ \t\r\n\f].

text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',paragraph)
text = text.lower()
text = re.sub(r'\d',' ',text)        #removing digits 
text = re.sub(r'\s+',' ',text)        #removing white spaces
# text = re.sub(r'[^a-z0-9]',' ',text)       #for removing special char
# text


In [42]:
# Preparing the dataset

sentences = nltk.sent_tokenize(text)               #convert to sentences 
 
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]        #sentences to words

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]   #remove stop words
    

# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)       #min in one document and it create 100 dimension vector by default


words = model.wv.vocab               #vocabulary [unique words among all sentences ]


# # Finding Word Vectors
vector = model.wv['war']

# # Most similar words
similar = model.wv.most_similar('history')
similar

[('dr.', 0.35709553956985474),
 ('dhawan', 0.30520105361938477),
 ('gdp', 0.1781352311372757),
 ('father', 0.17399102449417114),
 ('vikram', 0.155681312084198),
 ('development', 0.1513759344816208),
 ('four', 0.14021986722946167),
 ('nurture', 0.1370331346988678),
 ('growth', 0.13159319758415222),
 ('portuguese', 0.12773531675338745)]

## Word2Vec
Word2Vec is an efficient solution to these problems, which leverages the context of the target words. Essentially, we want to use the surrounding words to represent the target words with a Neural Network whose hidden layer encodes the word   representation.  
There are two types of Word2Vec, Skip-gram and Continuous Bag of Words (CBOW). 

Skip-gram

For skip-gram, the input is the target word, while the outputs are the words surrounding the target words. For instance, in the sentence “I have a cute dog”, the input would be “a”, whereas the output is “I”, “have”, “cute”, and “dog”, assuming the window size is 5. All the input and output data are of the same dimension and one-hot encoded. The network contains 1 hidden layer whose dimension is equal to the embedding size, which is smaller than the input/ output vector size.


CBOW-Continous Bag of words

It is similar to skip gram except swaps out i/p and o/p.          
The biggest difference between Skip-gram and CBOW is that the way the word vectors are generated.           
For CBOW, all the examples with the target word as target are fed into the networks, and taking the average of the extracted hidden layer.             
For example, assume we only have two sentences, “He is a nice guy” and “She is a wise queen”.
To compute the word representation for the word “a”, we need to feed in these two examples, “He is nice guy”, and “She is wise queen” into the Neural Network and take the average of the value in the hidden layer. 

In [49]:
#TED talk Dataset, First, we download the the dataset using urllib, extracting the subtitle from the file

import numpy as np
import os
from random import shuffle
import re
import urllib.request
import zipfile
import lxml.etree


In [88]:
#download the data
# urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")
# # extract subtitle
# with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
#     doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
# input_text = '\n'.join(doc.xpath('//content/text()'))
# input_text

In [57]:
# # remove parenthesis 
# input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# # store as list of sentences
# sentences_strings_ted = []
# for line in input_text_noparens.split('\n'):
#     m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
#     sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
# # store as list of lists of words
# sentences_ted = []
# for sent_str in sentences_strings_ted:
#     tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
#     sentences_ted.append(tokens)
# sentences_ted

In [89]:
# input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# sentences_string_ted = nltk.sent_tokenize(input_text_noparens)
# sentences_ted_words  = [nltk.word_tokenize(sentence) for sentence in sentences_string_ted]


In [70]:
sentences_ted_words = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()      #special symbol and punctuation marks(,.)
    sentences_ted_words.append(tokens)
# for i in range(len(sentences_ted_words)):
#     sentences_ted_words[i] = [word for word in sentences_ted_words[i] if word not in stopwords.words('english')]   #remove stop words
# sentences_ted_words[:2]



sentences: the list of split sentences.

size: the dimensionality of the embedding vector

window: the number of context words you are looking at

min_count: tells the model to ignore words with total count less than this number.

workers: the number of threads being used

sg: whether to use skip-gram or CBOW

In [73]:
model_ted = Word2Vec(sentences=sentences_ted_words, size=100, window=5, min_count=5, workers=4, sg=0)


In [74]:
model_ted.wv.most_similar("man")

[('woman', 0.8511221408843994),
 ('guy', 0.8264572620391846),
 ('lady', 0.7827465534210205),
 ('girl', 0.7550215721130371),
 ('boy', 0.7485848665237427),
 ('gentleman', 0.7180063724517822),
 ('soldier', 0.7176931500434875),
 ('kid', 0.7104731202125549),
 ('poet', 0.6692278385162354),
 ('philosopher', 0.6434281468391418)]

Although Word2Vec successfully handles the issue posed by one-hot vector, it has several limitation. The biggest challenge is that it is not able to represent words that do not appear in the training dataset.

# FastText
FastText is an extension to Word2Vec proposed by Facebook in 2016.Instead of feeding individual words into the Neural Network, FastText breaks words into several n-grams (sub-words).         
For instance, the tri-grams for the word apple is app, ppl, and ple (igno ring the starting and ending of boundaries of words).                 
The word embedding vector for apple will be the sum of all these n-grams.                  
After training the Neural Network, we will have word embeddings for all the n-grams given the training dataset.       
Rare words can now be properly represented since it is highly likely that some of their n-grams also appears in other words.          

In [77]:
from gensim.models import FastText

In [78]:
model_ted = FastText(sentences_ted_words, size=100, window=2, min_count=10, workers=4,sg=1)

In [87]:
model_ted.wv.most_similar("insidious")   #rarely use word and not in trainig model

[('audacious', 0.8744534254074097),
 ('dubious', 0.8740344047546387),
 ('ingenious', 0.8653150796890259),
 ('unambiguous', 0.8556569814682007),
 ('ludicrous', 0.8528010845184326),
 ('suspicious', 0.850383996963501),
 ('cautious', 0.8467897176742554),
 ('ambiguous', 0.8442314267158508),
 ('innocuous', 0.842936098575592),
 ('vicious', 0.8421852588653564)]