#  Approach 3: Doc2Vec Similar Search

This approach will treat each description as a document and convert the words into an embedding and the firms themselves into a vector embedding. Then we will just look at the most similar other firm embeddinings using cosine similarity. 

## Feature Extraction

Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents. Note: distributed memory model is the default when running Doc2Vec

The algorithm runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset. Here the input data is the the firm's description and the label is the the firm's name. 

One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. To obtain better results we will iterate over the data several times.

Note on space complexity: With the current implementation, all label vectors are stored separately in RAM. In this case, a unique firm name per description, causes memory usage to grow linearly with the size of the corpus.

## Measuring Similar in Doc2Vec

To measure the similarity of the firm, we use the Gensim's Doc2Vec built in 'most similar' function. This method computes cosine similarity between a simple mean of the projection weight vectors of the given docs. The default number of most similar returned by this function is 10 (topn=10).

In [1]:
# load modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
import pandas as pd
import numpy
import re
import time
from random import shuffle
from nltk.corpus import stopwords

In [2]:
#functions to create Doc2Vec Similar Search
def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source, usecols=["name", "desc"])
    ds['desc']=ds['desc'].apply(lambda x: x.lower()) #convert to lower
    ds['desc']=ds['desc'].apply(lambda x: re.sub("[^a-zA-Z]"," ", x)) #remove all non leters
    ds['desc']=ds['desc'].apply(lambda x: x.replace("  "," ")) #replace double spaces
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

def _train(ds):
    """
    create dataframe to hold firm name and most similar firms
    transform original data into LabeledSenetence object
    train Doc2Vec in iterations while dropping learning rate
    save Doc2Vec model
    print results to dataframe  
    """

    columns = ['name','nearest_firm']
    frame = pd.DataFrame(columns=columns)
    
    train_set=[]

    for idx, row in ds.iterrows():
        train_set.append(LabeledSentence([word for word in row['desc'].split() if
                                      word not in stopwords.words('english')],
                                     [row['name']]))
    
    #manually controlling the learning rate over the course of several iterations
    model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=100)  # use fixed learning rate
    model.build_vocab(train_set)
    for epoch in range(20):
        model.train(train_set)
        model.alpha -= 0.002  # decrease the learning rate
        model.min_alpha = model.alpha  # fix the learning rate, no decay
        
    model.save('./firm2vec.d2v')
        
    for idx, row in ds.iterrows():
        #add to frame
        arr=[row['name'], model.docvecs.most_similar(row['name'])]
        frame.loc[len(frame)]=arr
    return frame

In [3]:
#some useful functions for Doc2vec
#model = Doc2Vec.load('./firm2vec.d2v') #load model
#model.most_similar('bio') #look at most similar word on trained model
#model.vocab #see the model vocab
#model.docvecs['Yantra'] #see vector embedding of a firm

In [4]:
#train, and create dataframe with 10 similar firms
df=train('companies.csv')

Training data ingested in 0.164169073105 seconds.
Engine trained in 57.4152131081 seconds.


In [5]:
#see an example output of df
df.head()

Unnamed: 0,name,nearest_firm
0,Octagen,"[(Destination Kiruna, 0.477032274008), (Perine..."
1,GeckoGo,"[(Snaptracs, 0.43338021636), (Mobile Travel Te..."
2,Yantra,"[(Hewlett-Packard, 0.507278323174), (Angiotech..."
3,Insider Pages,"[(Sports and Things, 0.384937077761), (Kontera..."
4,GrindMedia,"[(Homeworkcentral.com, 0.468018054962), (Spots..."


In [6]:
#load original dataframe for checking performance
ds = pd.read_csv('companies.csv', usecols=["name", "desc"]) 

In [7]:
#functions to check performance 
def checker(firm_name):
    ind=df[(df.name==firm_name)].nearest_firm[df[(df.name==firm_name)].index[0]]
    ind=[word[0] for word in ind]
    print firm_name
    print ds[(ds.name==firm_name)].desc[ds[(ds.name==firm_name)].index[0]]
    print '===============Neighbors==============='
    for i in ind:
        print i
        print ds[(ds.name==i)].desc[ds[(ds.name==i)].index[0]]
        print '-----------------------------'

In [8]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
Destination Kiruna
Operator of a tourism company in the region of Kiruna. The company organizes events, guided tours, self-drive tours and offers holiday and tourism packages.
-----------------------------
Periness
Operator of a biotechnology company focusing on addressing the problem of male infertility. The company’s product is a systemic protein-based drug for treatment of male sub-fertility.
-----------------------------
Edge Therapeutics
Provider of therapeutic products for acute, fatal and debilitating medical conditions. The company develops implantable technology for direct delivery of therapeutic compounds to the site of brain injury.
-----------------------------
BMDSys Production
Developer of cardiac diagnostic imaging systems. The company uses magn

In [9]:
#check Yantra
checker('Yantra')

Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides consulting and support services. It offers Yantra 7x products, a comprehensive group of software applications, which enable organizations to manage their fulfillment processes across customers, operations, suppliers, and partners.
Hewlett-Packard
HP Inc, formerly Hewlett-Packard Company was incorporated in 1947 under the laws of the State of California as the successor to a partnership founded in 1939 by William R. Hewlett and David Packard. Effective in May 1998, it changed its state of incorporation from California to Delaware. The Company is a provider of products, technologies, software, solutions and serv

In [10]:
#check Discovery Engine
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
Eldat Communication
Developer of Electronic Shelf Label (ESL) systems. The company develops electronic defense systems and integrated circuit designs.
-----------------------------
Photonics Applications
Manufacturer of fiber-optic transmission technology for cable and wireless communications networks. The company also provide design, development and consulting services in fiber-optic transmission technology systems.
-----------------------------
Genesis Teleserv
Provider of integrated customer contact services. The company also integrates information management systems.
-----------------------------
Afferent Corporation
Developer of medical devices to treat chronic neurological dysfunction. The company\'s lead technology enhances the function of mechanoreceptor cells involved in sensory perception as a means of 

# Improving the Model

The Model uses Gensim's Doc2Vec for feature extraction and then a cosine measure to get the most similar firms. Possible improvements are:

- Tune the model: learn over more data, play with learning rate, play with stop words, etc.
- Mine more data: could create a webscraper with Selenium to collect more information about the firms; one thought is use crunchbase to get a competitor list; another is just get more description data about the firms to help the firm and word embeddings.