# Approach 1: Naive Nearest Neighbor Search

The first approach will be to obtain a baseline for an extremeley simple Nearest Neighbor Search. That is we will do a very naive feature extraction and create the list of similar companies by just measuring the proximity of the firms that are closest.

## Feature Extraction

The features will be extracted from the firms' descriptions. The features will just be a bag of words (just a count of the words in the description). We will not drop any stopwords, or create any n-grams (except of course the unigrams that are the words). 

## Measuring Similar in Nearest Neighbor

We use the default parameters (except for number of neighbors, which we set to 10) for Scikit Learn's unsupervised learner for implementing neighbor searches (sklearn.neighbors.NearestNeighbors). 

Note: The distance here is minkowski. Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance

In [1]:
#load modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import NearestNeighbors
import nltk
from nltk.stem import PorterStemmer
import string
import time
from sklearn.manifold import TSNE

# Pretty display for notebooks
%matplotlib inline

In [2]:
#functions to create naive nearest neighbor search
def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source, usecols=["name", "desc"])
    ds['desc']=ds['desc'].apply(lambda x: x.lower()) #convert to lower
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

def _train(ds):
    """
    create dataframe to hold firm name and most similar firms
    store org data into two lists
    vectorize desc with simple count (bow) with no drop of stopwords
    perform a NN search where k =10 for each firm
    print results to dataframe  
    """

    columns = ['name','nearest_neighbor', 'features']
    frame = pd.DataFrame(columns=columns)
    
    summ = list(ds['desc']) #desc in list
    names = list(ds['name']) #names in list

    # Now, we want to convert the raw text in our desc to a "bag of words" vector
    # To do that, we use the CountVectorizor
    vectorizer = CountVectorizer()

    # first, we "teach" the vectorizor which tokens to vectorize on
    vectorizer.fit(summ)
    # then we vectorize those speeches
    summ_features = vectorizer.transform(summ)

    #print vectorizer.get_feature_names()
    x=vectorizer.transform(summ).toarray()

    neigh = NearestNeighbors(n_neighbors=10)
    neigh.fit(summ_features) 
        
    for idx, row in ds.iterrows():
        #add to frame
        nn=neigh.kneighbors([x[idx]])
        nn=nn[1][0]
        arr=[row['name'], [names[n] for n in nn], x[idx]]
        frame.loc[len(frame)]=arr
    return frame

In [3]:
#train, and create dataframe with 10 nearest neighbors
df=train('companies.csv') #train, and create dataframe with 10 nearest neighbors

Training data ingested in 0.363903045654 seconds.
Engine trained in 20.6557528973 seconds.


In [4]:
#see an example output of df
df.head()

Unnamed: 0,name,nearest_neighbor,features
0,Octagen,"[Octagen, Twinstrand Therapeutics, Axovan, Zet...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,GeckoGo,"[GeckoGo, Truth Soft, DealAngel, PortugalRes, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,Yantra,"[Yantra, Factory Logic, Valdero, GEOCOMtms, Me...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,Insider Pages,"[Insider Pages, Tinmar Holdings, Sprockets, Lo...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,GrindMedia,"[GrindMedia, Atomic Moguls, Viva! Vision, NuCo...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [5]:
#load original dataframe for checking performance
ds = pd.read_csv("companies.csv", usecols=["name", "desc"])

In [6]:
ds.head()

Unnamed: 0,name,desc
0,Octagen,Operator of biopharmaceutical company. The com...
1,GeckoGo,Operator of an online travel website. The comp...
2,Yantra,Provider of distributed order management and s...
3,Insider Pages,Operator of online directory and reviews site ...
4,GrindMedia,Provider of online action sports and entertain...


In [7]:
ds['desc'][0]


'Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.'

In [14]:
#functions to check performance 
def checker(firm_name):
    ind=df[(df.name==firm_name)].nearest_neighbor[df[(df.name==firm_name)].index[0]]
    print firm_name
    print ds[(ds.name==firm_name)].desc[ds[(ds.name==firm_name)].index[0]]
    print '===============Neighbors==============='
    for i in ind:
        print i
        print ds[(ds.name==i)].desc[ds[(ds.name==i)].index[0]]
        print '-----------------------------'

# Checking Performance

For this Problem, I implement three seperate models. Therefore to try and get both an understanding of which implementation may be superior and if an implementation even makes sense, I create a function that prints the name and description of a firm we want to look at then prints a list of the nearest firms and thier descriptions. The idea is to read about the other firms and see if they intuitively seem similar. 

For purpose of measuring performance across the three models, I have chosen three firms randomly: Octagen, Yantra, Disocvery Engine. In each implementation, we check performance on the same three companies. Ofocurse, you can check additional firms by running the "checker" function. 

Also, The discusion of the models' performances will be in the capstone project report under 'Results'.

In [15]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
-----------------------------
Twinstrand Therapeutics
Operator of biopharmaceutical company. The company engages in the discovery, development and commercialization of biological drugs for the treatment of life threatening diseases.
-----------------------------
Axovan
Operator of a biopharmaceutical research company. The company is involved in the discovery of drugs linked to G protein-coupled receptors.
-----------------------------
ZetaRx Biosciences
Operator of biotechnology company. The company engages in the development of thera

In [16]:
#check Yantra
checker('Yantra')

Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides consulting and support services. It offers Yantra 7x products, a comprehensive group of software applications, which enable organizations to manage their fulfillment processes across customers, operations, suppliers, and partners.
Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides con

In [17]:
#check Discovery Engine
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
-----------------------------
Zebido.com
Operator of an auction website.
-----------------------------
ES Enterprise Solutions
Provider of an online software service.
-----------------------------
Realtime Worlds
Developer of video games.
-----------------------------
La La Media
Operator of an online music store.
-----------------------------
Peryskop.pl
Provider of an online search engine. The company provides semantic search engine for products and product\'s reviews in Polish and English.
-----------------------------
PlanarMag
Developer of planar electromagnetic components.
-----------------------------
Xaar
Developer of ink jet techno

In [18]:
#prepare t-sne, commented out because slow and performs terrible
#feat=[list(df['features'][i]) for i in range(len(df['features']))]
#feats = np.array(feat)
#model = TSNE(n_components=2, random_state=0)
#np.set_printoptions(suppress=True)


In [19]:
#takes a while, apply t-sne (reduce features dim to 2)
#feats2d=model.fit_transform(feats) 

In [20]:
def plot(firm_name):
    plt.clf()
    names1=list(df['name'])
    #names=[n.encode('utf-8') for n in names]
    x1=[feats2d[i][0] for i in range(len(names1))]
    y1=[feats2d[i][1] for i in range(len(names1))]
    nn=df[(df.name==firm_name)].nearest_neighbor[df[(df.name==firm_name)].index[0]]
    
    x=[]
    y=[]
    names=[]
    
    for n in nn:
        for i, name in enumerate(names1):
            if n==name:
                names.append(name)
                x.append(x1[i])
                y.append(y1[i])  
                
    #ploat scatter plot
    plt.scatter(x,y,color='green')
    
    for i, name in enumerate(names):
        try:
            plt.annotate(name, (x[i],y[i]))
        except:
            print 'name'

    #make titles
    plt.title("t-sne plot")
    plt.xlabel("X")
    plt.ylabel("Y")

    plt.show

In [21]:
#t-sne doesn't look very good
#plot('Yantra')

# Improving the Model

Since this is a very naive case, there are a lot of potential improvements. I will list them in no particular order.

- Use a different method for feature extraction: TF-IDF or some word vectorization (like word2vec)
- Use n-grams to capture phrases
- Change the distance measure: possibly cosine or jaccard
- Utilize keyswords: either append to description in some way or have a seperate feature extraction process for them.
- Mine more data: could create a webscraper with Selenium to collect more information about the firms; one thought is use crunchbase to get a competitor list. 