# Refinements

In [35]:
#load modules
import pandas as pd
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
import numpy as  np

# Pretty display for notebooks
%matplotlib inline

In [36]:
#functions to create recommender
def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source, usecols=["name", "desc"])
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

## Quad-grams

For this refinement, we will add quad grams to the content based recommender engine code.

In [37]:
def _train(ds):
    """
    Train the engine.
    Create a TF-IDF matrix of unigrams, bigrams, and trigrams
    for each firm. The 'stop_words' param tells the TF-IDF
    module to ignore common english words like 'the', etc.
    Then we compute similarity between all products using
    SciKit Leanr's linear_kernel (which in this case is
    equivalent to cosine similarity).
    Iterate through each item's similar items and store the
    10 most-similar. 
    Similarities and their scores are stored in 
    :param ds: A pandas dataset containing two fields: description & id
    """

    columns = ['name','content_recommended']
    frame = pd.DataFrame(columns=columns)
    
    tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 4),
                         min_df=0,
                         stop_words='english')
    tfidf_matrix = tf.fit_transform(ds['desc'])

    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

    for idx, row in ds.iterrows():
        similar_indices = cosine_similarities[idx].argsort()[:-11:-1]
        similar_items = [(cosine_similarities[idx][i], ds['name'][i])
                         for i in similar_indices]

        # First item is the item itself, set 0 as 1 to remove it.
        # This 'sum' is turns a list of tuples into a single tuple:
        # [(1,2), (3,4)] -> (1,2,3,4)
        flattened = sum(similar_items[0:], ())
        #add to frame
        arr=[row['name'], flattened]
        frame.loc[len(frame)]=arr
    return frame

In [38]:
#train, and create dataframe with 10 recommeneded firms
df=train('companies.csv')

Training data ingested in 0.0337560176849 seconds.
Engine trained in 15.4151630402 seconds.


In [39]:
#load original dataframe for checking performance
ds = pd.read_csv("companies.csv", usecols=["name", "desc"])

In [40]:
#functions to check performance 
def checker(firm_name):
    
    ind=df[(df.name==firm_name)].content_recommended[df[(df.name==firm_name)].index[0]]
    ind=ind[1::2] #take every other item in array
    print firm_name
    print ds[(ds.name==firm_name)].desc[ds[(ds.name==firm_name)].index[0]]
    print '===============Neighbors==============='
    for i in ind:
        print i
        print ds[(ds.name==i)].desc[ds[(ds.name==i)].index[0]]
        print '-----------------------------'

In [41]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
-----------------------------
Flex Pharma
Operator of a biopharmaceutical company. The company develops clinically proven products and treatments for muscle cramps and spasms.
-----------------------------
Alkermes
Operator of a biopharmaceutical company. The company develops products based on drug-delivery technologies to enhance therapeutic outcomes in major diseases.
-----------------------------
Twinstrand Therapeutics
Operator of biopharmaceutical company. The company engages in the discovery, development and commercialization of

In [42]:
#check Yantra
checker('Yantra')

Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides consulting and support services. It offers Yantra 7x products, a comprehensive group of software applications, which enable organizations to manage their fulfillment processes across customers, operations, suppliers, and partners.
Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides con

In [43]:
#check Discovery Engine
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
-----------------------------
Peryskop.pl
Provider of an online search engine. The company provides semantic search engine for products and product\'s reviews in Polish and English.
-----------------------------
JustSpotted
Provider of real time search engine. The company\'s search engine aggregates and organizes content being shared on the internet. It offers search options on entertainment, technology, sports, world and business, science, gaming, politics and lifestyle topics.
-----------------------------
Zoomf
Provider of a residential property sales and letting search engine. The company also offers consumers local market intelligence 

# Appending the labels to description

For this refinement, we will add the labels to the description. This will give us a bit more information in our descriptions, hopefully leading to better similar company results. 

In [47]:
#functions to create recommender
def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source)
    ds['desc']=ds.apply(lambda x: x['desc']+' '+x['keywords'], axis=1)
    ds = ds.drop('keywords', 1)
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

def _train(ds):
    """
    Train the engine.
    Create a TF-IDF matrix of unigrams, bigrams, and trigrams
    for each firm. The 'stop_words' param tells the TF-IDF
    module to ignore common english words like 'the', etc.
    Then we compute similarity between all products using
    SciKit Leanr's linear_kernel (which in this case is
    equivalent to cosine similarity).
    Iterate through each item's similar items and store the
    10 most-similar. 
    Similarities and their scores are stored in 
    :param ds: A pandas dataset containing two fields: description & id
    """

    columns = ['name','content_recommended']
    frame = pd.DataFrame(columns=columns)
    
    tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 3),
                         min_df=0,
                         stop_words='english')
    tfidf_matrix = tf.fit_transform(ds['desc'])

    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

    for idx, row in ds.iterrows():
        similar_indices = cosine_similarities[idx].argsort()[:-11:-1]
        similar_items = [(cosine_similarities[idx][i], ds['name'][i])
                         for i in similar_indices]

        # First item is the item itself, set 0 as 1 to remove it.
        # This 'sum' is turns a list of tuples into a single tuple:
        # [(1,2), (3,4)] -> (1,2,3,4)
        flattened = sum(similar_items[0:], ())
        #add to frame
        arr=[row['name'], flattened]
        frame.loc[len(frame)]=arr
    return frame

In [48]:
#train, and create dataframe with 10 recommeneded firms
df=train('companies.csv')

Training data ingested in 0.261860132217 seconds.
Engine trained in 15.1754980087 seconds.


In [49]:
#load original dataframe for checking performance
ds = pd.read_csv("companies.csv", usecols=["name", "desc"])

In [50]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
-----------------------------
Flex Pharma
Operator of a biopharmaceutical company. The company develops clinically proven products and treatments for muscle cramps and spasms.
-----------------------------
Twinstrand Therapeutics
Operator of biopharmaceutical company. The company engages in the discovery, development and commercialization of biological drugs for the treatment of life threatening diseases.
-----------------------------
Alkermes
Operator of a biopharmaceutical company. The company develops products based on drug-deliver

In [51]:
#check Yantra
checker('Yantra')

Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides consulting and support services. It offers Yantra 7x products, a comprehensive group of software applications, which enable organizations to manage their fulfillment processes across customers, operations, suppliers, and partners.
Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides con

In [52]:
#check Discovery Engine
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
-----------------------------
WiseNut
Provider of search engine and Web-browsing services. The company is the developer of a crawler-based search engine and database of indexed Web pages.
-----------------------------
Peryskop.pl
Provider of an online search engine. The company provides semantic search engine for products and product\'s reviews in Polish and English.
-----------------------------
JustSpotted
Provider of real time search engine. The company\'s search engine aggregates and organizes content being shared on the internet. It offers search options on entertainment, technology, sports, world and business, science, gaming, politic