#  Approach 2: Sensitivity Analysis and Robustness Check

In this notebook, I will change a few parameters in approach 2 and see how negatively it affects approach 2's results. Those changes are:
- Bag of Words instead of TF-IDF
- A decreased data set: 3000 instead of 5000
- Taking a subset of description to extract features (Basically, testing effect of smaller description space)

In [4]:
#load modules
import pandas as pd
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer

# Bag of Words instead of TF-IDF

In [5]:
#functions to create recommender
def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source, usecols=["name", "desc"])
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

def _train(ds):
    """
    using BOW instead of TFIDF
    """

    columns = ['name','content_recommended']
    frame = pd.DataFrame(columns=columns)
    
    bow = CountVectorizer(analyzer='word',
                         ngram_range=(1, 3),
                         min_df=0,
                         stop_words='english')
    bow_matrix = bow.fit_transform(ds['desc'])

    cosine_similarities = linear_kernel(bow_matrix, bow_matrix)

    for idx, row in ds.iterrows():
        similar_indices = cosine_similarities[idx].argsort()[:-11:-1]
        similar_items = [(cosine_similarities[idx][i], ds['name'][i])
                         for i in similar_indices]

        # First item is the item itself, set 0 as 1 to remove it.
        # This 'sum' is turns a list of tuples into a single tuple:
        # [(1,2), (3,4)] -> (1,2,3,4)
        flattened = sum(similar_items[0:], ())
        #add to frame
        arr=[row['name'], flattened]
        frame.loc[len(frame)]=arr
    return frame

In [6]:
#train, and create dataframe with 10 recommeneded firms
df=train('companies.csv')

Training data ingested in 0.0357511043549 seconds.
Engine trained in 13.7227859497 seconds.


In [7]:
#load original dataframe for checking performance
ds = pd.read_csv("companies.csv", usecols=["name", "desc"])

In [8]:
#functions to check performance 
def checker(firm_name):
    
    ind=df[(df.name==firm_name)].content_recommended[df[(df.name==firm_name)].index[0]]
    ind=ind[1::2] #take every other item in array
    print firm_name
    print ds[(ds.name==firm_name)].desc[ds[(ds.name==firm_name)].index[0]]
    print '===============Neighbors==============='
    for i in ind:
        print i
        print ds[(ds.name==i)].desc[ds[(ds.name==i)].index[0]]
        print '-----------------------------'

In [9]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
-----------------------------
Cephalon
Cephalon, Inc. is an international biopharmaceutical company dedicated to the discovery, development and commercialization of innovative products in four core therapeutic areas: central nervous system, pain, oncology and inflammatory disease. In addition to conducting an active research and development program, the Company markets seven proprietary products in the United States and numerous products in various countries throughout Europe and the world. The Company\'s most significant products are

In [10]:
#check Yantra
checker('Yantra')

Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides consulting and support services. It offers Yantra 7x products, a comprehensive group of software applications, which enable organizations to manage their fulfillment processes across customers, operations, suppliers, and partners.
Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and manufacturing industries. The company focuses on distributed order management, supply collaboration, inventory synchronization, reverse logistics, logistics management, networked warehouse management, and delivery and service scheduling. It also provides con

In [11]:
#check Discovery Engine
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
-----------------------------
Hurra Communications
Provider of integrated online marketing services. The company provides products and services on the subjects of search engine and dialogue marketing. The services provided by the company include search engine advertising, search engine optimization, conversation optimization, web analytics, performance display advertising and Facebook advertising.
-----------------------------
JustSpotted
Provider of real time search engine. The company\'s search engine aggregates and organizes content being shared on the internet. It offers search options on entertainment, technology, sports, world and bus

# A decreased data set: 3000 instead of 5000

In [33]:
#using a 3000 firms instead of 5k
#ds = pd.read_csv("companies.csv", usecols=["name", "desc"])
#ds=ds.sample(n=3000, random_state=13) #use only 3000
#ds.to_csv('3k.csv')

def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source, usecols=["name", "desc"])
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

def _train(ds):
    """
    using a subset of data
    """

    columns = ['name','content_recommended']
    frame = pd.DataFrame(columns=columns)
    
    tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 3),
                         min_df=0,
                         stop_words='english')
    tfidf_matrix = tf.fit_transform(ds['desc'])

    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

    for idx, row in ds.iterrows():
        similar_indices = cosine_similarities[idx].argsort()[:-11:-1]
        similar_items = [(cosine_similarities[idx][i], ds['name'][i])
                         for i in similar_indices]

        # First item is the item itself, set 0 as 1 to remove it.
        # This 'sum' is turns a list of tuples into a single tuple:
        # [(1,2), (3,4)] -> (1,2,3,4)
        flattened = sum(similar_items[0:], ())
        #add to frame
        arr=[row['name'], flattened]
        frame.loc[len(frame)]=arr
    return frame

In [34]:
#train, and create dataframe with 10 recommeneded firms
df=train('3k.csv')

Training data ingested in 0.014641046524 seconds.
Engine trained in 8.0909318924 seconds.


In [None]:
#load 3k dataframe for checking performance
ds = pd.read_csv('3k.csv', usecols=["name", "desc"])

In [35]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations of recombinant B domain to avoid inactivation by flying below the radar screen of the immune system.
-----------------------------
Flex Pharma
Operator of a biopharmaceutical company. The company develops clinically proven products and treatments for muscle cramps and spasms.
-----------------------------
Twinstrand Therapeutics
Operator of biopharmaceutical company. The company engages in the discovery, development and commercialization of biological drugs for the treatment of life threatening diseases.
-----------------------------
Ansata Therapeutics
Operator of a biopharmaceutical company focused on dermatologic treatments. The c

In [36]:
#check Discovery Engine
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine that also can also compile information from multiple sources.
-----------------------------
JustSpotted
Provider of real time search engine. The company\'s search engine aggregates and organizes content being shared on the internet. It offers search options on entertainment, technology, sports, world and business, science, gaming, politics and lifestyle topics.
-----------------------------
Goshme Solucoes Para a Internet
Developer and provider of search engine. The company assists users by providing a list of all search engines and databases appropriate to the query, ranked by relevance, divided by categories and sub-categories, and with a brief description about each search engine.
-----------------

# Taking a subset of description to extract features 

Basically, testing effect of smaller description space

In [65]:
#using a subsets of decriptions keep first 15 words
#ds = pd.read_csv("companies.csv", usecols=["name", "desc"])
#ds['desc']=ds['desc'].apply(lambda x: (' '.join(x.split()[:15])).strip() )
#ds.to_csv('limited.csv')

def train(data_source):
    start = time.time()
    ds = pd.read_csv(data_source, usecols=["name", "desc"])
    print "Training data ingested in %s seconds." % (time.time() - start)
    start = time.time()
    frame=_train(ds)
    print "Engine trained in %s seconds." % (time.time() - start)
    return frame

def _train(ds):
    """
    using limited descriptions
    """

    columns = ['name','content_recommended']
    frame = pd.DataFrame(columns=columns)
    
    tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 3),
                         min_df=0,
                         stop_words='english')
    tfidf_matrix = tf.fit_transform(ds['desc'])

    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

    for idx, row in ds.iterrows():
        similar_indices = cosine_similarities[idx].argsort()[:-11:-1]
        similar_items = [(cosine_similarities[idx][i], ds['name'][i])
                         for i in similar_indices]

        # First item is the item itself, set 0 as 1 to remove it.
        # This 'sum' is turns a list of tuples into a single tuple:
        # [(1,2), (3,4)] -> (1,2,3,4)
        flattened = sum(similar_items[0:], ())
        #add to frame
        arr=[row['name'], flattened]
        frame.loc[len(frame)]=arr
    return frame

In [66]:
#train, and create dataframe with 10 recommeneded firms
df=train('limited.csv')

Training data ingested in 0.0114548206329 seconds.
Engine trained in 13.2973761559 seconds.


In [None]:
#load limited dataframe for checking performance
ds = pd.read_csv('limited.csv', usecols=["name", "desc"])

In [67]:
#check Octagen
checker('Octagen')

Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations
Octagen
Operator of biopharmaceutical company. The company develops drugs for hemophilia, other genetic disorders and variations
-----------------------------
Sterix
Operator of a biopharmaceutical company. The company specializes in the research and development of a
-----------------------------
Alkermes
Operator of a biopharmaceutical company. The company develops products based on drug-delivery technologies to enhance
-----------------------------
Flex Pharma
Operator of a biopharmaceutical company. The company develops clinically proven products and treatments for muscle
-----------------------------
Twinstrand Therapeutics
Operator of biopharmaceutical company. The company engages in the discovery, development and commercialization of biological
-----------------------------
Neuromed Pharmaceuticals
Provider of small molecule drugs for the Biopharmaceuti

In [68]:
#check Octagen
checker('Yantra')

Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and
Yantra
Provider of distributed order management and supply chain fulfillment solutions for retail, distribution, logistics, and
-----------------------------
LCL Logistix
Provider of logistics and supply-chain solutions. The company provides integrated, end-to-end shipping logistics services to
-----------------------------
Mercari Technologies
Provider of supply-chain management and e-fulfillment software company. The company\'s merchandising solutions bringing retailers and
-----------------------------
Speedchain Networks
Provider of supply chain event management and global e-logistics services. The company through its web-native
-----------------------------
Global Beverage Group
Developer of delivery management technologies. The company offers a supply chain management software for the
-----------------------------
DownstreamEnergy.com
Provider of solutio

In [69]:
#check Octagen
checker('Discovery Engine')

Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine
Discovery Engine
Developer of an internet search engine. The company offers an interaction model of search engine
-----------------------------
Peryskop.pl
Provider of an online search engine. The company provides semantic search engine for products and
-----------------------------
JustSpotted
Provider of real time search engine. The company\'s search engine aggregates and organizes content being
-----------------------------
Trulia
Provider of a real estate search engine. The company offers information on search sales statistics,
-----------------------------
Krillion
Operator of a shopping search engine. The company offers audio and video accessories, such as
-----------------------------
Yandex
Operator of an Internet search engine in Russia. The company offers access to a range
-----------------------------
Milewise
Provider of a search engine for frequent fliers. The 