# url-prediction-model
SanjayKAroraPhD@gmail.com <br>
October 2018

## Description
This notebook validates multiple classifiers to predict correct firm urls.  It assumes as input a matrix of known firm urls and potential matches from MS Bing.  Variables include the search result # (i.e., rank), length of the candidate url, matches of words derived from the known firm name and name of the url, etc. <br>

After model section, train the most promising models and implement a simple voting mechanism to predict the most likely firm url for each firm.  Future work could invovle experimenting with bagging or some other ensemble technique to reduce variance.  

## Change log
v3 updates the script to predict correct matching records (to obtain a firm url) on the test set. 

In [1]:
# import data processing and other libraries
import csv
import sys
import requests
import re
import pprint
import pymongo
import traceback
from time import sleep
import requests
import pandas as pd
from IPython.display import display
import time
import numpy as np

In [24]:
# import sklearn
from sklearn.model_selection import train_test_split
from sklearn import svm
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, f1_score

## Data pre-processing
Load data and get it ready for cross-validation.

In [6]:
# import data
train_df = pd.read_csv('/Users/sanjay/dev/EAGER/data/modeling/urls/training/bing-firm-url-train-v5-unique.csv')
train_df.head()

Unnamed: 0,firm,firm_length,url,name_clnd,name_length,hit_url,hit_url_length,rank,matches,public,acquired_merged,outcome
0,honeywell international inc.,9,honeywell.com/,Honeywell - Official Site,25,honeywell.com/,26,1,1,0,0,1
1,honeywell international inc.,9,honeywell.com/,Honeywell International Inc. Company Profile |...,54,hoovers.com/company-information/cs/company-pro...,111,3,1,0,0,0
2,honeywell international inc.,9,honeywell.com/,HON:New York Stock Quote - Honeywell Internati...,58,bloomberg.com/quote/HON:US,38,4,1,1,0,0
3,honeywell international inc.,9,honeywell.com/,HON Stock Price - Honeywell International Inc....,62,marketwatch.com/investing/stock/hon,47,5,1,2,0,0
4,honeywell international inc.,9,honeywell.com/,HON Stock Price & News - Honeywell Internation...,56,quotes.wsj.com/HON,26,6,1,3,0,0


In [7]:
# Prep data and set up cross-validation

# X, y
X = train_df.drop(['firm', 'url', 'name_clnd', 'hit_url', 'outcome'], axis=1)
y = np.ravel(train_df[['outcome']].values)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

# Check how unbalanced we are
display("Outcomes are unbalanced.")
unique, counts = np.unique(y, return_counts=True)
display(dict(zip(unique, counts)))

# Assign groups based on 'firm'
groups = train_df.groupby('firm').ngroup().values

# k-foldGroup
gkf = GroupKFold(n_splits=3)

'Outcomes are unbalanced.'

{0: 963, 1: 154}

## Run cross validation on several models
The main evaluation metric currently being assessed is accuracy.  Future work might consider other other common metrics such as the F1 score. Another to-do is to tune the hyperparamets using cross-validation.  (They are static currently, as seen below in the next cell.)

In [8]:
# specify a few models

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=0.001, C=100.),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

In [10]:
# build dataframe for output metrics 
eval_df = pd.DataFrame (names,index=(range(len(names))), columns=["Name"])
eval_df['F1'] = np.float64(0)
display (eval_df)

# build dataframe for predicted values
pred_df = pd.DataFrame(index=(range(len(train_df.index)))) # number of rows equals number of training observations

Unnamed: 0,Name,F1
0,Nearest Neighbors,0.0
1,Linear SVM,0.0
2,RBF SVM,0.0
3,Gaussian Process,0.0
4,Decision Tree,0.0
5,Random Forest,0.0
6,Neural Net,0.0
7,AdaBoost,0.0
8,Naive Bayes,0.0
9,QDA,0.0


In [26]:
# build evaluation outputs (currently limited to accuracy)
i = np.int64(0)
for name, clf in zip(names, classifiers):
    display (name)
    scores = cross_val_score(clf, X, y, cv=gkf, groups=groups, 
                             scoring='f1')
    avg_score = np.mean(scores)
    eval_df.set_value(i, 'F1', avg_score)
    i = i + 1
    
display(eval_df)
eval_df.to_clipboard()

'Nearest Neighbors'

  


'Linear SVM'

  


'RBF SVM'

  


'Gaussian Process'

  


'Decision Tree'

  


'Random Forest'

  


'Neural Net'

  


'AdaBoost'

  


'Naive Bayes'

  


'QDA'

  


Unnamed: 0,Name,F1
0,Nearest Neighbors,0.790765
1,Linear SVM,0.90603
2,RBF SVM,0.907911
3,Gaussian Process,0.913805
4,Decision Tree,0.916236
5,Random Forest,0.914336
6,Neural Net,0.929366
7,AdaBoost,0.924306
8,Naive Bayes,0.809434
9,QDA,0.834869


## Cross-validation prediction
This section produces outputs to examine the efficacy of each model. The csv file generated below can be imported into Excel, and the analyst can then review which firms seem more or less likely to have accurate predicted URLs (based on the training data)

In [11]:
# predict across classifiers
for name, clf in zip(names, classifiers):
    display (name)
    y_hat = cross_val_predict(clf, X, y, cv=gkf, groups=groups)
    pred_df[name] = y_hat

'Nearest Neighbors'

'Linear SVM'

'RBF SVM'

'Gaussian Process'

'Decision Tree'

'Random Forest'

'Neural Net'



'AdaBoost'

'Naive Bayes'

'SVC'

'QDA'

In [12]:
# hold simple voting
vote_df = pred_df.copy()
vote_df['Votes'] = vote_df.sum(axis=1)
vote_df['Outcome'] = train_df['outcome']
vote_df['Group'] = groups

In [13]:
# pick from each group the top vote getter
idx = vote_df.groupby(['Group'])['Votes'].transform(max) == vote_df['Votes']
results_df = vote_df[idx]

# filter out 0-vote getting observations
results_df = results_df[(results_df['Votes'] > 0)]

# Check how unbalanced we are
display("Votes sometimes might be tied for non-zero values.")
unique, counts = np.unique(results_df['Group'], return_counts=True)
group_dup_list = zip(unique,counts)
# display(group_dup_list)

# merge with original training data
results_small_df = results_df[['Votes']]
results_merged_df = train_df.merge (results_small_df, left_index=True, right_index=True, how='inner')
results_merged_df

'Votes sometimes might be tied for non-zero values.'

Unnamed: 0,firm,firm_length,url,name_clnd,name_length,hit_url,hit_url_length,rank,matches,public,acquired_merged,outcome,Votes
0,honeywell international inc.,9,honeywell.com/,Honeywell - Official Site,25,honeywell.com/,26,1,1,0,0,1,10
6,imds corporation,4,imds-ohio.com/,IMDS Corporation,16,imds-ohio.com/,25,1,1,0,0,1,10
13,"ziptronix, inc.",9,xperi.com/,Invensas,8,invensas.com/,25,2,0,0,1,0,2
20,"dcg systems, inc.",11,fei.com/,Electrical Failure Analysis | Thermo Fisher Sc...,54,fei.com/products/electrical-failure-analysis/,57,1,0,0,0,0,3
29,quallion llc,8,quallion.com/,EnerSys Advanced Systems: Powering Submarines ...,59,enersys.com/advancedsystems/,40,2,0,0,0,0,2
38,alliant techsystems inc.,19,northropgrumman.com/Pages/default.aspx,Vista Outdoor - Official Site,29,vistaoutdoor.com/,25,10,0,0,1,0,1
39,sensor-kinesis corporation,14,sensor-kinesis.com/,Sensor-Kinesis Corp. | The Leader in Advanced ...,59,sensor-kinesis.com/,26,1,2,0,0,1,10
46,epcot crenshaw corporation,14,epcotcrenshaw.com/,Epcot | Crenshaw,16,epcotcrenshaw.com/,29,1,2,1,0,1,8
53,gen-probe incorporated,9,hologic.com/,Hologic - Official Site,23,hologic.com/,24,1,0,0,0,1,10
61,"lexmark international, inc.",7,lexmark.com/en_us.html,"Print, secure and manage your information | Le...",62,lexmark.com/en_us.html,34,1,1,0,0,1,6


In [16]:
# write data frame to csv
results_merged_df.to_csv('/home/eager/EAGER/data/orgs/workshop/bing-firm-url-out-v5.csv')

## Fit best models and predict on (final) test set
After cross-validation, fit current top-performing models which include Linear SVM, Gaussian Process, Decision Tree,  AdaBoost, and SVC. Predict outputs, aggregate votes, and write to a csv file (similar to how we checked cross-validation outputs, except now there are no observed correct y values, i.e. urls, just predicted ones)

In [17]:
# specify the highest-performing models from cross-validation
top_names = ["Linear SVM", "Gaussian Process",
         "Decision Tree", "AdaBoost", "SVC" ]

top_classifiers = [
    SVC(kernel="linear", C=0.025),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    AdaBoostClassifier(), 
    SVC(gamma=0.001, C=100.)]

In [20]:
# load final test set
X_test_in = pd.read_csv('/home/eager/EAGER/data/orgs/workshop/bing-final-test-matrix.csv')
X_test = X_test_in.drop(['firm', 'name_clnd', 'hit_url'], axis=1)
X_test.head()

Unnamed: 0,firm_length,name_length,hit_url_length,rank,matches,public,acquired_merged
0,21,31,26,1,2,1,0
1,21,58,82,3,3,1,0
2,21,43,84,7,3,2,0
3,21,57,79,8,3,2,0
4,21,54,95,9,2,2,0


In [21]:
# build dataframe for predicted values
test_df = pd.DataFrame(index=(range(len(X_test.index)))) # number of rows equals number of training observations

# Assign groups based on 'firm'
test_groups = X_test_in.groupby('firm').ngroup().values

In [22]:
# predict
for name, clf in zip(top_names, top_classifiers):
    display (name)
    clf.fit(X, y)
    y_hat = clf.predict (X_test)
    test_df[name] = y_hat

'Linear SVM'

'Gaussian Process'

'Decision Tree'

'AdaBoost'

'SVC'

In [23]:
# hold simple voting
fin_df = test_df.copy()
fin_df['Votes'] = fin_df.sum(axis=1)
fin_df['Group'] = test_groups

In [24]:
# pick from each group the top vote getter
idx = fin_df.groupby(['Group'])['Votes'].transform(max) == fin_df['Votes']
results_df = fin_df[idx]

# merge with original training data
results_small_df = results_df[['Votes', 'Group']]
results_merged_df = X_test_in.merge (results_small_df, left_index=True, right_index=True, how='inner')

# filter out 0-vote getting observations
fin_results_merged_df = results_merged_df[results_merged_df.groupby(['Group'])['rank'].transform(min) == results_merged_df['rank']]

# Check how unbalanced we are
display("Votes sometimes might be tied for non-zero values.")
unique, counts = np.unique(fin_results_merged_df['Group'], return_counts=True)
group_dup_list = zip(unique,counts)
# display(group_dup_list)
fin_results_merged_df

'Votes sometimes might be tied for non-zero values.'

Unnamed: 0,firm,firm_length,name_clnd,name_length,hit_url,hit_url_length,rank,matches,public,acquired_merged,Votes,Group
0,ACACIA RESEARCH GROUP LLC,21,Acacia Research - Official Site,31,acaciaresearch.com/,26,1,2,1,0,5,0
5,Ablexis,7,"Ablexis A superior, next generation transgenic...",56,ablexis.com/,23,1,1,0,0,5,1
13,Alliance for Sustainable Energy,31,Alliance for Sustainable Energy - Official Site,47,allianceforsustainableenergy.org/,44,1,4,0,0,1,2
20,CyboEnergy,10,CyboEnergy,10,cyboenergy.com/,26,1,1,0,0,5,3
26,EMD Technologies Inc.,16,EMD Technologies EMD Technologies is a leadin...,60,emd-technologies.com/,29,1,2,0,0,5,4
34,Energen,7,"Diamondback Energy, Inc. (FANG)",31,diamondbackenergy.com/,34,1,0,0,0,4,5
40,FULL CIRCLE BIOCHAR,19,Full Circle Biochar | Full Circle Biochar,41,fullcirclebiochar.com/,29,1,3,1,0,5,6
48,Ferro Corporation,5,Ferro - Official Site,21,ferro.com/,22,1,1,0,0,5,7
54,GOAL ZERO LLC,9,Goal Zero - Official Site,25,goalzero.com/,25,1,2,0,0,5,8
62,Genesco Inc.,7,Investor Overview | Genesco,27,genesco.gcs-web.com/,27,2,1,0,0,0,9


In [25]:
# write data frame to csv
fin_results_merged_df.to_csv('/home/eager/EAGER/data/orgs/workshop/bing-firm-final-urls-out.csv')