### Decision Tree Procedure

As the regression approach proved unsuccessful, the original approach will be modified. At this point, the decision tree model will be used to generate several distinct groups as the leaf nodes, from which the subsequent text analysis will be used to determine the highest probabilities for each group to classify a shortlist of companies.

With respect to the decision tree itself, there are three components from the list of acquired companies that will be used: the acquisition price, the functional groups, and the industry.

In [101]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy import stats, special
from sklearn import tree
from sklearn import model_selection, metrics, linear_model, datasets, feature_selection
from seaborn import pairplot, heatmap

import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [102]:
tree_data = pd.read_csv('Final Project/IPO Information.csv')
del tree_data['Unnamed: 0']
del tree_data['Actions']
del tree_data['Exchange']
del tree_data['Date']
del tree_data['Description']
del tree_data['Price']
del tree_data['Shares']
tree_data

Unnamed: 0,Symbol,Company Name,Offer Amount,Country,Group,Industry
0,ILPT,Industrial Logistics Properties Trust,480000000,United States,Misc.,Real Estate
1,LBRT,Liberty Oilfield Services Inc.,216428564,United States,Misc.,Energy
2,PACK,Ranpak Holdings Corp.,300000000,United States,Misc.,Consumer Cyclical
3,NINE,"Nine Energy Service, Inc.",161000000,United States,Misc.,Energy
4,ADT,ADT Inc.,1470000000,United States,Devices,Industrials
...,...,...,...,...,...,...
394,CASA,Casa Systems Inc,78000000,United States,Media,Technology
395,NMRK,"NEWMARK GROUP, INC.",280000000,United States,Misc.,Real Estate
396,TRVG,trivago N.V.,287211298,Germany,Search,Communication Services
397,AMTB,Amerant Bancorp Inc.,81900000,United States,Organization,Financial Services


In the data below, an additional column has been added - the tech index. This column examines whether the company in question can be considered a tech company or not, this being determined by whether it falls in the technology or communications services industries or not.

In [103]:
acquired = pd.read_csv('Final Project/acquisitions.csv')
acquired

Unnamed: 0,Acquisition Date,Company,Business,Country,Price,Used as or integrated with,Group,ParentCompany,Industry,Tech
0,11-Feb-05,Verdisoft,Wireless data sharing,United States,58000000,Yahoo! Mobile,OS,Yahoo,Communications Services,1
1,20-Mar-05,Ludicorp,Image hosting service,Canada,40000000,Flickr,Search,Yahoo,Technology,1
2,17-Aug-05,Android,Mobile operating system,United States,50000000,Android,OS,Alphabet,Technology,1
3,23-Aug-05,facebook.com domain name,AboutFace,United States,200000,name change from Thefacebook.com,Business,Facebook,Technology,1
4,12-Dec-05,del.icio.us,Social bookmarking,United States,20000000,del.icio.us,Search,Yahoo,Communications Services,1
...,...,...,...,...,...,...,...,...,...,...
86,10-Dec-18,Sigmoid Labs,Indian railway tracking,India,40000000,,Misc.,Alphabet,Industrials,0
87,03-Jan-19,Superpod,Question and answer app,United States,60000000,Google Assistant,Search,Alphabet,Technology,1
88,06-Jun-19,Looker,"Big data, analytics",United States,2600000000,Google Cloud Platform,Productivity,Alphabet,Technology,1
89,01-Nov-19,Fitbit,Wearables,United States,2100000000,,,Alphabet,Consumer Cyclical,0


In [104]:
acquired['Google'] = np.where(acquired['ParentCompany']=="Alphabet", 1, 0)
acquired['Group'][89] = "Devices"
acquired1 = acquired.copy()
acquired

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Acquisition Date,Company,Business,Country,Price,Used as or integrated with,Group,ParentCompany,Industry,Tech,Google
0,11-Feb-05,Verdisoft,Wireless data sharing,United States,58000000,Yahoo! Mobile,OS,Yahoo,Communications Services,1,0
1,20-Mar-05,Ludicorp,Image hosting service,Canada,40000000,Flickr,Search,Yahoo,Technology,1,0
2,17-Aug-05,Android,Mobile operating system,United States,50000000,Android,OS,Alphabet,Technology,1,1
3,23-Aug-05,facebook.com domain name,AboutFace,United States,200000,name change from Thefacebook.com,Business,Facebook,Technology,1,0
4,12-Dec-05,del.icio.us,Social bookmarking,United States,20000000,del.icio.us,Search,Yahoo,Communications Services,1,0
...,...,...,...,...,...,...,...,...,...,...,...
86,10-Dec-18,Sigmoid Labs,Indian railway tracking,India,40000000,,Misc.,Alphabet,Industrials,0,1
87,03-Jan-19,Superpod,Question and answer app,United States,60000000,Google Assistant,Search,Alphabet,Technology,1,1
88,06-Jun-19,Looker,"Big data, analytics",United States,2600000000,Google Cloud Platform,Productivity,Alphabet,Technology,1,1
89,01-Nov-19,Fitbit,Wearables,United States,2100000000,,Devices,Alphabet,Consumer Cyclical,0,1


### Decision Tree - Take 1 

The first approach taken for the decision tree model is to use the binary approach for the respective functional groups.

In [105]:
acquired['Search'] = np.where(acquired['Group'] == 'Search', 1,0)
acquired['Media'] = np.where(acquired['Group'] == 'Media', 1,0)
acquired['Devices'] = np.where(acquired['Group'] == 'Devices', 1,0)
acquired['OS'] = np.where(acquired['Group'] == 'OS', 1,0)
acquired['Connectivity'] = np.where(acquired['Group'] == 'Connectivity', 1,0)
acquired['Productivity'] = np.where(acquired['Group'] == 'Productivity', 1,0)
acquired['Business'] = np.where(acquired['Group'] == 'Business', 1,0)
acquired['Research'] = np.where(acquired['Group'] == 'Research', 1,0)
acquired['Innovation'] = np.where(acquired['Group'] == 'Innovation', 1,0)
acquired['Misc'] = np.where(acquired['Group'] == 'Misc.', 1,0)
acquired

Unnamed: 0,Acquisition Date,Company,Business,Country,Price,Used as or integrated with,Group,ParentCompany,Industry,Tech,Google,Search,Media,Devices,OS,Connectivity,Productivity,Research,Innovation,Misc
0,11-Feb-05,Verdisoft,0,United States,58000000,Yahoo! Mobile,OS,Yahoo,Communications Services,1,0,0,0,0,1,0,0,0,0,0
1,20-Mar-05,Ludicorp,0,Canada,40000000,Flickr,Search,Yahoo,Technology,1,0,1,0,0,0,0,0,0,0,0
2,17-Aug-05,Android,0,United States,50000000,Android,OS,Alphabet,Technology,1,1,0,0,0,1,0,0,0,0,0
3,23-Aug-05,facebook.com domain name,1,United States,200000,name change from Thefacebook.com,Business,Facebook,Technology,1,0,0,0,0,0,0,0,0,0,0
4,12-Dec-05,del.icio.us,0,United States,20000000,del.icio.us,Search,Yahoo,Communications Services,1,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,10-Dec-18,Sigmoid Labs,0,India,40000000,,Misc.,Alphabet,Industrials,0,1,0,0,0,0,0,0,0,0,1
87,03-Jan-19,Superpod,0,United States,60000000,Google Assistant,Search,Alphabet,Technology,1,1,1,0,0,0,0,0,0,0,0
88,06-Jun-19,Looker,0,United States,2600000000,Google Cloud Platform,Productivity,Alphabet,Technology,1,1,0,0,0,0,0,1,0,0,0
89,01-Nov-19,Fitbit,0,United States,2100000000,,Devices,Alphabet,Consumer Cyclical,0,1,0,0,1,0,0,0,0,0,0


In [106]:
X = acquired[['Price', 'Search', 'Media', 'OS', 'Connectivity', 'Productivity','Business', 'Research', 'Innovation', 'Misc']]
y = acquired['Google']

In [107]:
bt_tree = tree.DecisionTreeClassifier(max_depth=5)
bt_tree.fit(X,y)
bt_tree.score(X,y)

0.8021978021978022

In [108]:
estimator = bt_tree
n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))

The binary tree structure has 35 nodes and has the following tree structure:
node=0 test node: go to node 1 if X[:, 0] <= 395000000.0 else to node 22.
	node=1 test node: go to node 2 if X[:, 0] <= 165000000.0 else to node 15.
		node=2 test node: go to node 3 if X[:, 0] <= 59000000.0 else to node 8.
			node=3 test node: go to node 4 if X[:, 0] <= 52500000.0 else to node 7.
				node=4 test node: go to node 5 if X[:, 1] <= 0.5 else to node 6.
					node=5 leaf node.
					node=6 leaf node.
				node=7 leaf node.
			node=8 test node: go to node 9 if X[:, 3] <= 0.5 else to node 12.
				node=9 test node: go to node 10 if X[:, 0] <= 117500000.0 else to node 11.
					node=10 leaf node.
					node=11 leaf node.
				node=12 test node: go to node 13 if X[:, 0] <= 69100000.0 else to node 14.
					node=13 leaf node.
					node=14 leaf node.
		node=15 test node: go to node 16 if X[:, 2] <= 0.5 else to node 21.
			node=16 test node: go to node 17 if X[:, 0] <= 368000000.0 else to node 18.
				node=17 leaf

While this method generated a decision tree fine, most of the binary inputs were never used in the model itself, despite the model's fairly high score. Noting this, let's try a different approach to this problem.

In [109]:
tree_data

Unnamed: 0,Symbol,Company Name,Offer Amount,Country,Group,Industry
0,ILPT,Industrial Logistics Properties Trust,480000000,United States,Misc.,Real Estate
1,LBRT,Liberty Oilfield Services Inc.,216428564,United States,Misc.,Energy
2,PACK,Ranpak Holdings Corp.,300000000,United States,Misc.,Consumer Cyclical
3,NINE,"Nine Energy Service, Inc.",161000000,United States,Misc.,Energy
4,ADT,ADT Inc.,1470000000,United States,Devices,Industrials
...,...,...,...,...,...,...
394,CASA,Casa Systems Inc,78000000,United States,Media,Technology
395,NMRK,"NEWMARK GROUP, INC.",280000000,United States,Misc.,Real Estate
396,TRVG,trivago N.V.,287211298,Germany,Search,Communication Services
397,AMTB,Amerant Bancorp Inc.,81900000,United States,Organization,Financial Services


In [110]:
print(tree_data['Industry'].unique())
print(acquired1['Industry'].unique())

['Real Estate' 'Energy' 'Consumer Cyclical' 'Industrials' 'Technology'
 'Healthcare' 'Consumer Defensive' 'Utilities' 'Basic Materials'
 'Financial Services' 'Communication Services']
['Communications Services' 'Technology' 'Industrials' 'Consumer Cyclical'
 'Consumer Defensive' 'Financial Services']


### Decision Tree - Take 2

Rather than a 11-fold binary + Price approach, this time we will use a group index to split the groups into 3 factions - into a 0-1-2 basis. This index will be paired with the Prices and the aforementioned Tech index for the decision tree.

In [111]:
group_index = {'Search':0, 'Media':0, 'Business':0, 'OS':1, 'Organization':1, 'Productivity':1, 'Connectivity':1, 'Devices':2, 'Research':2, 'Innovation':2, 'Misc.':2}
acquired1['Group_Index'] = acquired1['Group'].map(group_index)

acquired1

Unnamed: 0,Acquisition Date,Company,Business,Country,Price,Used as or integrated with,Group,ParentCompany,Industry,Tech,Google,Group_Index
0,11-Feb-05,Verdisoft,Wireless data sharing,United States,58000000,Yahoo! Mobile,OS,Yahoo,Communications Services,1,0,1
1,20-Mar-05,Ludicorp,Image hosting service,Canada,40000000,Flickr,Search,Yahoo,Technology,1,0,0
2,17-Aug-05,Android,Mobile operating system,United States,50000000,Android,OS,Alphabet,Technology,1,1,1
3,23-Aug-05,facebook.com domain name,AboutFace,United States,200000,name change from Thefacebook.com,Business,Facebook,Technology,1,0,0
4,12-Dec-05,del.icio.us,Social bookmarking,United States,20000000,del.icio.us,Search,Yahoo,Communications Services,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
86,10-Dec-18,Sigmoid Labs,Indian railway tracking,India,40000000,,Misc.,Alphabet,Industrials,0,1,2
87,03-Jan-19,Superpod,Question and answer app,United States,60000000,Google Assistant,Search,Alphabet,Technology,1,1,0
88,06-Jun-19,Looker,"Big data, analytics",United States,2600000000,Google Cloud Platform,Productivity,Alphabet,Technology,1,1,1
89,01-Nov-19,Fitbit,Wearables,United States,2100000000,,Devices,Alphabet,Consumer Cyclical,0,1,2


In [112]:
X1 = acquired1[['Price', 'Tech', 'Group_Index']]
y = acquired1['Google']

In [113]:
bt_tree1 = tree.DecisionTreeClassifier(max_depth=4)
bt_tree1.fit(X1,y)
bt_tree1.score(X1,y)

0.7912087912087912

In [114]:
estimator = bt_tree1
n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)] 
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))

The binary tree structure has 23 nodes and has the following tree structure:
node=0 test node: go to node 1 if X[:, 0] <= 395000000.0 else to node 16.
	node=1 test node: go to node 2 if X[:, 0] <= 165000000.0 else to node 9.
		node=2 test node: go to node 3 if X[:, 0] <= 59000000.0 else to node 6.
			node=3 test node: go to node 4 if X[:, 2] <= 0.5 else to node 5.
				node=4 leaf node.
				node=5 leaf node.
			node=6 test node: go to node 7 if X[:, 2] <= 1.5 else to node 8.
				node=7 leaf node.
				node=8 leaf node.
		node=9 test node: go to node 10 if X[:, 0] <= 368000000.0 else to node 13.
			node=10 test node: go to node 11 if X[:, 0] <= 247500000.0 else to node 12.
				node=11 leaf node.
				node=12 leaf node.
			node=13 test node: go to node 14 if X[:, 0] <= 385000000.0 else to node 15.
				node=14 leaf node.
				node=15 leaf node.
	node=16 test node: go to node 17 if X[:, 2] <= 0.5 else to node 22.
		node=17 test node: go to node 18 if X[:, 0] <= 3050000000.0 else to node 21.
		

This time, there are far fewer nodes (and less depth) required to generate a decent model. However, the tech index was never used in the decision tree regression - let's re-run the tree without it.

In [115]:
X2 = acquired1[['Price', 'Group_Index']]
y = acquired1['Google']

In [116]:
bt_tree2 = tree.DecisionTreeClassifier(max_depth=4)
bt_tree2.fit(X2,y)
bt_tree2.score(X2,y)

0.7912087912087912

In [117]:
estimator = bt_tree2
n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))

The binary tree structure has 23 nodes and has the following tree structure:
node=0 test node: go to node 1 if X[:, 0] <= 395000000.0 else to node 16.
	node=1 test node: go to node 2 if X[:, 0] <= 165000000.0 else to node 9.
		node=2 test node: go to node 3 if X[:, 0] <= 59000000.0 else to node 6.
			node=3 test node: go to node 4 if X[:, 1] <= 0.5 else to node 5.
				node=4 leaf node.
				node=5 leaf node.
			node=6 test node: go to node 7 if X[:, 1] <= 1.5 else to node 8.
				node=7 leaf node.
				node=8 leaf node.
		node=9 test node: go to node 10 if X[:, 0] <= 368000000.0 else to node 13.
			node=10 test node: go to node 11 if X[:, 0] <= 247500000.0 else to node 12.
				node=11 leaf node.
				node=12 leaf node.
			node=13 test node: go to node 14 if X[:, 0] <= 385000000.0 else to node 15.
				node=14 leaf node.
				node=15 leaf node.
	node=16 test node: go to node 17 if X[:, 1] <= 0.5 else to node 22.
		node=17 test node: go to node 18 if X[:, 0] <= 3050000000.0 else to node 21.
		

### Decision Tree - Conclusion

From this final decision tree model, we have generated a regression tree that uses both independent inputs to establish 12 distinct leaf nodes that will be used to partition the data points in our list of IPO companies.