# Housekeeping

The following code will allow you to load and install all required packages, libraries and functions in order to run the code in the notebook. If a package is not installed on your computer, a line of code is provided as a comment that can be used to install the package with via the command prompts of Conda. 

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
from tld import get_tld
from datetime import date
import matplotlib.pyplot as plt
from matplotlib.widgets import Slider
import matplotlib as mpl
mpl.rcParams['text.color'] = 'black'
from matplotlib import style
style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.linear_model import Lasso, LassoCV

In [34]:
# Read the .csv files
dforg = pd.read_csv('dforgbonus_final.csv', index_col = 0)
dfinv = pd.read_csv('dfinv_final.csv', index_col = 0)
dfacq = pd.read_csv('dfacq_final.csv', index_col = 0)
dfrou = pd.read_csv('dfrou_final.csv', index_col = 0)

## End of data prep and engineering

# Predictive model

In this section we build a predictive model that can estimate the amount of funding start-ups receive. First we need to create the dataframe for the model by taking key variables from all of the other sheets. Thereafter we use 5-fold cross-validation and test data to create a model that generalises well and does not overfit.

## Creating the dataframe for the model

### Data from organization

The whole dforg dataframe will be used as basis for the model master sheet. This because it has the correct permalink to combine all other sheets. Later on, when everything is added, only the necessary columns will remain.

In [35]:
dfmodel = dforg.copy()

### Data from investment

A new variable needs to be added, namely the percentage of investor type i.e. organization or person.  The percentage indicates the amount of funding that came from organizations

In [36]:
# First create a variable where organization is equal to 0
dfinv['org_bin'] = np.where(dfinv['investor_type_x'] == 'organization', 1,0)
# The organizations are grouped together and the percentages are calculated by taking the sum and dividing that by the count.
dfinv['O/P_%'] = dfinv.groupby('company_permalink')['org_bin'].transform('sum') / dfinv.groupby('company_permalink')['org_bin'].transform('count')
# Keep only the first sample to avoid duplicates
dfinv=dfinv.groupby('company_permalink').first().reset_index()

# Merge
dfmodel=pd.merge(dfmodel,dfinv[['company_permalink','O/P_%']],how='left',left_on='permalink',right_on='company_permalink')
dfmodel.drop('company_permalink',axis=1,inplace=True)

### Data from acquisition

Extra data is mainly centered around the average time to acquisition from founding and from last funding.

In [37]:
# Consider the average time from companies that were acquired more than one time
dfacq['avg_time_acq_last_funding']=dfacq.groupby('company_permalink')['time_acq_last_funding'].transform('mean')
dfacq['avg_time_founding_acq']=dfacq.groupby('company_permalink')['time_founding_acq'].transform('mean')
dfacq=dfacq.groupby('company_permalink').first().reset_index()

# Merge
dfmodel = pd.merge(dfmodel, dfacq[['company_permalink','time_acq_last_funding','time_founding_acq']],how='left',left_on='permalink',right_on='company_permalink')
dfmodel.drop('company_permalink',axis=1,inplace=True)

### Data from round

The only extra data we want to extract from here is the last type of funding that each company received i.e. series A, B, C, seed etc.

In [38]:
# Sort values to extract the final funding type of each company
dfrou.sort_values(by='funded_at',ascending=False,inplace=True)
dfrou=dfrou.groupby('company_permalink').first().reset_index()

# Merge to get the final funding round types for each company.
dfmodel = pd.merge(dfmodel,dfrou[['company_permalink','funding_round_type']],how='left',left_on='permalink',right_on='company_permalink')
dfmodel.drop('company_permalink',axis=1,inplace=True)

### Change markets in the bottom 20% to 'other'

As there are a lot of different markets to be found, we decided to combine the largest chunk of these together. We decided on an 80/20 split where the top 80% are still grouped by their market and the remaining 20% is grouped together and named "other".

In [39]:
# Create new dataframe to get the cummulative percentage of the top markets
dfmarketcount = dfmodel['market'].value_counts().rename_axis('market').reset_index(name = 'count')
dfmarketcount['cum_percent'] = 100 * (dfmarketcount['count'].cumsum()/dfmarketcount['count'].sum())

# At which index is the threshold of 80% reached
dfmarketcount[dfmarketcount['cum_percent'].gt(80)].index[0]  # From this we see that the top 75 markets make up 80% of the entries.

# Create list with top 74 markets
marketlink = list(dfmarketcount['market'].head(75))

# Change all other markets to 'other'
dfmodel.loc[~dfmodel['market'].isin(marketlink), 'market'] = 'Other'

### Change domains in the bottom 2% to 'other'

As there are a lot of different domains to be found, we decided to combine the largest chunk of these together. We decided on an 98/2 split where the top 98% are still grouped by their domain and the remaining 2% is grouped together and named "other". This split was chosen as the top 98% are less than 30 different domains. 

In [40]:
# Create new dataframe to get the cummulative percentage of the top domains
dfdomaincount = dforg['Domains'].value_counts().rename_axis('Domains').reset_index(name = 'count')
dfdomaincount['cum_percent'] = 100 * (dfdomaincount['count'].cumsum()/dfmarketcount['count'].sum())

# At which index is the threshold of 98% reached
dfdomaincount[dfdomaincount['cum_percent'].gt(98)].index[0]  # From this we see that the top 29 domains make up 98% of the entries.

# Create list with top 29 domains
domainlink = list(dfdomaincount['Domains'].head(29))

# Change all other domains to 'other'
dforg.loc[~dforg['Domains'].isin(domainlink), 'Domains'] = 'Other'

### Change all non-US states to 'Non-US'

In [41]:
# Change all non-US states to 'Non-US', this because we believe the states only to be interesting for the US.
dforg.loc[dforg['country_code'] != 'USA', 'state_code'] = 'Non-US'

### Transform categorical variables and years to dummies to run the models

One hot encoding can be defined as the process of changing categorical data variables to be fed into machine and deep learning algorithms, which improves the model's prediction and classification accuracy. Preprocessing categorical features for machine learning models using one hot encoding is a frequent practice.
This form of encoding creates a new binary feature for each potential category and assigns a value of 1 to each sample's feature that matches to its original category.

We chose to encode the following variables to dummy variables: 

In [42]:
# Funding round type
round_type = pd.get_dummies(dfmodel['funding_round_type'])
dfmodel = pd.concat([dfmodel,round_type],axis=1)
dfmodel.drop('funding_round_type',axis=1,inplace=True)

# Country code
country_code = pd.get_dummies(dfmodel['country_code'])
dfmodel = pd.concat([dfmodel,country_code],axis=1)
dfmodel.drop('country_code',axis=1,inplace=True)

# Market
market = pd.get_dummies(dfmodel['market'])
dfmodel = pd.concat([dfmodel,market],axis=1)
dfmodel.drop('market',axis=1,inplace=True)

# State code
state_code = pd.get_dummies(dfmodel['state_code'])
dfmodel = pd.concat([dfmodel,state_code],axis=1)
dfmodel.drop('state_code',axis=1,inplace=True)

# Domains
domains = pd.get_dummies(dfmodel['Domains'])
dfmodel = pd.concat([dfmodel,domains],axis=1)
dfmodel.drop('Domains',axis=1,inplace=True)

# First funding year
first_funding = pd.get_dummies(dfmodel['first_funding_year'])
dfmodel = pd.concat([dfmodel,first_funding],axis=1)
dfmodel.drop('first_funding_year',axis=1,inplace=True)

# Last funding year
last_funding = pd.get_dummies(dfmodel['last_funding_year'])
dfmodel = pd.concat([dfmodel,last_funding],axis=1)
dfmodel.drop('last_funding_year',axis=1,inplace=True)


## Normalize variables

As there are some variables (i.e. year and the differing amounts) where the range between entries is too large, a form of normalization needs to take place. Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

We decided to use the MinMaxScaler. MinMaxScaler subtracts the feature's minimum value from each value in the feature, then divides by the range. The range is the difference between the maximum and minimum values.

The shape of the original distribution is preserved by MinMaxScaler. It has no effect on the information included in the original data.

It's worth noting that MinMaxScaler doesn't lessen the significance of outliers.

MinMaxScaler returns a feature with a default range of 0 to 1.

In [43]:
mms = MinMaxScaler()

dfmodel[['founding_year','DaysBetweenFunding','DaysBetweenFoundFund','DaysPerRound','gdpFOU','gdpFUNF',
        'gdpFUNL','gdp95','gdp15','gdp95_15','gdpFOU_FUNL','gdpFUNF_FUNL','gdppcFOU','gdppcFUNF',
        'gdppc95','gdppc15','gdppc95_15','gdppcFOU_FUNL','gdppcFUNF_FUNL','r&dFOU','r&dFUNF','r&dFUNL',
        'r&d95','r&d15','r&d95_15','r&dFOU_FUNL','r&dFUNF_FUNL','time_acq_last_funding',
        'time_founding_acq']] = mms.fit_transform(dfmodel[['founding_year','DaysBetweenFunding','DaysBetweenFoundFund','DaysPerRound','gdpFOU','gdpFUNF',
        'gdpFUNL','gdp95','gdp15','gdp95_15','gdpFOU_FUNL','gdpFUNF_FUNL','gdppcFOU','gdppcFUNF',
        'gdppc95','gdppc15','gdppc95_15','gdppcFOU_FUNL','gdppcFUNF_FUNL','r&dFOU','r&dFUNF','r&dFUNL',
        'r&d95','r&d15','r&d95_15','r&dFOU_FUNL','r&dFUNF_FUNL','time_acq_last_funding','time_founding_acq']])

## Split dataset into training and test set

In [44]:
#Drop unnecessary columns
dfmodel.drop(['permalink','name','homepage_url','category_list','founded_at','first_funding_at',
              'last_funding_at','region','city', 'gdp95', 'gdp15', 'gdp95_15', 'gdppc95', 'gdppc15', 
              'gdppc95_15', 'r&d95', 'r&d15', 'r&d95_15'],axis=1,inplace=True)

#Replace NaN for 0 to avoid ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
dfmodel = dfmodel.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)

#Split set into features and target variable
x = dfmodel.drop('funding_total_usd',axis=1)
y = dfmodel[['funding_total_usd']]

#We also define random_state which corresponds to the seed, so that results are reproducible
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=25)

## Running OLS Model

The objective of a linear regression model is to find a relationship between one or more features(independent variables) and a continuous target variable(dependent variable).

In [75]:
# Run the initial model to establish a baseline
# Create a based model
modelOLS = LinearRegression(fit_intercept = True)

modelOLS.fit(x_train, y_train)

LinearRegression()

In [77]:
# Make predictions
y_pred_ols = modelOLS.predict(x_test)
# Print performance metrics
print(r2_score(y_test, y_pred_ols))
print(mean_absolute_error(y_test,y_pred_ols))

-57941031.91803935
23871830230.74381


The values of the performance metrics shown above are quite large and unhelpful, especially the R-squared. 

## Running Lasso Regression

The main purpose in Lasso Regression is to find the coefficients that minimize the error sum of squares by applying a penalty to these coefficients.

In [82]:
# Define model
modelLasso = Lasso(max_iter=10000).fit(x_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [89]:
# Make predictions
y_pred_lasso = modelLasso.predict(x_test)
# Print performance metrics
print(r2_score(y_test, y_pred_lasso))
print(mean_absolute_error(y_test, y_pred_lasso))

0.13161494951420794
16978982.44416478


### Hyperparameter tuning

To see whether this score can be improved we choose to do a grid search for hyperparameter tuning. Hyperparameter is a parameter whose value is used to contrl the learning process, and hyper-parameter tuning means choosing the optimal parameters

In [85]:
# Hyperparameter tuning using 5 fold cross-validation.
lasso_cv_model = LassoCV(alphas = np.random.randint(0,1000,100), cv = 5, max_iter = 10000).fit(x_train, y_train)

  return f(*args, **kwargs)
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

In [86]:
# View best alpha value
lasso_cv_model.alpha_

987

In [87]:
# Set up the corrected Lasso model with optimum alpha value
modelLasso_tuned = Lasso(max_iter = 10000).set_params(alpha = lasso_cv_model.alpha_).fit(x_train,y_train)

In [88]:
# Make predictions using tuned model
y_pred_lasso_tuned = modelLasso_tuned.predict(x_test)
# Print performance metrics
print(r2_score(y_test, y_pred_lasso_tuned))
print(mean_absolute_error(y_test, y_pred_lasso_tuned))

0.13050090355335842
16865860.672136225


As we see here, the model performs worse after tuning than before so the R-squared of the Random Forest will be taken as __0.1316__, which means that the independent variables in the Lasso Regression Model explain 13.16% of the change in the dependent variable for this data set.

## Running Random Forest

Random forest is a supervised machine learning algorithm that is commonly used to solve classification and regression problems. It creates decision trees from various samples, using the majority vote for classification and the average for regression.

In [55]:
# Run the initial model to establish a baseline
# Create a based model
rf = RandomForestRegressor()

rf.fit(x_train, y_train)

  rf.fit(x_train, y_train)


RandomForestRegressor()

In [57]:
print(rf.score(x_test,y_test))
print(mean_absolute_error(y_test,rf.predict(x_test)))

0.15628480021642832
13824531.008093499


### Hyperparameter tuning

To see whether this score can be improved we choose to do a grid search for hyperparameter tuning. Hyperparameter is a parameter whose value is used to contrl the learning process, and hyper-parameter tuning means choosing the optimal parameters

In [49]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 500, num = 5)]
# Number of features to consider at every split
max_features = ['auto','sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(2, 6, num = 3 )]
max_depth.append(None)
# Minimum number of samples required to split a node
#min_samples_split = [2, 5]
# Minimum number of samples required at each leaf node
#min_samples_leaf = [1, 2]
# Method of selecting samples for training each tree
bootstrap = [True] #False?

# Create the parameter grid to tune the hyper-parameters 
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               #'min_samples_split': min_samples_split,
               #'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 10)

# Fit the grid search to the data
grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


  self.best_estimator_.fit(X, y, **fit_params)


GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True], 'max_depth': [2, 4, 6, None],
                         'max_features': ['auto', 'sqrt'],
                         'n_estimators': [100, 200, 300, 400, 500]},
             verbose=10)

### View the best parameters from fitting the grid search

In [51]:
grid_search.best_params_

{'bootstrap': True,
 'max_depth': None,
 'max_features': 'sqrt',
 'n_estimators': 400}

### Evaluate the Best Random Forest Model from Grid Search

In [64]:
#Select best model
best_grid = grid_search.best_estimator_

#Test model performance
print(best_grid.score(x_test,y_test))
print(mean_absolute_error(y_test,best_grid.predict(x_test)))

0.11850848198958486
14066650.499949815


As we see here, the model performs worse after tuning than before so the R-squared of the Random Forest will be taken as __0.1562__

In [67]:
# get importance
importances = pd.DataFrame(data={
    'Attribute': x_train.columns,
    'Importance': best_grid.feature_importances_
})
importances = importances.sort_values(by='Importance', ascending=False)
print(importances.shape[0])
importances[importances['Importance'] >0]

513


Unnamed: 0,Attribute,Importance
0,funding_rounds,5.203122e-02
12,DaysBetweenFoundFund,4.136170e-02
11,DaysBetweenFunding,3.988007e-02
13,DaysPerRound,3.952236e-02
53,private_equity,3.675577e-02
...,...,...
70,BLR,1.541648e-10
80,CRI,6.953475e-11
152,TUN,5.763249e-11
454,tk,1.572190e-11


## Running Support Vector Machine Regression

In [96]:
# Define model
modelSVM = SVR(kernel = 'linear')  # We chose for the linear kernel as we have many features in our data set.

modelSVM.fit(x_train, y_train)

  return f(*args, **kwargs)


SVR(kernel='linear')

In [97]:
# Make predictions
y_pred_SVM = modelSVM.predict(x_test)
# Print performance metrics
print(r2_score(y_test, y_pred_SVM))
print(mean_absolute_error(y_test, y_pred_SVM))

-0.02407807177428234
15557847.645472461


Due to the bad performance metric scores we did not decide to pursue further tuning.

# Gradient Boosting

Next, we try the gradient boosting method to estimate the amount of funding startups receive. This method is builds the model additively and fits a regression tree on the negative gradient of a loss function in each stage. 
It can be categorized as an ensemble method, where predictions of multiple base estimators are combined to improve the generalizability. As our goal is to create a model that generalizes well and does not overfit, this is an appropriate method to consider.

In contrast to the random forests where predictions of single estimators are averaged after they have been built independently, in gradient boosting, these estimators are built sequentially and try to reduce the bias of the combined estimator. 

## Gradient Boosting with sklearn

With secondary data included:

In [45]:
from sklearn.ensemble import GradientBoostingRegressor

params = {'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'criterion': 'mse'}
gradient_boosting_regressor_model = GradientBoostingRegressor(**params)
y_train = np.asarray(y_train['funding_total_usd'])
gradient_boosting_regressor_model.fit(x_train, y_train)
y_pred = gradient_boosting_regressor_model.predict(x_test)
y_pred

#Test score
gradient_boosting_regressor_model.score(x_test, y_test)

#R^2
print(r2_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))

0.14015555652364742
14215340.714933703


## Gradient Boosting using XGBoost

XGBoost is a more regularized form of Gradient Boosting and is known for picking up patterns and regularities in the data. It uses advanced regularization (L1 & L2), which improves model generalization capabilities. Therefore, it tends to deliver higher performance compared to gradient boosting. 
This is achieved by improvisations made on the gradient boosting framework, by introducing more accurate approximations that avoid overfitting and help find the best tree model.

In [2]:
#pip install xgboost

Collecting xgboost
  Downloading xgboost-1.5.2-py3-none-win_amd64.whl (106.6 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.5.2
Note: you may need to restart the kernel to use updated packages.


In [47]:
#Find duplicate columns
#duplicate_columns = x_train.columns[x_train.columns.duplicated()]
#duplicate_columns
#Change names/drop duplicate columns
x_train.columns.values[512] = "lf_2014"
x_train.columns.values[511] = "lf_2013"
x_train.columns.values[510] = "lf_2012"
x_train.columns.values[509] = "lf_2011"
x_train.columns.values[508] = "lf_2010"
x_train.columns.values[507] = "lf_2009"
x_train.columns.values[506] = "lf_2008"
x_train.columns.values[505] = "lf_2007"
x_train.columns.values[504] = "lf_2006"
x_train.columns.values[503] = "lf_2005"
x_train.columns.values[502] = "lf_2004"
x_train.columns.values[501] = "lf_2003"
x_train.columns.values[500] = "lf_2002"
x_train.columns.values[499] = "lf_2001"
x_train.columns.values[498] = "lf_2000"
x_train.columns.values[497] = "lf_1999"
x_train.columns.values[496] = "lf_1998"
x_train.columns.values[495] = "lf_1997"
x_train.columns.values[494] = "lf_1996"
x_train.columns.values[493] = "lf_1994"
x_train.columns.values[492] = "lf_1990"
x_train.drop(x_train.columns[15], axis=1,inplace=True)

x_test.columns.values[512] = "lf_2014"
x_test.columns.values[511] = "lf_2013"
x_test.columns.values[510] = "lf_2012"
x_test.columns.values[509] = "lf_2011"
x_test.columns.values[508] = "lf_2010"
x_test.columns.values[507] = "lf_2009"
x_test.columns.values[506] = "lf_2008"
x_test.columns.values[505] = "lf_2007"
x_test.columns.values[504] = "lf_2006"
x_test.columns.values[503] = "lf_2005"
x_test.columns.values[502] = "lf_2004"
x_test.columns.values[501] = "lf_2003"
x_test.columns.values[500] = "lf_2002"
x_test.columns.values[499] = "lf_2001"
x_test.columns.values[498] = "lf_2000"
x_test.columns.values[497] = "lf_1999"
x_test.columns.values[496] = "lf_1998"
x_test.columns.values[495] = "lf_1997"
x_test.columns.values[494] = "lf_1996"
x_test.columns.values[493] = "lf_1994"
x_test.columns.values[492] = "lf_1990"
x_test.drop(x_test.columns[15], axis=1,inplace=True)

With secondary data included:

In [48]:
from xgboost import XGBRegressor

#drop features with the same name

params = {'max_depth': 8, 'subsample': 0.8, 'lambda': 3}
XGB_model = XGBRegressor(**params)
XGB_model.fit(x_train, y_train)
y_pred = XGB_model.predict(x_test)
#Test score
XGB_model.score(x_test, y_test)

#R^2
print(r2_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))

0.1865698193348978
14606070.948245412


We thus see that the model increased in performance with the XGBoost algorithm  

# K-nearest neighbors

In [49]:
from sklearn.neighbors import KNeighborsRegressor

np.random.seed(1234)
params = {'n_neighbors': 15}
knn_model = KNeighborsRegressor(**params)
#knn_model = KNeighborsRegressor()
#y_trainGB = np.asarray(y_trainGB['funding_total_usd'])
knn_model.fit(x_train, y_train)
y_pred = knn_model.predict(x_test)
y_pred

#Test score
knn_model.score(x_test, y_test)

#R^2
print(r2_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))

0.017088942302536436
17948528.16029238


Due to its inferior performance compared to the ensemble methods, this method does not seem worth investigating further.