# Hackaton! 
<img src="http://i.imgur.com/5l1MUjk.gif" width="600" height="400" />


Todays challenge will allow you to put together all the **moving parts** of the modelling process, in a group setting.

Your group will need to create a the best possible model for predicting success of a company in raising funding.

**The evaluation metric will be the mean cross validated (cv=5) accuracy for the best model you can find.**

Tips:
- Focus the EDA on data cleaning. 
    - Check for the missing values (might need to either drop columns or rows or try to impute).
    - Determine the continuous and the categorical features. Create a list of names for each.
    - Some categorical values could be broken down into multiple columns.
    - **If a feature is giving you trouble, take it out from the feature matrix until you get your first models up and running**
    
- Make it a group effort, divide the work, and share through slack. But also be mindful of the time it might take to integrate the code.

- Have some of you:
    - writing a function to quickly gridsearch models and output specification plus performance.
    - while someone else focus on the data cleaning.
    - and someone else looks for good hyper-parameter ranges for the models you plan to try out.

## imports

In [1]:
# Your code here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

plt.style.use('seaborn-notebook')
%matplotlib inline 

## The dataset

**Important**: The dataset is "provided as is". You can try to make hypothesis around different features, but the main goal is to create a good predictive model.

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/mkleinbort/Python-Knowledge-Share/master/resource-datasets/startup_success/CAX_Startup_Data.csv", encoding = "ISO-8859-1")
df.head()

Unnamed: 0,Company_Name,Dependent-Company Status,year of founding,Age of company in years,Internet Activity Score,Short Description of company profile,Industry of company,Focus functions of company,Investors,Employee Count,...,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
0,Company1,Success,No Info,No Info,-1.0,Video distribution,,operation,KPCB Holdings|Draper Fisher Jurvetson (DFJ)|Kl...,3.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
1,Company2,Success,2011,3,125.0,,Market Research|Marketing|Crowdfunding,"Marketing, sales",,,...,8.823529412,21.76470588,10.88235294,2.941176471,0.0,0,0,0,0,8
2,Company3,Success,2011,3,455.0,Event Data Analytics API,Analytics|Cloud Computing|Software Development,operations,TechStars|Streamlined Ventures|Amplify Partner...,14.0,...,3.846153846,17.09401709,9.401709402,0.0,2.777777778,0,0,0,0,9
3,Company4,Success,2009,5,-99.0,The most advanced analytics for mobile,Mobile|Analytics,Marketing & Sales,Michael Birch|Max Levchin|Sequoia Capital|Keit...,45.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,5
4,Company5,Success,2010,4,496.0,The Location-Based Marketing Platform,Analytics|Marketing|Enterprise Software,Marketing & Sales,DFJ Frontier|Draper Nexus Ventures|Gil Elbaz|A...,39.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,6


In [3]:
df.shape

(472, 116)

In [4]:
# Drop columns with null values
df = df.dropna(axis=1, how='any')

In [5]:
df.shape

(472, 98)

## The baseline

In [6]:
# 64% baseline accuracy
df['Dependent-Company Status'].value_counts(normalize=True)

Success    0.646186
Failed     0.353814
Name: Dependent-Company Status, dtype: float64

## Feature Creation

In [7]:
# Converting all categorical features of data into numeric for scatter chart and distribution analysis
def conv_to_cat(df):
   """
   Converts all non-numeric columns in dataframe df to categorical dtype
   Input:
       df - pd.DataFrame to process. May contain only numerical columns and text columns of "categorical" nature
       if any columns in df have uniwue text values, the results will be incorrect.
       Non of text columns of categorical nature must have "categorical" dtype assigned,
       as np.issubdtype get confused by pd.categorical datatype
   Output:
       df - return df with all non-numeric columns converted to categorical
       df_num - all non-numeric columns converted to numercis. df_num values are equal to df.values.codes
   """
   import numpy as np
   df_num = df.copy() # df version with all categorical values converted to numbers
   for c in df.columns:
       if np.issubdtype(df[c].dtype, np.number):
           pass # do nothing, it is already a number
       else:
           # First - convert it to categorical
           df[c] = df[c].astype('category')
           # Second - convert it to numerical value
           df_num[c] = df[c].values.codes
   return df, df_num
           
df_cat, df_num = conv_to_cat(df.copy())

In [8]:
df_num.head()

Unnamed: 0,Company_Name,Dependent-Company Status,year of founding,Number of Investors in Seed,Number of Investors in Angel and or VC,Number of Co-founders,Number of of advisors,Team size Senior leadership,Team size all employees,Presence of a top angel or venture fund in previous round of investment,...,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
0,0,1,15,8,0,1,2,2,15,2,...,0,0,0,0,0,0,0,0,0,0
1,111,1,12,13,0,2,0,4,23,0,...,78,58,9,19,0,0,0,0,0,10
2,222,1,12,6,0,3,0,7,1,0,...,47,37,59,0,25,0,0,0,0,11
3,333,1,10,14,0,2,0,4,60,2,...,0,0,0,0,0,0,0,0,0,7
4,417,1,11,15,0,1,1,8,48,0,...,0,0,0,0,0,0,0,0,0,8


## Set Feature Matrix

using all features

In [11]:
X = df_num.drop(['Dependent-Company Status'], axis=1)
y = df_num['Dependent-Company Status']

print (X.shape)
print (y.shape)

(472, 97)
(472,)


In [12]:
from sklearn.preprocessing import StandardScaler
# Scale X as X_ss
ss = StandardScaler()
X_scale = ss.fit_transform(X)
X_ss = pd.DataFrame(X_scale)
X_ss.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,87,88,89,90,91,92,93,94,95,96
0,-1.728385,1.424199,0.534092,-0.491627,-0.760455,0.488448,-0.688552,-0.98424,1.758582,1.477015,...,-0.886641,-1.217979,-0.919267,-0.832322,-0.857997,-0.40502,-0.462146,-0.500113,-0.600688,-1.317037
1,-0.913732,0.359995,1.310236,-0.491627,0.114996,-0.505291,0.107033,-0.713359,-0.752918,-0.553128,...,1.482713,0.089092,-0.540835,0.049562,-0.857997,-0.40502,-0.462146,-0.500113,-0.600688,1.087766
2,-0.099079,0.359995,0.223635,-0.491627,0.990446,-0.505291,1.300411,-1.458283,-0.752918,-0.553128,...,0.541046,-0.384158,1.56157,-0.832322,-0.037878,-0.40502,-0.462146,-0.500113,-0.600688,1.328246
3,0.715573,-0.349474,1.465465,-0.491627,0.114996,-0.505291,0.107033,0.539467,1.758582,-0.553128,...,-0.886641,-1.217979,-0.919267,-0.832322,-0.857997,-0.40502,-0.462146,-0.500113,-0.600688,0.366325
4,1.332067,0.005261,1.620694,-0.491627,-0.760455,-0.008422,1.698204,0.133145,-0.752918,-0.553128,...,-0.886641,-1.217979,-0.919267,-0.832322,-0.857997,-0.40502,-0.462146,-0.500113,-0.600688,0.606805


## Modelling Imports

In [13]:
# Modelling library imports
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
# Visualize decision tree outputs
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

# initialize the output file object
dot_data = StringIO() 

def build_tree_image(model,dot_data):
    export_graphviz(model, 
                out_file = dot_data,  
                filled = True, 
                rounded = True,
                special_characters = True,
                feature_names = X.columns)  

    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    return graph

----

## Decision Tree

In [14]:
# Create a decisions tree classified with a max depth of 3 layers and minimum 5 samples per leaf
model = DecisionTreeClassifier(max_depth=5,
                              min_samples_leaf = 10)
    
# Fit the model
model.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [15]:
for scoring in ['roc_auc','accuracy','recall','precision']:
    scores = cross_val_score(model, X, y, scoring=scoring, cv=5)
    print( scoring, ':')
    print('CV {} {}, Average {} {}'.format(scoring,scores,scoring,scores.mean()))

roc_auc :
CV roc_auc [ 0.7789296   0.68563163  0.73695976  0.94262295  0.95901639], Average roc_auc 0.8206320680284037
accuracy :
CV accuracy [ 0.82105263  0.63157895  0.84042553  0.82978723  0.81914894], Average accuracy 0.7883986562150056
recall :
CV recall [ 0.93442623  0.75409836  0.98360656  0.73770492  0.72131148], Average recall 0.8262295081967214
precision :
CV precision [ 0.81428571  0.6969697   0.81081081  1.          1.        ], Average precision 0.8644132444132445


## DecisionTree Parameter Tuning

In [16]:
# Create another decision tree classifier
model_gscv = DecisionTreeClassifier()

# Try depth from 1 to 10 levels and minimum leaf samples from 1 to 10
params = {'max_depth': range(1,11),
          'min_samples_leaf': range(1,11)}

# Perform 3 fold CV grid search with parameters specified
gs = GridSearchCV(model_gscv, param_grid=params, scoring='accuracy')
gs.fit(X_ss,y)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': range(1, 11), 'min_samples_leaf': range(1, 11)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [17]:
for scoring in ['roc_auc','accuracy','recall','precision']:
    scores = cross_val_score(gs.best_estimator_, X_ss, y, scoring=scoring, cv=5)
    print( scoring, ':')
    print('CV {} {}, Average {} {}'.format(scoring,scores,scoring,scores.mean()))

roc_auc :
CV roc_auc [ 0.87102218  0.73698168  0.89915549  0.99180328  0.97540984], Average roc_auc 0.8948744922708279
accuracy :
CV accuracy [ 0.85263158  0.81052632  0.84042553  0.88297872  0.85106383], Average accuracy 0.847525195968645
recall :
CV recall [ 0.98360656  1.          0.98360656  0.81967213  0.7704918 ], Average recall 0.9114754098360656
precision :
CV precision [ 0.82191781  0.7721519   0.81081081  1.          1.        ], Average precision 0.8809761035528332


---

## Random Forrest

In [21]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators = 20)
    
model_rf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [22]:
for scoring in ['roc_auc','accuracy','recall','precision']:
    scores = cross_val_score(model_rf, X, y, scoring=scoring, cv=5)
    print( scoring, ':')
    print('CV {} {}, Average {} {}'.format(scoring,scores,scoring,scores.mean()))

roc_auc :
CV roc_auc [ 0.90284474  0.8924783   0.80228515  1.          1.        ], Average roc_auc 0.9195216387598257
accuracy :
CV accuracy [ 0.71578947  0.84210526  0.79787234  0.90425532  0.88297872], Average accuracy 0.8286002239641658
recall :
CV recall [ 1.          1.          0.98360656  0.90163934  0.81967213], Average recall 0.9409836065573771
precision :
CV precision [ 0.71764706  0.7625      0.79220779  1.          1.        ], Average precision 0.8544709702062644


## RandomForest ParameterTuning

In [None]:
# New parameters
from sklearn.ensemble import RandomForestClassifier

model_rf3 = RandomForestClassifier()

param_range_rf = [4, 7, 8, 9, 10]
params_rf = [{'criterion': ['gini', 'entropy'],
            'min_samples_leaf': param_range,
            'max_depth': param_range,
            'min_samples_split': param_range[1:],
            'n_estimators': [10, 20]}]

#model_rf2 = RandomForestClassifier(n_estimators = 100, criterion='gini', max_depth=7, min_samples_leaf=2)
    
#model_rf2.fit(X_ss, y)

# Grid search
gs_rf3 = GridSearchCV(model_rf3, param_grid=params_rf, scoring='accuracy', cv=5)
gs_rf3.fit(X_ss,y)

for scoring in ['roc_auc','accuracy','recall','precision']:
    scores = cross_val_score(gs_rf3.best_estimator_, X_ss, y, scoring=scoring, cv=5)
    print( scoring, ':')
    print('CV {} {}, Average {} {}'.format(scoring,scores,scoring,scores.mean()))

In [68]:
gs_rf2.best_estimator_.get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 3,
 'min_samples_split': 9,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## Try Multiple Models With GSCV

In [33]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

# Construct some pipelines
pipe_lr = Pipeline([('scl', StandardScaler()),
            ('clf', LogisticRegression())])

pipe_lr_pca = Pipeline([('scl', StandardScaler()),
            ('pca', PCA(n_components=2)),
            ('clf', LogisticRegression())])

pipe_rf = Pipeline([('scl', StandardScaler()),
            ('clf', RandomForestClassifier())])

pipe_rf_pca = Pipeline([('scl', StandardScaler()),
            ('pca', PCA(n_components=2)),
            ('clf', RandomForestClassifier())])

pipe_svm = Pipeline([('scl', StandardScaler()),
            ('clf', svm.SVC())])

pipe_svm_pca = Pipeline([('scl', StandardScaler()),
            ('pca', PCA(n_components=2)),
            ('clf', svm.SVC())])

# Set grid search params
param_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
param_range_fl = [1.0, 0.5, 0.1]

grid_params_lr = [{'clf__penalty': ['l1', 'l2'],
        'clf__C': param_range_fl,
        'clf__solver': ['liblinear']}] 

grid_params_rf = [{'clf__criterion': ['gini', 'entropy'],
            'clf__min_samples_leaf': param_range,
            'clf__max_depth': param_range,
            'clf__min_samples_split': param_range[1:]}]

grid_params_svm = [{'clf__kernel': ['linear', 'rbf'], 
            'clf__C': param_range}]

# Construct grid searches
jobs = -1

gs_lr = GridSearchCV(estimator=pipe_lr,
            param_grid=grid_params_lr,
            scoring='accuracy',
            cv=5) 

gs_lr_pca = GridSearchCV(estimator=pipe_lr_pca,
            param_grid=grid_params_lr,
            scoring='accuracy',
            cv=5)

gs_rf = GridSearchCV(estimator=pipe_rf,
            param_grid=grid_params_rf,
            scoring='accuracy',
            cv=5, 
            n_jobs=jobs)

gs_rf_pca = GridSearchCV(estimator=pipe_rf_pca,
            param_grid=grid_params_rf,
            scoring='accuracy',
            cv=5, 
            n_jobs=jobs)

gs_svm = GridSearchCV(estimator=pipe_svm,
            param_grid=grid_params_svm,
            scoring='accuracy',
            cv=5,
            n_jobs=jobs)

gs_svm_pca = GridSearchCV(estimator=pipe_svm_pca,
            param_grid=grid_params_svm,
            scoring='accuracy',
            cv=5,
            n_jobs=jobs)

# List of pipelines for ease of iteration
grids = [gs_lr, gs_lr_pca, gs_rf, gs_rf_pca, gs_svm, gs_svm_pca]

# Dictionary of pipelines and classifier types for ease of reference
grid_dict = {0: 'Logistic Regression', 1: 'Logistic Regression w/PCA', 
        2: 'Random Forest', 3: 'Random Forest w/PCA', 
        4: 'Support Vector Machine', 5: 'Support Vector Machine w/PCA'}

# Fit the grid search objects
print('Performing model optimizations...')
best_acc = 0.0
best_clf = 0
best_gs = ''
for idx, gs in enumerate(grids):
    print('\nEstimator: %s' % grid_dict[idx])
    # Fit grid search
    gs.fit(X, y)
    # Best params
    print('Best params: %s' % gs.best_params_)
    # Best training data accuracy
    print('Best training accuracy: %.3f' % gs.best_score_)
    # Predict on test data with best params
    y_pred = gs.predict(X)
    # Test data accuracy of model with best params
    print('Test set accuracy score for best params: %.3f ' % accuracy_score(y, y_pred))
    # Track best (highest test accuracy) model
    if accuracy_score(y, y_pred) > best_acc:
        best_acc = accuracy_score(y, y_pred)
        best_gs = gs
        best_clf = idx
print('\nClassifier with best test set accuracy: %s' % grid_dict[best_clf])

Performing model optimizations...

Estimator: Logistic Regression
Best params: {'clf__C': 0.1, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}
Best training accuracy: 0.858
Test set accuracy score for best params: 0.919 

Estimator: Logistic Regression w/PCA
Best params: {'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}
Best training accuracy: 0.839
Test set accuracy score for best params: 0.856 

Estimator: Random Forest
Best params: {'clf__criterion': 'gini', 'clf__max_depth': 7, 'clf__min_samples_leaf': 4, 'clf__min_samples_split': 10}
Best training accuracy: 0.869
Test set accuracy score for best params: 0.951 

Estimator: Random Forest w/PCA
Best params: {'clf__criterion': 'entropy', 'clf__max_depth': 5, 'clf__min_samples_leaf': 3, 'clf__min_samples_split': 2}
Best training accuracy: 0.824
Test set accuracy score for best params: 0.892 

Estimator: Support Vector Machine
Best params: {'clf__C': 2, 'clf__kernel': 'rbf'}
Best training accuracy: 0.845
Test set accur

In [None]:
grids = [gs_rf]
grid_dict = {0: 'Random Forest'}

# Fit the grid search objects
print('Performing model optimizations...')
best_acc = 0.0
best_clf = 0
best_gs = ''
for idx, gs in enumerate(grids):
    print('\nEstimator: %s' % grid_dict[idx])
    # Fit grid search
    gs.fit(X, y)
    # Best params
    print('Best params: %s' % gs.best_params_)
    # Best training data accuracy
    print('Best training accuracy: %.3f' % gs.best_score_)
    # Predict on test data with best params
    y_pred = gs.predict(X)
    # Test data accuracy of model with best params
    print('Test set accuracy score for best params: %.3f ' % accuracy_score(y, y_pred))
    # Track best (highest test accuracy) model
    if accuracy_score(y, y_pred) > best_acc:
        best_acc = accuracy_score(y, y_pred)
        best_gs = gs
        best_clf = idx
print('\nClassifier with best test set accuracy: %s' % grid_dict[best_clf])

## SVM Model

In [47]:
# SVM model
model_svm = svm.SVC(C= 2, kernel= 'rbf')
model_svm.fit(X_ss, y)

SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [48]:
for scoring in ['roc_auc','accuracy','recall','precision']:
    scores = cross_val_score(model_svm, X_ss, y, scoring=scoring, cv=5)
    print( scoring, ':')
    print('CV {} {}, Average {} {}'.format(scoring,scores,scoring,scores.mean()))

roc_auc :
CV roc_auc [ 0.86162006  0.92237223  0.78042722  0.99105812  0.98857427], Average roc_auc 0.908810379591479
accuracy :
CV accuracy [ 0.74736842  0.82105263  0.82978723  0.93617021  0.87234043], Average accuracy 0.8413437849944009
recall :
CV recall [ 0.96721311  1.          0.98360656  0.90163934  0.80327869], Average recall 0.9311475409836065
precision :
CV precision [ 0.72839506  0.78205128  0.8         1.          1.        ], Average precision 0.8620892687559355


--------------

## EDA

In [14]:
# Check for missing values
df.isnull().values.any()

True

In [22]:
df.describe()

Unnamed: 0,Internet Activity Score,Employee Count,Employees count MoM change,Last Funding Amount,Number of Co-founders,Number of of advisors,Team size Senior leadership,Industry trend in investing
count,407.0,306.0,267.0,312.0,472.0,472.0,472.0,390.0
mean,114.159705,31.408497,-1.299625,6379489.0,1.868644,1.016949,3.730932,2.889744
std,225.651419,62.6736,23.967085,9901622.0,1.143481,2.014736,2.516539,0.845765
min,-725.0,0.0,-100.0,10000.0,0.0,0.0,1.0,0.0
25%,-3.5,4.25,0.0,750000.0,1.0,0.0,2.0,2.0
50%,60.0,13.0,0.0,2800000.0,2.0,0.0,3.0,3.0
75%,216.0,31.0,6.0,7850000.0,2.25,1.0,5.0,3.0
max,1535.0,594.0,50.0,77000000.0,7.0,13.0,24.0,5.0
