# Predicting the Survival of New Businesses in Vancouver

## Summary
This analysis aims to predict the survival of new businesses in Vancouver by examining various economic and demographic factors. Using datasets from the City business license registry (City of Vancouver 2023) and other external sources (Statistics Canada 2023), we explore the influence of location, industry, and economic indicators on business survival.

## Introduction
Vancouver's dynamic business landscape is influenced by various factors, including economic cycles, demographic shifts, and urban planning. Predicting the survival of new businesses in this environment is crucial for policymakers and entrepreneurs alike. This project seeks to answer: "Can the survival of a new business in Vancouver be predicted?" We utilize datasets from Vancouver's open data portal and integrate external data such as economic indicators and census data.

We are using the following Python packages to perform the analysis: Pandas (McKinney 2010), altair (VanderPlas, 2018), scikit-learn (Pedregosa et al. 2011). 

### Dataset Description
The primary dataset comes from the City of Vancouver's business license registry, updated as businesses are licensed, renew, or terminate. This dataset is enriched with external data, including economic indicators.




### Preprocessing

In [1]:
# preprocessing
from DataPreprocess import *
from DataFetch import fetch_business_license, fetch_econ_indicators

# viz
import altair as alt
alt.data_transformers.enable("vegafusion")
import matplotlib.pyplot as plt

## Loading dataset

In [2]:
# Fetch data by urls --> already modulized
business = fetch_business_license()
raw_econ_index_data_dict = fetch_econ_indicators()

# It takes a while to load data from the url, so... here's the shortcut!
# Just download the file above to your local machine, and put the file in the data folder
business = pd.read_csv('data/business-licences.csv', delimiter = ';')

raw_econ_index_data_dict = {
    'GDP': pd.read_csv('data/gdp_by_industry.csv'),
    'ConsumerPrice': pd.read_csv('data/consumer_price_index.csv'),
    'Employment': pd.read_csv('data/employment_by_industry.csv'),
    'InvestmentConstruction': pd.read_csv('data/investment_in_building_construction.csv')
}

Now loading: business_license data


KeyboardInterrupt: 

### Business Lisence data
#### Clean-up
- Drop rows where `ExpiredDate` and `IssuedDate` are NA.
- Transform `ExpiredDate` and `IssuedDate` to date.
- Calculate the survival interval of each company, which is the difference between the maximum of ExpiredDate and the minimum of IssuedDate.
- Keep only the newest issued record of each company.
- Filter to keep those records where the latest `ExpiredDate` is before or equal to year 2022 because for those licenses issued in year 2023, the dafault `ExpiredDate` are `2023-12-31` and we cannot know whether it would survive until then.

#### Response Variable for Classification: survival_status
- To balance the amount of True & False, set the threshold to 2 years 
- Adjust Boolean value to 0, 1

In [3]:
# Drop rows where ExpiredDate and IssuedDate are NA
business = business_datacleaning(business = business, survival_threshold = 365 * 2)
business

Unnamed: 0,FOLDERYEAR,LicenceRSN,LicenceNumber,LicenceRevisionNumber,BusinessName,BusinessTradeName,Status,IssuedDate,ExpiredDate,BusinessType,...,Country,PostalCode,LocalArea,NumberofEmployees,FeePaid,ExtractDate,Geom,geo_point_2d,survival_days,survival_status
0,2013,1786043,13-166627,0,Melissa Cheryl Aston (Melissa Aston),Kazoomko Productions,Issued,2012-12-29,2013-12-31,Entertainment Services,...,CA,,Kitsilano,0.0,129.0,2019-07-21T13:49:06-07:00,,,1828.0,1
1,2013,1786044,13-166628,0,Corus Radio Company,CHMJ AM730 and CFOX 99.3FM,Issued,2013-01-14,2013-12-31,Entertainment Services,...,CA,V7Y 1K9,Downtown,0.0,129.0,2019-07-21T13:49:06-07:00,"{""coordinates"": [-123.119500778402, 49.2822434...","49.2822434350563, -123.119500778402",2908.0,1
2,2013,1786048,13-166632,0,Jamieson Productions Inc,Jamieson Prod Inc,Issued,2013-09-12,2013-12-31,Entertainment Services,...,CA,,Downtown,0.0,191.0,2019-07-21T13:49:06-07:00,,,1936.0,1
4,2013,1786055,13-166639,0,(Jessica Minnie),Petite Pearl Wedding and Event Planning,Issued,2013-06-17,2013-12-31,Entertainment Services,...,CA,,Downtown,0.0,191.0,2019-07-21T13:49:06-07:00,,,3849.0,1
9,2013,1786065,13-166649,0,Holly Perrin Yoos (Holly Yoos),Copperplate Communications,Issued,2012-11-29,2013-12-31,Entertainment Services,...,CA,,Kitsilano,0.0,129.0,2019-07-21T13:49:06-07:00,,,4049.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
643052,2021,3859186,21-260945,0,Betterbite Ltd,,Issued,2021-08-16,2021-12-31,Moving/Transfer Service,...,CA,,Oakridge,1.0,138.0,2023-11-01T02:38:58-07:00,,,867.0,1
643061,2021,3859492,21-261247,0,Vancouver Charcuterie Inc,Charcuterie Vancouver,Issued,2021-10-06,2021-12-31,Ltd Service Food Establishment,...,CA,V6K 1R1,Kitsilano,1.0,155.0,2023-11-01T02:38:58-07:00,"{""coordinates"": [-123.167908743383, 49.2680903...","49.2680903409461, -123.167908743383",816.0,1
643067,2021,3859939,21-261673,0,Ian Martin Information Technology Inc,,Issued,2021-08-11,2021-12-31,Employment Agency,...,CA,V6E 3L2,Downtown,2.0,125.0,2023-11-01T02:38:58-07:00,"{""coordinates"": [-123.121522823224, 49.2870270...","49.2870270555211, -123.121522823224",142.0,0
643070,2021,3860003,21-261730,0,Glee Road Productions Ltd,,Issued,2021-07-19,2021-12-31,Production Company,...,CA,V5L 1R2,Grandview-Woodland,60.0,138.0,2023-11-01T02:38:58-07:00,"{""coordinates"": [-123.06398007395, 49.28178427...","49.2817842705027, -123.06398007395",165.0,0


### Macroeconomics Data
- Create a column `REF_YEAR` representing the year of `REF_DATE`
- Keep rows where `North American Industry Classification System (NAICS) == 'All industries'`, since it is time-consuming to manually map the `BusinessType` in business license dataset to the related industries, we will merely consider the overall GDP performance in this project.
- Keep rows where `REF_YEAR >= 2012`
- Keep columns `REF_YEAR` and `VALUE`

In [4]:
econ = econ_datacleaning(raw_econ_index_data_dict)
econ

Unnamed: 0,FOLDERYEAR,GDPValue,ConsumerPriceValue,EmploymentValue,InvestmentConstructionValue
0,2012,1710429.0,1.758333,2296.708333,912810600.0
1,2013,1754173.0,1.266667,2320.475,987533900.0
2,2014,1803636.0,1.475,2348.983333,1036283000.0
3,2015,1820026.0,1.891667,2390.0,1146144000.0
4,2016,1839614.0,1.708333,2468.166667,1267791000.0
5,2017,1901971.0,1.225,2560.616667,1287776000.0
6,2018,1958470.0,1.8,2607.116667,1434572000.0
7,2019,1996744.0,2.166667,2676.116667,1657493000.0
8,2020,1897187.0,1.633333,2509.85,1519591000.0
9,2021,1991978.0,2.566667,2665.416667,1504690000.0


### Combine business lisence and macroeconomics data
- Map the yearly GDP value to the first lisence issued year of each company (the year when a company starts it business).

In [5]:
business_econ = merge_business_econ_by_year(business, econ)

In [6]:
business_econ

Unnamed: 0,FOLDERYEAR,LicenceRSN,LicenceNumber,LicenceRevisionNumber,BusinessName,BusinessTradeName,Status,IssuedDate,ExpiredDate,BusinessType,...,FeePaid,ExtractDate,Geom,geo_point_2d,survival_days,survival_status,GDPValue,ConsumerPriceValue,EmploymentValue,InvestmentConstructionValue
0,2013,1786043,13-166627,0,Melissa Cheryl Aston (Melissa Aston),Kazoomko Productions,Issued,2012-12-29,2013-12-31,Entertainment Services,...,129.0,2019-07-21T13:49:06-07:00,,,1828.0,1,1.754173e+06,1.266667,2320.475000,9.875339e+08
1,2013,1786044,13-166628,0,Corus Radio Company,CHMJ AM730 and CFOX 99.3FM,Issued,2013-01-14,2013-12-31,Entertainment Services,...,129.0,2019-07-21T13:49:06-07:00,"{""coordinates"": [-123.119500778402, 49.2822434...","49.2822434350563, -123.119500778402",2908.0,1,1.754173e+06,1.266667,2320.475000,9.875339e+08
2,2013,1786048,13-166632,0,Jamieson Productions Inc,Jamieson Prod Inc,Issued,2013-09-12,2013-12-31,Entertainment Services,...,191.0,2019-07-21T13:49:06-07:00,,,1936.0,1,1.754173e+06,1.266667,2320.475000,9.875339e+08
3,2013,1786055,13-166639,0,(Jessica Minnie),Petite Pearl Wedding and Event Planning,Issued,2013-06-17,2013-12-31,Entertainment Services,...,191.0,2019-07-21T13:49:06-07:00,,,3849.0,1,1.754173e+06,1.266667,2320.475000,9.875339e+08
4,2013,1786065,13-166649,0,Holly Perrin Yoos (Holly Yoos),Copperplate Communications,Issued,2012-11-29,2013-12-31,Entertainment Services,...,129.0,2019-07-21T13:49:06-07:00,,,4049.0,1,1.754173e+06,1.266667,2320.475000,9.875339e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92312,2021,3859186,21-260945,0,Betterbite Ltd,,Issued,2021-08-16,2021-12-31,Moving/Transfer Service,...,138.0,2023-11-01T02:38:58-07:00,,,867.0,1,1.991978e+06,2.566667,2665.416667,1.504690e+09
92313,2021,3859492,21-261247,0,Vancouver Charcuterie Inc,Charcuterie Vancouver,Issued,2021-10-06,2021-12-31,Ltd Service Food Establishment,...,155.0,2023-11-01T02:38:58-07:00,"{""coordinates"": [-123.167908743383, 49.2680903...","49.2680903409461, -123.167908743383",816.0,1,1.991978e+06,2.566667,2665.416667,1.504690e+09
92314,2021,3859939,21-261673,0,Ian Martin Information Technology Inc,,Issued,2021-08-11,2021-12-31,Employment Agency,...,125.0,2023-11-01T02:38:58-07:00,"{""coordinates"": [-123.121522823224, 49.2870270...","49.2870270555211, -123.121522823224",142.0,0,1.991978e+06,2.566667,2665.416667,1.504690e+09
92315,2021,3860003,21-261730,0,Glee Road Productions Ltd,,Issued,2021-07-19,2021-12-31,Production Company,...,138.0,2023-11-01T02:38:58-07:00,"{""coordinates"": [-123.06398007395, 49.28178427...","49.2817842705027, -123.06398007395",165.0,0,1.991978e+06,2.566667,2665.416667,1.504690e+09


## Import packages

In [None]:
# preprocessing
from DataPreprocess import *
from DataFetch import fetch_business_license, fetch_econ_indicators

# viz
import altair as alt
alt.data_transformers.enable("vegafusion")
import matplotlib.pyplot as plt

## Loading dataset

In [None]:
# Fetch data by urls --> already modulized
# business = fetch_business_license()
# raw_econ_index_data_dict = fetch_econ_indicators()

# It takes a while to load data from the url, so... here's the shortcut!
# Just download the file above to your local machine, and put the file in the data folder
business = pd.read_csv('data/business-licences.csv', delimiter = ';')

raw_econ_index_data_dict = {
    'GDP': pd.read_csv('data/gdp_by_industry.csv'),
    'ConsumerPrice': pd.read_csv('data/consumer_price_index.csv'),
    'Employment': pd.read_csv('data/employment_by_industry.csv'),
    'InvestmentConstruction': pd.read_csv('data/investment_in_building_construction.csv')
}

## Preprocessing

In [None]:
# Drop rows where ExpiredDate and IssuedDate are NA
business = business_datacleaning(business = business, survival_threshold = 365 * 2)
business

In [None]:
econ = econ_datacleaning(raw_econ_index_data_dict)
econ

In [None]:
business_econ = merge_business_econ_by_year(business, econ)

In [None]:
business_econ.columns

## EDA & Visualization

Overall Target: To look at which of the features might be useful to predict the survival status, we plotted the distributions of each predictor from the dataset and coloured the distribution by class (failed to survive more than 2 yrs: green, and survived for more than 2 yrs: orange). In doing this, what we aim at is to omit features of which both the binary classes have similar patterns. In that way, it means that these features do have the power to tell the two classes apart and fit their values into each of them.

In [None]:
business_econ.info()

In [None]:
business.describe(include='all')

### Numeric Features

We start with numeric features. By deleting features that make no sense or have limited value to dig into, we picked 6 numeric features to implement EDA on. 

With the generated figures, we found out that patterns of all the numeric features of the two classes look very similar in the pattern. Thus, we choose to omit all of these from our model.


In [None]:
numeric_features = ['GDPValue', 'ConsumerPriceValue', 'EmploymentValue', 'InvestmentConstructionValue'] 
# Save Numberofemployees and FeePaid for later due to their large variance

In [None]:
# Create a chart object for each feature.
charts_numeric = [alt.Chart(business_econ).transform_density(
    feature,
    as_=[feature, 'density'],
    groupby=['survival_status']
).mark_area(opacity=0.5).encode(
    x=alt.X(feature, title=feature).stack(False),
    y='density:Q',
    color=alt.Color('survival_status:O').scale(scheme='dark2')
).properties(
    width=180,
    height=120
) for feature in numeric_features]


# Combine the charts.
chart_grid = alt.vconcat(*[
    alt.hconcat(*charts_numeric[i:i+2]) for i in range(0, len(charts_numeric), 2)
])

In [None]:
employee = alt.Chart(business_econ).transform_density(
    'NumberofEmployees',
    as_=['NumberofEmployees', 'density'],
    groupby=['survival_status']
).mark_area(opacity=0.5).encode(
    x=alt.X('NumberofEmployees', title='NumberofEmployees', scale=alt.Scale(domain=[0, 5000])).stack(False),
    y='density:Q',
    color=alt.Color('survival_status:O').scale(scheme='dark2')
).properties(
    width=120,
    height=120
)

In [None]:
feepaid = alt.Chart(business_econ).transform_density(
    'FeePaid',
    as_=['FeePaid', 'density'],
    groupby=['survival_status']
).mark_area(opacity=0.5).encode(
    x=alt.X('FeePaid', title='FeePaid', scale=alt.Scale(domain=[0, 5000])).stack(False),
    y='density:Q',
    color=alt.Color('survival_status:O').scale(scheme='dark2')
).properties(
    width=120,
    height=120
)

In [None]:
chart_grid

In [None]:
employee & feepaid

### Caregorical Features

For categorical features we generated histograms to see frequency of observations of both classes. 

The two histograms indicate an underlying pattern where the two features could have an influence on the target, with the similar spread of frequencies.

In [None]:
business_econ['City'].value_counts() # With significantly large proportion of data in Vancouver, we would focus our research only on Vancouver
business_econ['Province'].value_counts() # Since most of the data are in BC Province, we would look into records in BC.

categorical_features = ['LocalArea', 'BusinessType'] 

In [None]:
alt.Chart(business_econ).mark_bar(opacity=0.5).encode(
    alt.X('LocalArea', sort='-y').stack(False),
    y='count()',
    color=alt.Color('survival_status:O').scale(scheme='dark2')
).facet(
    'survival_status:O', columns = 2
)

In [None]:
business_econ['BusinessType'].value_counts()

In [None]:
top_20_provinces = business_econ['BusinessType'].value_counts().head(20).index.tolist()

# Filter to include only the top 20 business types
filtered = business_econ[business_econ['BusinessType'].isin(top_20_provinces)]

alt.Chart(filtered).mark_bar(opacity=0.5).encode(
    x=alt.X('BusinessType:N', sort='-y'),
    y='count()',
    color=alt.Color('survival_status:O', scale=alt.Scale(scheme='dark2'))
).facet(
    column='survival_status:O',
    columns=2
)

In [None]:
from DataPreprocess import *
from DataFetch import *

import pandas as pd
import numpy as np

#sklearn tool
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_validate

# Preprocess / transform
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    LabelEncoder,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)

# models
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

### Analysis

We have used Logistic Regression and BernoulliNB for predicting business survival in Vancouver due to the nature of the data. 

##### Why Logistic Regression and BernoulliNB?
Logistic Regression is effective when the outcome is binary, making it appropriate for predicting whether a business survives or not. Easier interpretability of the model results is another reason why we chose Logistic Regression. It's a linear model that provides coefficients for each predictor variable, making it easy to interpret the impact of each variable on the predicted outcome. This can be crucial for understanding the economic and demographic factors influencing business survival. 

BernoulliNB, a variant of Naive Bayes, accommodates binary outcomes, aligning with the nature of the task where businesses either survive or fail. It excels in handling categorical features and is effective in scenarios with sparse data, making it well-suited for the diverse and potentially sparse economic and demographic factors influencing business longevity in the city.

##### Train Test Split
We are using 70% of our data as training data and the remaining 30% is used as test data.


##### Results
Logistic Regression is performing better and we are getting a cross-validation accuracy of ~80% (79.2%) on whether a business will survive or not. BernoulliNB has slightly lower cross-validation accuracy of 74.6%.

In [2]:
# business = fetch_business_license()
# raw_econ_index_data_dict = fetch_econ_indices()

business = business_datacleaning(pd.read_csv('data/business-licences.csv', delimiter = ';'), survival_threshold = 730)
raw_econ_index_data_dict = {
    'GDP': pd.read_csv('data/gdp_by_industry.csv'),
    'ConsumerPrice': pd.read_csv('data/consumer_price_index.csv'),
    'Employment': pd.read_csv('data/employment_by_industry.csv'),
    'InvestmentConstruction': pd.read_csv('data/investment_in_building_construction.csv')
}

business = business[business['City'] == 'Vancouver']
econ = econ_datacleaning(raw_econ_index_data_dict)
business_econ = merge_business_econ_by_year(business, econ)
business_econ

  business = business_datacleaning(pd.read_csv('data/business-licences.csv', delimiter = ';'), survival_threshold = 730)


Unnamed: 0,FOLDERYEAR,LicenceRSN,LicenceNumber,LicenceRevisionNumber,BusinessName,BusinessTradeName,Status,IssuedDate,ExpiredDate,BusinessType,...,FeePaid,ExtractDate,Geom,geo_point_2d,survival_days,survival_status,GDPValue,ConsumerPriceValue,EmploymentValue,InvestmentConstructionValue
0,2015,2333488,15-103790,0,Hollyhock Properties Ltd,,Issued,2014-12-03,2015-12-31,Apartment House Strata,...,64.0,2019-07-21T13:49:14-07:00,"{""coordinates"": [-123.116856730836, 49.2678622...","49.2678622929998, -123.116856730836",4048.0,1,1.820026e+06,1.891667,2390.000000,1.146144e+09
1,2015,2333496,15-103798,0,(Zandra Paleczny),,Issued,2014-11-06,2015-12-31,Apartment House Strata,...,64.0,2019-07-21T13:49:14-07:00,"{""coordinates"": [-123.133925222671, 49.2796620...","49.2796620031115, -123.133925222671",4045.0,1,1.820026e+06,1.891667,2390.000000,1.146144e+09
2,2015,2333501,15-103803,0,(Dave Dixon),,Issued,2014-11-14,2015-12-31,Apartment House Strata,...,64.0,2019-07-21T13:49:14-07:00,"{""coordinates"": [-123.124998311257, 49.2836868...","49.2836868407842, -123.124998311257",2208.0,1,1.820026e+06,1.891667,2390.000000,1.146144e+09
3,2015,2333502,15-103804,0,Henry B Yuen (Henry Yuen),,Issued,2014-12-05,2015-12-31,Apartment House Strata,...,64.0,2019-07-21T13:49:14-07:00,"{""coordinates"": [-123.132003087572, 49.2741705...","49.2741705397492, -123.132003087572",4007.0,1,1.820026e+06,1.891667,2390.000000,1.146144e+09
4,2015,2333506,15-103808,0,Tsang & Lee Enterprises Inc,,Issued,2015-01-07,2015-12-31,Apartment House Strata,...,64.0,2019-07-21T13:49:14-07:00,"{""coordinates"": [-123.117431658016, 49.2687000...","49.2687000536747, -123.117431658016",4015.0,1,1.820026e+06,1.891667,2390.000000,1.146144e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79390,2022,4042383,22-216683,0,(XueYang Hu),,Issued,2022-05-27,2022-12-31,Apartment House Strata,...,123.0,2023-11-01T02:39:02-07:00,,,583.0,0,2.064208e+06,5.425000,2748.241667,1.794664e+09
79391,2022,4042384,22-216684,0,Frank F Wu & Su-Chi L Wu,,Issued,2022-09-14,2022-12-31,Apartment House Strata,...,143.0,2023-11-01T02:39:02-07:00,,,473.0,0,2.064208e+06,5.425000,2748.241667,1.794664e+09
79392,2022,4042552,22-216852,0,Evermark Real Estate Services Inc,Evermark Real Estate Services,Issued,2022-04-21,2022-12-31,Real Estate Dealer,...,185.0,2023-11-01T02:39:02-07:00,"{""coordinates"": [-123.133618556976, 49.2039954...","49.2039954491368, -123.133618556976",619.0,0,2.064208e+06,5.425000,2748.241667,1.794664e+09
79393,2022,4042611,22-142223,1,RG Plumbing Ltd,,Issued,2022-04-22,2022-12-31,Plumber & Sprinkler Contractor,...,11.0,2023-11-01T02:39:02-07:00,,,618.0,0,2.064208e+06,5.425000,2748.241667,1.794664e+09


In [3]:
econ.columns

Index(['FOLDERYEAR', 'GDPValue', 'ConsumerPriceValue', 'EmploymentValue',
       'InvestmentConstructionValue'],
      dtype='object')

In [4]:
## Create the column transformer
# imp = make_column_transformer(
#     ("drop", drop_features),
#     (SimpleImputer(strategy="most_frequent"), word_features + categorical_features),  # missing_values='NaN'
#     (SimpleImputer(strategy="median"), numeric_features),  # missing_values='NaN'
# )
# preprocessor = make_column_transformer(  
#     (CountVectorizer(binary=True), [0]),  # BusinessType
#     (OneHotEncoder(drop="if_binary", sparse_output=False, handle_unknown='ignore'), [1, 2]),  # categorical
#     (StandardScaler(), [3, 4])  # numeric
# )


In [5]:
def transform(df, word_features, categorical_features, numeric_features):
    # drop_features = ['Status', 'BusinessSubType', 'FOLDERYEAR', 'LicenceRSN', 'LicenceNumber', 'LicenceRevisionNumber',
    #     'BusinessName', 'BusinessTradeName', 'IssuedDate', 'ExpiredDate', 
    #     'Unit', 'UnitType', 'House', 'Street', 'ExtractDate', 'Geom', 'geo_point_2d']
    
    word_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        FunctionTransformer(np.reshape, kw_args={'newshape':-1}),
        CountVectorizer(binary=True)
    )

    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(drop="if_binary", sparse_output=False, handle_unknown='ignore')
    )

    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="median"),
        StandardScaler()
    )
    
    word_trans_arr = word_transformer.fit_transform(df[word_features])
    categorical_trans_arr = categorical_transformer.fit_transform(df[categorical_features])
    numeric_trans_arr = numeric_transformer.fit_transform(df[numeric_features])
    
    return np.hstack((word_trans_arr.toarray(), categorical_trans_arr, numeric_trans_arr))


In [6]:
train_df, test_df = train_test_split(business_econ, test_size=0.3, random_state=123)

word_features = ['BusinessType']
categorical_features = ['City', 'LocalArea']
numeric_features = ['NumberofEmployees', 'FeePaid', 
                     'GDPValue', 'ConsumerPriceValue', 'EmploymentValue', 'InvestmentConstructionValue']

X_train = train_df[word_features + categorical_features + numeric_features]
X_test = test_df[word_features + categorical_features + numeric_features]
y_train = train_df["survival_status"]
y_test = test_df["survival_status"]

X_train_transformed = transform(X_train, word_features, categorical_features, numeric_features)

In [8]:
bnb = BernoulliNB()
pd.DataFrame(cross_validate(bnb, X_train_transformed, y_train, cv=10, return_train_score=True))

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.083821,0.005145,0.751169,0.746031
1,0.051737,0.00483,0.746312,0.746291
2,0.051149,0.004826,0.747751,0.745572
3,0.052329,0.004608,0.740914,0.746171
4,0.050895,0.004974,0.743793,0.745772
5,0.05262,0.005117,0.741994,0.746191
6,0.05592,0.005003,0.745186,0.746716
7,0.060156,0.005425,0.741587,0.746736
8,0.062785,0.005698,0.744466,0.746936
9,0.058283,0.005274,0.746986,0.746976


In [7]:
logreg = LogisticRegression(random_state=123, max_iter=1000)
pd.DataFrame(cross_validate(logreg, X_train_transformed, y_train, cv=10, return_train_score=True))

Unnamed: 0,fit_time,score_time,test_score,train_score
0,5.330089,0.001129,0.790752,0.791295
1,4.685447,0.001117,0.79525,0.791455
2,4.428624,0.001107,0.792012,0.791975
3,5.179295,0.001085,0.785534,0.791975
4,5.024096,0.00111,0.787693,0.792235
5,3.567812,0.001132,0.79435,0.791555
6,4.858253,0.002255,0.795573,0.792259
7,4.522573,0.001158,0.791974,0.791899
8,4.145999,0.001141,0.785316,0.792179
9,4.459368,0.001088,0.788555,0.792419


#### Conclusion and Improvements

The Logistic Regression model gives a decent accuracy of ~80% here and can be used as an assistive model for making decisions on whether business licence will be renewed or not. 

We can further improve the model results:

- Trying out more complex models like Random Forest, Neural Networks etc. (which are currently out of our MDS syllabus scope as of now).
- By combining other economic and socio-economic factors in our dataset

## References

City of Vancouver. 2023. 'Business Licences Dataset.' Vancouver Open Data. https://opendata.vancouver.ca/explore/dataset/business-licences/information/?disjunctive.status&disjunctive.businesssubtype&refine.folderyear=23

Statistics Canada. 2023. https://www150.statcan.gc.ca/n1/en/type/data?MM=1

McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.

VanderPlas, J. et al., 2018. Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.