# Predicting the Survival of New Businesses in Vancouver

By Arturo Boquin, Beth Ou Yang, Prabhjit Thind & Weiran Zhao 2023/12/03

In [9]:
import pandas as pd
from myst_nb import glue
import pickle


In [10]:
test_scores_df = pd.read_csv("../results/tables/test_scores.csv").round(2)
glue("accuracy", test_scores_df['Accuracy'].values[0], display=False)
glue("f1", test_scores_df['F1_Score'].values[0], display=False)
test_scores_df = test_scores_df.style.format().hide()
glue("test_scores_df", test_scores_df, display=False)

In [11]:
confusion_df=pd.read_csv("../results/tables/confusion_matrix.csv", index_col=0)
confusion_df.rename(columns={'Predicted 0':'Predicted: Survived'}, inplace=True)
confusion_df.index.names = ['Actual label:']
glue("total", confusion_df.sum(axis=1).sum(), display=False)
glue("pred_correct", confusion_df['Predicted: Survived'].values[0] + confusion_df['Predicted 1'].values[1], display=False)
glue("false_neg", confusion_df['Predicted: Survived'].values[1], display=False)
glue("confusion_df", confusion_df, display=False)

In [None]:
with open('../results/models/lr_license_renewal_pipeline.pickle', 'rb') as f:
    business_fit = pickle.load(f)

## Summary
This study focuses on forecasting the viability of new businesses in Vancouver by analyzing a range of economic and demographic variables. We utilize data from the City business license registry (City of Vancouver, 2023) and supplementary sources such as Statistics Canada (2023) to assess the impact of factors like location, industry type, and economic conditions on the longevity of businesses.

Our approach involves developing a classification model using logistic regression. This model leverages the aforementioned datasets to ascertain the likelihood of a new business sustaining operations over a two-year period. The effectiveness of our final model was substantiated through its performance on a separate test dataset, achieving an accuracy of {glue:}`accuracy`. Out of {glue:text}`total` test data cases, it accurately forecasted the survival of {glue:text}`pred_correct` businesses.

Nevertheless, the model incorrectly classified {glue:text}`false_neg` cases as false negatives, erroneously indicating that certain businesses would thrive when they were actually at risk. These inaccuracies could potentially lead to detrimental outcomes, especially in scenarios involving targeted interventions for businesses. Therefore, we advocate for further research and refinement of this predictive model before its implementation as a tool for policy makers and economic authorities.

## Introduction
The business environment in Vancouver is dynamic, influenced by economic trends, demographic changes, and city planning. Accurate prediction of new business survival is essential for both policymakers and entrepreneurs. The central question of this project is: "Can we predict the survival of new businesses in Vancouver?" To answer this, we leveraged data from Vancouver's open data portal, supplemented with economic and census data. Analysis was conducted using Python packages including Pandas (McKinney 2010), Altair (VanderPlas, 2018), and scikit-learn (Pedregosa et al. 2011).

## Methods
### Dataset Description
The primary dataset for this study is sourced from the City of Vancouver's business license registry, which is regularly updated with new licenses, renewals, and terminations. This dataset is enhanced with external data on economic indicators and demographic trends, providing a comprehensive view of the factors influencing business survival in the city.

### Analysis
Our methodology involved developing a logistic regression model to classify businesses as likely to survive or not over a two-year period. We employed various economic and demographic variables from our datasets in the model. The data was divided, with 70% allocated for training and 30% for testing. Model performance was evaluated using accuracy and other relevant metrics, emphasizing the importance of reducing false negatives due to the high stakes involved in business 
survival predictions.

## Results & Discussion
Initial analysis of the datasets revealed significant trends and correlations between various factors and business survival. The logistic regression model showed promising results, though the presence of false negatives warrants further investigation and model refinement. Our findings suggest that, with improved accuracy, such a model could be a valuable tool for predicting business viability, aiding decision-making for both entrepreneurs and policymakers.

To look at which of the features might be useful to predict the survival status, we plotted the distributions of each predictor from the dataset and coloured the distribution by class (failed to survive more than 2 yrs: green, and survived for more than 2 yrs: orange). In doing this, what we aim at is to omit features of which both the binary classes have similar patterns. In that way, it means that these features do have the power to tell the two classes apart and fit their values into each of them. As illustrated in Fig.1, Fig.2 and Fig.3 (Visualization of Frequency of Numeric Features), there exists a correlation between two classes with the numeric features.

```{figure} ../results/figures/numeric_features.png
---
width: 800px
name: numeric_features
---
Comparison of the numeric features distributions.
```

```{figure} ../results/figures/numeric_FeePaid.png
---
width: 800px
name: numeric_FeePaid
---
Numeric feature with large variance - FeePaid-.
```

```{figure} ../results/figures/numeric_NumberofEmployees.png
---
width: 800px
name: numeric_NumberofEmployees
---
Numeric feature with large variance - NumberofEmployees-.
```

For categorical features we generated histograms to see frequency of observations of both classes. 

As Fig.4 and Fig.5 indicated, an underlying pattern where the two features could have an influence on the target, with the similar spread of frequencies.

```{figure} ../results/figures/categorical_LocalArea.png
---
width: 800px
name: categorical_LocalArea
---
Categorical feature Local Area.
```

```{figure} ../results/figures/varianced_categorical_BusinessType.png
---
width: 800px
name: varianced_categorical_BusinessType
---
Categorical feature Buiness Type.
```

We have used Logistic Regression predicting business survival in Vancouver due to the nature of the data. 
Logistic Regression is effective when the outcome is binary, making it appropriate for predicting whether a business survives or not. 
Easier interpretability of the model results is another reason why we chose Logistic Regression. It's a linear model that provides coefficients for each predictor variable, making it easy to interpret the impact of each variable on the predicted outcome. This can be crucial for understanding the economic and demographic factors influencing business survival. 

We are using 70% of our data as training data and the remaining 30% is used as test data.

Logistic Regression is performing with a test validation accuracy of {glue:}`accuracy` on whether a business will survive or not.

```{glue:figure} test_scores_df
:figwidth: 400px
:name: "test_scores_df"

Accuracy and F1-score from best model predicted on the test data.
```

Our prediction model performed quite well on test data, with a final overall accuracy of {glue:text}`accuracy` and F1-score of {glue:text}`f1`. 

Other indicators that our model performed well come from the confusion matrix, where it only made {glue:text}`false_neg` mistakes. However all {glue:text}`false_neg` mistakes were predicting certain businesses would thrive when they were actually at risk, given the implications this has for the economy, this model can be used as a reference for policymakers and authorities ({numref}`Figure {number} <confusion_df>`).

```{glue:figure} confusion_df
:figwidth: 400px
:name: "confusion_df"

Confusion matrix from best model on the test data.
```

#### Improvements

- The analysis conducted in this project provides significant insights into the factors influencing new business survival in Vancouver. The careful selection of predictive variables and the use of appropriate models have yielded results that not only predict business survival but also highlight critical economic and demographic factors influencing it.
- While the models show good predictive power, further research incorporating additional data sources, advanced modeling techniques, and a deeper temporal analysis could provide even more nuanced insights.
- The project has the potential to significantly impact policymaking and strategic planning for economic development in urban environments, demonstrating the practical applications of data science in real-world scenarios.


## References

City of Vancouver. 2023. 'Business Licences Dataset.' Vancouver Open Data. https://opendata.vancouver.ca/explore/dataset/business-licences/information/?disjunctive.status&disjunctive.businesssubtype&refine.folderyear=23

Statistics Canada. 2023. https://www150.statcan.gc.ca/n1/en/type/data?MM=1

McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.

VanderPlas, J. et al., 2018. Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.