## 1. Introduction

This memory describes the steps followed for the preparation of the Kschool Data Science Master 15th Edition TFM: 'Test your business viability' as well as the main results

Some **background**   
When I started to look for an idea for the project, I came into a open dataset of Madrid city council that contains information about the different retail stores and activities licensed in Madrid since 2014. See references in: https://datos.madrid.es/sites/v/index.jsp?vgnextoid=66665cde99be2410VgnVCM1000000b205a0aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD   
This data reminded me about the feeling that I had, that some people open their business without a business case analysis and many of them close in a short period of time. So I decided to analyse the commercial premises census evolution and look for some pattern.

The **TFM objective**, the main goal of this TFM is to use this information as a base for an advanced analytics model that predicts the probability that a commercial premises will be closed in a 3 years timeframe.   
This is a supervised classification problem and the preparation of this project has been full of of Pandas and all kind of classifications algorithms and recommendations to get the most of them (I have added some references at the end of the document). 

Some **important decisions:**   
During the project preparation and after some time working with the dataset, I realized that the quality of the data was not good enough. I made some on the field research and Google searchs and confirmed that the 'desc_situacion_local' that indicates the status of a store ('open', closed', etc...) was not up to date.   
I decided to go on with the information available in what has become an exercise of learning and investigation that has allowed me to apply most of the steps and concepts learnt during the master. Main assumptions taken:
- **About Data sources**: 2014 files had a different structure, with different identifiers and tags so I decided to use the information available since 2015 till september 2019.  


- **About Target variable**: I wanted to analyze the probability of a business to open and close in a 2 years timeframe but I had not enough samples (less than 1% over the total population). Finally, I defined my target variable as **commercial premises that closed between 2017 to the date (3 years timeframe)**.


- **Variables**: I have not found good predictive variables. I started with the retail stores files with no results so I added additional features trying to get some results.  
I have used information of Madrid population census also available in Madrid Opendata Portal and some information about floating population in the different districts of Madrid during one week of April 2018, kindly provided by Kinneo


- **About the Models**: I have used Logistic Regression as a baseline. 
I started testing KNN, Decission Tree, Random Forest and XGboost. The results were quite similar and far from good (aroung 60% AUC) so I focused on Random Forest and XGboost that got better results.  
For the sake of simplicity, I will only show the analysis performed with Logistic Regression, Random Forest and XGboost.  
I will deliver the report of the modeling results in a Jupyter Notebook 

- **Metrics**: the main metrics are Recall, AUC and f1 score. I have also analysed lift Precision and Recall curve and Roc curve. 


- **Project structure and presentation**:
The result and insights of the project are presented in Jupyter notebooks for code and visual reproductibility   

I have followed the different steps of a classification problems:   
- Data gathering
- Data preparation: cleaning, target generation, new variables generation
- Exploratory data analysis
- Data modeling
- Testing and metrics  

It is possible to execute the results through the Notebooks below. 

**Steps for navigating through the contents of the project**: 
- **First**: check libraries and install 
- **Second**: execute jupyter notebook Data_loading_and_preparation
- **Third**: execute jupyter notebook Classification Modeling
- **Appendix**

**Important notes**
- Please, "Clone and Download" the .git from [TFM Github repository](./https://github.com/bdiazcor/TFM-Test-your-business-viability) to local PC and execute as instructions in this readme. Do not change the directories tree structure.
- I suggest to run the project from the readme. However, the different notebooks can also be launched directly from folder TBV_v1. It is important to follow the steps listed above.
- I have added a Java script code to hide the raw code at the beginning of the notebooks and facilitate the reading. You can toggle on it to expand the code. 

For example, the following notebook will be as follows after edition
![caption](button.png)

And after Run All, only the output will be shown and there will be a buttom at the top to expand the code if needed
![caption](output.png)

## 2. TFM modules

### 1) Install libraries

Recommended installing all the libraries using the **Anaconda Distribution**.
First, download [Anaconda](./https://www.anaconda.com/distribution/). 
Second, install the version of Anaconda which you downloaded, following the instructions on the download page

I had almost all the libraries available. New libraries I had to install:
- **scipy 1.1.0**: > pip install scipy
- **scikitplot 0.3.7**: > pip install scikit-plot
- **imblearn 0.5.0**: > pip install imbalanced-learn
- **geopy 1.20.0**: > pip install geopy
- **utm 0.5.0**: > pip install utm

Full list of libraries used for this TFM:
- **Python 3.7.3**   
- **pip 19.1.1**   
- **Jupyter Notebook 6.0.2**   
- **Pandas 0.23.4**   
- **Numpy 1.15.4**    
- **matplotlib 3.0.2**      
- **sklearn 0.21.3**    
- **pickle**: part of the standar library of Python
- **seaborn 0.9.0**    
- **xgboost 0.90**   
- **scipy 1.1.0**    
- **scikitplot 0.3.7**   
- **imblearn 0.5.0**  
- **geopy 1.20.0** 
- **utm 0.5.0** 

Alternative to Anaconda, list of commands to execute (once Python 3 and Pypi installed):   
python get-pip.py   
pip install jupyter   
pip install pandas   
pip install numpy   
pip install matplotlib   
pip install -U scikit-learn   
pip install seaborn   
pip install xgboost   
pip install scipy   
pip install scikit-plot   
pip install imbalanced-learn   
pip install geopy   
pip install utm   

### 2) Data loading and data preparation

[Launches Data loading and preparation Notebook](./TBV1_data_cleaning.ipynb). Follows the structure:

**1. Data gathering**   
     1) Activities files (from 2015 - 2019, Ayuntamiento Madrid)   
     2) Madrid population database (1st january 2019, Ayuntamiento Madrid)   
     3) Madrid floating population (16-22 april 2018, Private source)    

**2. Data preparation**: cleaning, target generation, new variables generation   
     1) Commercial premises status normalization   
     2) NaN management   
     3) Merge all years commercial premises info in a single DataFrame   
     4) Generate target variable   
     5) Standardize type of activities   
     6) Generate interesting variables and convert UTM coordinates   
     7) Add population info   
     8) Select activities for analysis   
     9) Merge with info points in radius   
     10) Export to csv   

### 3) Classification Modeling

[Launches Data modeling Notebook](./TBV1_classification_model.ipynb#). Follows the structure:   

**1. Import libraries**


**2. Exploratory Data Analysis**   
1) Load .csv to DataFrame.    
2) Select columns of interest      
3) Identify type of columns   
4) Dummify categorical values   
5) train and test split    

**3. Modeling**   
1) Base model: Logistic Regression    
2) Random Forest   
3) XGboost. Includes optimal cut-off analysis and cummulative Gain and lift charts   
4) Features importance with the best estimator   
5) Future prediction (with reserved 5% of samples)    

**4. Summary and conclusions**   
1) Conclusions   
2) Next steps   
3) Summary of the exercise   

## 3. Next steps
Upgrades with future deliverables:
- Rethink and **correct the target variable**
- **Fine tune the models** and hyperparameters
- Get **more features**: 
    - New: location vs points of interest and transport stations; web scrapping in Google places for more accurate information about commercial premises status); stores prices from Idealista
    - More granularity of the existing (neighbourhood or postal code level)    
- **Pilot** the results. This could be slow. I will look for some comditions to test quickly: ie. visit the premises with the highest probability of closure a validate the context with the data available. 
- **Code optimization**
- **Industrialize** all the steps in a pipeline
- **End user web or app** for information queries

## 4. Appendix:

### A) Data
All the datasets needed to run the project are in the folder /TBV_v1/Data. Please, "Clone and Download" the folder TBV_v1 from [TFM Github repository](./https://github.com/bdiazcor/TFM-Test-your-business-viability) to local PC and execute the different parts from this document from here. Do not change the directories structure locally.

### B) Support Notebooks and scripts 
- [Points_in_radius](./Points_in_radius.ipynb): calculates the number of retail stores of the same category in a radius. It execution takes time (40 min) so the result is already available to merge with the rest of information
- [clean_functions](./clean_functions.py): set of function to simplify notebooks and reproduce code 
- [dataset_for_modeling](./dataset_for_modeling.py): it't a simplified version of the chapter Exploratory Data Analysis to generate the dataset for modeling in other notebook (Fine tuning Random Forest and Fine tuning XGboost)
- [Future_prediction](./Future_predictions.py): it is a function that returns information about the commercial premises predicted. 
- [Metrics](./metrics.py): a script that included different functions to plot Roc and Precision and Recall Curve and calculates the optimal cut-off
- [Random Forest Tuning](./TBV1_clas_rf.ipynb): code extract prepared to be executed independently of Classification Modeling Notebook. The results have been already included in Classification Modeling Notebook
- [Xgboost tuning](./TBV1_cla_xgb.ipynb): code extract prepared to be executed independently of Classification Modeling Notebook. The results have been already included in Classification Modeling Notebook

### C) Variables dictionary
- [Dictionary of variables](./dictionary.pdf)

### D) References

- Madrid council open portal (retail stores and activities census): https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=23160329ff639410VgnVCM2000000c205a0aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default

- Madrid council open portal (population census): https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=1d755cde99be2410VgnVCM1000000b205a0aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default

- Description of atributes in commercial premises datasets: https://datos.madrid.es/FWProjects/egob/Catalogo/Economia/Ficheros/Estructura_DS_FicheroCLA.pdf

- Incomes per district: https://www.expansion.com/economia/2019/09/12/5d7a1c78e5fdea4b218b458e.html

- EDA analysis and features transformation: https://medium.com/vickdata/four-feature-types-and-how-to-transform-them-for-machine-learning-8693e1c24e80 

- Features encoding: https://towardsdatascience.com/an-easier-way-to-encode-categorical-features-d840ff6b3900  

- Roc and precision and recall curves: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ 

- Fine tuning a classifier with Gridsearch: https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65

- Xgboost fine tuning: https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

- Xgboost tutorial: https://www.datacamp.com/community/tutorials/xgboost-in-python

- Optimal cut-off point: https://stackoverflow.com/questions/28719067/roc-curve-and-cut-off-point-python

- Feature importance: https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

- Cummulative gain and lift curves: https://www.datavedas.com/model-evaluation-in-python/