# The Project Task

*Welcome to your new role as a data scientist in one of the world's leading **payment service providers companies**.* 

*Already on your first day business stakeholders ask you to set up a model to forecast the **credit card fraud** of customers. The local management has the target to decrease the number of fraudulent transactions in 2021 by 25%. Can you help them to fulfill this target? How would you structure such a project?* 

We will use as an inspiration the MS Team Data Science Process. 

# MS Team Data Science Process (TDSP)

We go through the following *data science lifecycle*:

## Business understanding

*    We approach **business users** and ask how they currently approach credit card fraud. Do they have a set of fixed rules, which they apply to each customer transaction? Do they have a special app developed for that purpose? We take a close look at how they use the app and how helpful it is for their daily work.


*    We approach the **IT department** to get an idea of the data systems. Are data stored within a cloud environment or within a traditional database system? Which data warehouse (DWH) architectures are in place?


*    Finally, we approach colleagues - for instance from business intelligence department - who have a detailed **business understanding**. We ask which variables are important to figure out fraudulent transactions. Is it the age and the gender of the customer? Is it the distance between the home country of the customer and the transaction country? Later on during modeling we will find out if these variables are indeed important for fraudulent transactions or not. In many cases business users have their "feeling" about a problem, but "feelings" can be biased and very often does not reflect the real underlying explanatory features. 

## Data aquisition and understanding

*    Now the IT department has granted us access to the data warehouse (DWH) system. We try to understand the columns stored in the tables, and ask IT **how, when, and how often these entries are stored**. We again approach the business to understand this information also from a business side. 


*    Here we can already make first **statistical tests** if certain columns in these tables are important to forecast credit card fraud. Perhaps we already get a first idea about the **possible accuracy of the system.**

## Modeling

*     **Approach**: First we have to understand which problem we want to answer with using machine-learning methods. Is it: "How many transactions on a given day are fraudulent?" Is it: "What is the probability of the next transaction to be fraudulent." **The question we ask is closely connected to the method - regression, classification, clustering - we use.**


*     **Success measures**: In a lot of meetings with business partners we decide about the success measure of our project. For instance business could ask for a model which finds 50% more fraudulent transactions than the current, rule-based system in place. Then we have to ask if it is realistic that our machine-learning model can outperform current systems by 50%. Perhaps in a first version of the new model even 10% improvement is already a good start and further improvement can potentially lead an even higher value. 

    We try to find a good estimate for the **ROI (return on investment)**. How much money could our model potentially save the company in the upcoming years? Based on that value the management will be able to estimate the number of resources (data engineers, system architects, project manager...) can be used within the project. 
    

*     We start to set up the roles within our team and the infrastructure we will use. 

    If many colleagues work on our project it is very helpful to have a **code versioning tool** at place, like git: https://git-scm.com/. Make yourself familiar with **git**, since it is the state-of-the art versioning system used for data science projects. 
    

*    Coding starts! 

## Deployment

## Customer acceptance

# Setting up the git structure

In accordance with the lifecycle stages of a data science project the git-structure could look like:

```bash
├── etl
│   ├── README.md
│   ├── notebooks
│   ├── docs
│   ├── src
│   │   ├── main.py
│   |   ├── funcs
├── dev
│   ├── README.md
│   ├── notebooks
│   ├── docs
│   ├── src
│   │   ├── main.py
│   |   ├── funcs
├── val
│   ├── README.md
│   ├── docs
│   ├── src
│   │   ├── main.py
│   |   ├── funcs
├── depl
│   ├── README.md
│   ├── runscript.bat
│   ├── docs
│   ├── src
│   │   ├── main.py
│   |   ├── funcs
├── README.md
├── package.bat
```

with 

*    **etl (extract-transform-load)** : here automation code related to data gathering, data warehouse should be created
*    **dev (development)** : here the automated machine-learning model should be created
*    **val (validation)** : here the final model should be (in-depth) validated
*    **depl (deployment)** : here the final version of the machine-learning model adapted for the IT-infrastructure should be created

where 

*    **README.md** : describes the problem and the aim of each section
*    **src** : is the automated source-code (with a main.py script)
*    **docs** : are documents, like plots, presentations...
*    **.bat (.sh) scripts** : are automation scripts
*    **package.bat (.sh)** : creation of the python virtual environment and automated download of packages

# Additional Material: Cookiecutter project template

Project based on the <a target="_blank" href="https://github.com/drivendata/cookiecutter-data-science">cookiecutter data science project template</a>. 

```bash
    ├── README.md          <- The top-level README for developers using this project.
    ├── data
    │   ├── external       <- Data from third party sources.
    │   ├── interim        <- Intermediate data that has been transformed.
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    │
    ├── docs               <- A default Sphinx project; see sphinx-doc.org for details
    │
    ├── models             <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks          <- Jupyter notebooks. Naming convention requires no spaces and lower case: file_name.ipynb
    │
    ├── references         <- Data dictionaries, manuals, and all other explanatory materials.
    │
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures        <- Generated graphics and figures to be used in reporting
    │
    ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
    │                         generated with `pip freeze > requirements.txt`
    │
    ├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
    ├── src                <- Source code for use in this project.
    │   ├── __init__.py    <- Makes src a Python module
    │   │
    │   ├── data           <- Scripts to download or generate data
    │   │   ├── make_dataset.py
    │   │   └── __init__.py
    │   │
    │   ├── features       <- Scripts to turn raw data into features for modeling
    │   │   └── build_features.py
    │   │
    │   ├── models         <- Scripts to train models and then use trained models to make
    │   │   │                 predictions
    │   │   ├── predict_model.py
    │   │   ├── train_model.py
    │   │   └── __init__.py
    │   │
    │   └── visualization  <- Scripts to create exploratory and results oriented visualizations
    │       ├── visualize.py
    │       └── __init__.py
    │
    ├── utils
    │   ├── __init__.py    <- Makes src a Python module
    │   └── functions.py   <- Methods shared across packages
```