# Preparing your Data Science Virtual Environment

## Important Disclaimers

The main objective of this material is to develop a hands-on studying guide for those who are part of FATEC's DATA SQUAD and want learn more about Data Science and Machine Learning. Based on the 2nd edition of Aurélien Géron's "Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems-O’Reilly Media", published in 2019, this material is a working in progress. Some of the work will also be based on "Data Science from Scratch: First Principles with Python" by Joel Grus.
    
Basic knowledge in Python is necessary, even though we will try to link outside materials whenever necessity arises.
    
    


## First steps: Installing the required worktools

We will teach you how to install and run a Jupyter Notebooks Data Science environment using both Python's virtual environment and a Docker, both on Windows and on Linux (Ubuntu). Which one you use is up to you :)

### Runing Jupyter Notebooks on a Virtual Environment

Before we begin, make sure you have all the necessary tools. We will be using Jupyter Notebook to write and run example codes. Therefore, it is important to show how to install this application.

1. Make sure you have Python installed on your computer and, preferably, having it's executable on the \\$PATH environment variable. You can download it [here](https://www.python.org/downloads/). There are some visual instructions and a very brief introduction to Python [here](https://github.com/cafecompython/workshop_python_etec/blob/master/Introdu%C3%A7%C3%A3o_Python.pdf), in Portuguese.


2. After installing Python, you should learn about virtual environments. You can read the Official Documentation [here](https://docs.python.org/3/tutorial/venv.html). Virtual Environments are basically a way of isolating each project on its contained environment. In this way, you make sure that dependencies and requirements are met and kept on its original form, regardless of system or packages updates (automatically or unintentionally) made on the overall system.

To create a virtual environment, fire up a command prompt (Windows) or a terminal window (Linux - Ubuntu) and execute the code below:

`Command-Line`
```shell
    C:\User\Desktop> mkdir Data_Science # creates a new directory called "Data_Science"
    C:\User\Desktop> cd Data_Science # navigates to the new directory
    C:\User\Desktop\Data_Science> python -m venv my_env # creates a virtual environment, called "my_env" on the new directory (it can take a while until the environment is complete)
```

To initiate your virtual environment, you just need to run one of the codes below:
    
On Windows:
   
`Command-Line`
```shell
    C:\User\Desktop\Data_Science> my_env\Scripts\activate.bat
```

You will see a output like the one below:
    
`Command-Line`
```shell
    (my_env)C:\User\Desktop\Data_Science>
```
    
On Linux (Ubuntu):
    
`Command-Line`
```shell
    $ source my_env/bin/activate
```
Like on Windows, the output will be similar to the one below:
 
`Command-Line`
```shell
 (my_env)$
```

To exit the virtual environment, all you have to do is run the code `deactivate`, which works for both Windows and Linux (Ubuntu), you'll be back to the output line below:

`Command-line`
```shell
    (my_env)C:\User\Desktop\Data_Science>
```

or

`Command-line`
```shell
    $
```

3. Now that you know how to prepare a virtual environment, it's time to use `pip` to install the necessary packages/modules that will be used (Pandas, Numpy, MatplotLib, Seaborn, Bokeh, Scikit-Learn and others).

First of all, start your virtual environment:

On Windows:
    
`Command-Line`
```shell
    C:\User\Desktop\Data_Science> my_env\Scripts\activate.bat
```
On Linux (Ubuntu):
    
`Command-Line`
```shell
    $ source my_env/bin/activate
```

Now, it's time to install each package:

`Command-Line`
```shell
(my_env)C:\User\Desktop\Data_Science> pip3 install jupyter numpy pandas matplotlib sk-learn
```

4. After that, you just need to fire up you Jupyter Notebooks:

`Command-Line`
```shell
(my_env)C:\User\Desktop\Data_Science> jupyter
```

### Running Jupyter Notebooks on a Docker with Docker-Compose

Another way of having a Data Science environment set up is to leverage of Jupyter Docker Stacks. In brief, it is a really small and optimized virtual machine environment maintained by the [Jupyter Community itself](https://github.com/jupyter/docker-stacks/). All necessary documentation is [here](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html).

[Docker](https://docs.docker.com/) intends to be a tool to set up deployable and shareable virtual machine environments called containers, in a flexible and lightweight way. Through its [Docker Hub](https://hub.docker.com/search?q=&type=image), the community shares images of the most varied Operating Systems and environments. In summary, Docker is a service that frees you from downloading and installing any programs directly on your computer.

1. First, you need to have Docker installed on your computer. For **Windows**, we recommend using [Docker Desktop](https://www.docker.com/products/docker-desktop). After clicking the link, you will be prompted to a login form for the Docker Hub page. If you don't already have a Docker Hub account, proceed to sign-up, download and install the recommended version for your OS.

On Linux (Ubuntu), make sure to uninstall any old versions of Docker you might have. Run these commands from Terminal:

`Command-Line`
```shell
    $ sudo apt remove docker docker-engine docker.io containerd runc
```

After that, update apt package index and install Docker Engine Community:

`Command-Line`
```shell
    $ sudo apt update && sudo apt install docker-ce docker-ce-cli containerd.io
```

For convenience, configure Docker to be run at startup (this command needs to be run only once):

`Command-Line`
```shell
    $ sudo systemctl start docker && sudo systemctl enable docker
```



2. After downloading and installing Docker Desktop for Windows, which by default runs on your computer at startup, or after downloading and installing Docker on Linux (Ubuntu) create and navigate to a new directory called Data_Science, from the Command Line or the Terminal:

`Command-Line`
```shell
    C:\User\Desktop> mkdir Data_Science # creates a new directory called "Data_Science"
    C:\User\Desktop> cd Data_Science # navigates to the new directory
```

3. Inside this directory, let's create a new directory called jupyter, where we are going to create a Dockerfile:

`Command-Line`
```shell
    C:\User\Desktop\Data_Science> mkdir jupyter# creates a new directory called "Data_Science"
    C:\User\Desktop\Data_Science> cd jupyter # navigates to the new directory
```

Using your favorite text editor or IDE save the file as **Dockerfile** in the recently created "jupyter" directory. Your directory tree should be like this:

```
Data_Science
|
|--Jupyter
    |  Dockerfile
```

Inside the Dockerfile, write the lines below and save the file:

`Dockerfile`
```
    FROM jupyter/all-spark-notebook
    VOLUME /notebooks
    WORKDIR /notebooks
```

In short, a **Dockerfile** is a script that tells your computer to download a specific image from Docker Hub and run it as a Container. In this case, we are downloading a image from Jupyter's Repository that has, basically, any and all packages for most Data Science usages, from numpy to Spark. After that, our Dockerfile is going to create and navigate to a directory called notebooks inside its environment.

4. After saving and closing the **Dockerfile**, let's navigate back to the Data_Science directory and create a ***docker-compose.yml*** file:

```
Data_Science
|  docker-compose.yml
|
|--Jupyter
    |  Dockerfile
```

Inside the the docker-compose.yml file, write and save:

`docker-compose.yml`
```
version:  "3"
services:

  jupyter:
    build: 
      context: ./jupyter
      
    volumes:
      - ./notebooks:/notebooks
      - ./data:/data
    ports:
      - 8888:8888
```

Docker-compose is a tool that helps to manage multiple Docker containers as services. Although we are only going to have a single container for now, using docker-compose helps us with mapping ports and sharing volumes between our Jupyter virtual environment and our computer. In this way, when we shut down our docker container, we are not going to loose all the work done inside it, since we are going to save the files created on our own machine.

- version: declares which version of docker-compose we are using;

- services: declares which services we want to create, in our case, "jupytet" is the name of our service;
- build and context: points from which image we want to build the service from. In our case, we are pointing to our Dockerfile directory
- volumes: this command maps the directories 'notebooks' and 'data' from the container to the directories 'notebooks' and 'data' that will be created on our Data_Science directory when we first run the docker-compose
- ports; maps port 8888, where jupyter notebooks will be running inside the Docker container, to the por 8888 on your computer.


5. On **Windows**, to fire up our jupyter service all you need to do is open a Command Prompt Window, navigate to our Data_Science directory and run:

`Command-Line`
```
C:\> cd Desktop\Data_Science # navigates to the Data_Science directory
C:\User\Desktop\Data_Science> docker-compose up --build # fires up out docker-compose file
```

On **Linux (Ubuntu)** make sure you have **pip** installed, if not, run the commands below:

`Command-Line`
```
$ sudo apt update && sudo apt install python3-pip
```

After that, using **pip**, install **docker-compose**:

`Command-Line`
```
$ pip3 install --user docker-compose
```

Run docker-compose from our Data_Science directory:

`Command-Line`
```
$ cd Data_Science # navigates to the Data_Science directory
$Data_Science> docker-compose up -- build
```

6. Now, Docker will download the jupyter image from the Docker Hub repository, run a virtual environment with all its dependencies and run the jupyter notebook. To access the jupyter painel, al you need to do is open a browser and copy the last line that shows up at your command-line, if everything ran smoothly, will look something like:

`http://127.0.0.1:8888/?token=<token>`