#### Week 7 - Analytical Engineering

# Lesson 1: Intro to Analytical Engineering / Pipelines

> ### Warm up exercise:
>
> To start the second week of working with databases we want to come back to the data protection topic. >This warm up exercise is meant to prepare the students for how to work with credentials this week.
> Split the students into 3 groups and assign every group with one question:
>
> - what is a bad password? (present some concrete ideas)
>   - the same password everywhere
>    - set of consecutive numbers like 123456...
>    - string 'password', 'qwerty', etc.
>    - short password
>    - only (lowercase) letters
>    - including your name or any other personal data in the password
>    - using real words
>
> - how to not store a password?
>   - on a piece of paper
>   - or in general in a place easily visible by others
>    - github
>    - shared file
>    - sent as a message to someone 
>
> - what might go wrong when experiencing a data breach?
> - identity stealing
>   - money stealing
>    - personal data leakage...

After the group work, ask the students to present their ideas.
At the end mention that we will try to make storing credentials more secure this week.

Present the project of the week:

##### 🎯 Project "Weather vs Flights Data" Goal: Construct ELT data pipeline using python, pandas, more advanced SQL, SQLalchemy, and dbt

In this week, you will learn how to connect to your database with Python and how to design and to automate a pipeline. Next week we could use the collected data for a Dashboard to visualize the results. 

The next days, you will work with a comprehensive real world API from https://dev.meteostat.net/api/. We will get access to historical weather records (daily and hourly) from basically everywhere around over the world. We will aim to obtain weather data for 3 airport locations: New York (JFK), Los Angeles (LAX), And Miami (MIA) from the last year. 


### Weather Pipeline Steps : 
https://docs.google.com/presentation/d/1nzs5R1U-dkHu19YMmKkSLXKLSHYm5XozARAx34SzYJw

#### Milestones

1. Access data from real world API using a python script

2. Using SQLalchemy import raw data into postgres database 

3. (optional: we don't have it yet) Setting logging in order to catch problems and log successes

4. Connect database to dbt cloud - which will be used to parse and prepare the data

5. Learn how to use Common Table Expressions in SQL

6. Clean prepare the data on dbt cloud using CTE and the power of dbt

7. Answer questions with the data and save the answers in a data mart for your stakeholders

   

**Note:** Lectures Logging, Managing PostgreSQL users and roles, Query optimization and indexing would go beyond the scope of this week. Consider these topics as the next steps to look into.

### Objectives

1. Pyramid of needs in data aka data roles
2. Data roles
3. Tech stack
4. ETL vs ELT
5. dbt
6. `.env` Files

### 1. Pyramid of needs in data aka data roles

In 2017 Monica Rogati wrote an [article about pyramid of needs in Data Science](https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007). 

![pyramid](monica_rogati.jpeg)

The pyramid of needs stresses the fact that the most essential thing in data projects is data maintenance that includes data collection, data cleaning, data storage, data modelling, etc. which is typically considered data engineering. The next step refers to data analytics and data science / AI should be only a cherry on the top but only if all the other parts of the pyramid were covered before.

Even though the article is focused on AI and Data Science it opened a discussion about different parts of data projects, different roles in data and how they are all connected with each other. 

### 2. Data roles

Typically in data we distinguish following roles that will be connected with mentioned tasks and areas:

- Data Engineers
  - building and maintaining data infrastructure (data pipelines)
  - collecting raw data sources 
  - cleaning and modelling data for the later data analysis usage 
- Data Analysts
  - (exploratory) data analysis
  - A/B testing 
  - communication with stakeholders (domain knowledge)
- Data Scientists
  - feature engineering
  - machine learning
  - predictions and estimations

However, even though we usually put boundaries between these roles, in reality it's much more complex. Many times data analysts will be covering some of the topics from the data engineering or data science area, and the other way around. 

### 3. Tech stack

There are plenty of tools and types of tools used in data projects. Some of them, as BI tools (e.g. Tableau) are usually exclusive for data analytics, on the other hand many others will be shared between different data roles. 

![stack](data_stack.png)

### 4. ETL vs ELT

> Explain the steps and then the differences between the two types of pipelines. Here is a great resource:
>
> https://www.geeksforgeeks.org/difference-between-elt-and-etl/

Data engineering is many times connected with creating data pipelines - a technical solution that will be responsible for data collection, storing, cleaning and modeling the data for further usage. There are two different strategies or types of data pipelines: ETL and ELT:

|       ETL       |       ELT       |
| :-------------: | :-------------: |
| ![etl](etl.png) | ![elt](elt.png) |

The letters in both names refer to:

- (E)xtract - identifying data from one or more sources (such as databases, files, ERP, CRM, etc.)
- (T)ransform - process of transforming the raw data source into the target format required for analysis projects
- (L)oad - storing the extracted raw data (usually in a data warehouse or a data lake)

Historically data pipelines used to be ETL ones since it was expensive to store huge amount of data. In ETL structure the transformation and modelling of the data for the further analysis was done before loading (storing it).
As we do have more and more easy and cheap ways of storing data, the ELT model became way more popular recently. In this structure data is firstly loaded and could be transformed later on. That means that you don't neccessarily need to know in advance how you want to use your data which makes it more flexible. This model also made it more accessible for data analysts to work on data transformations together with data engineers.

In our case our ELT process will be covered by:

- (E)xtract - fetching our data (in a json form) using weather API
- (L)oad - using a db client (`sqlalchemy`) to connect with a db and sending the raw data to it
- (T)tansform - `dbt` as a data transformation tool

### 5. What is dbt?

**dbt**  is a transformation workflow (it handles the `T` part from ELT) that we're gonna learn about this week. More on that in the following encounters but some of the properties of it are:

- **dbt** compiles and runs your analytics code against your data platform
- reusable, or modular, data models that can be referenced in subsequent work 
- you can design tests to check the quality of your data as well as generating documentation for your tables

### 6. `.env` Files


 It's common for teams to maintain distinct "environments" for their codebase. These separate environments allow thorough testing before deploying changes to the production environment, where they interact with end-users. In scenarios involving multiple environments, developers often opt to use multiple .env files to store credentials. For instance, they might have one 
 .env file containing database keys for development and another for production.

 This separation of code and credentials lower the risk of unauthorized individuals gaining access to sensitive data in the cloud.

**.env** files are specifically designed to store credentials in a key-value format for the various services that the program utilizes. These files are intended to be stored locally and not shared in online code repositories, ensuring that sensitive information remains confidential. Each developer within a team typically manages one or more `.env` files, tailored for the specific environments they are working on.

#### Usage

In this section, we’ll walk through how to use a `.env` file in a basic python project.

1. To begin, head to the root of your **week folder** and create an empty `.env` file containing credentials you’d like injected into your codebase. It may look something like this:

```python
POSTGRES_USER = 'marylinmonroe'
POSTGRES_PASS = '382jLK393fk3k2' # never add your password to a jupyter notebook!
POSTGRES_HOST = 'data-analytics-course-2.c8g8r1deus2v.eu-central-1.rds.amazonaws.com'
POSTGRES_PORT = '5432'
POSTGRES_DB = 'minty_floats'
POSTGRES_SCHEMA = 's_marylinmonroe'
```

2. Keep in mind that the `.env` file should NOT be uploaded to **github**. A file called `.env_example` could be uploaded in order to give an example of what the `.env` file should contain. Therefore the `.env` file should be always in your `.gitignore` file! (it should be there already)


3. Now to inject the secrets into your project, you can use a popular module like dotenv; it will parse the `.env` file and make your secrets accessible within your codebase under the process object. Go ahead and install the module:

In [None]:
pip install python-dotenv

4. Import the module at the top of the start script for your codebase:

> **Teacher's Note:**  
> there are two ways, first is **less intrusve**, but decide whether you want to introduce both or just one

**First option:** `dotenv_values()` will only read the `.env` file and return a *temporary* dictionary

In [None]:
# getting API and DB credentials

from dotenv import dotenv_values

config = dotenv_values()
pg_user = config['POSTGRES_USER']  # align the key label with your .env file !
pg_host = config['POSTGRES_HOST']
pg_port = config['POSTGRES_PORT']
pg_db = config['POSTGRES_DB']
pg_schema = config['POSTGRES_SCHEMA']
pg_pass = config['POSTGRES_PASS']


In [None]:
pg_host

**Second option:** `load_dotenv()` : will actually load the key/values into your running os enviromental variabls

In [None]:
# getting API and DB credentials

# from dotenv import load_dotenv

# load_dotenv()
# pg_host = os.getenv('POSTGRES_HOST') # align the key label with your .env file !
# pg_user = os.getenv('POSTGRES_USER')  # align the key label with your .env file !
# pg_host = os.getenv('POSTGRES_HOST')
# pg_port = os.getenv('POSTGRES_PORT')
# pg_db = os.getenv('POSTGRES_DB')
# pg_schema = os.getenv('POSTGRES_SCHEMA')
# pg_pass = os.getenv('POSTGRES_PASS')

In [None]:
pg_host

Cool. We’ve successfully added a `.env` file into your project with some secrets and accessed those secrets in your codebase. Additionally, when you push your code via git, your secrets will stay on your machine.