<center>
<img src="./images/00_main_arcada.png" style="width:1400px">
</center>

## Lecture 3: Big Data Engineering


## Instructor:
Anton Akusok <br/> 
email:  anton.akusok@arcada.fi<br/>
messages: "Anton Akusok" @ Microsoft Treams

# Goal for today

* Understand challenges of data engineering at large scale.
* Learn about automation + version control.
* Learn about calling other service using a secret token.
* Be able to write simple automation scripts on GitHub.


## Agenda 

* 0. (intro) What is Big Data Engineering?
* 1. GitHub + Copilot for automation
* 2. Pull requests for collaborative workflows
* 3. *break*
* 4. (intro) Roles and responsibilities in engineering team
* 5. Communication with other services: API keys and Secrets
* 6. Scheduled tasks and Cat-as-a-Service

# 0. What is Big Data Engineering?

## Big Data

Big Data is big. Too big for any single person to handle.

Big Data Engineering is about iterative development and engineering community. GitHub and automation enable these things.

So this lecture we learn automation in GitHub.

You will notice these themes a lot over the lecture:
- working together with other people
- communication
- collaboration, work arrangement that enables collaboration

Engineering is a surprisingly social profession.  

Building a large system feels like building an anthill when you are an ant - effective collaboration is its main ingredient.

![anthill](images/anthill.jpg)

## Automation

![lego](images/lego.jpg)


Automation is not a complex tool to learn like AirFlow.

It is a "Lego" with little bricks that you use to build anything.

There are many ways to build what you want. They are all good. And the bricks are interchangeable.

Today we will learn these bricks.

### Automation with GitHub Actions

Today we will play around GitHub Actions because

- GitHub is a great collaboration hub
- It is the "anthill" you will be building together in a company
- You will likely use other tools but they work very similar to GitHub Actions

## But I don't know GitHub Actions!

![copilot](images/copilot.jpeg)

Yes, we will use Copilot. Or any other LLM, does not matter - ideally the one that integrates with your IDE.

Writing automation with Copilot is **silly fast**. Ridiculously easy.

Like, it feels almost offensive if you learned any programming thing by reading a book before.

(You still need to read books and understand what you are doing. But from that point on, it became silly fast starting from 2023 A.D.)

Example free IDE: `Cursor.sh` https://cursor.sh/pricing 

I am not saying it's good. I am saying it has a very good LLM with ChatGPT-4, and the free version works for today's tasks.

I *do recommend GPT-4* over any other LLMs right now. They will catch up but it will take a few years.

(grab the free one if you don't have an IDE with AI assistant already)

![cursor_ide](images/cursor.png)

Example I *recommend* to use: Free Student pack from GitHub (Microsoft)

https://education.github.com/pack#offers 

It has Copilot, JetBrains (PyCharm etc.) and many other.

*NOT SUITABLE FOR TODAY* because the student status validation takes a few days!

![github_student_offers](images/github_offers.png)

## Just a chat

https://jan.ai

![jan-chat](images/jan.png)

Beware of memory and system load

![jan-gpu](images/jan-gpu.png)

## Why Copilot / LLM?

(I say Copilot but really mean "LLM AI coding assistant" from now on)

- Because nobody is a "Sertified GitHub Actions programmer"
- Because you know **what** you want to do, but don't know **how**
- Copilot knows **how** without you spending hours Googling or days reading books
- It also debugs errors that is majestic when starting with a new tech
- It has a great intergation with VS Code (because Microsoft owns VS Code, GitHub, and basically OpenAI)

LLMs are really good at telling you "how", but are utter garbage at telling "what".

**You** have to think. And LLMs won't replace people any time soon. 

But they save a ton of time learning skills - like you learned to hold a spoon, to ride a bicycle, or to write specific code.

Usually a person knows many things "a little" and one thing "deeply". This is called a "T-shaped knowledge". LLMs deepen the arms of "T", making you good with stuff you know a little about.

![llms](images/llms.jpg)

## Automation is someone's computer

- Automation is code running on someone's computer. Just like you run code on a laptop.

- (It probably uses a Docker container - still a computer)

- It runs basic bash scripts. Learning very basics of bash scripts is very helpful, you will understand what a script means.

https://docs.csc.fi/support/tutorials/env-guide/linux-bash-scripts 

![bash_scripts_readme](images/bash_csc.png)

- For automating something, first literally run it on your laptop in a terminal (Linux and Mac computers have terminal, Windows have a linux terminal in WSL).  

- Then put the same code in automation script.

- There is no difference, both your laptop and an automation Docker machine are a regular computer connected to the Internet.

- Automation needs an "event" that starts the run, like you press "Enter" to run a command in a terminal.

*That's all what is automation - a computer, a script, and an event.*

![automation](images/robo_automation.webp)

# 1. GitHub + Copilot for automation

![github_actions](images/github-actions.png)

*some slides about GitHub Actions that I literally asked ChatGPT to generat for me because I am too lazy to do it myself...*

## GitHub Actions: Automating Your Workflow

### What are GitHub Actions?
- GitHub Actions is a powerful automation tool provided by GitHub.
- It allows you to automate tasks and workflows directly within your GitHub repository.

### Key Features
- **Automation**: Set up workflows to automatically perform tasks such as testing, building, and deploying your code.
- **Event-driven**: Trigger workflows based on various events, such as push, pull request, or issue creation.
- **Customizable**: Define your workflows using YAML syntax and customize them to fit your project's needs.
- **Integration**: Easily integrate with other tools and services, such as testing frameworks, cloud providers, and deployment platforms.

### Why GitHub Actions?
- **Streamlines development**: Automating repetitive tasks saves time and effort, allowing you to focus on writing code.
- **Ensures consistency**: With automated workflows, you can ensure consistent testing, building, and deployment processes across your projects.
- **Facilitates collaboration**: Share and reuse workflows across teams to standardize development practices and improve collaboration.


## Anatomy of a GitHub Actions Workflow

### Workflow File
- Workflows are defined in YAML files stored in the `.github/workflows` directory of your repository.
- Each workflow file contains one or more jobs, which consist of a sequence of steps to be executed.

### Events
- Workflows are triggered by events such as push, pull request, or schedule.
- You can specify the events that should trigger your workflow in the workflow file.

### Jobs and Steps
- Jobs represent a unit of work that can run concurrently.
- Each job consists of one or more steps, which are individual tasks to be executed.
- Steps can include actions, shell commands, or scripts.

### Actions
- Actions are reusable units of code that perform specific tasks within a workflow.
- You can use built-in actions provided by GitHub or create custom actions tailored to your project's needs.


## Getting Started with GitHub Actions

### Creating a Workflow
- To create a new workflow, navigate to the `.github/workflows` directory of your repository and click "New file."
- Name your workflow file with a `.yml` extension and define your workflow using YAML syntax.

### Running Workflows
- Workflows are automatically triggered by events specified in the workflow file.
- You can also manually trigger workflows or schedule them to run at specific times.

### Monitoring Workflows
- View the status and logs of your workflows in the "Actions" tab of your repository.
- Monitor workflow runs, troubleshoot failures, and review logs to ensure your automation is working as expected.

### Example Workflow
```yaml
name: CI

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Install dependencies
      run: npm install
    - name: Run tests
      run: npm test
```

# Hands-on exercise 1: "Hello World" action on push to GitHub

Preparation:
- Have a GitHub account (seriously)
- Have or install an IDE with an LLM code assistant

Task: Write a "hello world" application using LLM assistant.
1. Make a new repo. Or take a repo you don't care about.
2. Download / clone it locally
3. Open in an IDE
4. Create a new branch, then switch to that branch `git checkout -b exercise-1-hello-world`
5. Make sure you can use LLM assistant in IDE
6. Ask LLM assistant to help you write a "hello world" example in GitHub Actions
7. Follow the advice, do the coding
8. If unclear, clarify the question. Maybe ask where to save the code it suggested.
9. Commit and push to GitHub
10. Go to GitHub in browser and make a Pull Request

# 2. Pull Requests (PRs) for collaborative workflows

![pull_request](images/pull-request.png)

Git is a genius tool. Very easy to learn.

Sadly really hard to master. 

Because you don't see a reason why doing things a certain way until joining a large company.

(this is why: https://github.com/elastic/elastic-charts/pull/1475#pullrequestreview-802918615)

## Idea of Pull Requests

There is one nice and clean version of code, and many proposals for improving the code.

All development mess is hidden in these proposals.

We can ask feedback from colleagues on our proposals, and give suggestions to their proposals.

We want to make sure the nice and clean code works always by:
- running tests on the "future" version of code before accepting a proposal
- looking at the proposed code together
- enforcing some coding standards

## What should I do?

- **ALWAYS WORK IN A BRANCH / PULL REQUEST!**

- Make separate PRs for different things

- It's OK to have many PRs and switch between them

- Try to finish, review, and merge PRs fast!

- Write tests that make sure your code works if anyone makes any changes, and the tests pass

- Document what you have done: commit messages, docstrings, readme

## What should I do?

(trying out today)
- Setup branch protection rules

- Use coding standards enforcement and validation

- Run automatic tests

- Handle access tokens using Secrets

## Development happens in Pull Requests

PR has: 

- **name and description** because you must know what you are going to do before starting to code

- history of commits **with meaningful messages explaining the changes**

- discussion around new code: questions, answers, comments, whole discussion threads

- tests and code standards

PRs are **blocked** by default and cannot be merged until all checks have passed. A common check is an approval from another engineer - the "4-eyes principle".

## Never work in `main` branch

Nobody serious ever sends code directly into the "main" branch!

Some rules can be skipped for your own 1-person repositories, but not this one.

PRs help having a *fully working* update before it is merged into the code. Also they help testing out new stuff that may not work, or have several proposals in parallel.

Committing directly into `main` branch will leave you with broken code, and a very annoying way back to previous working version. Don't do this to yourself.

(also interviewers will check your GitHub and you don't want them see commits to `main`)


## Working with PRs

To be clear - PR is asking to merge a branch. You are working in a "branch", but it looks more like you are working in a "PR" because all discussion happens in a PR.

- create a new branch and make a mess! try things out! don't afraid of breaking something!

- clean up and get to a working version

- ask for feedback from colleagues, use their suggestions to improve the code further

- merge into the main code to get one nice update (use "squash" option to hide the mess inside the PR)

# Hands-on exercise 2: Working in a PR

Task: Make another PR that will fail an automated test
1. Go to the repo
2. Enable main branch protection
3. Add requirements of another person's approval
4. (optional) Add Codacy for code quality checks
5. Create a new branch (from your previous branch). Let's make a lot of branches and PRs!
6. Ask LLM assistant to help you write a GitHub Action that always fails
7. Add this as another action
8. Commit and push to GitHub
9. Go to GitHub in browser and make a Pull Request
10. Check that you cannot merge new PR because some checks are failing

# 3. break

# 4. Roles and tasks of people in an engineering team

## Who are all these people?

We love coding.

These people are here to take up the boring parts, so we could code away.

Learn talking to them, and you will be the most happy engineer!

## Non-engineer roles in a team

The main trio are:

- **Engineering Manager** is the team lead, and probably your official boss (if that ever matters)>

- **Product Manager** is responsible for the feature experience, he will ask you code more things.

- **Principal Engineer** is the Jedi coder that thinks at a large scale and writes techinical docs. Used to be called "Software Architect".

These people are "the court" of development process - Product Manager is the prosecutor who gives you more work, Engineering Manager is an attorney who protects their team, Principal Engineer ensures everybody follow the rules and writes down the final verdict (technical design document). The "competition" between PM and EM creates a suitable work plan for us the engineers.

## Engineering Manager

Engineering manager has "direct reports" - like you and me.  
They are the ones to ask when *you* need something.

![engineering_manager](images/em.webp)

## Product Manager / Product Owner

![product_owner](images/PRODUCT%20OWNER.png)

## Principal Engineer

They can and will code, but this is a minority of their tasks. They may not code at all. And they don't belong to a specific team anymore.

This is an example of an "Individual Contributor" role that is a high-level position *without* becoming a manager.

- Think at the large scale how systems work (software architecture), how they interact with other systems

- Review plans for new systems and large system updates (we have a group of principal engineers informally called "The Jedi Counsil")

- Write and review TDD (technical design documents) that explain how something works before we even build it

- Review code PRs

- Mentor other engineers

## Non-engineer roles in a team

Sometimes you can meet:

- **Project Manager** coordinates the timelines between teams, and looks for dependencies. They speak Jira and Excel.

- **Technical Program Manager** is a company-scale project manager that knows coding. They may ask you about details.

- **Business people** who are curious or enthusiastic. They sound like they live in an own world (they do).

## "Engineers" come in different flavours

https://datatalks.club/blog/data-roles.html 

![engineers](images/engineers.png)

## Why would I care?

AI is taking over the world!  
(well, at least the skills in the world)

Future jobs are more about planning and thinking, while AI will do the implementation.

Planning is the ultimate collaboration work:
- **your** decision is always worse than **your + asking everyone around** decision
- missing information, need to ask around
- whatever you plan interacts with systems created by other people
- or other people changed something in their system, now yours is broken and you need to figure out

## Why would I care?

Another fundamental reason - distributed agile organization beats centralized in speed of development.

Teams in distributed organization act by themselves, without asking anyone around.

Communication is the key to ensuring things continue to work.

# 5. Communication with other services: API keys and Secrets

![api-keys](images/api-key.webp)

## API Key

An API Key is a magic string that enables calling different services. 

It is basically to a username+password together.

A typical and very good practice is to give them lifetime - they will stop working after 30/90/180 days, and need to be re-created.

Of course you can automate creating a new API key using the current API key before it expires.

## Secrets

Like passwords, API keys must never go into Git repository!

There are special encrypted storages for sensitive things like API keys. They are usually called "Secrets".

A "secret" is a placeholder for a string, that will be replaced only at runtime (when the automation runs).

Secret can replace code directly in a script, create a Bash environmental variable, or return a variable in programming language directly.

# Hands-on exercise 3: Secrets in GitHub

Task: Save a secret message, print it out in Actions
1. Go to GitHub website
2. Create a new secret
3. Make another branch
4. Ask LLM to load the secret in your Actions script
5. Print the secret value with the help of LLM - make sure it does not exist in plain text
6. Commit and push
7. Check the Actions output


# 6. Cat-as-a-Service

https://thecatapi.com

![cat-api](images/cat-api.png)

Get cats with an API call 

![cats](images/cats2.png)

## The Cat API

- An actual service you can register and get an API key

- Real calls, receive real data

- ... but no BS like Google Drive API, and very easy to understand

- Same workflow you would use for automation at work

# Hands-on exercise 4: Cat-as-a-Service

Task: Load a cat and validate it is actually a cat image
1. Register at the Cats API and get an API key by email
2. Save the API key to GitHub Secrets
3. Make another branch
4. Ask LLM to help you get the cat image with an API call in Python
5. Ask LLM to build a very simple computer vision code to check if an image has a cat in it
6. Make another Action with the help of LLM. Install necessary libraries in the Actions script.
7. Commit and push
8. Make sure it works; debug and fix if not
9. Now add deployment target: Load and pring in terminal a cat image every minute upon merge to master. 
10. Ask LLM how to print an image in terminal with text as graphics.
11. Pass all validations and merge the PR
12. Observe an action actually loading a cat image every minute
13. Stop all actions once done