Terminology
======

 - **REST API/RESTful**: A web service that uses the "REST" (representational state transfer) architectual style. This style uses a subset of HTTP. It can use a variety of data formats, e.g. json.
   - Scalable and stateless
   - High performance, largely due to **caching**
   - Consists of resources
   - REQUEST and RESPONSE 
     - REQEST consists of CRUD (Create, Read, Update, Delete) operations
     - POST (Create)
     - GET (Read)
     - PUT (Update)
     - DELETE (Delete)
     - REQUEST: Header, Operator, Endpoint, Parameter/body
     - RESPONSE: usually in json format


 - **SOAP**: A (different) web messaging protocol specification that uses XML as its message format (and ONLY xml).
 - **WSGI**: WSGI is the Web Server Gateway Interface. It is a specification that describes how a web server communicates with web applications, and how web applications can be chained together to process one request.


 - **CapEx** (Capital Expenditure): Up-front purchase of assets (such as physical IT infrastructure/machinery). These assets are amortized (depreciate in value) over time.
 - **OpEx** (Operational Expenditure): Day-to-day expenditures, requiring no up-front capital cost.
 - "Shared responsibility model":
 - **Serverless Computing**: Cloud-based computing in which the physical infrastucture is flexibly and automatically allocated in the back-end, i.e. there is no requirement on the customer to allocate physical resources (similar to Paas). Serverless architectures are **event driven** (triggered by some event).
 - **SLA**: Service-level agreement (e.g. between a provider and a client), e.g. how much database uptime is expected of the provider.

 - **SaaS**: Software as a service. Like you **use someone else's app online** (like Google Docs).
 - **PaaS**: Platform as a service. **You can deploy your app, and also have runtime environments for developing/testing**. Examples: Heroku, PythonAnywhere, AWS Elastic Beanstalk.
 - **IaaS**: Infrastructure as a service. Like a **virtual data center, complete with servers and storage**. Also includes availability of **computing power for data analysis / data mining**. Examples: **AWS, Google Compute Engine, Microsoft Azure**. This is what an IT administrator would work with, or I suppose an analyst doing data mining.


 - **Hybrid Cloud**:
 - **Private Cloud**:
 - **Public Cloud**:

Data Warehouses, Lakes, Lakehouses
========

 - **Database**: Intended to store real-time information of e.g. a business. Used in everyday, realtime transactions.
 - **Data Warehouse**: Intended to store *historical* data. Not updated in real-time; for historical analytics.
   - **OLTP**: OnLine Transactional Processing (e.g. Create, Read, Update, Delete initiated by e.g. a customer transaction)
   - **OLAP**: OnLine Analytical Processing, for operations on aggregated data. Configured differently for speed reasons.
   - The data in a Data warehouse can be **denormalized** in order to improve the speed of analytical queries.

Docker
========

Terminology:
 - **Virtual machines**: Complete operating system running. It directly uses the hardware of the system using something called the **hypervisor**.
    - Uses a lot more space
    - Takes some time to boot up the OS
    - Do you have to update a VM with software patches???
 - **A container**, on the other hand, simply virtualizes the **operating system**. No hypervisor.
 - **"Bare metal"**: Like totally the hardware man
 - **UAT**: ???
 - **Stage**: ???
 - **chroot**: A linux command, "change root directory" so that the terminal / process only sees that root directory.
 - **Image**: the instructions for building your container. Also a snapshot of a container. It is made up of of layers.
 - **Layers**: e.g. a base layer (Ubuntu), another layer (software), then dependencies, configurations, ...
 - **Container**: A running **instance** of a docker **image**.

**3 Stages**:
 - Build images
 - Ship images
 - Run images

Docker for CVMFS (on a Mac)
-------

```bash
brew install --cask osxfuse
curl https://ecsft.cern.ch/dist/cvmfs/cvmfs-2.8.0/cvmfs-service-2.8.0-1.x86_64.docker.tar.gz
cat cvmfs-service-2.8.0-1.x86_64.docker.tar.gz | docker load
```

To run the container, do:
```bash
docker run -d --rm -e CVMFS_CLIENT_PROFILE=single   -e CVMFS_REPOSITORIES=sft.cern.ch   --cap-add SYS_ADMIN   --device /dev/fuse --volume /cvmfs:/cvmfs:shared cvmfs/service:2.8.0-1
```

Volumes
--------

To persist data (on e.g. the local machine) in a way that docker knows about the data, and can connect it back up with a new container, you can do e.g.:

```bash
docker volume create my-database
```

docker run -dp 3000:3000 -v todo-db:/etc/todos getting-started

An Example DockerFile
---------
```docker
# syntax=docker/dockerfile:1

FROM python:3.8-slim-buster

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD [ "python3", "-m" , "flask", "run", "--host=0.0.0.0"]
```

Starting a Dev-mode container
---------

You can start a container that has some base image, and then mount the local directory and (somehow?) get the container to look for local changes. When local changes are detected, the container stops, rebuilds, and restarts, so you can develop without having to set up a local vm or anything! Pretty neat. Not sure exactly how this works in fact.

Docker Compose
--------

These are yaml files that allow for multi-container applications to be spun up quickly. An example file is below:

```yaml
version: "3.7"
services:
  # "app" is the name of the service, below:
  app:
    image: node:12-alpine
    command: sh -c "yarn install && yarn run dev"
    ports:
      - 3000:3000
    working_dir: /app
    volumes:
      - ./:/app
    environment:
      MYSQL_HOST: mysql
      MYSQL_USER: root

  # Another service, called "mysql"
  mysql:
    image: mysql:5.7
    volumes:
      - todo-mysql-data:/var/lib/mysql
    
  # If you use a volume, you must schedule it as a service in a Docker Compose file, see below
  volumes:
    # Below: no info == default volume properties
    todo-mysql-data:
```

Then run it with (`-d` means "in the background"):
```bash
docker-compose up -d
```

and to stop it, you do:
```
docker compose down
```

Other Things of Note:
--------
 - Use `docker scan <image>` to check for vulnerabilities in your images
 - Use `docker image history --no-trunc <image>` to see how each layer is built. Also useful for seeing the size of a layer.

Network
--------

```bash
docker network create mysqlnet
```

Microsoft Azure
========

 - Benefits of Cloud Computing: **Scalability, High Availability, Elasticity, Agility, Disaster Recovery**
 - Microsoft Azure uses **virtual machines** in the backend, using a **hypervisor** to perform the virtualization.
 - **Hypervisor**: A hypervisor is a kind of emulator; it is computer software, firmware or hardware that creates and runs virtual machines.
 - **Orchestrator**: Part of the Azure backend that deals with responding to user requests, and it has a web API
 - **Fabric controller**: Part of the backend that creates the new virtual machine on the server.


 - **Azure Marketplaces** allows you to integrate someone else's data solutions into your workflow... interesting.


 - **Compute Services**: Virtual machines, containers, serverless computing (microservices)
 - **Cloud Storage**: Disks attached to virtual machines, fileshares, databases
 - **Networking Feastures**: Allows creation of private network connections, ...???
 - **Integration**: Allows for workflows to be created...


 - Internet of Things (IoT) Hub, IoT Central, and Azure Sphere 
 - Azure Synapse Analytics, HDInsight, and Azure Databricks 
 - Azure Machine Learning, Cognitive Services and Azure Bot Service 
 - Serverless computing solutions:Azure Functions and Logic Apps 
 - Azure DevOps, GitHub, GitHub Actions, and Azure DevTest Labs 


| Service (Computing)  | Description  |
| :---      | :---  |
| Azure Virtual Machines  | Windows or Linux virtual machines (VMs) hosted in Azure  |
| Azure Virtual Machine Scale Sets  | Scaling for Windows or Linux VMs hosted in Azure  |
| Azure Kubernetes Service  | Cluster management for VMs that run containerized services  |
| Azure Service Fabric  | Distributed systems platform that runs in Azure or on-premises  |
| Azure Batch  | Managed service for parallel and high-performance computing applications  |
| Azure Container Instances  | Containerized apps run on Azure without provisioning servers or VMs  |
| Azure Functions  | An event-driven, serverless compute service  |


| Service (Networking)  | Description  |
| :---      | :---  |
| Azure Virtual Network     | Connects VMs to incoming virtual private network (VPN) connections |
| Azure Load Balancer       | Balances inbound and outbound connections to applications or service endpoints |
| Azure Application Gateway | Optimizes app server farm delivery while increasing application security |
| Azure VPN Gateway         | Accesses Azure Virtual Networks through high-performance VPN gateways |
| Azure DNS                 | Provides ultra-fast DNS responses and ultra-high domain availability |
| Azure Content Delivery Network  | Delivers high-bandwidth content to customers globally |
| Azure DDoS Protection     | Protects Azure-hosted applications from distributed denial of service (DDOS) attacks |
| Azure Traffic Manager     | Distributes network traffic across Azure regions worldwide |
| Azure ExpressRoute        | Connects to Azure over high-bandwidth dedicated secure connections |
| Azure Network Watcher     | Monitors and diagnoses network issues by using scenario-based analysis |
| Azure Firewall            | Implements high-security, high-availability firewall with unlimited scalability |
| Azure Virtual WAN         | Creates a unified wide area network (WAN) that connects local and remote sites |

| Service (Storage)  | Description  |
| :---      | :---  |
| Azure Blob storage   | Storage service for very large objects, such as video files or bitmaps  |
| Azure File storage   | File shares that can be accessed and managed like a file server  |
| Azure Queue storage  | A data store for queuing and reliably delivering messages between applications  |
| Azure Table storage  | A NoSQL store that hosts unstructured data independent of any schema  |

| Service (Databases)  | Description  |
| :---      | :---  |
|  Azure Cosmos DB  | Globally distributed database that supports NoSQL options  |
| Azure SQL Database     | Fully managed relational database with auto-scale, integral intelligence, and robust security  |
| Azure Database for MySQL    | Fully managed and scalable MySQL relational database with high availability and security  |
| Azure Database for PostgreSQL  | Fully managed and scalable PostgreSQL relational database with high availability and security  |
| SQL Server on Azure Virtual Machines  | Service that hosts enterprise SQL Server apps in the cloud  |
| Azure Synapse Analytics  | Fully managed data warehouse with integral security at every level of scale at no extra cost  |
| Azure Database Migration Service  | Service that migrates databases to the cloud with no application code changes
| Azure Cache for Redis  | Fully managed service caches frequently used and static data to reduce data and application latency  |
| Azure Database for MariaDB  | Fully managed and scalable MariaDB relational database with high availability and security  |

| Service (IoT)  | Description  |
| :---      | :---  |
| IoT Central  | Fully managed global IoT software as a service (SaaS) solution that makes it easy to connect, monitor, and manage IoT assets at scale  |
| Azure IoT Hub  | Messaging hub that provides secure communications between and monitoring of millions of IoT devices  |
| IoT Edge  | Fully managed service that allows data analysis models to be pushed directly onto IoT devices, which allows them to react quickly to state changes without needing to consult cloud-based AI models  |

| Service (Big Data Analytics)  | Description  |
| :---  | :---  |
| Azure Synapse Analytics  | Run analytics at a massive scale by using a cloud-based enterprise data warehouse that takes advantage of massively parallel processing to run complex queries quickly across petabytes of data.  |
| Azure HDInsight  | Process massive amounts of data with managed clusters of Hadoop clusters in the cloud.  |
| Azure Databricks  | Integrate this collaborative Apache Spark-based analytics service with other big data services in Azure.  |

| Service (ML / AI)  | Description  |
| :---  | :---  |
| Azure Machine Learning Service  | Cloud-based environment you can use to develop, train, test, deploy, manage, and track machine learning models. It can auto-generate a model and auto-tune it for you. It will let you start training on your local machine, and then scale out to the cloud.  |
| Azure Machine Learning Studio  | Collaborative visual workspace where you can build, test, and deploy machine learning solutions by using prebuilt machine learning algorithms and data-handling modules.  |

| Service (Cognitive AI)  | Description  |
| :---  | :---  |
| Vision  | Use image-processing algorithms to smartly identify, caption, index, and moderate your pictures and videos.  |
| Speech  | Convert spoken audio into text, use voice for verification, or add speaker recognition to your app.  |
| Knowledge mapping  | Map complex information and data to solve tasks such as intelligent recommendations and semantic search.  |
| Bing Search  | Add Bing Search APIs to your apps and harness the ability to comb billions of webpages, images, videos, and news with a single API call.  |
| Natural Language processing  | Allow your apps to process natural language with pre-built scripts, evaluate sentiment, and learn how to recognize what users want.  |

| Service (DevOps)  | Description  |
| :---  | :---  |
| Azure DevOps  | Use development collaboration tools such as high-performance pipelines, free private Git repositories, configurable Kanban boards, and extensive automated and cloud-based load testing. Formerly known as Visual Studio Team Services.  |
| Azure DevTest Labs  | Quickly create on-demand Windows and Linux environments to test or demo applications directly from deployment pipelines.  |

Apache Kafka
=======

 - A software / distributed system for broadcasting "events" that are listened to by other services
   - Can be used to decouple applications from one another (no more complicated dependencies, just broadcasters/listeners)
 - Can used for **decoupling**, **messaging**, **location tracking**, **data gathering**
 - Built on for core APIs:
    - **Producer API**: Topics (persisted to physical storage, for some amount of time
    - **Consumer API** ingests that data in real-time or in the past
    - **Streams API**: Transforms/analyzes/aggregates data and re-broadcasts them as a new Topic/Stream
    - **Connector API**: Integration layer for configuring data sources (e.g. "reusable producers and consumers")


 Apache Spark
=======

Setting up a version of Apache Spark can be a bit tricky; I found this useful guide to help avoid some of the biggest issues: https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec

 - **DataFrame API**: Similar to e.g. Pandas DataFrames.
 - **RDD API**: The "Resilient Distributed Dataset", is an **interface** to a sequence of data objects located in a collection of machines. A subset of the DataFrame API. This was the original data structure of Apache Spark.
 - **Dataset API**: *Only available in Java and Scala*, this is some combination between DataFrames and RDD.

One key important feature of Spark is that it features data *transformations* and *actions*. *Transformations* are executed "lazily," e.g. because they do not immediately require an output, Spark will wait until there is an action (requiring an output) to perform the transformation(s)+action, which optimizes its performance.

You can get pyspark by running the following -- **note, however, that pyspark is insufficient to set up a standalone spark cluster**.
```bash
pip3 install pyspark
```

Otherwise, you should go to the Apache Spark website and download one of the latest versions (as a tgz), and unzip it in your home directory.

Note that you will have to make sure you have downloaded the Java Development Kit (JDK), e.g. here: https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html.
(Note that we are installing JDK 8, because Spark 2.x apparently does not work with Java versions >8.)

Add the following to your `.bash_profile` script:

```bash
export JAVA_HOME=$(/usr/libexec/java_home)
export PATH=$HOME/spark-3.1.2-bin-hadoop3.2/bin:$PATH
export PYSPARK_PYTHON=python3
```

Then you can use the command-line interface simply by typing:
```bash
pyspark
```

Some More Concepts:
========

 - **Wide vs narrow Transformations**: A wide transformation is one that occurs across multiple nodes of a spark cluster, e.g. `orderBy` or `groupBy`. Narrow transformations include e.g. `filter` and `contains` (applied to individual rows). Wide transformations must be used carefully, because they incur a fairly hefty time penalty.