# Intro

The DS/ML section discusses the python packages/frameworks specialised for building database systems and machine learning.

## Hugging face

Huggingface is an ecosystem of packages that are related to all aspects of working with deep learning objects.

The first thing you need to do is log in:

`huggingface-cli login --token <your HF token>`

The following table shows the structure of the ecosystem:

| Package                           | Purpose                                                                                                                              | Example Usage                                                     |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
| **🤗 Hub (`huggingface_hub`)**    | Central repository for models, datasets, and Spaces. Lets you push/pull models and datasets.                                         | Upload a fine-tuned model to share with others.                   |
| **Transformers (`transformers`)** | High-level library with pretrained NLP, vision, and multimodal models. Handles training, inference, and tokenization (via wrappers). | `pipeline("sentiment-analysis")`                                  |
| **Tokenizers (`tokenizers`)**     | Fast, low-level text tokenization library (written in Rust). Often used inside `transformers`.                                       | Train a custom WordPiece/BPE tokenizer.                           |
| **Datasets (`datasets`)**         | Efficient dataset loading, processing, and streaming. Optimized for large ML datasets.                                               | Load `imdb` dataset in one line: `datasets.load_dataset("imdb")`. |
| **Evaluate (`evaluate`)**         | Standardized evaluation metrics library. Works well with `datasets` and `transformers`.                                              | Compute accuracy, F1, BLEU, etc.                                  |
| **Diffusers (`diffusers`)**       | Library for diffusion models (e.g., Stable Diffusion) for images, audio, video.                                                      | Generate images from text prompts.                                |
| **Accelerate (`accelerate`)**     | Utility for running training on any hardware setup (CPU, GPU, multi-GPU, TPU) with minimal code changes.                             | Scale PyTorch model training across GPUs.                         |
| **PEFT (`peft`)**                 | Parameter-Efficient Fine-Tuning library (LoRA, adapters, etc.) for large models.                                                     | Fine-tune a 13B LLM on a laptop GPU.                              |
| **Optimum (`optimum`)**           | Optimizations for transformers (ONNX, quantization, hardware-specific acceleration).                                                 | Run models faster on Intel/NVIDIA chips.                          |
| **Gradio (`gradio`)** (partnered) | Simple UI framework to demo models in the browser.                                                                                   | Deploy a model demo in a few lines.                               |


Check the set of tutorials: 
- [LLM course](https://huggingface.co/learn/llm-course/chapter0/1) from hugging face.
- The [transformers](transformers.ipynb) package, provided by the Hugging Face, allows you to use a variety of popular transformer-based models.

## Spark

Spark is a framework for processing large amounts of data. This section covers its Python SDK.

Some configuration is required to start experimenting with Spark in local mode:

- `pip3 install pyspark`: for spark instalation.
- Install java: `openjdk-17-jdk` package in `apt`. Set path to the jdk to the `$JAVA_HOME` variable. In ubuntu case `export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64`.

---

If you have completed the specified configuration correctly, you will be able to run the script below, which creates a local `SparkContext` - way to experiment with spark without any clusters.

In [1]:
from pyspark import SparkContext, SparkConf

sc = SparkContext(conf=SparkConf().setMaster("local"))

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/22 15:13:04 WARN Utils: Your hostname, user-ThinkPad-E16-Gen-2, resolves to a loopback address: 127.0.1.1; using 10.202.37.58 instead (on interface wlp0s20f3)
25/08/22 15:13:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/22 15:13:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Datasets

Spark actually operates with Resilient Distributed Datasets (RDDs), but I'll use the term "dataset", as this section mosttly focuses on the syntax of `pyspark`, not actual its features associated with the ability to distribute the computations. Check more details on [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html) page.

---

The following cell demonstrates how to create an RDD from a simple python list.

In [16]:
df = sc.parallelize([1, 2, 3, 4, 5])
type(df)

pyspark.core.rdd.RDD

## Sentence transformer

The sentence transformer package implements models for building embeddings from sets of texts. Check [SBERT](https://sbert.net/) page for mode details.

---

Consider a basic example of using the `sentence_transformers` package.

The following cell loads the model and displays the type. It's a special object that build to privide specific interfaces associated with building embeddings.

In [12]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
type(model)

sentence_transformers.SentenceTransformer.SentenceTransformer

The obtained object have an `encode` method - that takes a range of texts and returns `numpy.array` of embeddings.

In [15]:
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)
embeddings

(3, 384)


array([[ 0.01919573,  0.12008536,  0.15959828, ..., -0.0053629 ,
        -0.08109505,  0.05021338],
       [-0.01869039,  0.04151868,  0.07431544, ...,  0.00486597,
        -0.06190442,  0.03187514],
       [ 0.136502  ,  0.08227322, -0.02526165, ...,  0.08762047,
         0.03045845, -0.01075752]], shape=(3, 384), dtype=float32)

The following cell uses the `similarity` method to create a matrix of the embeddings' similarities.

In [16]:
similarities = model.similarity(embeddings, embeddings)
similarities

tensor([[1.0000, 0.6660, 0.1046],
        [0.6660, 1.0000, 0.1411],
        [0.1046, 0.1411, 1.0000]])