# Intro

The DS/ML section discusses the python packages/frameworks specialised for building database systems and machine learning.

## Hugging Face

Huggingface is an ecosystem of packages that are related to all aspects of working with deep learning objects.

The first thing you need to do is log in:

`huggingface-cli login --token <your HF token>`

The following table shows the structure of the ecosystem:

| Package                           | Purpose                                                                                                                              |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| **🤗 Hub (`huggingface_hub`)**    | Central repository for models, datasets, and Spaces. Lets you push/pull models and datasets.                                         |
| **Transformers (`transformers`)** | High-level library with pretrained NLP, vision, and multimodal models. Handles training, inference, and tokenization (via wrappers). |
| **Tokenizers (`tokenizers`)**     | Fast, low-level text tokenization library (written in Rust). Often used inside `transformers`.                                       |
| **Datasets (`datasets`)**         | Efficient dataset loading, processing, and streaming. Optimized for large ML datasets.                                               |
| **Evaluate (`evaluate`)**         | Standardized evaluation metrics library. Works well with `datasets` and `transformers`.                                              |
| **Diffusers (`diffusers`)**       | Library for diffusion models (e.g., Stable Diffusion) for images, audio, video.                                                      |
| **Accelerate (`accelerate`)**     | Utility for running training on any hardware setup (CPU, GPU, multi-GPU, TPU) with minimal code changes.                             |
| **PEFT (`peft`)**                 | Parameter-Efficient Fine-Tuning library (LoRA, adapters, etc.) for large models.                                                     |
| **Optimum (`optimum`)**           | Optimizations for transformers (ONNX, quantization, hardware-specific acceleration).                                                 |
| **Smollagents (`smolagents`)**    | Building agentic systems.                                                                                                            |
| **Gradio (`gradio`)** (partnered) | Simple UI framework to demo models in the browser.                                                                                   |

Find out more: 

- [LLM course](https://huggingface.co/learn/llm-course/chapter0/1) from hugging face.
- The [Hugging Face](hugging_face.ipynb) page.

## Spark

Spark is a framework for processing large amounts of data. This section covers its Python SDK.

Some configuration is required to start experimenting with Spark in local mode:

- `pip3 install pyspark`: for spark instalation.
- Install java: `openjdk-17-jdk` package in `apt`. Set path to the jdk to the `$JAVA_HOME` variable. In ubuntu case `export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64`.

---

If you have completed the specified configuration correctly, you will be able to run the script below, which creates a local `SparkContext` - way to experiment with spark without any clusters.

In [1]:
from pyspark import SparkContext, SparkConf

sc = SparkContext(conf=SparkConf().setMaster("local"))

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/19 09:04:44 WARN Utils: Your hostname, user-ThinkPad-E16-Gen-2, resolves to a loopback address: 127.0.1.1; using 10.202.22.210 instead (on interface enp0s31f6)
25/09/19 09:04:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/19 09:04:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Context and session

There are two ways to interact with the Spark provided by the PySpark:

- Spark Context: is a low-level API for manipulating with computational resources provided by Spark.
- Spark Session: is built on top of the SparkContext tool to implement the way users interact with SparkSQL.

---

The following cell provides an example of how to create a Spark session.

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Temp").getOrCreate()
type(spark)

pyspark.sql.session.SparkSession

**Note:** The Spark Session still has a SparkContext under the hood. The following cell shows that the session's `sparkContext` attribute is the same as the context created earlier.

In [4]:
spark.sparkContext is sc

True

### Datasets

Spark actually operates with Resilient Distributed Datasets (RDDs), but I'll use the term "dataset", as this section mosttly focuses on the syntax of `pyspark`, not actual its features associated with the ability to distribute the computations. Check more details on [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html) page.

---

The following cell demonstrates how to create an RDD from a simple python list.

In [16]:
df = sc.parallelize([1, 2, 3, 4, 5])
type(df)

pyspark.core.rdd.RDD

## Read csv

In [None]:
from pyspark.sql import SparkSession

# Explicitly creating the SparkSession
spark = SparkSession.builder \
    .appName("My Spark Application") \
    .getOrCreate()

25/09/19 08:55:20 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: path/to/data.json.
java.io.FileNotFoundException: File path/to/data.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:56)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
	at org.apache.spark.sql.catalyst.analysis.ResolveDataSource.org$apache$spark$sql$catalyst$analysis$ResolveDataSource$$loadV1BatchSource(ResolveDataSource.scala:143)
	at org.apache.spark.sql.catalyst.analysis.ResolveDataSource

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/home/user/Documents/code/python/ds_ml/path/to/data.json. SQLSTATE: 42K03

## Sentence transformer

The sentence transformer package implements models for building embeddings from sets of texts. Check [SBERT](https://sbert.net/) page for mode details.

---

Consider a basic example of using the `sentence_transformers` package.

The following cell loads the model and displays the type. It's a special object that build to privide specific interfaces associated with building embeddings.

In [12]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
type(model)

sentence_transformers.SentenceTransformer.SentenceTransformer

The obtained object have an `encode` method - that takes a range of texts and returns `numpy.array` of embeddings.

In [15]:
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)
embeddings

(3, 384)


array([[ 0.01919573,  0.12008536,  0.15959828, ..., -0.0053629 ,
        -0.08109505,  0.05021338],
       [-0.01869039,  0.04151868,  0.07431544, ...,  0.00486597,
        -0.06190442,  0.03187514],
       [ 0.136502  ,  0.08227322, -0.02526165, ...,  0.08762047,
         0.03045845, -0.01075752]], shape=(3, 384), dtype=float32)

The following cell uses the `similarity` method to create a matrix of the embeddings' similarities.

In [16]:
similarities = model.similarity(embeddings, embeddings)
similarities

tensor([[1.0000, 0.6660, 0.1046],
        [0.6660, 1.0000, 0.1411],
        [0.1046, 0.1411, 1.0000]])

## LangChain

The Lang chain is the core library for developing modern, agent-based solutions. The following table lists and describes the central components of the lang chain package.

| Component | Analogy | Description |
| :--- | :--- | :--- |
| **Models** | The brains | These are the core language models (LLMs) that handle the actual work, like generating text, holding conversations, or creating embeddings. |
| **Prompts** | The instructions | These are the templates used to provide specific instructions and context to the models. They ensure the model responds in a consistent and desired format. |
| **Chains** | The workflow | A way to link multiple components together into a single, automated sequence. This allows you to perform multi-step tasks, like combining a prompt with a model call. |
| **Agents** | The reasoning engine | A more advanced chain that uses an LLM to decide which external **Tools** to use to achieve a goal. It can think, act, and observe, repeating the process until the task is complete. |
| **Tools** | The external capabilities | These are functionalities an agent can use to interact with the world. Examples include a search engine, a calculator, or a database lookup. |


Check more in [Lang Chain](langchain.ipynb) package.

## MCP SDK

There is an MCP SDK for python. It is provided by the `mcp[cli]` package.

Define the assign a server object using the `mcp.server.fastmcp.FastMCP` class. Use decorators: `tool`, `resource`, `prompt`, and `sampling` to wrap the funcitons that implement the corresponding facilities.

---

In the following cell we will consider how to run the server.

In [5]:
%%writefile intro_files/mcp_server.py
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("Some service")

@mcp.tool()
def some_tool(inp: str) -> str:
    return f"Output of some tool for {inp}."

mcp.run()

Overwriting intro_files/mcp_server.py


Run your server using the command `mcp dev intro_files/mcp_server.py`. The following cell runs the server from python using `os.system` command to demonstrate the expected output.

In [6]:
import os
os.system("mcp dev intro_files/mcp_server.py &")

0

Starting MCP inspector...
⚙️ Proxy server listening on localhost:6277
🔑 Session token: 4c21942ece36a04554ee01562067c5b129c3b03eaa945ead9d0b8964d9334fe8
   Use this token to authenticate requests or set DANGEROUSLY_OMIT_AUTH=true to disable auth

🚀 MCP Inspector is up and running at:
   http://localhost:6274/?MCP_PROXY_AUTH_TOKEN=4c21942ece36a04554ee01562067c5b129c3b03eaa945ead9d0b8964d9334fe8

🌐 Opening browser...


**Note:** To use an inspector tool, you must install `npm` on your system.