# Загрузка контента из papers.csv

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("papers.csv")

In [3]:
df

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...
...,...,...
1386,Brain: A Mystery,“The most beautiful experience we can have is ...
1387,Machine Learning: Lincoln Was Ahead of His Time,Photo by Jp Valery on Unsplash\n\nIn the 45th ...
1388,AI and Us — an Opera Experience. In my previou...,EKHO COLLECTIVE: OPERA BEYOND SERIES\n\nIn my ...
1389,Digital Skills as a Service (DSaaS),Have you ever thought about what will be in th...


# Векторная база данных Qdrant

### Credentials / Секретные коды

In [4]:
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
OPENAI_PROXY = os.environ.get('OPENAI_PROXY')
QDRANT_URL = os.environ.get('QDRANT_URL')
QDRANT_API_KEY = os.environ.get('QDRANT_API_KEY')
QDRANT_collection_name = os.environ.get('QDRANT_collection_name')

In [5]:
QDRANT_URL

'http://localhost:6333'

### Установка библиотек, функций

In [9]:
!pip install langchain-core
!pip install langchain-openai
!pip install langchain-qdrant
!pip install langchain-text-splitters



In [6]:
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant.qdrant import QdrantVectorStore
import tiktoken

In [7]:
def loader(titles: list[str], texts: list[str]) -> list[Document]:
    """Загружает 2 массива одинаковой длины и создает список документов.

    Args:
        titles (list[str]): Список заголовков
        texts (list[str]): Список текстов

    Returns:
        list[Document]: Список объектов Document
    """
    result: list[Document] = []
    for index, title in enumerate(titles):
        result.append(Document( page_content = title, metadata = { "source": texts[index] }))
    return result

In [8]:
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """Возвращает количество токенов в текстовой строке."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

def docs_num_tokens(docs: list[Document]) -> int:
    """
    Функция вычисляет общее количество токенов в списке документов.

    Args:
        docs (list[Document]): Список объектов Document, для которых необходимо подсчитать токены.

    Returns:
        int: Общее количество токенов во всех документах.
    """
    return sum([num_tokens_from_string(doc.page_content) for doc in docs])

### Langchain Pipeline - Индексация векторной базы данных

In [9]:
docs = loader(df['Title'].values, df['Text'].values)
prompt_tokens = docs_num_tokens(docs)

In [10]:
docs[0]

Document(metadata={'source': '1. Introduction of Word2vec\n\nWord2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space. For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another.\n\nThere are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, the skip-gram method can have a better performance compared with 

In [11]:
print(f"Количество токенов docs = {prompt_tokens}")

Количество токенов docs = 14142


In [12]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions = 256, api_key=OPENAI_API_KEY)

In [13]:
qdrant = QdrantVectorStore.from_documents(docs,
                                          embeddings,
                                          collection_name=QDRANT_collection_name,
                                          api_key=QDRANT_API_KEY,
                                          url=QDRANT_URL,
                                          force_recreate=True)



### Cosine Similarity - Семантический поиск. 4 примера

In [14]:
def docs_related(query: str, top_k: int = 4):
    return qdrant.similarity_search(query, k = top_k)

def keys_related(docs: list[Document]) -> str:
    return "\n".join([doc.page_content for doc in docs])

def values_related(docs: list[Document]) -> str:
    return "\n\n---\n\n".join([doc.metadata.get("source") for doc in docs])

In [15]:
Questions = [
    "Where can I apply Convolutional Neural Network?",
    "What is Reinforcement Learning?",
    "How to deploy a machine learning model?",
    "How to implement a random forest algorithm?"
]

In [16]:
Docs = list(map(docs_related, Questions))
Keys = list(map(keys_related, Docs))
Values = list(map(values_related, Docs))

In [17]:
print(Keys[0])

Convolutional Neural Networks
An introduction to Convolutional Neural Networks
CNN vs fully-connected network for image processing
Understanding Neural Networks


In [18]:
print(Values[0])

Researchers came up with the concept of CNN or Convolutional Neural Network while working on image processing algorithms. Traditional fully connected networks were kind of a black box — that took in all of the inputs and passed through each value to a dense network that followed into a one hot output. That seemed to work with small set of inputs.

But, when we work on a image of 1024x768 pixels, we have an input of 3x1024x768 = 2359296 numbers (RGB values per pixel). A dense multi layer neural network that consumes an input vector of 2359296 numbers would have at least 2359296 weights per neuron in the first layer itself — 2MB of weights per neuron of the first layer. That would be crazy! For the processor as well as the RAM. Back in 1990’s and early 2000’s, this was almost impossible.

That led researchers wondering if there is a better way of doing this job. The first and foremost task in any image processing (recognition or manipulation) is typically detecting the edges and texture.

In [19]:
print(Keys[1])

Reinforcement Learning Introduction
Reinforcement Learning : Markov-Decision Process (Part 1)
Reinforcement Learning is full of Manipulative Consultants
Reinforcement Learning — Model Based Planning Methods


In [20]:
print(Values[1])

Reinforcement Learning Introduction

An introduction to reinforcement learning problems and solutions Y Tech · Follow 4 min read · Jul 25, 2019 -- Share

This post will be an introductory level on reinforcement learning. Throughout this post, the problem definitions and some most popular solutions will be discussed. After this article, you should be able to understand what is reinforcement learning, and how to find the optimal policy for the problem.

The Problem Description

The agent-environment interaction in reinforcement learning

The Setting

The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment.

learning to interact with its environment. At each time step, the agent receives the environment’s state (the environment presents a situation to the agent), and the agent must choose an appropriate action in response. One time step later, the agent receives a reward (the environment indicates whether the agent has responded app

In [21]:
print(Keys[2])

Deploy ML models at scale
A Recipe for using Open Source Machine Learning models
Build and Deploy a Deep Learning Image Classification App
Deploy Machine Learning Web API using NGINX and Docker on Kubernetes in Python


In [22]:
print(Values[2])

Deploy ML models at scale

ML Model Deployment (Source)

Let’s assume that you have built a ML model and that you are happy with its performance. Then the next step is to deploy the model into production. In this blog series I will cover how you can deploy your model for large scale consumption with in a scalable Infrastructure using AWS using docker container service.

In this blog I will start with the first step of building an API framework for the ML model and running it in you local machine. For the purpose of this blog, let’s consider the Sentiment classification model built here. In order to deploy this model we will follow the below steps:

Convert the model into .hdf5 file or .pkl file Implement a Flask API Run the API

Convert the model into “.hdf5” file or ‘.pkl’ file

In case the model is a built on sklearn, it would be best to save it as a ‘.pkl’ file. Alternatively if it is a deep learning model then it is recommended to save the model as a ‘HDF’ file. The main difference

In [23]:
print(Keys[3])

Implementing Random Forest in R. A Practical Application of Random…
The Basics: Decision Tree Classifiers
Is Random Forest better than Logistic Regression? (a comparison)
The Complete Guide to Decision Trees


In [24]:
print(Values[3])

Implementing Random Forest in R

Photo by Rural Explorer on Unsplash

What is Random Forest (RF)?

In order to understand RF, we need to first understand about decision trees. Rajesh S. Brid wrote a detailed article about decision trees. We will not go too much in details about the definition of decision trees since that is not the purpose of this article. I just want to quickly summarise a few points. A decision tree is series of Yes/No questions. For each level of the tree, if your answer is Yes, you fall into a category, otherwise, you will fall into another category. You will answer this series of Yes/No questions until you reach the final category. You will be classified into that group.

Taken from here

Trees work well with the data we use to train, but they are not performing well when it comes to new data samples. Fortunately, we have Random Forest, which is a combination of many decision trees with flexibility, hence resulting in an improvement in accuracy.

Here I will not g

# Langchain & LLM

In [27]:
import time
import httpx
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY,
                http_client=httpx.Client(proxy=OPENAI_PROXY,
                                          transport=httpx.HTTPTransport(local_address="0.0.0.0")),
                )


def LLM_request(query: str, context: str) -> dict:

    system_prompt = f"""DOCUMENT:
    {context}

    QUESTION:
    {query}

    INSTRUCTIONS:
    Answer the users QUESTION using the DOCUMENT text above.
    Keep your answer ground in the facts of the DOCUMENT.
    If the DOCUMENT doesn’t contain the facts to answer the QUESTION return Нет данных.
    """

    start_time = time.time()   # record the time before the request is sent

    LLM = client.chat.completions.create(
            messages=[{"role": "system", "content": system_prompt}],
            model="gpt-4o",
            temperature=0.2,
    )

    elapsed_time = time.time() - start_time    # calculate the time it took to receive the response

    return {
            "content": LLM.choices[0].message.content,
            "prompt_tokens": LLM.usage.prompt_tokens,
            "completion_tokens": LLM.usage.completion_tokens,
            "total_tokens": LLM.usage.total_tokens,
            "elapsed_time": elapsed_time
    }

In [28]:
LLM_Results = [LLM_request(Questions[i], Values[i]) for i in range(len(Questions))]

In [38]:
print(LLM_Results[0]['content'])

Convolutional Neural Networks (CNNs) can be applied in various domains beyond their initial use in image processing. Some common applications include:

1. **Image Processing**: CNNs are widely used for image classification, segmentation, and enhancement tasks. Examples include identifying satellite images that contain roads or classifying handwritten letters and digits.

2. **Natural Language Processing (NLP)**: CNNs are used for understanding and processing language data, although Recurrent Neural Networks (RNNs) are often preferred for certain NLP tasks.

3. **Speech Recognition**: CNNs can be used to process audio data for recognizing speech patterns.

4. **Population Genetics**: CNNs are applied in genetic inference tasks such as performing selective sweeps, finding gene flow, and inferring population size changes.

5. **Astrophysics**: CNNs help interpret radio telescope data to predict visual images representing the data.

6. **Voice Synthesis**: Deepmind’s WaveNet, a CNN model, 

In [39]:
print(LLM_Results[1]['content'])

Reinforcement Learning (RL) is a framework in which an agent learns to interact with its environment by taking actions and receiving feedback in the form of rewards. The goal of the agent is to maximize the expected cumulative reward over time. In RL, the agent does not receive explicit instructions on how to perform tasks but instead learns through trial and error, guided by the rewards it receives. The environment provides the agent with a state, and the agent must choose an action that leads to a new state and a reward. RL problems can be categorized into episodic tasks, which have a defined start and end, and continuing tasks, which go on indefinitely. The learning process often involves concepts such as the Markov Decision Process (MDP), which provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.


In [40]:
print(LLM_Results[2]['content'])

To deploy a machine learning model, you can follow these steps:

1. **Convert the Model**: Save your trained model in a suitable format. If it's a model built with sklearn, save it as a `.pkl` file. If it's a deep learning model, save it as an `.hdf5` file.

2. **Implement an API**: Use Flask to create an API for your model. Load the saved model and create a Flask application object. Develop a test API function to ensure the API's health and a "POST" request API for processing requests to your model.

3. **Run the API Locally**: Set up the Flask app to run on your local machine. You can test the API using a browser or tools like Postman.

4. **Deploy on a Cloud Server**: Use container services like Docker to package your application. Set up a Dockerfile to create an environment with all necessary dependencies and configurations. Use a web server like Gunicorn and a reverse proxy like NGINX for handling requests.

5. **Monitor and Manage**: Use tools like Supervisord to monitor your pro

In [41]:
print(LLM_Results[3]['content'])

To implement a random forest algorithm, you can follow these steps as outlined in the DOCUMENT:

1. **Understand Decision Trees**: Before implementing a random forest, it's important to understand decision trees, as a random forest is essentially a collection of decision trees. Decision trees involve making a series of Yes/No decisions to classify data into categories.

2. **Use Ensemble Learning**: Random forests utilize ensemble learning by creating multiple decision trees and averaging their results to improve accuracy. This method emphasizes feature selection and does not assume a linear relationship in the data.

3. **Load Necessary Packages**: If you are using R, you need to load specific packages to work with random forests. If these packages are not already installed, you will need to install them first.

4. **Read and Inspect Data**: Load your dataset into R for analysis. For example, in the DOCUMENT, the dataset named "Wisconsin" is read directly from a web link.

5. **Build 