## Loading articles

In [1]:
import pandas as pd

df = pd.read_csv('articles/medium.csv')
df.head()

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...


In [2]:
assert len(df) == 1391

In [3]:
from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(df, page_content_column="Text")
articles = loader.load()

print("Articles loaded:", len(articles))

Articles loaded: 1391


In [4]:
print(articles[0].page_content)

1. Introduction of Word2vec

Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space. For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another.

There are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture tw

## Chunking

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # chunk_size=1000,
    # chunk_overlap=20,
    # length_function=len,
)

articles = text_splitter.split_documents(articles)
print("Number of chunks:", len(articles))

Number of chunks: 2731


# Vector database

In [6]:
articles = articles[:10]

In [7]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(show_progress=True)

vector_store = FAISS.from_documents(articles, embeddings)

OllamaEmbeddings: 100%|██████████| 10/10 [03:28<00:00, 20.81s/it]


In [14]:
retrieved = vector_store.similarity_search("What is Gensim?")
print("Retrieved documents:", len(retrieved))
print("Document content:", retrieved[2].page_content)

OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.63s/it]

Retrieved documents: 4
Document content: Photo credit to Mika Baumeister from Unsplash

When I work on Python projects dealing with large datasets, I usually use Spyder. The environment of Spyder is very simple; I can browse through working directories, maintain large code bases and review data frames I create. However, if I don’t subset the large data, I constantly face memory issues and struggle with very long computational time. For this reason, I occasionally use Databricks. Databricks is a Microsoft Azure platform where you can easily parse large amounts of data into “notebooks” and perform Apache Spark-based analytics.

If you want to work with data frames and run models using pyspark, you can easily refer to Databricks’ website for more information. However, while working on Databricks, I noticed that saving files in CSV, which is supposed to be quite easy, is not very straightforward. In the following section, I would like to share how you can save data frames from Databricks i




# System

In [9]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.llms import Ollama

llm = Ollama(model="llama2")

prompt = ChatPromptTemplate.from_template("""
   Answer the question based on the provided context:
   <context>
   {context}
   </context>
                                          
   Question: 
   {input}
""")

document_chain = create_stuff_documents_chain(llm, prompt)
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), config={'run_name': 'format_inputs'})
| ChatPromptTemplate(input_variables=['context', 'input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], template='\n   Answer the question based on the provided context:\n   <context>\n   {context}\n   </context>\n                                          \n   Question: \n   {input}\n'))])
| Ollama()
| StrOutputParser(), config={'run_name': 'stuff_documents_chain'})

In [10]:
from langchain.chains import create_retrieval_chain

retriever = vector_store.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)
retrieval_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['FAISS', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x00000225408ABFD0>), config={'run_name': 'retrieve_documents'})
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), config={'run_name': 'format_inputs'})
            | ChatPromptTemplate(input_variables=['context', 'input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], template='\n   Answer the question based on the provided context:\n   <context>\n   {context}\n   </context>\n                                          \n   Question: \n   {input}\n'))])
            | Ollama()
            | StrOutputParser(), config={'run_name': 'stuff_documents_chain'})
  }), conf

In [11]:
response = retrieval_chain.invoke({"input": "How to implement the gradient descent?"})
print(response["answer"])

OllamaEmbeddings: 100%|██████████| 1/1 [00:06<00:00,  6.75s/it]


The code for implementing gradient descent in Python using a neural network is shown below:
```
import numpy as np

class NeuralNetwork:
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        
        self.w = np.random.rand(input_dim, hidden_dim)
        self.b = np.zeros((hidden_dim,))
        
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_derivative(self, x, w):
        return x * (1 - x) * np.exp(x) * (-w)
    
    def gradient_descent(self, x, y, iterations):
        for i in range(iterations):
            Xi = x
            Xj = self.sigmoid(Xi, self.w)
            yhat = self.sigmoid(Xj, self.b)
            
            # gradients for hidden to output weights
            g_b = np.dot(Xj.T, (y - yhat) * self.sigmoid_derivative(Xj, self.b))
            
            # gradients for input to hidden weights
            g_w = np