### Visualize the vector store in 3D
- We will use OpenAI text-embedding-ada-002 model for embedding the information
- Then we will fetch first 240 vectors and visualize them in 3D
- Perform Semantic Search and visualize its Nearest Neighbors 

In [41]:
%pip install -Uqr requirements.txt
%pip install -Uq matplotlib scikit-learn==1.5.0 plotly nbformat==4.2.0

from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from langchain_community.document_loaders import PyPDFLoader

import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.io as pio

import os
from dotenv import dotenv_values

config = {
    **dotenv_values(".env"),
    **os.environ
}

openai_embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    api_key=config["OPEN_AI_API_KEY"]
)

print("Completed setup!")


[31mERROR: Cannot install -r requirements.txt (line 19), -r requirements.txt (line 3), -r requirements.txt (line 70), langchain==0.2.5 and numpy==2.0.0 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Completed setup!


#### Setup MongoDB 

In [42]:

client = MongoClient(config["MONGODB_ATLAS_CLUSTER_URI"], server_api=ServerApi('1'))
DB_NAME = "capit_ai"
# COLLECTION_NAME = "demo_vector"
# ATLAS_VECTOR_SEARCH_INDEX_NAME = "demo_vector_index"
COLLECTION_NAME = "cfa_level_1_ada2"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "cfa_level_1_vindex_1536"
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

print("Connected to MongoDB!")


Connected to MongoDB!


##### A. Fine-tune the Model's Vectors Database

In [43]:
# insert the documents in MongoDB Atlas with their embedding
vector_search = MongoDBAtlasVectorSearch.from_documents(
    documents=pages,
    embedding=openai_embeddings,
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

##### B. Load the Model's Vectors Database

In [44]:
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
    config["MONGODB_ATLAS_CLUSTER_URI"],
    DB_NAME + "." + COLLECTION_NAME,
    embedding=openai_embeddings,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

print("Loading vector completed!")

Loading vector completed!


#### Test the Model's Vectors Database

In [45]:
# query_embedding = openai_embeddings.embed_query("Who is the Associate Dean in UOWCHK?")
query_embedding = openai_embeddings.embed_query("What is the domain inside CFA level 1?")

print(f"Number of Dimensions: {len(query_embedding)}")
print(query_embedding[:5])


Number of Dimensions: 1536
[0.04898280277848244, -0.030952323228120804, 0.012566697783768177, -0.009240620769560337, -0.03944850340485573]


In [52]:
# Cosine Similarity, It calculates the cosine of the angle between two vectors. We will use this method in our example.
# Euclidean Distance, It measures the straight-line distance between two vectors in the multidimensional space
# Manhattan Distance (L1 Distance), This calculates the sum of the absolute differences between corresponding elements of two vectors.

top_k = 240

response = vector_search._similarity_search_with_score(
    #query="Who is Dr Pang? What is his role?",      # query
    embedding=query_embedding,                      # query embedding
    k = top_k                                           # top N results
)

if len(response) > 0:
    print(f"Number of results: {len(response)}")
    print(f"First result embedding:")
    print(response[0][0].page_content)
    print(response[0][0].metadata.get("embedding"))
else:
    print("No results found!")

Number of results: 240
First result embedding:
Corporate Finance
10%
41
.
0.88
.
Portfolio Management
6%
40
.
0.54
.
Equity Investments
11%
62
.
0.64
.
Fixed Income Investments
11%
56
.
0.71
.
Derivatives
6%
20
.
1.08
.
Alternative Investments
6%
7
.
3.09
.
Totals
100%
525
.
0.69
.
Notice how the LOS counts are not consistent with the exam weights. In fact, some
topic areas with a relatively high number of LOS have a reatively low weight on the
exam, so allocating your preparation time based on the number of LOS will most likely
lead to over-preparation in some areas (e.g., Economics) and under-preparation in others
(especially Ethics).
Formulas
You may be surprised to know that the Level I CFA examination is quite conceptual
and is not heavily weighted toward computations based on memorized formulas. It is
nothing like what my undergraduate students used to refer to as “plug and chug”
problems. Certainly, some formulas are required, but you will find that you need to use
your calculat

In [47]:

# print object keys
embs = []
for i in range(0, top_k):
    embs.append(response[i][0].metadata.get("embedding"))
embs.insert(0, query_embedding)


n_components = 3 #3D
embs = np.array(embs) #converting to numpy array

# n_components is the number of dimensions of the embedded space
# random_state is the seed used by the random number generator, useful for reproducibility
# perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results.
tsne = TSNE(n_components=n_components, random_state=42, perplexity=50)
reduced_vectors = tsne.fit_transform(embs)  #fitting the model and reducing the dimensions
reduced_vectors[0:10]
# print(reduced_vectors)


array([[ 9.989872 , 31.63792  , 34.802128 ],
       [ 0.4542818, 23.499418 , 12.06794  ],
       [ 9.031281 , 20.498766 , 32.241596 ],
       [-2.750646 , 24.274012 , 32.76383  ],
       [ 2.843891 , 28.26126  , 25.461365 ],
       [ 9.593894 , 27.512913 , 20.909052 ],
       [-1.0333468, 21.858128 , 20.5508   ],
       [16.784018 , 28.797344 , 24.38921  ],
       [-3.446257 , 32.035206 , 18.516897 ],
       [12.595609 , 13.810703 , 36.808807 ]], dtype=float32)

#### Visualize the Vectors in 3D Space

In [48]:

# Create a 3D scatter plot
scatter_plot = go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color='grey', opacity=0.5, line=dict(color='lightgray', width=1)),
    text=[f"Point {i}" for i in range(len(reduced_vectors))]
)

# Highlight the first point with a different color
highlighted_point = go.Scatter3d(
    x=[reduced_vectors[0, 0]],
    y=[reduced_vectors[0, 1]],
    z=[reduced_vectors[0, 2]],
    mode='markers',
    marker=dict(size=8, color='red', opacity=0.8, line=dict(color='lightgray', width=1)),
    text=["Question"]
    
)

blue_points = go.Scatter3d(
    x=reduced_vectors[1:4, 0],
    y=reduced_vectors[1:4, 1],
    z=reduced_vectors[1:4, 2],
    mode='markers',
    marker=dict(size=8, color='blue', opacity=0.8,  line=dict(color='black', width=1)),
    text=["Top 1 Document","Top 2 Document","Top 3 Document"]
)

# Create the layout for the plot
layout = go.Layout(
    scene=dict(
        xaxis=dict(title='X'),
        yaxis=dict(title='Y'),
        zaxis=dict(title='Z'),
    ),
    title=f'3D Representation after t-SNE (Perplexity=5)'
)


fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'scatter3d'}]])

# Add the scatter plots to the Figure
fig.add_trace(scatter_plot)
fig.add_trace(highlighted_point)
fig.add_trace(blue_points)

fig.update_layout(layout)

pio.write_html(fig, 'interactive_plot.html')
# fig.show()

--------------------------------------------

In [51]:
openai_gpt4o = ChatOpenAI(model_name='gpt-4o', openai_api_key=config["OPEN_AI_API_KEY"])

response = openai_gpt4o.invoke("Who is the Associate Dean in UOWCHK?")

print(response)

content="As of my last update in October 2023, I don't have real-time access to current personnel details. For the most up-to-date information about the Associate Dean at the University of Wollongong College Hong Kong (UOWCHK), I recommend visiting the official UOWCHK website or directly contacting the institution. They should have the latest and most accurate information regarding their administrative staff." response_metadata={'token_usage': {'completion_tokens': 77, 'prompt_tokens': 17, 'total_tokens': 94}, 'model_name': 'gpt-4o', 'system_fingerprint': 'fp_319be4768e', 'finish_reason': 'stop', 'logprobs': None} id='run-5a73db83-c35c-46fa-815e-652616423233-0' usage_metadata={'input_tokens': 17, 'output_tokens': 77, 'total_tokens': 94}


#### Prepare Fine-tuning Data

In [36]:

loader = PyPDFLoader('demo/Staff - University of Wollongong – UOW.pdf', extract_images=False)
pages = loader.load_and_split()

print(pages)


[Document(page_content='繁中\nVice-P r e s i d ent c u m  Acting D ean 副校長  暨署\n理院長\nCHOI, Charlie Yiu-kuen\nBSc CUHK; MSc \xa0Leeds; PhD\xa0Sunderland; CEng, MIET ,\nMHKIE, RPE (ENS)\nTel:\xa02707 3239\nEma il:\xa0charlie_choi@uow.edu.au\nStaﬀ17/06/2024, 17:15 Staff - University of Wollongong – UOW\nhttps://www.uowchk.edu.hk/about-us/about-the-faculties/science-technology/staff/ 1/8', metadata={'source': 'demo/Staff - University of Wollongong – UOW.pdf', 'page': 0}), Document(page_content='Ass ociate D ean 副院長\nAss i s tant P r ofe ss o r  助理教授\nLAU, Ho Lam\nBEng, MPhil, PhD\xa0HKUST\nTel:\xa02707 3252\nEma il:\xa0ho_lam_lau@uow.edu.au\nPr ofe ss o r  of P r actice 實務教授\nCHENG, Paul Sui-Pong\nMBA\xa0Columbia Southern; DBA American City;\xa017/06/2024, 17:15 Staff - University of Wollongong – UOW\nhttps://www.uowchk.edu.hk/about-us/about-the-faculties/science-technology/staff/ 2/8', metadata={'source': 'demo/Staff - University of Wollongong – UOW.pdf', 'page': 1}), Document(page_content=

#### Reconnect to another MongoDB Vector Database

In [38]:

client = MongoClient(config["MONGODB_ATLAS_CLUSTER_URI"], server_api=ServerApi('1'))
DB_NAME = "capit_ai"
COLLECTION_NAME = "demo_vector"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "demo_vector_index"
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

print("Connected to MongoDB!")


Connected to MongoDB!


##### A. Fine-tune the Model's Vectors Database

In [53]:
# insert the documents in MongoDB Atlas with their embedding
vector_search = MongoDBAtlasVectorSearch.from_documents(
    documents=pages,
    embedding=openai_embeddings,
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

#### Test the Fine-tuned Model's Vectors Database

In [56]:
from langchain.chains import RetrievalQA

qa_retriever = vector_search.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

qa = RetrievalQA.from_chain_type(
    llm=openai_gpt4o,
    # chain_type="stuff",
    retriever=qa_retriever,
    return_source_documents=True,
    # chain_type_kwargs={"prompt": PROMPT},
)

docs = qa({"query": "Who is the Associate Dean in UOWCHK? tell me about him."})

print(docs["result"])

The Associate Dean at the University of Wollongong College Hong Kong (UOWCHK) is LAU, Ho Lam. He holds a Bachelor of Engineering (BEng), Master of Philosophy (MPhil), and Doctor of Philosophy (PhD) from the Hong Kong University of Science and Technology (HKUST). You can contact him via telephone at 2707 3252 or email at ho_lam_lau@uow.edu.au.


In [57]:
print(docs["source_documents"])

[Document(page_content='Ass ociate D ean 副院長\nAss i s tant P r ofe ss o r  助理教授\nLAU, Ho Lam\nBEng, MPhil, PhD\xa0HKUST\nTel:\xa02707 3252\nEma il:\xa0ho_lam_lau@uow.edu.au\nPr ofe ss o r  of P r actice 實務教授\nCHENG, Paul Sui-Pong\nMBA\xa0Columbia Southern; DBA American City;\xa017/06/2024, 17:15 Staff - University of Wollongong – UOW\nhttps://www.uowchk.edu.hk/about-us/about-the-faculties/science-technology/staff/ 2/8', metadata={'_id': ObjectId('667064c19705c5d7a3823fb0'), 'embedding': [0.007546300999820232, 0.012413562275469303, -0.0005811401642858982, -0.03108503296971321, -0.0294489786028862, 0.017042232677340508, -0.00419238954782486, 0.01929180696606636, -0.019619017839431763, -0.02106419950723648, -0.02591782808303833, -0.01038894522935152, 0.031875792890787125, 0.029558049514889717, -0.0060431757010519505, -0.01317023765295744, 0.01621057279407978, -7.669004844501615e-05, -0.0037015730049461126, 0.006349936127662659, -0.032911960035562515, -0.007464498281478882, -0.012808942236