<font color='green'>
Pip install is the command you use to install Python packages with the help of a tool called Pip package manager.
<br><br>Installing LangChain package along with langchain-openai, new update from langchain team
</font>

In [1]:
#!pip install langchain==0.1.4
#!pip install langchain-openai==0.0.5

<font color='green'>
Installing Openai package, which includes the classes that we can use to communicate with Openai services
<font>

In [2]:
#!pip install openai==1.10.0

## Let's use OpenAI

<font color='green'>
Imports the Python built-in module called "os."
<br>This module provides a way to interact with the operating system, such as accessing environment variables, working with files and directories, executing shell commands, etc
<br><br>
The environ attribute is a dictionary-like object that contains the environment variables of the current operating system session
<br><br>
By accessing os.environ, you can retrieve and manipulate environment variables within your Python program. For example, you can retrieve the value of a specific environment variable using the syntax os.environ['VARIABLE_NAME'], where "VARIABLE_NAME" is the name of the environment variable you want to access.
<font>

In [1]:
from dotenv import load_dotenv
load_dotenv()

<font color='green'>
LangChain has built a Wrapper around OpenAI APIs, using which we can get access to all the services OpenAI provides.
<br>
The code snippet below imports a specific class called 'OpenAIEmbeddings'(Wrapper around OpenAI large language models) from the 'embeddings' module of the 'langchain' library.

<font>

In [2]:
#As Langchain team has been working aggresively on improving the tool, we can see a lot of changes happening every weeek,
#As a part of it, the below import has been depreciated
#from langchain.embeddings import OpenAIEmbeddings

#New import from langchain, which replaces the above
from langchain_openai import OpenAIEmbeddings

<font color='green'>
Initialize the OpenAIEmbeddings object
<font>

In [3]:
embeddings = OpenAIEmbeddings()

<font color='green'>
Let's read our input data and get its embedding representation, so that we use it up for our future tasks
<font>

In [4]:
#install the below if already not installed...
#!pip install pandas
#!pip install openpyxl

import pandas as pd
df = pd.read_excel(r'./Data.xlsx')
print(df)

         Words
0     Elephant
1         Lion
2        Tiger
3          Dog
4      Cricket
5      Footbal
6       Tennis
7   Basketball
8        Apple
9       Orange
10      Banana


<font color='green'>
    We can use "apply" to apply the get_embedding function to each row in the dataframe because our words are stored in a pandas dataframe. In order to save time and to save the calculated word embeddings in a new csv file called "word_embeddings.csv" rather than calling OpenAI once more to carry out these computations.
    <font>

In [5]:
df['embedding'] = df['Words'].apply(lambda x: embeddings.embed_query(x))
df.to_csv('word_embeddings.csv')

<font color='green'>
    Let's load the existing file, which contains the embeddings, so that we can save chargers by not hitting the API repeatedly
    <font>

In [6]:
new_df = pd.read_csv('word_embeddings.csv')
print(new_df)

    Unnamed: 0       Words                                          embedding
0            0    Elephant  [-0.018824968530431193, -0.008682483827226351,...
1            1        Lion  [-0.0015009929272673714, -0.010024921244147899...
2            2       Tiger  [-0.013500550598151365, -0.009651595344086194,...
3            3         Dog  [-0.0008935772431031664, -0.015069474605842001...
4            4     Cricket  [0.004101423533335084, -0.007130117598087479, ...
5            5     Footbal  [-0.011442113927512236, -0.008124158902145616,...
6            6      Tennis  [-0.023049057156337614, 0.0015891833895531008,...
7            7  Basketball  [-0.012911115406280736, -0.013261756483018321,...
8            8       Apple  [0.014532958590987296, -0.003988702690459901, ...
9            9      Orange  [0.020821801954896618, -0.029376701870735933, ...
10          10      Banana  [-0.013107325443239795, -0.020157383080523828,...


<font color='green'>
Let's get the embeddings for our text
<font>

In [7]:
our_Text = "Cat"

In [8]:
text_embedding = embeddings.embed_query(our_Text)

In [9]:
print (f"Our embedding is {text_embedding}")

Our embedding is [-0.008174207879591727, -0.007511803310590737, -0.009956554371743542, -0.02478895115778007, -0.012790553094547418, 0.00665477514359485, -0.0015151649503578348, -0.037832173925964885, -0.014422662356334215, -0.026250339680779573, 0.017154227704543154, 0.046327340706031485, 0.0035646922858117063, 0.0042407544673495525, -0.03228709801998716, -0.004592443287070651, 0.039553060579624245, 0.00526167677875539, 0.007894222515219342, -0.015501631209043831, -0.02372364108176051, 0.005319722854397888, 0.01487337125346158, -0.012141805905252642, -0.006781109980413548, 0.00438416184370826, 0.012387647121867568, -0.013364181668659684, 0.005620194986821517, 0.000973120027242582, 0.009881435756561033, -0.016539625594328398, -0.0177961455054929, -0.039225272290804344, -0.029473585599573242, -0.0003316290630643995, 0.011465743956545437, -0.007361567477209563, 0.020732577602891714, -0.013917323009059427, 0.009451215024468754, 0.009806318072701086, -0.013371010125682151, 0.011916451922350

<font color='green'>
    We can determine how similar a word is to other words in our dataframe after we have a vector representing that word.
    <br>
By computing the cosine similarity of the word vector for our search term to each word embedding in our dataframe.
    <font>

In [10]:
# embeddings_utils class has been remove by OpenAi from 1.1.version, so we will use the code to find the cosine similarity :)

#from openai.embeddings_utils import cosine_similarity


import numpy as np
#Creating a function which will help us with finding the cosine_similarity score
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

df["similarity score"] = df['embedding'].apply(lambda x: cosine_similarity(x, text_embedding))

df

Unnamed: 0,Words,embedding,similarity score
0,Elephant,"[-0.018824968530431193, -0.008682483827226351,...",0.819722
1,Lion,"[-0.0015009929272673714, -0.010024921244147899...",0.840089
2,Tiger,"[-0.013500550598151365, -0.009651595344086194,...",0.84697
3,Dog,"[-0.0008935772431031664, -0.015069474605842001...",0.878559
4,Cricket,"[0.004101423533335084, -0.007130117598087479, ...",0.795555
5,Footbal,"[-0.011442113927512236, -0.008124158902145616,...",0.780048
6,Tennis,"[-0.023049057156337614, 0.0015891833895531008,...",0.785051
7,Basketball,"[-0.012911115406280736, -0.013261756483018321,...",0.78498
8,Apple,"[0.014532958590987296, -0.003988702690459901, ...",0.833735
9,Orange,"[0.020821801954896618, -0.029376701870735933, ...",0.811868


<font color='green'>
    Sorting by similarity values in dataframe reveals Dog, Tiger, and Lion are closest to searched term, such as Cat.
    <font>

In [11]:
df.sort_values("similarity score", ascending=False).head(10)

Unnamed: 0,Words,embedding,similarity score
3,Dog,"[-0.0008935772431031664, -0.015069474605842001...",0.878559
2,Tiger,"[-0.013500550598151365, -0.009651595344086194,...",0.84697
1,Lion,"[-0.0015009929272673714, -0.010024921244147899...",0.840089
8,Apple,"[0.014532958590987296, -0.003988702690459901, ...",0.833735
0,Elephant,"[-0.018824968530431193, -0.008682483827226351,...",0.819722
9,Orange,"[0.020821801954896618, -0.029376701870735933, ...",0.811868
10,Banana,"[-0.013107325443239795, -0.020157383080523828,...",0.806824
4,Cricket,"[0.004101423533335084, -0.007130117598087479, ...",0.795555
6,Tennis,"[-0.023049057156337614, 0.0015891833895531008,...",0.785051
7,Basketball,"[-0.012911115406280736, -0.013261756483018321,...",0.78498
