<font color='green'>
Pip install is the command you use to install Python packages with the help of a tool called Pip package manager.
<br><br>Installing LangChain package along with langchain-openai, new update from langchain team
</font>

In [1]:
#!pip install langchain
#!pip install langchain-openai

<font color='green'>
Installing Openai package, which includes the classes that we can use to communicate with Openai services
<font>

In [2]:
#!pip install openai

## Let's use OpenAI

<font color='green'>
Imports the Python built-in module called "os."
<br>This module provides a way to interact with the operating system, such as accessing environment variables, working with files and directories, executing shell commands, etc
<br><br>
The environ attribute is a dictionary-like object that contains the environment variables of the current operating system session
<br><br>
By accessing os.environ, you can retrieve and manipulate environment variables within your Python program. For example, you can retrieve the value of a specific environment variable using the syntax os.environ['VARIABLE_NAME'], where "VARIABLE_NAME" is the name of the environment variable you want to access.
<font>

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

<font color='green'>
LangChain has built a Wrapper around OpenAI APIs, using which we can get access to all the services OpenAI provides.
<br>
The code snippet below imports a specific class called 'OpenAIEmbeddings'(Wrapper around OpenAI large language models) from the 'embeddings' module of the 'langchain' library.

<font>

In [2]:
#As Langchain team has been working aggresively on improving the tool, we can see a lot of changes happening every weeek,
#As a part of it, the below import has been depreciated
#from langchain.embeddings import OpenAIEmbeddings

#New import from langchain, which replaces the above
from langchain_openai import OpenAIEmbeddings

<font color='green'>
Initialize the OpenAIEmbeddings object
<font>

In [3]:
embeddings = OpenAIEmbeddings()

<font color='green'>
Let's read our input data and get its embedding representation, so that we use it up for our future tasks
<font>

In [8]:
#install the below if already not installed...a
#!pip install pandas
#!pip install openpyxl

import pandas as pd
df = pd.read_excel('data.xlsx')
print(df)

         Words
0     Elephant
1         Lion
2        Tiger
3          Dog
4      Cricket
5      Footbal
6       Tennis
7   Basketball
8        Apple
9       Orange
10      Banana


<font color='green'>
    We can use "apply" to apply the get_embedding function to each row in the dataframe because our words are stored in a pandas dataframe. In order to save time and to save the calculated word embeddings in a new csv file called "word_embeddings.csv" rather than calling OpenAI once more to carry out these computations.
    <font>

In [10]:
df.head()

Unnamed: 0,Words
0,Elephant
1,Lion
2,Tiger
3,Dog
4,Cricket


In [11]:
df['embedding'] = df['Words'].apply(lambda x: embeddings.embed_query(x))
df.to_csv('word_embeddings.csv')

<font color='green'>
    Let's load the existing file, which contains the embeddings, so that we can save chargers by not hitting the API repeatedly
    <font>

In [12]:
new_df = pd.read_csv('word_embeddings.csv')
print(new_df)

    Unnamed: 0       Words                                          embedding
0            0    Elephant  [-0.018833935260772705, -0.008583207614719868,...
1            1        Lion  [-0.0014748648973181844, -0.00997674185782671,...
2            2       Tiger  [-0.013435241766273975, -0.009651306085288525,...
3            3         Dog  [-0.0008876295178197324, -0.015154650434851646...
4            4     Cricket  [0.003892941400408745, -0.007233874406665564, ...
5            5     Footbal  [-0.011474513448774815, -0.008092077448964119,...
6            6      Tennis  [-0.0230230912566185, 0.0015749745070934296, 0...
7            7  Basketball  [-0.012938675470650196, -0.013289385475218296,...
8            8       Apple  [0.01341729424893856, -0.005096784792840481, -...
9            9      Orange  [0.02072838321328163, -0.029429513961076736, 4...
10          10      Banana  [-0.013066131621599197, -0.020126868039369583,...


<font color='green'>
Let's get the embeddings for our text
<font>

In [13]:
our_Text = "Cat"

In [14]:
text_embedding = embeddings.embed_query(our_Text)

In [15]:
print (f"Our embedding is {text_embedding}")

Our embedding is [-0.008235761895775795, -0.007498231250792742, -0.009963496588170528, -0.02478923462331295, -0.012790698558092117, 0.006682166829705238, -0.0015484736068174243, -0.037859924137592316, -0.014409169554710388, -0.02622332237660885, 0.01718173921108246, 0.04624592140316963, 0.0035852198489010334, 0.00422373041510582, -0.0323420986533165, -0.004657371435314417, 0.03941693156957626, 0.005213933996856213, 0.007853338494896889, -0.015583756379783154, -0.02375122904777527, 0.005340270232409239, 0.014900856651365757, -0.01211462914943695, -0.0067743584513664246, 0.004350066650658846, 0.012387788854539394, -0.01334384735673666, 0.005633917171508074, 0.00092020642478019, 0.009861062280833721, -0.0166081041097641, -0.017741717398166656, -0.039307668805122375, -0.02944660559296608, -0.0003254440671298653, 0.011452216655015945, -0.007354822475463152, 0.020732814446091652, -0.013903824612498283, 0.009478638879954815, 0.009799601510167122, -0.013418965972959995, 0.011937075294554234, -

<font color='green'>
    We can determine how similar a word is to other words in our dataframe after we have a vector representing that word.
    <br>
By computing the cosine similarity of the word vector for our search term to each word embedding in our dataframe.
    <font>

In [16]:
# embeddings_utils class has been remove by OpenAi from 1.1.version, so we will use the code to find the cosine similarity :)

#Instead of >>> from openai.embeddings_utils import cosine_similarity
# We can use standard way of finding the cosine_similarity, by using the below code snippet

import numpy as np
#Creating a function which will help us with finding the cosine_similarity score
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


df["similarity score"] = df['embedding'].apply(lambda x: cosine_similarity(x, text_embedding))

df

Unnamed: 0,Words,embedding,similarity score
0,Elephant,"[-0.018833935260772705, -0.008583207614719868,...",0.819499
1,Lion,"[-0.0014748648973181844, -0.00997674185782671,...",0.83991
2,Tiger,"[-0.013435241766273975, -0.009651306085288525,...",0.846924
3,Dog,"[-0.0008876295178197324, -0.015154650434851646...",0.878248
4,Cricket,"[0.003892941400408745, -0.007233874406665564, ...",0.795419
5,Footbal,"[-0.011474513448774815, -0.008092077448964119,...",0.779953
6,Tennis,"[-0.0230230912566185, 0.0015749745070934296, 0...",0.784875
7,Basketball,"[-0.012938675470650196, -0.013289385475218296,...",0.784836
8,Apple,"[0.01341729424893856, -0.005096784792840481, -...",0.830426
9,Orange,"[0.02072838321328163, -0.029429513961076736, 4...",0.811837


<font color='green'>
    Sorting by similarity values in dataframe reveals Dog, Tiger, and Lion are closest to searched term, such as Cat.
    <font>

In [17]:
df.sort_values("similarity score", ascending=False).head(10)

Unnamed: 0,Words,embedding,similarity score
3,Dog,"[-0.0008876295178197324, -0.015154650434851646...",0.878248
2,Tiger,"[-0.013435241766273975, -0.009651306085288525,...",0.846924
1,Lion,"[-0.0014748648973181844, -0.00997674185782671,...",0.83991
8,Apple,"[0.01341729424893856, -0.005096784792840481, -...",0.830426
0,Elephant,"[-0.018833935260772705, -0.008583207614719868,...",0.819499
9,Orange,"[0.02072838321328163, -0.029429513961076736, 4...",0.811837
10,Banana,"[-0.013066131621599197, -0.020126868039369583,...",0.806726
4,Cricket,"[0.003892941400408745, -0.007233874406665564, ...",0.795419
6,Tennis,"[-0.0230230912566185, 0.0015749745070934296, 0...",0.784875
7,Basketball,"[-0.012938675470650196, -0.013289385475218296,...",0.784836


In [18]:
# from openai.embeddings_utils import cosine_similarity
#this is not working since it is from old openai lib hence we have used it from other libs

In [26]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from scipy import spatial

def calculate_cosine_similarity(embedding1, embedding2):
    return 1 - spatial.distance.cosine(embedding1, embedding2)


In [27]:
df["similarity score"] = df['embedding'].apply(lambda x: calculate_cosine_similarity(x, text_embedding))

In [31]:
df.sort_values("similarity score", ascending=False)

Unnamed: 0,Words,embedding,similarity score
3,Dog,"[-0.0008876295178197324, -0.015154650434851646...",0.878248
2,Tiger,"[-0.013435241766273975, -0.009651306085288525,...",0.846924
1,Lion,"[-0.0014748648973181844, -0.00997674185782671,...",0.83991
8,Apple,"[0.01341729424893856, -0.005096784792840481, -...",0.830426
0,Elephant,"[-0.018833935260772705, -0.008583207614719868,...",0.819499
9,Orange,"[0.02072838321328163, -0.029429513961076736, 4...",0.811837
10,Banana,"[-0.013066131621599197, -0.020126868039369583,...",0.806726
4,Cricket,"[0.003892941400408745, -0.007233874406665564, ...",0.795419
6,Tennis,"[-0.0230230912566185, 0.0015749745070934296, 0...",0.784875
7,Basketball,"[-0.012938675470650196, -0.013289385475218296,...",0.784836


ModuleNotFoundError: No module named 'openai.embeddings_utils'