<font color='green'>
Pip install is the command you use to install Python packages with the help of a tool called Pip package manager.
<br><br>Installing LangChain package along with langchain-openai, new update from langchain team
</font>

In [53]:
#!pip install langchain==0.1.4
#%pip install langchain-openai==0.0.5
#%pip install numpy

<font color='green'>
Installing Openai package, which includes the classes that we can use to communicate with Openai services
<font>

In [54]:
#!pip install openai==1.10.0

## Let's use OpenAI

<font color='green'>
Imports the Python built-in module called "os."
<br>This module provides a way to interact with the operating system, such as accessing environment variables, working with files and directories, executing shell commands, etc
<br><br>
The environ attribute is a dictionary-like object that contains the environment variables of the current operating system session
<br><br>
By accessing os.environ, you can retrieve and manipulate environment variables within your Python program. For example, you can retrieve the value of a specific environment variable using the syntax os.environ['VARIABLE_NAME'], where "VARIABLE_NAME" is the name of the environment variable you want to access.
<font>

In [55]:
import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

<font color='green'>
LangChain has built a Wrapper around OpenAI APIs, using which we can get access to all the services OpenAI provides.
<br>
The code snippet below imports a specific class called 'OpenAIEmbeddings'(Wrapper around OpenAI large language models) from the 'embeddings' module of the 'langchain' library.

<font>

In [56]:
#As Langchain team has been working aggresively on improving the tool, we can see a lot of changes happening every weeek,
#As a part of it, the below import has been depreciated
#from langchain.embeddings import OpenAIEmbeddings

#New import from langchain, which replaces the above
from langchain_openai import OpenAIEmbeddings

<font color='green'>
Initialize the OpenAIEmbeddings object
<font>

In [57]:
embeddings = OpenAIEmbeddings()

<font color='green'>
Let's read our input data and get its embedding representation, so that we use it up for our future tasks
<font>

In [58]:
#install the below if already not installed...
#%pip install pandas
#%pip install openpyxl

import pandas as pd
df = pd.read_excel('data.xlsx')
print(df)

         Words
0     Elephant
1         Lion
2        Tiger
3          Dog
4      Cricket
5      Footbal
6       Tennis
7   Basketball
8        Apple
9       Orange
10      Banana


<font color='green'>
    We can use "apply" to apply the get_embedding function to each row in the dataframe because our words are stored in a pandas dataframe. In order to save time and to save the calculated word embeddings in a new csv file called "word_embeddings.csv" rather than calling OpenAI once more to carry out these computations.
    <font>

In [59]:
df['embedding'] = df['Words'].apply(lambda x: embeddings.embed_query(x))
df.to_csv('word_embeddings.csv')

<font color='green'>
    Let's load the existing file, which contains the embeddings, so that we can save chargers by not hitting the API repeatedly
    <font>

In [60]:
new_df = pd.read_csv('word_embeddings.csv')
print(new_df)

    Unnamed: 0       Words                                          embedding
0            0    Elephant  [-0.018878312423034248, -0.008599258796030004,...
1            1        Lion  [-0.0015138392825705174, -0.00997922881147518,...
2            2       Tiger  [-0.013501134742314675, -0.009632539697974465,...
3            3         Dog  [-0.000861077619131276, -0.015141254049242174,...
4            4     Cricket  [0.0039361337947048155, -0.007287947426925405,...
5            5     Footbal  [-0.011466965783521396, -0.008091297497552554,...
6            6      Tennis  [-0.023068435609869063, 0.0016187547091597776,...
7            7  Basketball  [-0.012866201360427364, -0.013317207339922954,...
8            8       Apple  [0.01454721563243437, -0.0040025349546621005, ...
9            9      Orange  [0.0207876073455832, -0.029517869639723315, 1....
10          10      Banana  [-0.013083976674634977, -0.020122427894718062,...


<font color='green'>
Let's get the embeddings for our text
<font>

In [61]:
our_Text = "Baseball"

In [62]:
text_embedding = embeddings.embed_query(our_Text)

In [63]:
print (f"Our embedding is {text_embedding}")

Our embedding is [-0.0016198110573376614, -0.01616438109644016, 0.015059343575899943, -0.017179472413827262, -0.02625105722689299, 0.00042041202061879806, -0.020006312106160454, -0.017577800942606758, 0.016202928537897492, -0.024889035211099603, 0.006032731864410773, 0.0032829882191454645, 0.016421366523015905, -0.006087341360690377, -0.00573719853071641, -0.00819783380391465, 0.016819693189150248, -0.02406668110099916, 0.015688957684746003, -0.021291239239039177, -0.004333416010575434, 0.0056022815542931745, 0.0010110768514194305, -0.020777268385887688, -0.014583920164205785, -0.01737221148375907, 0.010350086751486513, -0.02114989799948058, 0.0012833207178489167, -0.0010448062119405614, 0.03433324777511307, -0.01195624473626234, -0.015072193033493246, -0.02455495303896404, -0.01428838729617271, -0.023308575210187803, -0.014596769621799087, 0.01647276249074396, 0.007041399384323799, -0.0024445734512066122, 0.028165597537295195, 0.01863143970145119, -0.013877211172445064, -0.01037578473

<font color='green'>
    We can determine how similar a word is to other words in our dataframe after we have a vector representing that word.
    <br>
By computing the cosine similarity of the word vector for our search term to each word embedding in our dataframe.
    <font>

In [64]:
# embeddings_utils class has been remove by OpenAi from 1.1.version, so we will use the code to find the cosine similarity :)

#from openai.embeddings_utils import cosine_similarity


import numpy as np
#Creating a function which will help us with finding the cosine_similarity score
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

df["similarity score"] = df['embedding'].apply(lambda x: cosine_similarity(x, text_embedding))

df

Unnamed: 0,Words,embedding,similarity score
0,Elephant,"[-0.018878312423034248, -0.008599258796030004,...",0.789476
1,Lion,"[-0.0015138392825705174, -0.00997922881147518,...",0.785808
2,Tiger,"[-0.013501134742314675, -0.009632539697974465,...",0.805853
3,Dog,"[-0.000861077619131276, -0.015141254049242174,...",0.764699
4,Cricket,"[0.0039361337947048155, -0.007287947426925405,...",0.889343
5,Footbal,"[-0.011466965783521396, -0.008091297497552554,...",0.854145
6,Tennis,"[-0.023068435609869063, 0.0016187547091597776,...",0.882801
7,Basketball,"[-0.012866201360427364, -0.013317207339922954,...",0.89949
8,Apple,"[0.01454721563243437, -0.0040025349546621005, ...",0.769005
9,Orange,"[0.0207876073455832, -0.029517869639723315, 1....",0.774834


<font color='green'>
    Sorting by similarity values in dataframe reveals Dog, Tiger, and Lion are closest to searched term, such as Cat.
    <font>

In [65]:
df.sort_values("similarity score", ascending=False).head(10)

Unnamed: 0,Words,embedding,similarity score
7,Basketball,"[-0.012866201360427364, -0.013317207339922954,...",0.89949
4,Cricket,"[0.0039361337947048155, -0.007287947426925405,...",0.889343
6,Tennis,"[-0.023068435609869063, 0.0016187547091597776,...",0.882801
5,Footbal,"[-0.011466965783521396, -0.008091297497552554,...",0.854145
2,Tiger,"[-0.013501134742314675, -0.009632539697974465,...",0.805853
10,Banana,"[-0.013083976674634977, -0.020122427894718062,...",0.804521
0,Elephant,"[-0.018878312423034248, -0.008599258796030004,...",0.789476
1,Lion,"[-0.0015138392825705174, -0.00997922881147518,...",0.785808
9,Orange,"[0.0207876073455832, -0.029517869639723315, 1....",0.774834
8,Apple,"[0.01454721563243437, -0.0040025349546621005, ...",0.769005
