## Objective: Text Recommendation or text similarity using Generative AI

* Problem statement: Find out the Similar tickets using generative AI

* Approach: Using GPT4 and FAISS technique to embed text and find similarity score which helps us finding the most similar text.

* Algorithm: Sentence transformers for embedding, FAISS for indexing of the embeddings, gpt4 model for text understanding

* Data Structure: Unstructural text data or columnar data

* Procedure: 
    - Select required columns and load a CSV Loader
    - Create embeddings for the given text using sentence transformer
    - Store or index all the embeddings into Database(local memory) using FAISS
    - Test query embedding is tested with existing db to find similarity score
    - sort out the similarity score and recommend the text

* Libraries installed: GPT4ALL, FAISS-CPU, Huggingface

![img](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*tmMrwJAPRy1zKG8-k-J_xg.png)

#### Installations

In [1]:
!pip -qq install gpt4all

#### Uncomment below lines and install required packages, if faiss-gpu is failed try faiss-cpu

In [None]:
# !pip install faiss-gpu -qq
# !pip install faiss-cpu -qq
# !pip install sentence-transformers -qq
# !pip install huggingface-hub -qq
# !pip install langchain -qq

#### Import statements

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.document_loaders.csv_loader import CSVLoader
import pandas as pd

#### GPT4 model to generate text like chatgpt

In [None]:
from gpt4all import GPT4All
model = GPT4All("orca-mini-3b.ggmlv3.q4_0.bin",
                model_path="/")

#### Testing the model with sample input

In [3]:
output = model.generate("The capital of France is ", max_tokens=20)
output

'1. Paris, the city of light, is located in the northern part of the country and is'

#### Testing the model with sample input

In [2]:
output = model.generate("India is a country with population ", max_tokens=20)
output

'1.3 billion people, and it has a rapidly growing economy that is expected to reach $5'

In [None]:
# another way of loading gpt4 model using langchain, 
# but not suggested as model_path specification not available
from langchain.llms import GPT4All
llm = GPT4All(model="C:/Users/vjakkula/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin")
llm("The first man on the moon was ... Let's think step by step")

In [29]:
llm("Use the anomaly detection function from the transformers library to train a model on the numpy array.")

"\n\nHere's an example code snippet:\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForSequenceClassification\n\n# Load the data\ndata = ... # Your data here\n\n# Define the target labels\nlabels = ... # Your labels here\n\n# Create an instance of the auto-tokenizer\ntokenizer = AutoTokenizer.from_pretrained('cardiff')\n\n# Convert the data to a numpy array\nX = tokenizer(data, return_tensors=' batch', input_shape=(None,), output_shapes=[ None, 1 ]).prepare()\ny = tokenizer(labels, return_tensors='batch', input_shape=(None,)).prepare()\n\n# Train the model on the numpy array\nmodel = AutoModelForSequenceClassification.from_pretrained('cardiff')\nmodel.fit(X, y)\n``` \n\nNote that you will need to replace `'cardiff'` with the name of your custom tokenizer and `labels` with the names of your target labels."

In [None]:
# import torch
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# #\n\n# Load the data\n
# data = 
# # Your data here\n\n# Define the target labels\n
# labels = ... # Your labels here\n\n# Create an instance of the auto-tokenizer\n
# tokenizer = AutoTokenizer.from_pretrained('cardiff')# Convert the data to a numpy array\n
# X = tokenizer(data, return_tensors=' batch', input_shape=(None,), output_shapes=[ None, 1 ]).prepare()
# y = tokenizer(labels, return_tensors='batch', input_shape=(None,)).prepare()
# #\n\n# Train the model on the numpy array\n
# model = AutoModelForSequenceClassification.from_pretrained('cardiff')
# model.fit(X, y)

#### CSV Loader or Data loading

In [None]:
import pandas as pd
df = pd.read_csv('DATA.csv', encoding="utf8")
df

In [2]:
loader = CSVLoader(file_path='DATA.csv', encoding="utf8")
data = loader.load()

#### Embedding of the text using GPT4 model and Huggingface sentence tranformers

In [None]:
from gpt4all import GPT4All, Embed4All
embeddings = Embed4All()

In [5]:
text = 'The quick brown fox jumps over the lazy dog'
output = embeddings.embed(text)
len(output),type(output)

(384, list)

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',cache_folder="LLM Similar Ticket Detection/SentenceTransformers")

In [14]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="LLM Similar Ticket Detection/SentenceTransformers/sentence-transformers_all-MiniLM-L6-v2")

#### Pipelining the embedding and data into database indexing using FAISS

In [13]:
# this code will take longer to execute
db = FAISS.from_documents(data,embeddings)

#### Testing

- Take 1 query from the data and create test embeddings and search with FAISS DB 
- input : test query
- output: similarity score

In [None]:
print(df.sample()["Description"].values[0])

In [None]:
query = """ Router was unable to resolve """
result_docs = db.similarity_search_with_score(query,k=10)
for idx,doc in enumerate(result_docs):
    #if idx == 0:
       # continue
    print(f"Recommendation {idx}, score = {round((1-doc[1])*100,2)}%\n=============================\n")
    print(doc[0].page_content)

#### Saving the FAISS index in local memory to use further

In [33]:
db.save_local("faiss_index")

In [5]:
# reload the saved FAISS index
new_db = FAISS.load_local("Recommendation_Streamlit_Web_App/SavedModels/faiss_index_v3", embeddings)

In [None]:
result_docs = new_db.similarity_search_with_score("something")
result_docs[0][0].metadata
# for doc in result_docs:
#     print(doc[0].page_content,"\n score = ",doc[1])
#     print(doc[0].metadata)

In [18]:
import os
base = os.path.abspath(os.path.dirname("LLM Similar Ticket Detection.ipynb"))

In [None]:
os.path.join(base,"Data","53766.csv")

In [None]:
result_docs = new_db.similarity_search_with_score(df.iloc[0,:]["Description"],k=10)
for idx,doc in enumerate(result_docs):
    if idx == 0:
        continue
    print(f"Recommendation {idx}, score = {round((1-doc[1])*100,2)}%\n=============================\n")
    print(doc[0].page_content)

In [9]:
idxes = [r[0].metadata["row"] for r in result_docs]
idxes

[0, 331, 3088, 2649, 10362, 9192, 3937, 3099, 335, 8553]

In [None]:
tkt_data1 = pd.read_excel("/Tkt_Recommendation_sep27_Datasets/ISSUE_DATA.xlsx")
tkt_data1.head(1)

#### Metrics for FAISS text similarity

Read the duplicates data and use it to calculate the metrics

In [None]:
dup_df = pd.read_csv("Tkt_Recommendation_sep27_Datasets/duplicates.csv")
dup_df.head()

In [None]:
ground_filtered = tkt_data1[tkt_data1["Key"].isin(dup_df[["Issue"]].values.reshape(-1))]
ground_filtered["Description"].values[0]

In [None]:
def recommend_score(issue,query):
    result_docs = new_db.similarity_search_with_score(query,k=5)
    idxes = [r[0].metadata["row"] for r in result_docs]
    recommended_keys = tkt_data1.iloc[idxes,:]["Key"].values
    ground_truth_keys = dup_df[dup_df["Issue"]==issue].values.reshape(-1)
    return ((sum([1 if each in recommended_keys else 0 for each in ground_truth_keys]))/(len(ground_truth_keys)))*100
recommend_score(ground_filtered["Key"].values[0],ground_filtered["Description"].values[0])

In [None]:
ground_filtered["score"] = [recommend_score(row[0],row[1]) for _,row in ground_filtered[["Key","Description"]].iterrows()]
ground_filtered

In [None]:
ground_filtered["score"].value_counts()

In [38]:
ground_filtered["score"].sum()/ground_filtered["score"].shape[0]

69.26020408163265

### Version 2 of Faiss_index

In [None]:
df = pd.read_excel("Tkt_Recommendation_sep27_Datasets/ISSUE_DATA.xlsx")
# df.to_csv("complete_BR_ISSUE_DATA.csv",index=False)
df

In [6]:
loader = CSVLoader(file_path='complete_BR_ISSUE_DATA.csv', encoding="utf8")
data = loader.load()

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

In [8]:
db = FAISS.from_documents(data,embeddings)

In [9]:
db.save_local("faiss_index_v2")

#### Train Faiss Indx V4 (Summary only Data)

In [12]:
# df = pd.read_csv("complete_BR_ISSUE_DATA.csv")
# df["Summary"].to_csv("BR_ISSUE_DATA_only_Summary.csv",index=False)
loader = CSVLoader(file_path='LLM Similar Ticket Detection/ISSUE_DATA_only_Summary.csv', encoding="utf8")
data = loader.load()
db = FAISS.from_documents(data,embeddings)
db.save_local("faiss_index_v4_only_summary")

In [10]:
def recommend_score(issue,query):
    result_docs = db.similarity_search_with_score(query,k=5)
    idxes = [r[0].metadata["row"] for r in result_docs]
    recommended_keys = tkt_data1.iloc[idxes,:]["Key"].values
    ground_truth_keys = dup_df[dup_df["Issue"]==issue].values.reshape(-1)
    return ((sum([1 if each in recommended_keys else 0 for each in ground_truth_keys]))/(len(ground_truth_keys)))*100
recommend_score(ground_filtered["Key"].values[0],ground_filtered["Description"].values[0])

50.0

In [None]:
ground_filtered["score"] = [recommend_score(row[0],row[1]) for _,row in ground_filtered[["Key","Description"]].iterrows()]
ground_filtered

In [16]:
ground_filtered["score"].value_counts()

score
50.000000     253
100.000000    165
75.000000      18
0.000000        5
83.333333       4
62.500000       1
58.333333       1
66.666667       1
Name: count, dtype: int64

In [17]:
ground_filtered["score"].sum()/ground_filtered["score"].shape[0]

69.24293154761905

In [None]:
result_docs = db.similarity_search_with_score(df.iloc[0,:]["Description"],k=10)
for idx,doc in enumerate(result_docs):
    print(f"Recommendation {idx}, score = {doc[1]}%\n=============================\n")
    print(doc[0].page_content)

### Version 3 of FAISS index

In [2]:
loader = CSVLoader(file_path='ISSUE_DATA.csv', encoding="utf8")
data = loader.load()

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='paraphrase-distilroberta-base-v1')

In [4]:
db = FAISS.from_documents(data,embeddings)

In [5]:
db.save_local("faiss_index_v3")

In [None]:
ground_filtered["score"] = [recommend_score(row[0],row[1]) for _,row in ground_filtered[["Key","Description"]].iterrows()]
ground_filtered

In [12]:
ground_filtered["score"].sum()/ground_filtered["score"].shape[0]

57.52152423469387

#### version 4

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='paraphrase-xlm-r-multilingual-v1')
db_v4 = FAISS.from_documents(data,embeddings)

def recommend_score(issue,query,db):
    result_docs = db.similarity_search_with_score(query,k=5)
    idxes = [r[0].metadata["row"] for r in result_docs]
    recommended_keys = tkt_data1.iloc[idxes,:]["Key"].values
    ground_truth_keys = dup_df[dup_df["Issue"]==issue].values.reshape(-1)
    return ((sum([1 if each in recommended_keys else 0 for each in ground_truth_keys]))/(len(ground_truth_keys)))*100

ground_filtered["score"] = [recommend_score(row[0],row[1],db_v4) for _,row in ground_filtered[["Key","Description"]].iterrows()]
ground_filtered

In [5]:
import pandas as pd
links = pd.read_csv("LLM Similar Ticket Detection/Data/links.csv")
cr = "CR-1159059"
links[links["BiraKey"]==cr]["LinkBiraKey"].values
links.shape

(5376, 2)