### Explore Amazon Reviews with LangChain and Pinecone

Goal: Create a simple CX Analytics PoC using LangChain, Pinecone, and Huggingface embeddings.

### Import libraries

In [1]:
import os
import json
import gzip
import pandas as pd

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

### Load data

In [3]:
data = []
with gzip.open('AMAZON_FASHION.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))

metadata = []
with gzip.open('meta_AMAZON_FASHION.json.gz') as f:
    for l in f:
        metadata.append(json.loads(l.strip()))

### Read data into dataframe

In [4]:
df = pd.DataFrame.from_dict(data)
df = df[df['reviewText'].notna()]

df_meta = pd.DataFrame.from_dict(metadata)

In [5]:
df_meta.head()

Unnamed: 0,title,brand,feature,rank,date,asin,imageURL,imageURLHighRes,description,price,also_view,also_buy,fit,details,similar_item,tech1
0,Slime Time Fall Fest [With CDROM and Collector...,Group Publishing (CO),[Product Dimensions:\n \n8....,"13,052,976inClothing,Shoesamp;Jewelry(",8.70 inches,764443682,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
1,XCC Qi promise new spider snake preparing men'...,,,"11,654,581inClothing,Shoesamp;Jewelry(",5 star,1291691480,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
2,Magical Things I Really Do Do Too!,Christopher Manos,[Package Dimensions:\n \n8....,"19,308,073inClothing,ShoesJewelry(",5 star,1940280001,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,[For the professional or amateur magician. Ro...,,,,,,,
3,"Ashes to Ashes, Oranges to Oranges",Flickerlamp Publishing,[Package Dimensions:\n \n8....,"19,734,184inClothing,ShoesJewelry(",5 star,1940735033,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
4,Aether & Empire #1 - 2016 First Printing Comic...,,[Package Dimensions:\n \n10...,"10,558,646inClothing,Shoesamp;Jewelry(",5 star,1940967805,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,$4.50,,,,,,


### Truncate reviews that are too long

In [6]:
max_text_length=400
def truncate_review(text):
    return text[:max_text_length]

df['truncated'] = df.apply(lambda row: truncate_review(row['reviewText']), axis=1)

In [7]:
# Find a product with a lot of reviews
df.groupby('asin').count().sort_values('overall', ascending=False).head()

Unnamed: 0_level_0,overall,verified,reviewTime,reviewerID,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,truncated
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
B000V0IBDM,4380,4380,4380,4380,4379,4380,4378,4380,0,0,38,4380
B000KPIHQ4,4371,4371,4371,4371,4370,4371,4369,4371,193,3346,38,4371
B00I0VHS10,3884,3884,3884,3884,3884,3884,3880,3884,128,3872,107,3884
B00RLSCLJM,3633,3633,3633,3633,3633,3633,3632,3633,225,3538,210,3633
B000PHANNM,2566,2566,2566,2566,2566,2566,2563,2566,85,2563,112,2566


In [11]:
# Extract the product

df_meta[df_meta.asin=='B000KPIHQ4'].values

array([['Powerstep Pinnacle Orthotic Shoe Insoles', nan,
        list(['Shipping Information:\n                    \nView shipping rates and policies']),
        '154inClothing,Shoesamp;Jewelry(', '5 star', 'B000KPIHQ4',
        list(['https://images-na.ssl-images-amazon.com/images/I/414VFpnmvjL._US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/51yLLxuD5%2BL._US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/51NJmYTkeiL._US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/41VRUCCVKEL._US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/51b-GUTXm0L._US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/41mORzqQTwL._US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/61RHVYCqQcL._US40_.jpg']),
        list(['https://images-na.ssl-images-amazon.com/images/I/414VFpnmvjL.jpg', 'https://images-na.ssl-images-amazon.com/images/I/51yLLxuD5%2BL.jpg', 'https://images-na.ssl-images-amazon.com/images/I/51NJmYTkeiL.jpg', 'https://images-

### Create embedding vectors

In [13]:
df = df.loc[df['asin'] == 'B000KPIHQ4'].copy() # copy to avoid SettingWithCopyWarning

In [14]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,truncated
11218,3.0,True,"09 26, 2007",A1CIM0XZ3UA926,B000KPIHQ4,M. Cane,"Good price, good product. Howver, it is generi...",Orthotics off the rack,1190764800,2.0,"{'Size Name:': ' Men's 5-5.5, Women's 7-7.5', ...",,"Good price, good product. Howver, it is generi..."
11219,5.0,True,"01 18, 2007",A1EVVPCWRW5YYZ,B000KPIHQ4,Deborah Morris,My husband rates these insoles a 5 for comfort...,Very comfortable,1169078400,3.0,"{'Size Name:': ' Men's 10-10.5, Women's 12', '...",,My husband rates these insoles a 5 for comfort...
11220,5.0,True,"05 18, 2018",A2P3NZ9H4PANK0,B000KPIHQ4,Stephanie,I have worn the Powerstep Pinnacle shoe insole...,... Pinnacle shoe insoles for the past 5 years...,1526601600,,"{'Size Name:': ' Men's 6-6.5, Women's 8-8.5', ...",,I have worn the Powerstep Pinnacle shoe insole...
11221,1.0,True,"05 18, 2018",A2975GY186VV7A,B000KPIHQ4,jessica etim,Very uncomfortable feel like I wasted my money!,Uncomfortable,1526601600,,"{'Size Name:': ' Men's 7-7.5, Women's 9-9.5', ...",,Very uncomfortable feel like I wasted my money!
11222,5.0,True,"05 17, 2018",A3U8E58RIKWDAW,B000KPIHQ4,Nancy Mazzuca,work perfect,Five Stars,1526515200,,"{'Size Name:': ' Men's 9-9.5, Women's 11-11.5'...",,work perfect


In [15]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)a8e1d/.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 457kB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 105kB/s]
Downloading (…)b20bca8e1d/README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 3.39MB/s]
Downloading (…)0bca8e1d/config.json: 100%|██████████| 571/571 [00:00<00:00, 412kB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 197kB/s]
Downloading (…)e1d/data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 23.0MB/s]
Downloading pytorch_model.bin: 100%|██████████| 438M/438M [00:05<00:00, 78.4MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 27.0kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 80.1kB/s]
Downloading (…)a8e1d/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 3.70MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 215kB

In [16]:
# Do not use OpenAI embeddings because of cost

df['embeddings'] = df.apply(lambda row: embeddings.embed_query(row['truncated']), axis=1) # this takes ~ 3m on a CPU

In [17]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,truncated,embeddings
11218,3.0,True,"09 26, 2007",A1CIM0XZ3UA926,B000KPIHQ4,M. Cane,"Good price, good product. Howver, it is generi...",Orthotics off the rack,1190764800,2.0,"{'Size Name:': ' Men's 5-5.5, Women's 7-7.5', ...",,"Good price, good product. Howver, it is generi...","[-0.007861517369747162, -0.00678021926432848, ..."
11219,5.0,True,"01 18, 2007",A1EVVPCWRW5YYZ,B000KPIHQ4,Deborah Morris,My husband rates these insoles a 5 for comfort...,Very comfortable,1169078400,3.0,"{'Size Name:': ' Men's 10-10.5, Women's 12', '...",,My husband rates these insoles a 5 for comfort...,"[-0.07544193416833878, 0.025455137714743614, -..."
11220,5.0,True,"05 18, 2018",A2P3NZ9H4PANK0,B000KPIHQ4,Stephanie,I have worn the Powerstep Pinnacle shoe insole...,... Pinnacle shoe insoles for the past 5 years...,1526601600,,"{'Size Name:': ' Men's 6-6.5, Women's 8-8.5', ...",,I have worn the Powerstep Pinnacle shoe insole...,"[-0.06397590041160583, 0.012907425872981548, -..."
11221,1.0,True,"05 18, 2018",A2975GY186VV7A,B000KPIHQ4,jessica etim,Very uncomfortable feel like I wasted my money!,Uncomfortable,1526601600,,"{'Size Name:': ' Men's 7-7.5, Women's 9-9.5', ...",,Very uncomfortable feel like I wasted my money!,"[-0.009998583234846592, -0.05696876719594002, ..."
11222,5.0,True,"05 17, 2018",A3U8E58RIKWDAW,B000KPIHQ4,Nancy Mazzuca,work perfect,Five Stars,1526515200,,"{'Size Name:': ' Men's 9-9.5, Women's 11-11.5'...",,work perfect,"[-0.01534116081893444, -0.005922064650803804, ..."


### Train a simple RandomForest model with Scikit-Learn

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X_train, X_test, y_train, y_test = train_test_split(
    list(df.embeddings.values),
    df.overall,
    test_size=0.2,
    random_state=1
)

In [19]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=150, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [20]:
# Predict
mean_absolute_error(y_test, y_pred)

0.538682258969592

### Load embeddings into the vector database

In [22]:
import pinecone
from langchain.vectorstores.pinecone import Pinecone

# Initialize Pinecone
pinecone.init(
    api_key=os.environ.get("PINECONE_API_KEY"),
    environment=os.environ.get("PINECONE_ENVIRONMENT"),
)

In [23]:
texts = df['truncated'].tolist()

In [24]:
texts[:5]

['Good price, good product. Howver, it is generic and if you really need orthotics, best to have them individually fitted. These are a good value.',
 "My husband rates these insoles a 5 for comfort. He hasn't noticed any improvment as far as leg or foot pain and has wore them consistantly since Christmas. The owner of the Red Wing store where we get his work boots highly recommended them and he was right. They make heavy, steel toed work boots more bearable. Can't say they will cure or even help with orthopedic problems though. Guess you need to",
 'I have worn the Powerstep Pinnacle shoe insoles for the past 5 years and love them.  They are so comfortable and since I have been wearing them have had no foot pain or other discomfort.',
 'Very uncomfortable feel like I wasted my money!',
 'work perfect']

In [28]:
# Upload embeddings to Pinecone

vstore = Pinecone.from_texts(texts, embeddings, index_name="cxanalytics")

In [30]:
# Do a basic similarity search to confirm that the embeddings are uploaded correctly

query = "I love this product"
result = vstore.similarity_search(query, top_k=5)

result

[Document(page_content='Love this product', metadata={}),
 Document(page_content='I like this product', metadata={}),
 Document(page_content='I like this product a lot', metadata={}),
 Document(page_content='great product', metadata={})]

### Have LLM access data in the vector store

In [47]:
# Connect to OpenAI

openai_api_key = os.environ.get("OPENAI_API_KEY")

In [None]:
# Test the connection

import openai
openai.Engine.list()

In [64]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(temperature=0)
review_chain = RetrievalQA.from_chain_type(llm=chat, chain_type="stuff", retriever=vstore.as_retriever())


In [65]:
q = """
The reviews you see are for a product called 'Powerstep Orthoic Shoe Insoles'
What is the overall impression of these reviews? Give most prevalent examples in bullet points.
What do you suggest we focus on improve?
"""

In [66]:
result=review_chain.run(q)
print(result)

The overall impression of these reviews is mixed. Some users found the insoles to be helpful and effective, while others did not notice any improvement. Here are the most prevalent examples:

Positive:
- Helped considerably with foot pain
- Good for moderate support
- Corrected metatarsal pain issues
- Top-notch quality
- Solid construction and feel

Negative:
- Did not do anything for foot pain
- Not suitable for high arches
- Did not provide enough support
- Did not fit well

Based on these reviews, it seems that the effectiveness of the insoles varies depending on the user's specific foot issues. To improve, it may be helpful to provide more detailed information about which foot problems each type of insole is designed to address, so users can make a more informed decision.
