## We still need to check on:
- Larger Sentence Transformer model
- try (word2Vec,Glove,FastText,Ect.) for "better" embeddings

In [66]:
import pandas as pd

In [67]:
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("./dataset/cleaned_data/reviews.csv")

In [68]:
df.shape

(542460, 10)

In [69]:
# Create a new DataFrame with only the first 100,000 rows
df_100k = df.head(100000)
df_100k

Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful,roberta_sentiment
0,rw0213367,SUPERNOVA HEIGHTS,A Nightmare on Elm Street (1984),8.0,A Classic of the horror films History,27 August 2003,0,This is the beginning of a great horror film s...,"['0', '0']",Positive
1,rw0213369,kibler@adelphia.net,A Nightmare on Elm Street (1984),,Better than your average horror movie,1 September 2003,0,"Nightmare on Elm Street, A (1984) John Saxon, ...","['0', '0']",Negative
2,rw0213371,matthew87,A Nightmare on Elm Street (1984),,good slasher flick,1 September 2003,0,"the best freddy film period,1 because the horr...","['0', '0']",Strongly Positive
3,rw0213375,rossrobinson,A Nightmare on Elm Street (1984),10.0,A nightmare on elm st part 1,27 September 2003,0,I remember seeing a nightmare on elm street pa...,"['0', '2']",Negative
4,rw0213376,Andres24,A Nightmare on Elm Street (1984),9.0,Highway to hell,3 October 2003,0,"It's a nightmare. If Nancy falls asleep, they...","['1', '1']",Positive
...,...,...,...,...,...,...,...,...,...,...
99995,rw3473248,burlesonjesse5,Neighbors 2: Sorority Rising (2016),5.0,VIEWS ON FILM review of Neighbors 2: Sorority ...,20 May 2016,1,"If you've read most of my reviews, you'll know...","['2', '9']",Neutral
99996,rw3473254,PyroSikTh,X-Men: Apocalypse (2016),7.0,The Third Movie's Not Always the Worst,18 May 2016,1,While Days of Future Past played on our nostal...,"['4', '10']",Neutral
99997,rw3473256,dvc5159,Eye in the Sky (2015),7.0,Triple Cross,18 May 2016,0,What is the value of a single human life? That...,"['2', '6']",Neutral
99998,rw3473258,dvc5159,Before I Wake (2016),8.0,A Human Heart to Go with the Scares,18 May 2016,0,It is infinitely better to center horror films...,"['11', '23']",Neutral


In [70]:
# Create a new DataFrame with only the first 100,000 rows
df_5 = df.head(5)
df_5

Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful,roberta_sentiment
0,rw0213367,SUPERNOVA HEIGHTS,A Nightmare on Elm Street (1984),8.0,A Classic of the horror films History,27 August 2003,0,This is the beginning of a great horror film s...,"['0', '0']",Positive
1,rw0213369,kibler@adelphia.net,A Nightmare on Elm Street (1984),,Better than your average horror movie,1 September 2003,0,"Nightmare on Elm Street, A (1984) John Saxon, ...","['0', '0']",Negative
2,rw0213371,matthew87,A Nightmare on Elm Street (1984),,good slasher flick,1 September 2003,0,"the best freddy film period,1 because the horr...","['0', '0']",Strongly Positive
3,rw0213375,rossrobinson,A Nightmare on Elm Street (1984),10.0,A nightmare on elm st part 1,27 September 2003,0,I remember seeing a nightmare on elm street pa...,"['0', '2']",Negative
4,rw0213376,Andres24,A Nightmare on Elm Street (1984),9.0,Highway to hell,3 October 2003,0,"It's a nightmare. If Nancy falls asleep, they...","['1', '1']",Positive


# Initialize Retriever

A retriever model is used to embed passages and queries, and it creates embeddings such that queries and passages with similar meanings are close in the vector space. We will use a sentence-transformer model as our retriever. The model can be loaded as follows:

In [71]:
from sentence_transformers import SentenceTransformer
import torch

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load the model from huggingface
# SMALLER MODEL -- all-MiniLM-L6-v2'
# word_embedding_dimension': 384
retriever = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device=device
)
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

# Initialize Pinecone Index

Now we need to initialize our Pinecone index. The Pinecone index stores vector representations of our passages which we can retrieve using another vector (the query vector). We first need to initialize our connection to Pinecone. For this, we need a free [API key](https://app.pinecone.io/), we initialize the connection like so:

In [72]:
#STORE VECTOR EMBEDDINGS
import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="e6b9058c-db97-43d2-a7d9-3028ea5625a9",
    environment="gcp-starter"  # find next to API key in console
)

Now we can create our vector index. We will name it `sentiment-mining` (feel free to choose any name you prefer). We specify the metric type as `cosine` and dimension as `384` as these are the vector space and dimensionality of the vectors generated by the retriever model.

In [73]:
index_name = "intent-cluster"

# check if the sentiment-mining index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=384,
        metric="cosine"
    )

# connect to sentiment-mining index we created
index = pinecone.Index(index_name)

# Generate Embeddings and Upsert

We generate embeddings for all the reviews in the dataset. Alongside the embeddings, we also include the sentiment label and score in the Pinecone index as metadata. Later we will use this data to understand customer opinions.



We need to convert the review dates to timestamps to filter query results for a given period. This is helpful if you want to understand customer sentiment over a specific period. Let's write another helper function to convert dates to timestamps.

Convert TO timestamp

In [74]:
import dateutil.parser
# convert dates to timestamps
def get_timestamp(date_list):
    timestamps = [dateutil.parser.parse(date).timestamp() for date in date_list]
    return timestamps

# timestamps_result = get_timestamps(dates)
# print(timestamps_result)

 Convert FROM timestamp

In [75]:
from datetime import datetime

timestamp = 1061956800.0

# Convert timestamp to a datetime object
dt_object = datetime.utcfromtimestamp(timestamp)

# Format the datetime object as a string
formatted_date = dt_object.strftime('%Y-%m-%d %H:%M:%S')

print(formatted_date)

2003-08-27 04:00:00


In [76]:
# df["review_date"][0]

In [77]:
# get_timestamp([df["review_date"][0]])[0]

In [78]:
# selected_columns = ['review_detail', 'review_date','roberta_sentiment']
# batch = df_100k[selected_columns]
# batch.head()

In [79]:
df_50k = df.head(10)

In [80]:
from tqdm.auto import tqdm


# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df_50k), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(df_50k))
    
    # extract batch
    # batch = df_5.iloc[i:i_end]
    # Specify the column names you want to select
    selected_columns = ['review_detail', 'review_date','roberta_sentiment']

    # Use the square bracket notation to select the specified columns
    batch = df_50k[selected_columns]
    # generate embeddings for batch
    emb = retriever.encode(batch["review_detail"].tolist()).tolist()
    # convert review_date to timestamp to enable period filters
    timestamp = get_timestamp(batch["review_date"].tolist())
    batch["timestamp"] = timestamp
    # get sentiment label and score for reviews in the batch
    # label, score = get_sentiment(batch["review"].tolist())
    batch["label"] = batch["roberta_sentiment"]
    # batch["score"] = score
    # get metadata
    meta = batch.to_dict(orient="records")
    # create unique IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)
# print(meta)
# print(len(meta))
# check that we have all vectors in index
index.describe_index_stats()

  0%|          | 0/1 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["timestamp"] = timestamp
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["label"] = batch["roberta_sentiment"]


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

# Opinion Mining

Now that we have all the customer reviews indexed, we will search for a few areas that customers usually consider when staying at a hotel and analyze the general opinion of the customers. Pinecone vector database makes it very flexible to do this as we can easily search for any topic and get customer reviews relevant to the search query along with sentiment labels as metadata.

We will start with a general question about the room sizes of hotels in London and return the top 500 reviews to analyze the overall customer sentiment.

In [85]:
query = "nightmare"
# generate dense vector embeddings for the query
xq = retriever.encode(query).tolist()
# query pinecone
result = index.query(xq, top_k=500, include_metadata=True)

In [86]:
result

{'matches': [{'id': '1',
              'metadata': {'label': 'Negative',
                           'review_date': datetime.datetime(2003, 9, 1, 0, 0),
                           'review_detail': 'Nightmare on Elm Street, A (1984) '
                                            'John Saxon, Ronee Blakley, '
                                            'Heather Langenkamp, Amanda Wyss, '
                                            'Nick Corri, Johnny Depp, Robert '
                                            'England, Charles Fleischer, Lin '
                                            'Shaye, D: Wes Craven. In the '
                                            'slumber of his victims, a '
                                            'facially burned killer, Fred '
                                            'Krueger, with finger knives '
                                            'plagues and kills the children of '
                                            'the adults who set him ablaze