***Generating Personalized Emails with AWS Bedrock and AWS Personalize***


In this notebook, we're building a product recommendation engine that can provide personalized recommendations to users based on their past ratings and reviews. The recommendation engine will leverage machine learning algorithms to analyze the user's behavior and preferences, and then suggest relevant products that they might be interested in.


In [1]:
import time
from time import sleep
import json
from datetime import datetime
import boto3
import pandas as pd
from io import StringIO

***Creating the Product Review Dataset***

Taking the initial amazon.csv file, we will be filtering out the respect columns of data to create a product data set. This would include fields like timestamp, product id, product name, and description.

In [2]:
# Create the product review dataset filtering out respective columns


bucket_name = 'personalizeproductreviewdata'
review_data_key = 'amazon_data.csv'



data_s3_location1 = "s3://{}/{}".format(bucket_name, review_data_key)  # S3 URL
product_data = pd.read_csv(data_s3_location1)


product_data = product_data.drop('discounted_price', axis=1)
product_data = product_data.drop('actual_price', axis=1)
product_data = product_data.drop('discount_percentage', axis=1)
product_data = product_data.drop('user_id', axis=1)
product_data = product_data.drop('user_name', axis=1)
product_data = product_data.drop('review_id', axis=1)
product_data = product_data.drop('review_title', axis=1)
product_data = product_data.drop('review_content', axis=1)
product_data = product_data.drop('img_link', axis=1)
product_data = product_data.drop('product_link', axis=1)
product_data = product_data.drop('rating', axis=1)
product_data = product_data.drop('age', axis=1)



product_data.rename(columns={'timestamp': 'CREATION_TIMESTAMP', 'product_id': 'ITEM_ID', 'product_name': 'PRODUCT_NAME', 'category': 'CATEGORY','rating_count': 'RATING_COUNT','about_product': 'DESCRIPTION'}, inplace=True)
product_data['CREATION_TIMESTAMP'] = pd.to_datetime(product_data['CREATION_TIMESTAMP']).astype(int) // 10**9  # Convert to Unix timestamp (seconds)




product_data.head()

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Unnamed: 0,CREATION_TIMESTAMP,ITEM_ID,PRODUCT_NAME,CATEGORY,RATING_COUNT,DESCRIPTION
0,1587427700,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,24269,High Compatibility : Compatible With iPhone 12...
1,1666858131,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,43994,"Compatible with all Type C enabled devices, be..."
2,1664695794,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,7928,【 Fast Charger& Data Sync】-With built-in safet...
3,1692737458,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,Computers&Accessories|Accessories&Peripherals|...,94363,The boAt Deuce USB 300 2 in 1 cable is compati...
4,1598981603,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,Computers&Accessories|Accessories&Peripherals|...,16905,[CHARGE & SYNC FUNCTION]- This cable comes wit...


***Uploading the data to DynamoDB***

DynamoDB is a highly scalable, low-latency, and fully managed NoSQL database service which is well-suited for applications that require fast and consistent performance, such as personalization recommendation engines. DynamoDB offers built-in security features, automatic scaling, and high availability across multiple Availability Zones, making it a reliable and durable choice for storing and retrieving large volumes of data. 


Using the boto client, we create a table with a schema where ITEM_ID is the primary key. After ensuring the table is ready, we convert the rating_count column in the product_data DataFrame to strings and iterate through the DataFrame, inserting each row as an item into the DynamoDB table.

In [3]:
# Upload the product data to DynamoDB


session = boto3.Session()
dynamodb = session.resource('dynamodb')

table_name = 'productdata'




try:
    table = dynamodb.Table(table_name)
    table.load()
    print(f"Table {table_name} already exists.")
except dynamodb.meta.client.exceptions.ResourceNotFoundException:
    # Define the table schema
    table = dynamodb.create_table(
        TableName=table_name,
        KeySchema=[
            {
                'AttributeName': 'ITEM_ID',
                'KeyType': 'HASH'  
            }
        ],
        AttributeDefinitions=[
            {
                'AttributeName': 'ITEM_ID',
                'AttributeType': 'S'
            }
        ],
        BillingMode='PAY_PER_REQUEST'
    )
    table.meta.client.get_waiter('table_exists').wait(TableName=table_name)
    print(f"Table {table_name} created successfully.")


    
product_data['RATING_COUNT'] = product_data['RATING_COUNT'].astype(str)

# Define the DynamoDB table
table = dynamodb.Table(table_name)

# Insert data into DynamoDB
for index, row in product_data.iterrows():
    item = {
        'TIMESTAMP': row['CREATION_TIMESTAMP'],
        'ITEM_ID': row['ITEM_ID'],
        'PRODUCT_NAME': row['PRODUCT_NAME'],
        'CATEGORY': row['CATEGORY'],
        'RATING_COUNT': row['RATING_COUNT'],
        'DESCRIPTION': row['DESCRIPTION']
    }
    
    table.put_item(Item=item)

print("Data inserted successfully.")

Table productdata already exists.
Data inserted successfully.


***Create the user dataset***

We then load the user review data from an Amazon S3 bucket into a DataFrame filtering out certain columns and renaming some columns for consistency. Convert the TIMESTAMP column to a Unix timestamp in seconds.

In [7]:

# Create the user review dataset filtering out respective columns

bucket_name = 'personalizeproductreviewdata'
user_data_key = 'amazon_data.csv'
data_s3_location1 = "s3://{}/{}".format(bucket_name, user_data_key)  # S3 URL
user_data = pd.read_csv(data_s3_location1)


user_data = user_data.drop('rating_count', axis=1)
user_data = user_data.drop('category', axis=1)
user_data = user_data.drop('about_product', axis=1)
user_data = user_data.drop('img_link', axis=1)
user_data = user_data.drop('product_link', axis=1)
user_data = user_data.drop('discounted_price', axis=1)
user_data = user_data.drop('actual_price', axis=1)
user_data = user_data.drop('discount_percentage', axis=1)


user_data.rename(columns={'timestamp': 'TIMESTAMP', 'product_id': 'ITEM_ID', 'product_name':'PRODUCT_NAME','rating':'RATING','user_id': 'USER_ID', 'age': 'AGE','user_name':'USERNAME', 'review_id': 'REVIEW_ID','review_title':'REVIEW_TITLE','review_content':'REVIEW_CONTENT'}, inplace=True)

user_data['TIMESTAMP'] = pd.to_datetime(user_data['TIMESTAMP']).astype(int) // 10**9  # Convert to Unix timestamp (seconds)
user_data['USERNAME'] = user_data['USERNAME'].apply(lambda x: x.split(',')[0] + ',' + x.split(',')[1] if len(x.split(',')) > 1 else x)  # Extract the first and second user entries if available, otherwise keep the original value


user_data.info()
user_data.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   TIMESTAMP       1465 non-null   int64 
 1   ITEM_ID         1465 non-null   object
 2   PRODUCT_NAME    1465 non-null   object
 3   RATING          1465 non-null   object
 4   AGE             1465 non-null   int64 
 5   USER_ID         1465 non-null   object
 6   USERNAME        1465 non-null   object
 7   REVIEW_ID       1465 non-null   object
 8   REVIEW_TITLE    1465 non-null   object
 9   REVIEW_CONTENT  1465 non-null   object
dtypes: int64(2), object(8)
memory usage: 114.6+ KB


Unnamed: 0,TIMESTAMP,ITEM_ID,PRODUCT_NAME,RATING,AGE,USER_ID,USERNAME,REVIEW_ID,REVIEW_TITLE,REVIEW_CONTENT
0,1587427700,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,4.2,59,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...
1,1666858131,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,4.0,38,"AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...
2,1664695794,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,3.9,58,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a..."
3,1692737458,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,4.2,49,"AEWAZDZZJLQUYVOVGBEUKSLXHQ5A,AG5HTSFRRE6NL3M5S...","Omkar dhale,JD","R3EEUZKKK9J36I,R3HJVYCLYOY554,REDECAZ7AMPQC,R1...","Good product,Good one,Nice,Really nice product...","Good product,long wire,Charges good,Nice,I bou..."
4,1598981603,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,4.2,19,"AE3Q6KSUK5P75D5HFYHCRAOLODSA,AFUGIFH5ZAFXRDSZH...","rahuls6099,Swasat Borah","R1BP4L2HH9TFUP,R16PVJEXKV6QZS,R2UPDB81N66T4P,R...","As good as original,Decent,Good one for second...","Bought this instead of original apple, does th..."


***Write back user data to DynamoDB***


Using the boto client, we create a table with a schema where review_id as the primary key. After ensuring the table is ready, we convert the rating and review_content columns in the user_data DataFrame to strings.Iterate through the DataFrame, inserting each row as an item into the DynamoDB table.

In [None]:
session = boto3.Session()

# Get the DynamoDB resource
dynamodb = session.resource('dynamodb')

# Define the table
table_name = 'userdata'

# Check if the table exists, if not, create it
try:
    table = dynamodb.Table(table_name)
    table.load()
    print(f"Table {table_name} already exists.")
except dynamodb.meta.client.exceptions.ResourceNotFoundException:
    # Define the table schema
    table = dynamodb.create_table(
        TableName=table_name,
        KeySchema=[
            {
                'AttributeName': 'REVIEW_ID',
                'KeyType': 'HASH' 
            }
        ],
        AttributeDefinitions=[
            {
                'AttributeName': 'REVIEW_ID',
                'AttributeType': 'S'
            }
        ],
        
        BillingMode='PAY_PER_REQUEST'
        
    )

    table.meta.client.get_waiter('table_exists').wait(TableName=table_name)
    print(f"Table {table_name} created successfully.")

table = dynamodb.Table(table_name)
user_data['RATING'] = user_data['RATING'].astype(str)
user_data['REVIEW_CONTENT'] = user_data['REVIEW_CONTENT'].astype(str)



for index, row in user_data.iterrows():
    item = {
        'TIMESTAMP': row['TIMESTAMP'],
        'AGE': row['AGE'],
        'PRODUCT_ID': row['ITEM_ID'],
        'PRODUCT_NAME': row['PRODUCT_NAME'],
        'REVIEW_ID': row['REVIEW_ID'],
        'REVIEW_TITLE': row['REVIEW_TITLE'],
        'USER_ID': row['USER_ID'],
        'USERNAME': row['USERNAME'],
        'REVIEW_CONTENT': row['REVIEW_CONTENT']
    }
    
    table.put_item(Item=item)

print("Data inserted successfully.")





***Creating the interaction dataset***

We are performing sentiment analysis on product reviews using Amazon Comprehend in preparation for training a personalized recommendation model with Amazon Personalize. The sentiment analysis provides valuable insights into the users' opinions and attitudes towards specific products, which can be leveraged by the recommendation engine to generate more accurate and relevant recommendations.

1. Enriching User Interactions: The sentiment analysis adds an extra layer of information to the user interaction data. By analyzing the sentiment of reviews, we can better understand the user's experience and preferences with different products. Positive sentiments indicate satisfaction, while negative sentiments may suggest dissatisfaction or areas for improvement.

2. Improving Recommendation Relevance: Amazon Personalize uses various data sources, including user interactions, to train its recommendation models. By incorporating sentiment information, the model can learn not only what products a user interacted with but also their subjective opinions about those products. This additional context can help the model better understand user preferences and provide more relevant recommendations.

3. Filtering Interactions: In the provided code, we're using the sentiment analysis to filter out interactions (reviews) that are considered neutral or negative (based on the rating). This filtering ensures that the recommendation model focuses on positive interactions, which are more likely to lead to accurate recommendations that align with user preferences.

In [None]:
# Perform sentiment analysis on review using Amazon Comprehend

import boto3
import pandas as pd

comprehend = boto3.client('comprehend', region_name='us-west-2')

def chunk_text(text, chunk_size=4000):
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks

def get_comprehend_sentiment(text):
    if not text:
        return None, None
    chunks = chunk_text(text)
    sentiments = []
    sentiment_scores = []
    for chunk in chunks:
        response = comprehend.detect_sentiment(Text=chunk, LanguageCode='en')
        sentiment = response['Sentiment'].upper()
        sentiment_score = response['SentimentScore'][sentiment.capitalize()]
        sentiments.append(sentiment)
        sentiment_scores.append(sentiment_score)
    avg_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)
    return sentiments[0], avg_sentiment_score

processed_data = user_data.copy()
processed_data['RATING'] = pd.to_numeric(processed_data['RATING'], errors='coerce')

# Apply sentiment analysis to the 'review_content' column
processed_data[['SENTIMENT', 'SENTIMENTSCORE']] = processed_data['REVIEW_CONTENT'].apply(lambda x: pd.Series(get_comprehend_sentiment(x)))

# Add the EVENT_TYPE column
processed_data['EVENT_TYPE'] = None
processed_data.loc[processed_data['RATING'] > 3.0, 'EVENT_TYPE'] = 'read'
processed_data.loc[processed_data['RATING'] > 2.0, 'EVENT_TYPE'] = 'click'

# Filter rows that have an EVENT_TYPE assigned
interactions_df = processed_data[processed_data['EVENT_TYPE'].notna()]




# Select relevant columns and rename them
interactions_df = interactions_df[['TIMESTAMP', 'USERNAME', 'ITEM_ID', 'PRODUCT_NAME', 'EVENT_TYPE', 'SENTIMENT', 'SENTIMENTSCORE']]
interactions_df.rename(columns={
    'USERNAME': 'USER_ID',
    'product_id': 'ITEM_ID',
    'product_name': 'ITEM_NAME'
}, inplace=True)

interactions_df['SENTIMENTSCORE'] = interactions_df['SENTIMENTSCORE'].astype(str)

# Print the final DataFrame
print(interactions_df)


***Read datasets into S3***

Read the interaction,product,and user dataset in the initialized S3 bucket. Reading data into Amazon S3 is a common practice when utilizing Amazon Personalize because it requires input data to be stored in an S3 bucket before it can be used to train and deploy models. 

In [None]:
# read interaction data into S3

from io import StringIO
interactions_filename = "interactions.csv"

interactions_df.info()
print(interactions_df.columns)


csv_buffer = StringIO()
interactions_df.to_csv(csv_buffer, index=False)

s3_client = boto3.client('s3')
s3_client.put_object(Bucket='personalizeproductreviewdata', Key='interactions.csv', Body = csv_buffer.getvalue())


print(interactions_df.dtypes)


In [None]:
#Read product data into S3


csv_buffer = StringIO()
product_data.to_csv(csv_buffer, index=False)


s3_client = boto3.client('s3')
s3_client.put_object(Bucket='personalizeproductreviewdata', Key='product_data.csv', Body = csv_buffer.getvalue())




In [None]:
# read user data into S3

user_data = user_data.drop('REVIEW_CONTENT', axis=1)


csv_buffer = StringIO()
user_data.to_csv(csv_buffer, index=False)

s3_client = boto3.client('s3')
s3_client.put_object(Bucket='personalizeproductreviewdata', Key='user_data.csv', Body = csv_buffer.getvalue())



In [None]:
%store product_data
%store user_data
%store interactions_df

In [None]:
%store interactions_filename