# Examinación de información disponible referente a reseñas de productos
Se tomará el repositorio de datos público de reseñas de productos de Amazon Reviews. Se trata de un conjunto de información recopilada en el 2020 referente a productos de diversas categorías. Actualmente se plantea el estudio de únicamente los productos pertenecientes al entorno tecnológico.

Específicamente, se eligieron las siguientes categorías:
- Accesorios y Celulares
- Electrónica
- Software
- Videojuegos

In [1]:
import pyspark.sql.functions as F
from pyspark.sql import SQLContext, SparkSession, types as T, DataFrame

import os
import dotenv

dotenv.load_dotenv()

True

In [2]:
# Config
BASE_JSON_PATH = os.getenv("BASE_JSON_PATH")

In [3]:
BASE_JSON_PATH

'/mnt/c/Users/User/Documents/Maestría/Amazon Reviews/raw'

In [4]:
spark = (
    SparkSession.builder
        .appName("ML-Amazon-Reviews")
        .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
25/04/27 22:13:28 WARN Utils: Your hostname, ASUSPRIMEA520MAII resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/04/27 22:13:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/27 22:13:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Importar información de caracterización de productos
Se carga información de productos, incluyendo características abstraídas en su descripción, categoría principal, cantidad de referencias, calificación promedio, precio y código interno (parent_asin). Este último código refleja la categoría general del producto, descartando características particulares como color o presentación, lo cuál no se considerará dentro del estudio.

In [5]:
item_schema = T.StructType([
    T.StructField("title", T.StringType(), True),
    T.StructField("main_category", T.StringType(), True),
    T.StructField("features", T.ArrayType( T.StringType() ), True),
    T.StructField("description", T.ArrayType( T.StringType() ), True),
    T.StructField("average_rating", T.FloatType(), True),
    T.StructField("rating_number", T.IntegerType(), True),
    T.StructField("price", T.DoubleType(), True),
    T.StructField("store", T.StringType(), True),
    T.StructField("parent_asin", T.StringType(), True)
])

In [6]:
def load_category_data( path ) -> DataFrame:
    file = (
        spark.read
            .format('json')
            .option("quote",'/"')
            .schema(item_schema)
            .load(path)
        )

    return file

In [7]:
meta_software = load_category_data( f"{BASE_JSON_PATH}/meta_categories/meta_Software.jsonl" )
meta_electronics = load_category_data( f"{BASE_JSON_PATH}/meta_categories/meta_Electronics.jsonl" )
meta_Cell_Phones_and_Accessories = load_category_data( f"{BASE_JSON_PATH}/meta_categories/meta_Cell_Phones_and_Accessories.jsonl" )
meta_Video_Games = load_category_data( f"{BASE_JSON_PATH}/meta_categories/meta_Video_Games.jsonl" )

In [8]:
unified_meta_items = meta_software.unionByName(
    meta_electronics.unionByName(
        meta_Cell_Phones_and_Accessories.unionByName(
            meta_Video_Games
        )
    )
)

In [9]:
unified_meta_items.count()

25/04/27 22:13:42 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

3125022

In [10]:
unified_meta_items.limit(1)

DataFrame[title: string, main_category: string, features: array<string>, description: array<string>, average_rating: float, rating_number: int, price: double, store: string, parent_asin: string]

Se revisa la unicidad de la llave natural propuesta: product_asin

In [11]:
unified_meta_items[['parent_asin']].distinct().count()

                                                                                

3125022

## Importar información de referencias de usuarios del producto
Se carga información de opiniones de usuarios sobre diferentes productos. Los productos se califican en un rango de 1 a 5, incluyendo un título y texto, así mismo, se incluye información de imágenes, código de la variante del producto, código padre del producto (descartando variantes como color, presentación, etc), identificador de usuario, hora de la revisión, y calificación de la referencia (helpful).

In [12]:
review_schema = T.StructType([
    T.StructField("rating", T.FloatType(), True),
    T.StructField("title", T.StringType(), True),
    T.StructField("text", T.StringType(), True),
    T.StructField("timestamp", T.TimestampType(), True),
    T.StructField("helpful_vote", T.IntegerType(), True),
    T.StructField("parent_asin", T.StringType(), True)
])

In [13]:
def load_review_data( path ) -> DataFrame:
    file = (
        spark.read
            .format('json')
            .option("quote",'/"')
            .schema(review_schema)
            .load(path)
        )

    return file

In [14]:
software = load_review_data( f"{BASE_JSON_PATH}/review_categories/Software.jsonl" )
electronics = load_review_data( f"{BASE_JSON_PATH}/review_categories/Electronics.jsonl" )
cell_Phones_and_Accessories = load_review_data( f"{BASE_JSON_PATH}/review_categories/Cell_Phones_and_Accessories.jsonl" )
video_Games = load_review_data( f"{BASE_JSON_PATH}/review_categories/Video_Games.jsonl" )

In [15]:
unified_reviews = software.unionByName(
    electronics.unionByName(
        cell_Phones_and_Accessories.unionByName(
            video_Games
        )
    )
)

In [16]:
unified_reviews.count()

                                                                                

74204685

## Normalizar información para procesamiento posterior
Se tiene información de características (features) que deben ser abstraídas para cada ítem

In [17]:
items_features = (
    unified_meta_items
        .select(
            F.col('parent_asin'),
            F.explode( F.col('features') ).alias('feature')
        )
)

In [18]:
items_features.show(10)

+-----------+--------------------+
|parent_asin|             feature|
+-----------+--------------------+
| B00VRPSGEO|All the pressing ...|
| B00NWQXXHQ|ENCOURAGE literac...|
| B00NWQXXHQ|FOLLOW along with...|
| B00NWQXXHQ|LEARN new vocabul...|
| B00NWQXXHQ|TAP objects to he...|
| B00RFKP6AC|Mahjong 2015 is a...|
| B00RFKP6AC|This board game i...|
| B00RFKP6AC|Mahjong involves ...|
| B00RFKP6AC|If you enjoy play...|
| B00RFKP6AC|IF YOU LIKE IT, P...|
+-----------+--------------------+
only showing top 10 rows



## Importar modelo para cálculo de similitud semántica entre frases

In [19]:
import tensorflow as tf

import tensorflow_hub as hub

2025-04-27 22:16:34.537721: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-27 22:16:36.689740: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-27 22:16:38.542442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745810198.682290   79264 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745810199.103381   79264 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-27 22:16:43.011898: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

In [20]:
# https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder?hl=es-419
MODULE_URL = "https://tfhub.dev/google/universal-sentence-encoder/4"

In [21]:
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = (
    "Universal Sentence Encoder embeddings also support short paragraphs. ",
    "There is no hard limit on how long the paragraph is. Roughly, the longer ",
    "the more 'diluted' the embedding will be.")
messages = [word, sentence, paragraph]

In [22]:
model = hub.load(MODULE_URL)

2025-04-27 22:18:35.466646: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [23]:
message_embeddings = model(messages)

TypeError: Binding inputs to tf.function failed due to `Can't convert Python sequence with mixed types to Tensor.`. Received args: (['Elephant', 'I am a sentence for which I would like to get its embedding.', ('Universal Sentence Encoder embeddings also support short paragraphs. ', 'There is no hard limit on how long the paragraph is. Roughly, the longer ', "the more 'diluted' the embedding will be.")],) and kwargs: {} for signature: (inputs: TensorSpec(shape=<unknown>, dtype=tf.string, name=None)).

In [None]:
message_embeddings

<tf.Tensor: shape=(3, 512), dtype=float32, numpy=
array([[ 0.00834448,  0.00048086,  0.06595249, ..., -0.03266349,
         0.02640913, -0.0606688 ],
       [ 0.0508086 , -0.01652433,  0.0157378 , ...,  0.00976658,
         0.03170121,  0.01788118],
       [-0.02833269, -0.05586218, -0.01294148, ..., -0.0513303 ,
         0.01178872,  0.00579201]], dtype=float32)>

In [None]:
layers = hub.KerasLayer(MODULE_URL)

In [None]:
model_tf = tf.saved_model.load('/mnt/c/Users/User/Documents/Maestría/universal-sentence-encoder-tensorflow2-universal-sentence-encoder-v2')

In [None]:
print(model_tf.signatures)

_SignatureMap({'serving_default': <ConcreteFunction (*, inputs: TensorSpec(shape=(None,), dtype=tf.string, name='inputs')) -> Dict[['outputs', TensorSpec(shape=(None, 512), dtype=tf.float32, name='outputs')]] at 0x7FDC507034F0>})


In [None]:
infer = model_tf.signatures["serving_default"]

In [None]:
infer.structured_input_signature

((), {'inputs': TensorSpec(shape=(None,), dtype=tf.string, name='inputs')})

In [None]:
infer.structured_outputs

{'outputs': TensorSpec(shape=(None, 512), dtype=tf.float32, name='outputs')}