<a href="https://colab.research.google.com/github/davidguzmanr/Datos-Masivos-II/blob/main/SVD/SVD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini-proyecto: Modelado de tópicos con SVD

El objetivo de este mini-proyecto es identificar los tópicos a partir de un conjunto de comentarios usando el método de SVD.

- La base de datos a usar es: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products
- El archivo a usar es: Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19
- La columna a usar es: reviews.text

## Datos

Primero descargamos los datos, por facilidad usamos la API de Kaggle:

In [1]:
import os, json, nbformat, pandas as pd

USER_ID = 'davidguzman'                          # REPLACE WITH YOUR OWN USER NAME
USER_SECRET = '7a61331a4dc397bbe9da6c0130e5ab75' # REPLACE WITH YOUR OWN PRIVATE API TOKEN

KAGGLE_CONFIG_DIR = os.path.join(os.path.expandvars('$HOME'), '.kaggle')
os.makedirs(KAGGLE_CONFIG_DIR, exist_ok = True)

with open(os.path.join(KAGGLE_CONFIG_DIR, 'kaggle.json'), 'w') as f:
    json.dump({'username': USER_ID, 'key': USER_SECRET}, f)
    
!chmod 600 {KAGGLE_CONFIG_DIR}/kaggle.json

In [2]:
!kaggle datasets download -d datafiniti/consumer-reviews-of-amazon-products
!unzip consumer-reviews-of-amazon-products.zip

Downloading consumer-reviews-of-amazon-products.zip to /content
 68% 11.0M/16.3M [00:00<00:00, 111MB/s]
100% 16.3M/16.3M [00:00<00:00, 104MB/s]
Archive:  consumer-reviews-of-amazon-products.zip
  inflating: 1429_1.csv              
  inflating: Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv  
  inflating: Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv  


## Pre-procesamiento

In [3]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # para quitar un warning molesto
import matplotlib.pyplot as plt

import re
import string

import nltk
from nltk.tokenize import TweetTokenizer
from nltk import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
data = pd.read_csv('/content/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')
reviews = data[['reviews.text']]

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28332 entries, 0 to 28331
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   28332 non-null  object 
 1   dateAdded            28332 non-null  object 
 2   dateUpdated          28332 non-null  object 
 3   name                 28332 non-null  object 
 4   asins                28332 non-null  object 
 5   brand                28332 non-null  object 
 6   categories           28332 non-null  object 
 7   primaryCategories    28332 non-null  object 
 8   imageURLs            28332 non-null  object 
 9   keys                 28332 non-null  object 
 10  manufacturer         28332 non-null  object 
 11  manufacturerNumber   28332 non-null  object 
 12  reviews.date         28332 non-null  object 
 13  reviews.dateSeen     28332 non-null  object 
 14  reviews.didPurchase  9 non-null      object 
 15  reviews.doRecommend  16086 non-null 

Los datos contienen una lista de 28,332 reseñas de consumidores para productos de Amazon como Kindle, tabletas, baterias y más (el esquema de datos se encuentra en [Data Schema](https://developer.datafiniti.co/docs/product-data-schema)).

In [6]:
data['name'].value_counts()[0:10]

AmazonBasics AAA Performance Alkaline Batteries (36 Count)                                     8343
AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary                 3728
Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Tangerine - with Special Offers              2443
All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Black          2370
Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Pink Kid-Proof Case                         1676
Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Blue Kid-Proof Case                         1425
Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case                        1212
Fire Tablet, 7 Display, Wi-Fi, 16 GB - Includes Special Offers, Black                          1024
Fire Tablet with Alexa, 7 Display, 16 GB, Blue - with Special Offers                            987
All-New Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Marine Blue - with Special Offers     883


Algo importante a notar es que la mayor parte de las reseñas son positivas y recomiendan el producto que están reseñando.

In [7]:
print(100*data['reviews.rating'].value_counts(dropna=False)/28332)
print()
print(100*data['reviews.doRecommend'].value_counts(dropna=False)/28332)

5    70.228011
4    19.935056
3     4.256671
1     3.406043
2     2.174220
Name: reviews.rating, dtype: float64

True     54.189609
NaN      43.223211
False     2.587181
Name: reviews.doRecommend, dtype: float64


In [8]:
reviews['reviews.text'] = reviews['reviews.text'].apply(lambda x:word_tokenize(x.lower()))                                 # pasamos a minúscula y tokenizamos
reviews['reviews.text'] = reviews['reviews.text'].apply(lambda x:[item for item in x if item not in stop_words])           # quitamos stop_words
reviews['reviews.text'] = reviews['reviews.text'].apply(lambda x:[item for item in x if item not in string.punctuation])   # quitamos signos de puntuación
reviews['reviews.text'] = reviews['reviews.text'].apply(lambda x:' '.join(x))                                              # volvemos a juntar en una oración

Ahora veamos alguna reseña y cómo se ve una vez procesada:

In [9]:
print('\033[1m Original: ' + '\033[94m' + data['reviews.text'][2] + '\033[0m')
print('\033[1m Procesado: ' + '\033[95m' + reviews['reviews.text'][2])

[1m Original: [94mWell they are not Duracell but for the price i am happy.[0m
[1m Procesado: [95mwell duracell price happy


## SVD

Ahora tratemos de encontrar los tópicos y separemos estos tópicos en $n$ grupos.

In [10]:
# Creamos una matriz de reseñas y términos usando TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english', 
                             analyzer = 'word',
                             max_features = 1000,       # máximo número de términos
                             max_df = 0.5, 
                             smooth_idf = True)

X = vectorizer.fit_transform(reviews['reviews.text'])
X.shape

(28332, 1000)

In [11]:
# Calculamos la descomposición de valores singulares de la matriz, usando la función TruncatedSVD
from sklearn.decomposition import TruncatedSVD

svd_model = TruncatedSVD(n_components = 5,
                         algorithm = 'randomized', 
                         n_iter = 100, 
                         random_state = 42)
svd_model.fit(X)

TruncatedSVD(algorithm='randomized', n_components=5, n_iter=100,
             random_state=42, tol=0.0)

In [12]:
# Los componentes del modelo, serán los tópicos de los documentos
terms = vectorizer.get_feature_names()

# Visualizamos algunas de las plabras más importantes en cada uno de los 5 tópicos
for i, comp in enumerate(svd_model.components_):
    
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[0:10]
    print("\033[1mTopic " + str(i+1) + ": \033[0m")
    
    for t in sorted_terms:
        print(t[0])
        
    print('\n')

[1mTopic 1: [0m
great
batteries
good
price
work
tablet
product
long
value
use


[1mTopic 2: [0m
good
batteries
far
brand
long
quality
brands
duracell
cheap
say


[1mTopic 3: [0m
batteries
work
great
long
brand
price
brands
lasting
aa
duracell


[1mTopic 4: [0m
great
good
price
value
product
deal
works
quality
shipping
item


[1mTopic 5: [0m
work
value
good
fine
great
expected
far
like
deal
cheaper




Los tópicos parecen corresponder a los artículos más vendidos, los cuales son los más representativos.

In [13]:
data['name'].value_counts()[0:5]

AmazonBasics AAA Performance Alkaline Batteries (36 Count)                               8343
AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary           3728
Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Tangerine - with Special Offers        2443
All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Black    2370
Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Pink Kid-Proof Case                   1676
Name: name, dtype: int64