<img src="http://www.cidaen.es/assets/img/mCIDaeNnb.png" alt="Logo CiDAEN" align="right">


<br><br><br>
<h2><font color="#00586D" size=4>Módulo 12: Arquitecturas y procesos Big Data</font></h2>



<h1><font color="#00586D" size=5>Capstone 12. Parte 1: Modelo de <i>sentiment</i> sobre Amazon Reviews</font></h1>

<br><br><br>
<div style="text-align: right">
<font color="#00586D" size=3>Enrique González, Jacinto Arias</font><br>
<font color="#00586D" size=3>Máster en Ciencia de Datos e Ingeniería de Datos en la Nube</font><br>
<font color="#00586D" size=3>Universidad de Castilla-La Mancha</font>




</div>

<a id="indice"></a>
<h2><font color="#00586D" size=5>Índice</font></h2>


* [1. Introducción](#section1)
* [2. Análisis exploratorio](#section2)
* [3. Modelado](#section3)

In [1]:
# Instalamos algunas librerías útiles para la práctica

import pyspark.sql.functions as sqlf
from pyspark.sql import SparkSession

# Creamos la sesión de Spark. Modificad si fuera necesario la memoria del driver dependiendo de vuestra máquina (cuanta más memoria, mejor) 
spark = SparkSession.builder.config("spark.driver.memory", "8g").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/11 05:32:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


---

<a id="section1"></a>
## <font color="#00586D"> 1. Introducción</font>
<br>

En este capstone vamos a aprender un modelo de detección del sentimiento utilizando Spark y MLlib. Una vez aprendido ampliaremos el proyecto serializando este modelo y aplicándolo a un conjunto de test.

Para ello utilizaremos el dataset de __amazon reviews__ que está disponible a través del campus virtual (bajo el nombre amazon-reviews-pds-parquet). El dataset proporcionado es una versión procesada (con menos categorías y en parquet) del dataset de amazon-reviews habitual disponible en Kaggle a través de este enlace: https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset

Este dataset tiene las siguientes columnas (de su diccionario de datos):
```
marketplace       - 2 letter country code of the marketplace where the review was written.
customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
review_id         - The unique ID of the review.
product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews
                    for the same product in different countries can be grouped by the same product_id.
product_parent    - Random identifier that can be used to aggregate reviews for the same product.
product_title     - Title of the product.
product_category  - Broad product category that can be used to group reviews 
                    (also used to group the dataset into coherent parts).
star_rating       - The 1-5 star rating of the review.
helpful_votes     - Number of helpful votes.
total_votes       - Number of total votes the review received.
vine              - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline   - The title of the review.
review_body       - The review text.
review_date       - The date the review was written.
```

De estas, la columna `product_category` se usa como clave de partición. Podéis encontrar toda la información en el enlace que os proporcionamos más arriba. 

---

<a id="section2"></a>
## <font color="#00586D"> 2. Análisis exploratorio</font>
<br>

Antes de empezar con el modelado exploraremos los datos minimamente para poder estudiar sus propiedades.

---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 1: Carga de datos </font>
<br>

Carga el dataset completo en formato parquet y cuenta sus registros. De momento, no lo persistas. 




In [2]:
# Prepara la lectura de datos en la siguiente variable

# Solución
reviews_df = spark.read.parquet('data/amazon-reviews-pds-parquet')

                                                                                

In [3]:
# Cuenta los registros del dataset

# Solución
count = reviews_df.count() 
print(count)



16967903


                                                                                

**Resultado esperado**: 16967903 registros

<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>


---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 2: Filtrado </font>
<br>


Como el dataset es masivo para entrenar el modelo de sentiment vamos a trabajar únicamente con una partición. Concretamente utilizaremos la partición de `Electronics`. Filtra los datos para quedarte con esta partición y cuenta ahora el total de elementos de este nuevo dataset. No cachees este dataset. 



In [4]:
import pyspark.sql.functions as sqlf
reviews_df = reviews_df.filter(sqlf.col('product_category') == 'Electronics')

In [5]:
# Solución
reviews_df.count()

                                                                                

3105119

**Resultado esperado**: 3105119 registros

<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>


---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 3: Almacenamiento </font>
<br>


Para no seguir trabajando con los datos públicos, vamos a escribir los datos en local. Para ello, escribe los datos en parquet dentro del directorio `data/electronics.parquet`. Utiliza repartition para tener 32 particiones. Tras esto, vuelve a cargar el dataset y cachéalo. 



In [6]:
# Escribe tu solución aquí
reviews_df.repartition(32).write.format('parquet').mode('overwrite').save('data/electronics.parquet')
reviews_df_rep = spark.read.format('parquet').load('data/electronics.parquet')
reviews_df_rep.cache() #Sobrepasa la RAM de mi PC al ejecutarlo. Se ha ejecutado descacheado. Este seria el comando para cachear el dataset

                                                                                



<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>


---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 4: Almacenamiento </font>
<br>

Obten los siguiente resultados del dataset que acabáis de cargar:

1. Muestra el total de reviews para cada posible número de estrellas recibidas (*star_rating*)
2. Obtén los 10 productos con mayor número de votos (*total_votes*) mostrando su nombre, numero de votos y valoración media (*star_rating*)
3. Obtén la cantidad de reviews (1 registro de dataset -> 1 review) y la valoración media (*star_rating*) por mes y año. Obten los últimos 15 registros ordenador por año y mes.  

In [7]:
# Escribe tus soluciones a continuación
# Muestra el total de reviews para cada posible número de estrellas recibidas (star_rating)
reviews_df_rep.groupBy('star_rating').agg(sqlf.count('star_rating').alias('count')).show()

[Stage 11:>                                                       (0 + 16) / 16]

+-----------+-------+
|star_rating|  count|
+-----------+-------+
|          3| 239459|
|          5|1787754|
|          1| 359248|
|          4| 538824|
|          2| 179834|
+-----------+-------+



                                                                                

In [8]:
# Escribe tus soluciones a continuación
# Obtén los 10 productos con mayor número de votos (total_votes) mostrando su nombre, numero de votos y valoración media (star_rating)
reviews_df_rep.orderBy(sqlf.desc('total_votes')).select(sqlf.col('product_title'), sqlf.col('total_votes'), sqlf.col('star_rating')).show(10)

[Stage 14:>                                                       (0 + 16) / 16]

+--------------------+-----------+-----------+
|       product_title|total_votes|star_rating|
+--------------------+-----------+-----------+
|Denon AKDL1 Dedic...|      12944|          3|
|AudioQuest K2 Ter...|       9072|          1|
|Panasonic ErgoFit...|       8680|          5|
|Apple iPod touch ...|       6353|          5|
|Denon AKDL1 Dedic...|       5546|          1|
|Apple iPod touch ...|       4595|          5|
|Bose QuietComfort...|       4556|          4|
|Panasonic ErgoFit...|       4341|          5|
|X-Mini II XAM4-B ...|       4260|          1|
|Denon AKDL1 Dedic...|       4242|          2|
+--------------------+-----------+-----------+
only showing top 10 rows



                                                                                

In [9]:
# Escribe tus soluciones a continuación
# Obtén la cantidad de reviews (1 registro de dataset -> 1 review) y la valoración media (star_rating) por mes y año. Obten los últimos 15 registros ordenador por año y mes.
(reviews_df_rep
 .withColumn('year', sqlf.year('review_date'))
 .withColumn('month', sqlf.month('review_date'))
 .groupBy('year','month')
 .agg((sqlf.count('*').alias('review_count')),
      (sqlf.avg('star_rating').alias('mean_star_rating')))
 .orderBy(sqlf.desc('year'),sqlf.desc('month'))
 .show(15)
)

[Stage 15:>                                                       (0 + 16) / 16]

+----+-----+------------+------------------+
|year|month|review_count|  mean_star_rating|
+----+-----+------------+------------------+
|2015|    8|      102984| 4.093985473471608|
|2015|    7|       99806|  4.08580646454121|
|2015|    6|       91486| 4.093478783639027|
|2015|    5|       89357| 4.100439808856609|
|2015|    4|       93152| 4.102466935760907|
|2015|    3|      108861|  4.11561532596614|
|2015|    2|      107291| 4.118062092813004|
|2015|    1|      120404| 4.152602903558021|
|2014|   12|      107891| 4.120232456831432|
|2014|   11|       77529|  4.10810148460576|
|2014|   10|       78128| 4.114210014335449|
|2014|    9|       77753| 4.116111275449179|
|2014|    8|       82143| 4.115664146671049|
|2014|    7|       79424| 4.118352135374698|
|2014|    6|       48375|4.0157726098191215|
+----+-----+------------+------------------+
only showing top 15 rows



                                                                                

**Resultados esperados**:
1. Muestra el total de reviews para cada posible número de estrellas recibidas (*star_rating*)

|    |   star_rating |   count |
|---:|--------------:|--------:|
|  0 |             3 |  239459 |
|  1 |             5 | 1787754 |
|  2 |             1 |  359248 |
|  3 |             4 |  538824 |
|  4 |             2 |  179834 |


2. Obtén los 10 productos con mayor número de votos (*total_votes*) mostrando su nombre, numero de votos y valoración media (*star_rating*)

|    | product_title                                                                                    |   total_votes |   star_rating |
|---:|:-------------------------------------------------------------------------------------------------|--------------:|--------------:|
|  0 | Denon AKDL1 Dedicated Link Cable (Discontinued by Manufacturer)                                  |         12944 |             3 |
|  1 | AudioQuest K2 Terminated Speaker Cable - UST 2.44 m Plugs 8' Pair (Discontinued by Manufacturer) |          9072 |             1 |
|  2 | Panasonic ErgoFit In-Ear Earbud Headphone                                                        |          8680 |             5 |
|  3 | Apple iPod touch 8GB (4th Generation)                                                            |          6353 |             5 |
|  4 | Denon AKDL1 Dedicated Link Cable (Discontinued by Manufacturer)                                  |          5546 |             1 |
|  5 | Apple iPod touch 8 GB 2nd Generation                                                             |          4595 |             5 |
|  6 | Bose QuietComfort 15 Acoustic Noise Cancelling Headphones (Discontinued by Manufacturer)         |          4556 |             4 |
|  7 | Panasonic ErgoFit In-Ear Earbud Headphone                                                        |          4341 |             5 |
|  8 | X-Mini II XAM4-B Portable Capsule Speaker, Mono                                                  |          4260 |             1 |
|  9 | Denon AKDL1 Dedicated Link Cable (Discontinued by Manufacturer)                                  |          4242 |             2 |

3. Obtén la cantidad de reviews (1 registro de dataset -> 1 review) y la valoración media (*star_rating*) por mes y año. Obten los últimos 15 registros ordenador por año y mes. 

|    |   year |   month |   review_count |   mean_star_rating |
|---:|-------:|--------:|---------------:|-------------------:|
|  0 |   2015 |       8 |         102984 |            4.09399 |
|  1 |   2015 |       7 |          99806 |            4.08581 |
|  2 |   2015 |       6 |          91486 |            4.09348 |
|  3 |   2015 |       5 |          89357 |            4.10044 |
|  4 |   2015 |       4 |          93152 |            4.10247 |
|  5 |   2015 |       3 |         108861 |            4.11562 |
|  6 |   2015 |       2 |         107291 |            4.11806 |
|  7 |   2015 |       1 |         120404 |            4.1526  |
|  8 |   2014 |      12 |         107891 |            4.12023 |
|  9 |   2014 |      11 |          77529 |            4.1081  |
| 10 |   2014 |      10 |          78128 |            4.11421 |
| 11 |   2014 |       9 |          77753 |            4.11611 |
| 12 |   2014 |       8 |          82143 |            4.11566 |
| 13 |   2014 |       7 |          79424 |            4.11835 |
| 14 |   2014 |       6 |          48375 |            4.01577 |


<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>

--- 

---

<a id="section3"></a>
## <font color="#00586D"> 3. Modelado</font>
<br>

Como paso previo al modelado realizaremos dos procesos de limpieza sobre los datos:


---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 6: Preparación del texto </font>
<br>

Limpiead el texto de las reviews (`review_body`) utilizando expresiones sobre strings o expresiones regulares
 - Pasar todo el texto a minusculas.
 - Eliminar números y signos de puntuacion.
 - Si existen, elimina los registros con valores nulos en el body con las transformaciones anteriores. 
 
Muestra los resultados para las primeras 10 filas del dataframe ordenadas por `review_id`

In [10]:
# Escribe tu solución aquí
pattern = "[^a-z\s]"

review_df_rep_clean = (reviews_df_rep
 .select(
     sqlf.col('review_id'), 
     sqlf.col('review_body'),
     sqlf.regexp_replace(sqlf.lower(sqlf.col('review_body')), pattern, "").alias('clean_review_body')         
         )
 .filter(sqlf.col('clean_review_body').isNotNull())
 .orderBy('review_id')
)
review_df_rep_clean.show(10)



+--------------+--------------------+--------------------+
|     review_id|         review_body|   clean_review_body|
+--------------+--------------------+--------------------+
|R10000WMGXS51T|Great little emer...|great little emer...|
|R10001L4QTCA84|Lives up to its c...|lives up to its c...|
|R10003OLR2P5UE|I've gone through...|ive gone through ...|
|R10005O193PJ6W|stopped working a...|stopped working a...|
|R10008LR7CU84N|I ordered this ca...|i ordered this ca...|
|R10009JN2UWOJC|Have not owned it...|have not owned it...|
|R1000AMVKPW32O|Bought for a gift...|bought for a gift...|
|R1000CJMO2L8X4| Perfect for the gym| perfect for the gym|
|R1000EDGJUU3CU|Love these !!! Th...|love these  the s...|
|R1000EG9XXBLXT|I have had good s...|i have had good s...|
+--------------+--------------------+--------------------+
only showing top 10 rows



                                                                                

**Resultado esperado**:

| review_id      | review_body                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | clean_review_body                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|:---------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| R10000WMGXS51T | Great little emergency radio.  Very good reception.  The<br />weather band is a feature.  Can't beat the quality for this price/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | great little emergency radio  very good reception  thebr weather band is a feature  cant beat the quality for this price                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| R10001L4QTCA84 | Lives up to its claim, and really does fit bulky phone cases. Braided cable is sturdy but flexible. I think it stays a little more flexible in the cold weather, which is nice. Definitely getting a few more in the future!                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | lives up to its claim and really does fit bulky phone cases braided cable is sturdy but flexible i think it stays a little more flexible in the cold weather which is nice definitely getting a few more in the future                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| R10003OLR2P5UE | I've gone through three pairs of these in the last two years. I am in love with the sound quality, and even though I know it's not the best I particularly love how the bass sounds. They're comfortable to wear and very isolating. With these headphones, you don't even need noise canceling. There is very little sound leak, unless you like to listen to music ridiculously loud. All in all, I was very impressed with these. They're without a doubt the best sounding headphones I've ever owned.<br /><br />Now, the problem: The wires are thin and stringy, and do NOT last. On my first pair, the part of the wire that connected to the left cup came apart. I'm not an abuser of headphones, either. On the other two pairs, they wire at the base next to the adapter came apart. I went at them with a soldering iron, desperately trying to make them last as long as I could, but they'd always crap out on me again. The sound quality distorts over time, and the foam around the cups is cheap and wears out quickly.<br /><br />They aren't worth the price for such bad quality. I'd suggest looking around for other pairs, Sony, Denon, and Sennheiser all have superior headphones for a similar price. I myself just ordered a pair of Denon AHD1001's, and here's hoping they last longer! | ive gone through three pairs of these in the last two years i am in love with the sound quality and even though i know its not the best i particularly love how the bass sounds theyre comfortable to wear and very isolating with these headphones you dont even need noise canceling there is very little sound leak unless you like to listen to music ridiculously loud all in all i was very impressed with these theyre without a doubt the best sounding headphones ive ever ownedbr br now the problem the wires are thin and stringy and do not last on my first pair the part of the wire that connected to the left cup came apart im not an abuser of headphones either on the other two pairs they wire at the base next to the adapter came apart i went at them with a soldering iron desperately trying to make them last as long as i could but theyd always crap out on me again the sound quality distorts over time and the foam around the cups is cheap and wears out quicklybr br they arent worth the price for such bad quality id suggest looking around for other pairs sony denon and sennheiser all have superior headphones for a similar price i myself just ordered a pair of denon ahds and heres hoping they last longer |
| R10005O193PJ6W | stopped working after a while, changed batteries, it worked for a few days, then it quit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | stopped working after a while changed batteries it worked for a few days then it quit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| R10008LR7CU84N | I ordered this cable and it doesn't work when I contacted them they told me I was doing something wrong. I then had my dad who it a certified computer tech look at it and there is something wrong with the cable. When I told them they never responded to me again.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | i ordered this cable and it doesnt work when i contacted them they told me i was doing something wrong i then had my dad who it a certified computer tech look at it and there is something wrong with the cable when i told them they never responded to me again                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| R10009JN2UWOJC | Have not owned it that long however it has the features , feel and works like a quality unit that would be at a much higher price point                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | have not owned it that long however it has the features  feel and works like a quality unit that would be at a much higher price point                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| R1000AMVKPW32O | Bought for a gift and it is just what was needed to mount the new 32&#34; TV outdoors. The fact that it has full motion swing makes it even better because we can move it around to see it from different angles and still have a sturdy mount.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | bought for a gift and it is just what was needed to mount the new  tv outdoors the fact that it has full motion swing makes it even better because we can move it around to see it from different angles and still have a sturdy mount                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| R1000CJMO2L8X4 | Perfect for the gym                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | perfect for the gym                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| R1000EDGJUU3CU | Love these !!! The sound quality is amazing ! The price was amazing especially for the quality.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | love these  the sound quality is amazing  the price was amazing especially for the quality                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| R1000EG9XXBLXT | I have had good success with these disks, and have used hundreds of them successfully on both computers and a dedicated Panosonic DVD recorder. They seem very reliable, and the lines on the disk label help to keep labeling neat and straight.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | i have had good success with these disks and have used hundreds of them successfully on both computers and a dedicated panosonic dvd recorder they seem very reliable and the lines on the disk label help to keep labeling neat and straight                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |

<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>


---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 7: Obtención del sentiment </font>
<br>

Cread la variable `sentiment` en función del número de estrellas asumiendo que una review de menos (<) de 3 estrellas es negativa, usando 1 para el sentiment positivo y 0 para el negativo. Para poder generar la variable que determine el sentiment a partir del número de estrellas podéis utilizar la función de spark `when`. Muestra el resultado para las primeras 10 reviews ordenadas por `review_id`. 

In [11]:
# Escribe tu solución aquí
review_df_rep_sentiment = (reviews_df_rep
 .select(
     sqlf.col('review_id'), 
     sqlf.col('review_body'),
     sqlf.col('star_rating'),
     sqlf.when(sqlf.col('star_rating') < 3, 0).otherwise(1).alias("sentiment")         
         )
 .orderBy('review_id')
)
review_df_rep_sentiment.show(10)

[Stage 19:>                                                       (0 + 16) / 16]

+--------------+--------------------+-----------+---------+
|     review_id|         review_body|star_rating|sentiment|
+--------------+--------------------+-----------+---------+
|R10000WMGXS51T|Great little emer...|          5|        1|
|R10001L4QTCA84|Lives up to its c...|          5|        1|
|R10003OLR2P5UE|I've gone through...|          3|        1|
|R10005O193PJ6W|stopped working a...|          3|        1|
|R10008LR7CU84N|I ordered this ca...|          1|        0|
|R10009JN2UWOJC|Have not owned it...|          5|        1|
|R1000AMVKPW32O|Bought for a gift...|          5|        1|
|R1000CJMO2L8X4| Perfect for the gym|          5|        1|
|R1000EDGJUU3CU|Love these !!! Th...|          5|        1|
|R1000EG9XXBLXT|I have had good s...|          5|        1|
+--------------+--------------------+-----------+---------+
only showing top 10 rows



                                                                                

**Resultado esperado**:

| review_id      | review_body                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |   star_rating |   sentiment |
|:---------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------:|------------:|
| R10000WMGXS51T | Great little emergency radio.  Very good reception.  The<br />weather band is a feature.  Can't beat the quality for this price/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |             5 |           1 |
| R10001L4QTCA84 | Lives up to its claim, and really does fit bulky phone cases. Braided cable is sturdy but flexible. I think it stays a little more flexible in the cold weather, which is nice. Definitely getting a few more in the future!                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |             5 |           1 |
| R10003OLR2P5UE | I've gone through three pairs of these in the last two years. I am in love with the sound quality, and even though I know it's not the best I particularly love how the bass sounds. They're comfortable to wear and very isolating. With these headphones, you don't even need noise canceling. There is very little sound leak, unless you like to listen to music ridiculously loud. All in all, I was very impressed with these. They're without a doubt the best sounding headphones I've ever owned.<br /><br />Now, the problem: The wires are thin and stringy, and do NOT last. On my first pair, the part of the wire that connected to the left cup came apart. I'm not an abuser of headphones, either. On the other two pairs, they wire at the base next to the adapter came apart. I went at them with a soldering iron, desperately trying to make them last as long as I could, but they'd always crap out on me again. The sound quality distorts over time, and the foam around the cups is cheap and wears out quickly.<br /><br />They aren't worth the price for such bad quality. I'd suggest looking around for other pairs, Sony, Denon, and Sennheiser all have superior headphones for a similar price. I myself just ordered a pair of Denon AHD1001's, and here's hoping they last longer! |             3 |           1 |
| R10005O193PJ6W | stopped working after a while, changed batteries, it worked for a few days, then it quit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |             3 |           1 |
| R10008LR7CU84N | I ordered this cable and it doesn't work when I contacted them they told me I was doing something wrong. I then had my dad who it a certified computer tech look at it and there is something wrong with the cable. When I told them they never responded to me again.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |             1 |           0 |
| R10009JN2UWOJC | Have not owned it that long however it has the features , feel and works like a quality unit that would be at a much higher price point                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |             5 |           1 |
| R1000AMVKPW32O | Bought for a gift and it is just what was needed to mount the new 32&#34; TV outdoors. The fact that it has full motion swing makes it even better because we can move it around to see it from different angles and still have a sturdy mount.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |             5 |           1 |
| R1000CJMO2L8X4 | Perfect for the gym                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |             5 |           1 |
| R1000EDGJUU3CU | Love these !!! The sound quality is amazing ! The price was amazing especially for the quality.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |             5 |           1 |
| R1000EG9XXBLXT | I have had good success with these disks, and have used hundreds of them successfully on both computers and a dedicated Panosonic DVD recorder. They seem very reliable, and the lines on the disk label help to keep labeling neat and straight.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |             5 |           1 |







<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>

-----

---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 8: División del conjunto de datos </font>
<br>

Divide el conjunto de datos en entrenamiento (70% de los datos) y test (30% de los datos). Una vez hecho esto, guarda los datos de test en el bucket the s3 previamente creado (usa el nombre `electronics_test`)

In [12]:
# Escribe tu solución aquí
joinExpression = review_df_rep_clean["review_id"] == review_df_rep_sentiment["review_id"]
joinType = "left"

df_electronics = (
    review_df_rep_clean
    .join(review_df_rep_sentiment, joinExpression, joinType)
    .select(
        review_df_rep_clean['review_id'],
        sqlf.col('clean_review_body'),
        sqlf.col('sentiment')        
    )
)

df_train, df_test = df_electronics.randomSplit([0.7, 0.3], seed=42)

#Al cambiar la practica ahora se guardaria en local.
df_test.write.format('parquet').mode('overwrite').save('data/electronics_test.parquet')



<div style="text-align: right"><font size=4> <i class="fa fa-check-square-o" aria-hidden="true" style="color:#00586D"></i></font></div>

-----

A continuación vamos a entrenar el modelo, para ello utilizaremos diferentes opciones de preprocesamiento. Para poder entrenar un clasificador de sentimiento necesitamos contruir una representación del texto que nos permita entrenar el modelo. Para ello utilizaremos podemos utilizar algoritmos de extracción de características como TF-IDF o Word2Vec que vienen implementados en Spark MLlib y que nos permitirá transformar una cadena de texto a un vector para utilizarlo como datos de entrenamiento de un clasificador.  

---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 8: Modelo de sentiment </font>
<br>

Construye un **pipeline** de entrenamiento de un modelo de sentiment a partir de los datos preparados anteriormente. Deberas utilizar una secuencia de diferentes  **transformadores** y **estimadores**:
- `Tokenizer` nos permitirá construir un vector de palabras a partir de nuestras sentencias
- `StopWordsRemover` nos permitirá limpiar de nuestros vectores de palabras las de menor significado

- Construcción de características dos alternativas:
    -  Modelo TF-IDF usando `HashingTF` e `IDF`
    - `Word2Vec` nos permitirá crear un vector a partir de la lista de palabras

- Clasificación binaria, basada en la variable sentiment que hemos utilizado, aplica un clasificador (LogisticRegregession, DecisionTree) evita ensembles por su alto tiempo de aprendizaje.

Buscando en la documentación, encuentra los distintos elementos y conectalos en un pipeline junto a un algorimo de clasificación

- Recomendamos utilizar una muestra (método `sample`) pues el tiempo puede ser excesivo
- Es posible ajustar hiperparámetros, pero igualmente puede ser bastante lento

**Valida el modelo con el conjunto de test anterior usando el area bajo la curva ROC**

In [13]:
df_train_sample = df_train.sample(0.2,42)

In [14]:
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import Word2Vec
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [15]:
tokenizer = Tokenizer(inputCol="clean_review_body", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="words_filtered")

hashingTF = HashingTF(inputCol="words_filtered", outputCol="rawFeatures", numFeatures=20)
idf = IDF(inputCol="rawFeatures", outputCol="features")

word2Vec = Word2Vec(vectorSize = 50, maxSentenceLength =100, seed=42, inputCol="words_filtered", outputCol="features")

lr = LogisticRegression(featuresCol = 'features', labelCol = 'sentiment')
dt = DecisionTreeClassifier(maxDepth=3, maxBins=8, maxMemoryInMB=128, cacheNodeIds=False, featuresCol = 'features', labelCol = 'sentiment') #He cambiado configuraciones para que no me crashea por falta de memoria

### Word2Vec + Logistic Regression

In [16]:
word2vec_lr_pipeline = Pipeline(stages = [tokenizer, remover, word2Vec, lr])

In [17]:
word2vec_lr_model = word2vec_lr_pipeline.fit(df_train_sample)

23/08/11 05:34:56 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS

In [18]:
word2vec_lr_prediction = word2vec_lr_model.transform(df_test)

In [19]:
evaluator = BinaryClassificationEvaluator(labelCol="sentiment", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
word2vec_lr_roc_auc = evaluator.evaluate(word2vec_lr_prediction)
print(f"Area bajo la curva ROC para Word2Vec + LR: {word2vec_lr_roc_auc}")

                                                                                

Area bajo la curva ROC para Word2Vec + LR: 0.9121236637202818


### Word2Vec + Decision Tree

In [20]:
word2vec_dt_pipeline = Pipeline(stages = [tokenizer, remover, word2Vec, dt])

In [21]:
word2vec_dt_model = word2vec_dt_pipeline.fit(df_train_sample)

In [22]:
word2vec_dt_prediction = word2vec_dt_model.transform(df_test)

In [23]:
evaluator = BinaryClassificationEvaluator(labelCol="sentiment", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
word2vec_dt_roc_auc = evaluator.evaluate(word2vec_dt_prediction)
print(f"Area bajo la curva ROC para Word2Vec + DT: {word2vec_dt_roc_auc}")

### TF-IDF + Logistic Regression

In [24]:
tfidf_lr_pipeline = Pipeline(stages = [tokenizer, remover, hashingTF, idf, lr])

In [25]:
tfidf_lr_model = tfidf_lr_pipeline.fit(df_train_sample)

In [26]:
tfidf_lr_prediction = tfidf_lr_model.transform(df_test)

In [27]:
evaluator = BinaryClassificationEvaluator(labelCol="sentiment", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
tfidf_lr_roc_auc = evaluator.evaluate(tfidf_lr_prediction)
print(f"Area bajo la curva ROC para TF-IDF + LR: {tfidf_lr_roc_auc}")

### TF-IDF + Decision Tree

In [28]:
tfidf_dt_pipeline = Pipeline(stages = [tokenizer, remover, hashingTF, idf, dt])

In [29]:
tfidf_dt_model = tfidf_dt_pipeline.fit(df_train_sample)

In [30]:
tfidf_dt_prediction = tfidf_dt_model.transform(df_test)

In [31]:
evaluator = BinaryClassificationEvaluator(labelCol="sentiment", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
tfidf_dt_roc_auc = evaluator.evaluate(tfidf_dt_prediction)
print(f"Area bajo la curva ROC para TF-IDF + DT: {tfidf_dt_roc_auc}")

---
### <font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#004D7F"></i> Tarea 9: Serialización </font>
<br>

Guarda el modelo entrenado en local, en un directorio nuevo para modelos, `../models/pipeline_model` utilizando la opción nativa de Spark. 

##### El mejor valor de area bajo la curva corresponde al empleo de Word2Vec + Logistic Regression
- Area bajo la curva ROC para Word2Vec + LR: 0.9121229600079983

- Area bajo la curva ROC para Word2Vec + DT: 0.6978580238168527

- Area bajo la curva ROC para TF-IDF + LR: 0.6044488701825967

- Area bajo la curva ROC para TF-IDF + DT: 0.5

In [32]:
# Escribe tu solución aquí
word2vec_lr_model.write().overwrite().save('models/pipeline_model')

                                                                                