# PySpark workaround for ML with `sparklyr`

This notebook stems from [this one](./sparklyr_test2.ipynb) where we encountered a problem: we don't have a method to unnest columns in `sparklyr`!.  
Fortunately here comes PySpark to help us.  
We import our data as a **Spark dataframe**:

In [2]:
type(sqlContext)

pyspark.sql.context.HiveContext

In [14]:
bin_reviews = sqlContext.read.json('amazon/bin_reviews.json')

In [5]:
bin_reviews.printSchema()

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- label: double (nullable = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)



In [19]:
select_reviews = bin_reviews.select('reviewText', 'overall', 'label')
select_reviews.show(2)

+--------------------+-------+-----+
|          reviewText|overall|label|
+--------------------+-------+-----+
|Spiritually and m...|    5.0|  1.0|
|This is one my mu...|    5.0|  1.0|
+--------------------+-------+-----+
only showing top 2 rows



## Tokenizer

In [21]:
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="reviewText", outputCol="words")

In [22]:
tokenized_reviews = tokenizer.transform(select_reviews)
tokenized_reviews.show(2)

+--------------------+-------+-----+--------------------+
|          reviewText|overall|label|               words|
+--------------------+-------+-----+--------------------+
|Spiritually and m...|    5.0|  1.0|[spiritually, and...|
|This is one my mu...|    5.0|  1.0|[this, is, one, m...|
+--------------------+-------+-----+--------------------+
only showing top 2 rows



## StopWordsRemover

In [23]:
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")

In [24]:
removed_reviews = remover.transform(tokenized_reviews)
removed_reviews.show(2)
sample_review = removed_reviews.first()
print sample_review['words'][:10]
print sample_review['filtered'][:10]

+--------------------+-------+-----+--------------------+--------------------+
|          reviewText|overall|label|               words|            filtered|
+--------------------+-------+-----+--------------------+--------------------+
|Spiritually and m...|    5.0|  1.0|[spiritually, and...|[spiritually, men...|
|This is one my mu...|    5.0|  1.0|[this, is, one, m...|[books., masterpi...|
+--------------------+-------+-----+--------------------+--------------------+
only showing top 2 rows

[u'spiritually', u'and', u'mentally', u'inspiring!', u'a', u'book', u'that', u'allows', u'you', u'to']
[u'spiritually', u'mentally', u'inspiring!', u'book', u'allows', u'question', u'morals', u'help', u'discover', u'really']


In [39]:
from pyspark.sql.functions import split, explode
unnested_reviews = removed_reviews.select('reviewText', 'overall', 'label', 'filtered', explode("filtered").alias("word"))

In [40]:
unnested_reviews.show(5)

+--------------------+-------+-----+--------------------+-----------+
|          reviewText|overall|label|            filtered|       word|
+--------------------+-------+-----+--------------------+-----------+
|Spiritually and m...|    5.0|  1.0|[spiritually, men...|spiritually|
|Spiritually and m...|    5.0|  1.0|[spiritually, men...|   mentally|
|Spiritually and m...|    5.0|  1.0|[spiritually, men...| inspiring!|
|Spiritually and m...|    5.0|  1.0|[spiritually, men...|       book|
|Spiritually and m...|    5.0|  1.0|[spiritually, men...|     allows|
+--------------------+-------+-----+--------------------+-----------+
only showing top 5 rows



We save our dataframe for further use in our small `sparklyr` pipeline.   
It will take a good load of time to save, so be patient!  

In [45]:
unnested_reviews.write.json('amazon/unnested_reviews_json')
# unnested_reviews.write.save('amazon/unnested_reviews.json', format='json', mode='append')

## References

- [PySpark cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf)