# Spark DataFrames

In [4]:
!pwd;ls -l Data-ML-100k--master/

/Users/facradri/Dropbox/Tech/apps/Python/PySpark/pySparkTutorial
total 9632
-rwxr-xr-x@  1 facradri  LL\Domain Users       74 Aug 25  2018 [31mREADME.md[m[m
drwxr-x---@ 25 facradri  LL\Domain Users      800 Jan 29  2016 [34mml-100k[m[m
-rwxr-xr-x@  1 facradri  LL\Domain Users  4924029 Aug 25  2018 [31mml-100k.zip[m[m


In [10]:
! head -n 3 Data-ML-100k--master/ml-100k/u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116


In [11]:
# File 'data' is a CSV that uses TAB as separator and does not have header
ratings = spark.read.load("Data-ML-100k--master/ml-100k/u.data",format="csv", sep="\t", inferSchema="true", header="false")
ratings.show(3)

+---+---+---+---------+
|_c0|_c1|_c2|      _c3|
+---+---+---+---------+
|196|242|  3|881250949|
|186|302|  3|891717742|
| 22|377|  1|878887116|
+---+---+---+---------+
only showing top 3 rows



### Change Column names / Add header
Good functionality. Always required. Don’t forget the * in front of the list.

In [18]:
ratings = ratings.toDF(*['user_id', 'movie_id', 'rating', 'unix_timestamp'])
ratings.show(3)

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    196|     242|     3|     881250949|
|    186|     302|     3|     891717742|
|     22|     377|     1|     878887116|
+-------+--------+------+--------------+
only showing top 3 rows



### Some basic stats

In [21]:
#print(ratings.count()) #Row Count
print(len(ratings.columns)) #Column Count

4


In [23]:
# ratings.describe()

In [24]:
ratings.select('user_id','movie_id').show(3)

+-------+--------+
|user_id|movie_id|
+-------+--------+
|    196|     242|
|    186|     302|
|     22|     377|
+-------+--------+
only showing top 3 rows



### Filter
Filter a dataframe using multiple conditions:

In [26]:
ratings.filter((ratings.rating==5) & (ratings.user_id==253)).show(3)

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    253|     465|     5|     891628467|
|    253|     510|     5|     891628416|
|    253|     183|     5|     891628341|
+-------+--------+------+--------------+
only showing top 3 rows



### Groupby
We can use groupby function with a spark dataframe too. 

Pretty much same as a pandas groupby with the exception that you will need to import pyspark.sql.functions

In [27]:
from pyspark.sql import functions as F
ratings.groupBy("user_id").agg(F.count("user_id"),F.mean("rating")).show(3)

+-------+--------------+------------------+
|user_id|count(user_id)|       avg(rating)|
+-------+--------------+------------------+
|    148|            65|               4.0|
|    463|           133|2.8646616541353382|
|    471|            31|3.3870967741935485|
+-------+--------------+------------------+
only showing top 3 rows



### Sort

In [29]:
ratings.sort("user_id").show(5)

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|      1|      33|     4|     878542699|
|      1|     202|     5|     875072442|
|      1|     160|     4|     875072547|
|      1|      61|     4|     878542420|
|      1|     189|     3|     888732928|
+-------+--------+------+--------------+
only showing top 5 rows



In [30]:
# descending Sort
from pyspark.sql import functions as F
ratings.sort(F.desc("user_id")).show(5)

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    943|     570|     1|     888640125|
|    943|     186|     5|     888639478|
|    943|     232|     4|     888639867|
|    943|      58|     4|     888639118|
|    943|    1067|     2|     875501756|
+-------+--------+------+--------------+
only showing top 5 rows



## Joins/Merging with Spark Dataframes

I was not able to find a pandas equivalent of merge with Spark DataFrames but we can use SQL with dataframes and thus we can merge dataframes using SQL.

Let us try to run some SQL on Ratings.

We first register the ratings df to a temporary table ratings_table on which we can run sql operations.

As you can see the result of the SQL select statement is again a Spark Dataframe.

In [34]:
ratings.registerTempTable('ratings_table')

#newDF = spark.sql('select * from ratings_table where rating>4')

#newDF.show(5)