# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func

In [2]:
spark = SparkSession.builder.appName("exercise-3").getOrCreate()

23/01/23 19:50:54 WARN Utils: Your hostname, Ana-Matebook resolves to a loopback address: 127.0.1.1; using 192.168.1.137 instead (on interface wlp2s0)
23/01/23 19:50:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/23 19:50:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/23 19:50:57 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

### Step 3. Assign it to a variable called drinks.

In [3]:
drinks = spark.read.option("header", True).option("inferSchema", True).csv("./../../datasets/alcohol.csv")

In [4]:
drinks.show()

+-----------------+-------------+---------------+-------------+----------------------------+---------+
|          country|beer_servings|spirit_servings|wine_servings|total_litres_of_pure_alcohol|continent|
+-----------------+-------------+---------------+-------------+----------------------------+---------+
|      Afghanistan|            0|              0|            0|                         0.0|       AS|
|          Albania|           89|            132|           54|                         4.9|       EU|
|          Algeria|           25|              0|           14|                         0.7|       AF|
|          Andorra|          245|            138|          312|                        12.4|       EU|
|           Angola|          217|             57|           45|                         5.9|       AF|
|Antigua & Barbuda|          102|            128|           45|                         4.9|       NA|
|        Argentina|          193|             25|          221|          

In [5]:
drinks.printSchema()

root
 |-- country: string (nullable = true)
 |-- beer_servings: integer (nullable = true)
 |-- spirit_servings: integer (nullable = true)
 |-- wine_servings: integer (nullable = true)
 |-- total_litres_of_pure_alcohol: double (nullable = true)
 |-- continent: string (nullable = true)



### Step 4. Which continent drinks more beer on average?

In [10]:
drinks.groupBy("continent").agg(func.avg("beer_servings").alias("avg_beers")).sort("avg_beers", ascending=False).show()

+---------+------------------+
|continent|         avg_beers|
+---------+------------------+
|       EU|193.77777777777777|
|       SA|175.08333333333334|
|       NA|145.43478260869566|
|       OC|           89.6875|
|       AF|61.471698113207545|
|       AS| 37.04545454545455|
+---------+------------------+



### Step 5. For each continent print the statistics for wine consumption.

In [13]:
# Done manually because describe does not work here
drinks.groupBy("continent").agg(
    func.avg("wine_servings"),
    func.max("wine_servings"),
    func.min("wine_servings"),
    func.sum("wine_servings")
).show()

+---------+------------------+------------------+------------------+------------------+
|continent|avg(wine_servings)|max(wine_servings)|min(wine_servings)|sum(wine_servings)|
+---------+------------------+------------------+------------------+------------------+
|       NA| 24.52173913043478|               100|                 1|               564|
|       SA|62.416666666666664|               221|                 1|               749|
|       AS| 9.068181818181818|               123|                 0|               399|
|       OC|            35.625|               212|                 0|               570|
|       EU|142.22222222222223|               370|                 0|              6400|
|       AF|16.264150943396228|               233|                 0|               862|
+---------+------------------+------------------+------------------+------------------+



### Step 6. Print the mean alcohol consumption per continent for every column

In [15]:
drinks.groupBy("continent").agg(
    func.avg("beer_servings").alias("avg_beer"),
    func.avg("spirit_servings").alias("avg_spirit"),
    func.avg("wine_servings").alias("avg_wine")
).show()

+---------+------------------+------------------+------------------+
|continent|          avg_beer|        avg_spirit|          avg_wine|
+---------+------------------+------------------+------------------+
|       NA|145.43478260869566| 165.7391304347826| 24.52173913043478|
|       SA|175.08333333333334|            114.75|62.416666666666664|
|       AS| 37.04545454545455| 60.84090909090909| 9.068181818181818|
|       OC|           89.6875|           58.4375|            35.625|
|       EU|193.77777777777777|132.55555555555554|142.22222222222223|
|       AF|61.471698113207545|16.339622641509433|16.264150943396228|
+---------+------------------+------------------+------------------+



### Step 7. Print the median alcohol consumption per continent for every column

In [17]:
drinks.groupBy("continent").agg(
    func.percentile_approx("beer_servings", 0.5).alias("median_beer"),
    func.percentile_approx("spirit_servings", 0.5).alias("median_spirit"),
    func.percentile_approx("wine_servings", 0.5).alias("median_wine")
).show()

+---------+-----------+-------------+-----------+
|continent|median_beer|median_spirit|median_wine|
+---------+-----------+-------------+-----------+
|       NA|        143|          137|         11|
|       SA|        162|          100|          8|
|       AS|         16|           16|          1|
|       OC|         49|           35|          8|
|       EU|        219|          122|        128|
|       AF|         32|            3|          2|
+---------+-----------+-------------+-----------+



### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [19]:
drinks.groupBy("continent").agg(
    func.mean("spirit_servings").alias("mean_spirit"),
    func.max("spirit_servings").alias("max_spirit"),
    func.min("spirit_servings").alias("min_spirit")
).show()

+---------+------------------+----------+----------+
|continent|       mean_spirit|max_spirit|min_spirit|
+---------+------------------+----------+----------+
|       NA| 165.7391304347826|       438|        68|
|       SA|            114.75|       302|        25|
|       AS| 60.84090909090909|       326|         0|
|       OC|           58.4375|       254|         0|
|       EU|132.55555555555554|       373|         0|
|       AF|16.339622641509433|       152|         0|
+---------+------------------+----------+----------+

