# Aggregating DataFrames in PySpark HW

First let's start up our PySpark instance

In [1]:

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("aggregate").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


## Read in the dataFrame for this Notebook

In [2]:
airbnb = spark.read.csv('Datasets/nyc_air_bnb.csv',inferSchema=True,header=True)

## About this dataset

This dataset describes the listing activity and metrics for Air BNB bookers in NYC, NY for 2019. Each line in the dataset is a booking. 

**Source:** https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/data

Let's go ahead and view the first few records of the dataset so we know what we are working with.

In [3]:
airbnb.toPandas()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365.0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194.0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49074,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9.0
49075,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36.0
49076,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27.0
49077,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2.0


Now print the schema so we can make sure all the variables have the correct types

In [4]:
airbnb.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: string (nullable = true)
 |-- minimum_nights: string (nullable = true)
 |-- number_of_reviews: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- reviews_per_month: string (nullable = true)
 |-- calculated_host_listings_count: string (nullable = true)
 |-- availability_365: integer (nullable = true)



Notice here that some of the columns that are obviously numeric have been incorrectly identified as "strings". Let's edit that. Otherwise we cannot aggregate any of the numeric columns.

In [5]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

airbnbNew = airbnb.withColumn("price", airbnb["price"].cast(IntegerType())) \
                    .withColumn("minimum_nights", airbnb["minimum_nights"].cast(IntegerType())) \
                    .withColumn("number_of_reviews", airbnb["number_of_reviews"].cast(IntegerType())) \
                    .withColumn("reviews_per_month", airbnb["reviews_per_month"].cast(IntegerType())) \
                    .withColumn("calculated_host_listings_count", airbnb["calculated_host_listings_count"].cast(IntegerType()))

In [6]:
airbnbNew.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: integer (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- number_of_reviews: integer (nullable = true)
 |-- last_review: string (nullable = true)
 |-- reviews_per_month: integer (nullable = true)
 |-- calculated_host_listings_count: integer (nullable = true)
 |-- availability_365: integer (nullable = true)



### Alright now we are ready to dig in!


### 1. How many rows are in this dataset?

In [7]:
airbnbNew.count()

49079

### 2. How many total reviews does each host have?

In [10]:
airbnbNew.groupBy("host_id").agg({'number_of_reviews':'sum'}).show(5)

+-------+----------------------+
|host_id|sum(number_of_reviews)|
+-------+----------------------+
| 716306|                   197|
|1203500|                    35|
| 368528|                     1|
|1577493|                    16|
|1390555|                    50|
+-------+----------------------+
only showing top 5 rows



### 3. Show the min and max of all the numeric variables in the dataset

In [12]:
summary = airbnbNew.select("price","minimum_nights","number_of_reviews").summary('min', 'max')
summary.toPandas()

Unnamed: 0,summary,price,minimum_nights,number_of_reviews
0,min,-74,0,0
1,max,10000,1250,629


### 4. Which host had the highest number of reviews?

Only display the top result.

Bonus: format the column names

In [54]:
from pyspark.sql.functions import *
airbnbNew.groupBy('host_name').max('number_of_reviews').alias("max_reviews").show()

+------------+----------------------+
|   host_name|max(number_of_reviews)|
+------------+----------------------+
|      Mary D|                     5|
|      Alayna|                    27|
|       Sandi|                    12|
|    Laurence|                   156|
|Paul & Elena|                    25|
|       Tyler|                    14|
|   Amaryllis|                     0|
|       Tegan|                     1|
|         Tmc|                    12|
|     Susanna|                    37|
|      Zaineb|                    70|
|        Rony|                    54|
|      Talisa|                     5|
|      Genine|                     3|
|    Pénélope|                     4|
|        Faye|                    56|
|       Yinny|                    48|
|    Francois|                    10|
|      Britta|                     5|
|        Saad|                     6|
+------------+----------------------+
only showing top 20 rows



In [48]:
airbnbNew.sort(desc("price")).toPandas()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,7003697,Furnished room in Astoria apartment,20582832,Kathrine,Queens,Astoria,40.7681,-73.91651,Private room,10000.0,100.0,2.0,2016-02-13,0.0,1.0,0.0
1,13894339,Luxury 1 bedroom apt. -stunning Manhattan views,5143901,Erin,Brooklyn,Greenpoint,40.7326,-73.95739,Entire home/apt,10000.0,5.0,5.0,2017-07-27,0.0,1.0,0.0
2,22436899,1-BR Lincoln Center,72390391,Jelena,Manhattan,Upper West Side,40.77213,-73.98665,Entire home/apt,10000.0,30.0,0.0,,,1.0,83.0
3,4737930,Spanish Harlem Apt,1235070,Olson,Manhattan,East Harlem,40.79264,-73.93898,Entire home/apt,9999.0,5.0,1.0,2015-01-02,0.0,1.0,0.0
4,9528920,"Quiet, Clean, Lit @ LES & Chinatown",3906464,Amy,Manhattan,Lower East Side,40.71355,-73.98507,Private room,9999.0,99.0,6.0,2016-01-01,0.0,1.0,83.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49074,36073776,Amazing Room! Awesome Price!,,,,,,,,,,,,,,
49075,36077664,Cozy Shared BR in East Side,,,,,,,,,,,,,,
49076,36169832,Tiny Room only for “1 lady” :),,,,,,,,,,,,,,
49077,36312667,My Cozy Apartment!,,,,,,,,,,,,,,


### 5. On average, how many nights did most hosts specify for a minimum?

In [57]:
airbnbNew.groupBy('host_name').avg('minimum_nights').alias("avg_min_nights").show()

+------------+-------------------+
|   host_name|avg(minimum_nights)|
+------------+-------------------+
|      Mary D|                7.0|
|      Alayna|                3.5|
|       Sandi|                2.0|
|    Laurence| 2.8461538461538463|
|Paul & Elena|                3.0|
|       Tyler|  4.483870967741935|
|   Amaryllis|                1.0|
|       Tegan|                2.0|
|         Tmc|                3.0|
|     Susanna|                5.0|
|      Zaineb|                4.0|
|        Rony|               3.25|
|      Talisa|                1.0|
|      Genine|                2.0|
|    Pénélope|                1.0|
|        Faye|  5.916666666666667|
|       Yinny|                3.5|
|    Francois|               3.75|
|      Britta|  7.333333333333333|
|        Saad|                4.0|
+------------+-------------------+
only showing top 20 rows



### 6. What is the most expensive neighborhood to stay in on average?

Note: only show the one result

In [78]:
airbnbNew.groupBy("neighbourhood_group").max("price").alias('max').show()

+-------------------+----------+
|neighbourhood_group|max(price)|
+-------------------+----------+
|         Douglaston|         1|
|             Queens|     10000|
|              Nadia|      null|
|            Midtown|        30|
|     Hell's Kitchen|         3|
|  Greenwich Village|        80|
|       Clinton Hill|        14|
|   Ditmars Steinway|         6|
|           Longwood|         5|
|        Little Neck|         1|
|           Flushing|        28|
|               null|      null|
|         Bath Beach|         2|
|        East Harlem|         3|
|            Astoria|        21|
|       East Village|         7|
|        Fort Greene|         1|
|         Mott Haven|         1|
|           Gramercy|         2|
|       Williamsburg|        30|
+-------------------+----------+
only showing top 20 rows



### 7. Display a two by two table that shows the average prices by room type (private and shared only) and neighborhood group (Manhattan and Brooklyn only)

In [65]:
airbnbNew.filter(airbnbNew.room_type.isin(['Shared room','Private room'])).groupBy("room_type").pivot("neighbourhood_group", ["Manhattan", "Brooklyn"]).avg('price').show(10)

+------------+------------------+-----------------+
|   room_type|         Manhattan|         Brooklyn|
+------------+------------------+-----------------+
| Shared room| 89.06903765690376|50.52784503631961|
|Private room|116.05400302114803|76.47234042553191|
+------------+------------------+-----------------+



### Alright that's all folks!

### Great job!