# Aggregating DataFrames in PySpark HW

First let's start up our PySpark instance

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('AggregateDataset').getOrCreate()
spark

## Read in the dataFrame for this Notebook

In [5]:
airbnb = spark.read.csv('data/nyc_air_bnb.csv',inferSchema=True,header=True)

## About this dataset

This dataset describes the listing activity and metrics for Air BNB bookers in NYC, NY for 2019. Each line in the dataset is a booking. 

**Source:** https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/data

Let's go ahead and view the first few records of the dataset so we know what we are working with.

In [7]:
airbnb.limit(5).toPandas()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


Now print the schema so we can make sure all the variables have the correct types

In [8]:
airbnb.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: string (nullable = true)
 |-- minimum_nights: string (nullable = true)
 |-- number_of_reviews: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- reviews_per_month: string (nullable = true)
 |-- calculated_host_listings_count: string (nullable = true)
 |-- availability_365: integer (nullable = true)



Notice here that some of the columns that are obviously numeric have been incorrectly identified as "strings". Let's edit that. Otherwise we cannot aggregate any of the numeric columns.

In [9]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

df_fixed = airbnb \
            .withColumn('id', airbnb['id'].cast(IntegerType())) \
            .withColumn('host_id', airbnb['host_id'].cast(IntegerType())) \
            .withColumn('latitude', airbnb['latitude'].cast(FloatType())) \
            .withColumn('longitude', airbnb['longitude'].cast(FloatType())) \
            .withColumn('price', airbnb['price'].cast(FloatType())) \
            .withColumn('minimum_nights', airbnb['minimum_nights'].cast(IntegerType())) \
            .withColumn('number_of_reviews', airbnb['number_of_reviews'].cast(IntegerType())) \
            .withColumn('last_review', to_date(airbnb['last_review'], 'yyyy-MM-dd')) \
            .withColumn('reviews_per_month', airbnb['reviews_per_month'].cast(FloatType())) \
            .withColumn('calculated_host_listings_count', airbnb['calculated_host_listings_count'].cast(IntegerType()))

df_fixed.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: float (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- number_of_reviews: integer (nullable = true)
 |-- last_review: date (nullable = true)
 |-- reviews_per_month: float (nullable = true)
 |-- calculated_host_listings_count: integer (nullable = true)
 |-- availability_365: integer (nullable = true)



### Alright now we are ready to dig in!


### 1. How many rows are in this dataset?

In [10]:
df_fixed.count()

49079

### 2. How many total reviews does each host have?

In [11]:
df_fixed.groupBy('host_id').sum('number_of_reviews').limit(5).toPandas()

Unnamed: 0,host_id,sum(number_of_reviews)
0,291112,35
1,1384111,103
2,1597481,13
3,2108853,18
4,2429432,27


### 3. Show the min and max of all the numeric variables in the dataset

In [14]:
df_fixed.select('minimum_nights', 'number_of_reviews').summary('min', 'max').toPandas()

Unnamed: 0,summary,minimum_nights,number_of_reviews
0,min,0,0
1,max,1250,629


### 4. Which host had the highest number of reviews?

Only display the top result.

Bonus: format the column names

In [24]:
df_highest_reviews = df_fixed.groupBy('host_id').agg(sum('number_of_reviews').alias('reviews'))
df_highest_reviews.orderBy('reviews', ascending=False).limit(1).toPandas()

Unnamed: 0,host_id,reviews
0,37312959,2273


### 5. On average, how many nights did most hosts specify for a minimum?

In [25]:
df_fixed.select(avg('minimum_nights')).toPandas()

Unnamed: 0,avg(minimum_nights)
0,7.128613


### 6. What is the most expensive neighborhood to stay in on average?

Note: only show the one result

In [27]:
df_fixed.groupBy('neighbourhood').agg(max('price').alias('highest_price')) \
    .orderBy('highest_price', ascending=False).limit(1).toPandas()

Unnamed: 0,neighbourhood,highest_price
0,Astoria,10000.0


### 7. Display a two by two table that shows the average prices by room type (private and shared only) and neighborhood group (Manhattan and Brooklyn only)

In [33]:
df_neigh = df_fixed.filter((df_fixed['room_type'] == 'Private room') | (df_fixed['room_type'] == 'Shared room')) \
        .groupBy('room_type').pivot('neighbourhood_group').avg('price')

df_neigh.select('room_type', 'Brooklyn', 'Manhattan').toPandas()

Unnamed: 0,room_type,Brooklyn,Manhattan
0,Shared room,50.527845,89.069038
1,Private room,76.47234,116.054003


### Alright that's all folks!

### Great job!