# Airbnb EDA

Kaia Lindberg (pkx2ec)



## 5.10 Final Project Ungraded Assignment
At this point in the course, you should be training and evaluating models. Please create a Jupyter Notebook containing a concise summary of your dataset (described in submission instructions).  

At a minimum, the file should include a summary containing:

- Number of records
- Number of columns
- Statistical summary of response variable
- Statistical summary of potential predictor variables (if there are a large number of predictors, select the top 10)
    - Note: Summarize categorical variables with counts and percentages for each level and summarize numerical variables with mean/quantiles/standard deviation.
- Include up to five helpful graphs

In [53]:
# Imports
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col, desc


In [36]:
# Start spark session
spark = SparkSession.builder.getOrCreate()


### Users Dataset

In [37]:
# Define schema for data
schema = StructType() \
      .add("id",StringType(),True) \
      .add("date_account_created",StringType(),True) \
      .add("timestamp_first_active",DoubleType(),True) \
      .add("date_first_booking",StringType(),True) \
      .add("gender",StringType(),True) \
      .add("age",DoubleType(),True) \
      .add("signup_method",StringType(),True) \
      .add("signup_flow",IntegerType(),True) \
      .add("language",StringType(),True) \
      .add("affiliate_channel",StringType(),True) \
      .add("affiliate_provider",StringType(),True) \
      .add("first_affiliate_tracked",StringType(),True) \
      .add("signup_app",StringType(),True) \
      .add("first_device_type",StringType(),True) \
      .add("first_browser",StringType(),True) \
      .add("country_destination",StringType(),True)

In [38]:
# Lists of columns
response_col = "country_destination"
id_col = "id"
categorical_cols = ["gender", "signup_method", "language", 
                    "affiliate_channel", "affiliate_provider", "first_affiliate_tracked",
                    "signup_app", "first_device_type", "first_browser"]
numeric_cols = ["timestamp_first_active", "age", "signup_flow", ]
date_cols = ["date_account_created", "date_first_booking"]


In [39]:
# Read data in json format
df = spark.read.option("header",True).csv("./data/train_users_2.csv", schema)


In [40]:
df.show(2)

+----------+--------------------+----------------------+------------------+---------+----+-------------+-----------+--------+-----------------+------------------+-----------------------+----------+-----------------+-------------+-------------------+
|        id|date_account_created|timestamp_first_active|date_first_booking|   gender| age|signup_method|signup_flow|language|affiliate_channel|affiliate_provider|first_affiliate_tracked|signup_app|first_device_type|first_browser|country_destination|
+----------+--------------------+----------------------+------------------+---------+----+-------------+-----------+--------+-----------------+------------------+-----------------------+----------+-----------------+-------------+-------------------+
|gxn3p5htnn|          2010-06-28|    2.0090319043255E13|              null|-unknown-|null|     facebook|          0|      en|           direct|            direct|              untracked|       Web|      Mac Desktop|       Chrome|                NDF|


In [41]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- date_account_created: string (nullable = true)
 |-- timestamp_first_active: double (nullable = true)
 |-- date_first_booking: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- signup_method: string (nullable = true)
 |-- signup_flow: integer (nullable = true)
 |-- language: string (nullable = true)
 |-- affiliate_channel: string (nullable = true)
 |-- affiliate_provider: string (nullable = true)
 |-- first_affiliate_tracked: string (nullable = true)
 |-- signup_app: string (nullable = true)
 |-- first_device_type: string (nullable = true)
 |-- first_browser: string (nullable = true)
 |-- country_destination: string (nullable = true)



In [42]:
# Count number of records
df.count()

213451

In [43]:
# Number of columns
len(df.columns)

16

In [44]:
# Statistical summary of response variable
df.groupBy('country_destination').count().show()

+-------------------+------+
|country_destination| count|
+-------------------+------+
|                 NL|   762|
|                 PT|   217|
|                 AU|   539|
|                 CA|  1428|
|                 GB|  2324|
|              other| 10094|
|                 DE|  1061|
|                 ES|  2249|
|                 US| 62376|
|                 FR|  5023|
|                NDF|124543|
|                 IT|  2835|
+-------------------+------+



There are 12 different destinations. The most common is NDF, no destination found, meaning that the user did not book a place. The second most common, and most common actual country destination, is the United States. Other is also a common, but pretty broad category. 

In [45]:
df.select(numeric_cols).show(3)

+----------------------+----+-----------+
|timestamp_first_active| age|signup_flow|
+----------------------+----+-----------+
|    2.0090319043255E13|null|          0|
|    2.0090523174809E13|38.0|          0|
|    2.0090609231247E13|56.0|          3|
+----------------------+----+-----------+
only showing top 3 rows



In [49]:
# Statistical summary of numeric columns
df.select(numeric_cols).summary().show()

+-------+----------------------+------------------+------------------+
|summary|timestamp_first_active|               age|       signup_flow|
+-------+----------------------+------------------+------------------+
|  count|                213451|            125461|            213451|
|   mean|  2.013085041736745...| 49.66833517985669|3.2673868944160485|
| stddev|   9.253717046788992E9|155.66661183021571|  7.63770686943505|
|    min|    2.0090319043255E13|               1.0|                 0|
|    25%|     2.012122504391E13|              28.0|                 0|
|    50%|    2.0130911053924E13|              34.0|                 0|
|    75%|    2.0140306074825E13|              43.0|                 0|
|    max|    2.0140630235824E13|            2014.0|                25|
+-------+----------------------+------------------+------------------+



Note illogical min age of 1 and max age of 2014.

In [57]:
for col in categorical_cols:
    print(col)
    df.groupBy(col).count().sort(desc("count")).show()

gender
+---------+-----+
|   gender|count|
+---------+-----+
|-unknown-|95688|
|   FEMALE|63041|
|     MALE|54440|
|    OTHER|  282|
+---------+-----+

signup_method
+-------------+------+
|signup_method| count|
+-------------+------+
|        basic|152897|
|     facebook| 60008|
|       google|   546|
+-------------+------+

language
+--------+------+
|language| count|
+--------+------+
|      en|206314|
|      zh|  1632|
|      fr|  1172|
|      es|   915|
|      ko|   747|
|      de|   732|
|      it|   514|
|      ru|   389|
|      pt|   240|
|      ja|   225|
|      sv|   122|
|      nl|    97|
|      tr|    64|
|      da|    58|
|      pl|    54|
|      cs|    32|
|      no|    30|
|      th|    24|
|      el|    24|
|      id|    22|
+--------+------+
only showing top 20 rows

affiliate_channel
+-----------------+------+
|affiliate_channel| count|
+-----------------+------+
|           direct|137727|
|        sem-brand| 26045|
|    sem-non-brand| 18844|
|            other|  8961