# Intro to DataFrames

References:

- DataFrame API docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

Rules of thumb:
- Hit tab to auto-complete
- To see all available methods, place a dot (.) after the RDD (e.g. words.) and hit tab 
- Use `.collect()` to see the contents of the RDD

Solutions for potentially challenging exercises can be found in the end of the section. Don't peek unless you're really stuck!

In [None]:
# like in the pyspark shell, SparkSession is already defined
spark

## 1. DataFrame methods

### 1.1 Data input

In [None]:
df = spark.read.json("../data/people/names.json")

# other supported file formats:
# spark.read.parquet("../data/pems_sorted/")
# spark.read.text()
# spark.read.csv()
# spark.read.orc()

# generic form: 
# spark.read.load("path/to/someFile.csv", format="csv", sep=":", inferSchema="true", header="true")

# Loading data from a JDBC source
# jdbcDF = spark.read \
#     .format("jdbc") \
#     .option("url", "jdbc:postgresql:dbserver") \
#     .option("dbtable", "schema.tablename") \
#     .option("user", "username") \
#     .option("password", "password") \
#     .load()

In [None]:
# TODO: write reading different files in ../data


In [None]:
spark.read.parquet("../data/pems_sorted/")

### 1.2 Data output (writing to disk)

- API docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

In [None]:
# writing to a file
df.write.parquet("new_data.parquet")

In [None]:
# overwrite on save
df.write.mode("overwrite").parquet("new_data.parquet")

In [None]:
# You can read from any format and write to any format (barring formatting limitations/rules):
# df.write.csv("new_data.csv",header=True)
# df.write.json("new_data.json")
# df.write.orc("new_data.orc")
# df.write.parquet("new_data.parquet")

# generic form:
# df.write.save("fileName.parquet", format="parquet")
# df.write.mode("overwrite").save("fileName.parquet", format="parquet")

# Saving data to a JDBC source
# jdbcDF.write \
#     .format("jdbc") \
#     .option("url", "jdbc:postgresql:dbserver") \
#     .option("dbtable", "schema.tablename") \
#     .option("user", "username") \
#     .option("password", "password") \
#     .save()


### Exploring DataFrames

In [None]:
df = spark.read.json("../data/people/names.json")

In [None]:
df.head(5)

In [None]:
df.show(5)

In [None]:
df.take(5)

In [None]:
# limit(n) returns a new dataframe with the first n rows of the dataframe
df.limit(3).show()

In [None]:
df.show()

In [None]:
df.printSchema()

In [None]:
df.columns

In [None]:
df.count()

In [None]:
df.describe().show()

### Selecting specific columns in a dataframe

In [None]:
# selecting a column
# Column API docs: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
df['name']



In [None]:
# creating a new dataframe with only selected columns
df.select('name')

In [None]:
# creating a new dataframe with only selected columns
df.select(['name', 'age']).show()

In [None]:
# renaming columns
df = df.withColumnRenamed('AGE', 'age')

In [None]:
df

In [None]:
# Creating new columns
df = df.withColumn('height plus 100', df.height + 100)
df.show()

In [None]:
# Creating new columns
df = df.withColumn('is_tall', df.height >= 175)
df.show()

### Filtering
`.filter()` takes in either (i) a `Column` of `types.BooleanType` or (ii) a string of SQL expression.

In [None]:
# filter using SQL expressions
# df.where('age >= 25').show() is also possible because .where() is an alias for .filter()
df.filter('age >= 25').show()

In [None]:
# filter using a column of boolean types
df.filter(df.age >= 25).show()

In [None]:
# df.age >= 25 returns a Column of booleans
df.age >= 25

In [None]:
df.filter( (df.age >= 25) & (df.age <= 30) ).show()
# you can use df.age or df['age']
# you can replace & with | for 'or' operations

In [None]:
# TODO: try filtering based on other predicates

### groupBy

TL;DR - `.groupBy()` allows you to group rows together based on its value in some given column(s)
- `df.groupBy([cols])`
- [GroupedData operations](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData) (alternatively, you can instantiate a variable with the type of GroupedData, let jupyter notebook's intellisense show you what methods are available:
    - `grouped = df.groupBy('gender')`
    - `grouped.` (and hit tab)

In [None]:
df.head()

In [None]:
df.groupBy('gender')

In [None]:
grouped = df.groupBy('gender')

In [None]:
df.groupBy('gender').count().show()

In [None]:
df.groupBy('gender').avg().show()

In [None]:
# calculating total age and height of all people (i.e. don't groupby anything)
df.groupBy().sum().show()

In [None]:
# TODO: calculate average height of all people 

In [None]:
# TODO: calculate max height for each gender

In [None]:
# TODO: calculate min height for each gender

### 1.2 Crimes Data

In [None]:
crimes = spark.read.csv("../data/crimes/Crimes_-_One_year_prior_to_present.csv", header=True, inferSchema=True)
# try the above without the header and inferSchema option. see what happens!

In [None]:
# TODO: print the schema of the dataframe (e.g. data type of each column)?


In [None]:
# TODO: how many rows are there in the dataframe?


In [None]:
# TODO: Display the first 2 rows


In [None]:
# TODO: What columns are in the dataframe?


In [None]:
# Let's rename the improperly formatted column names
columnNames = crimes.columns
for col in columnNames:
    crimes = crimes.withColumnRenamed(col, col.strip())
    
crimes.columns

In [None]:
# TODO: How many cases resulted in arrest, and how many didn’t?
# Hint: Highlight whitespace between this cell and the next cell to see the hint



<font color="white">Use .groupBy("ARREST")</font>

In [None]:
# TODO: List the total count of cases for each WARD


In [None]:
# TODO: List the total count of cases for each WARD, and sort it (by count) in ascending order


In [None]:
# TODO: Show top 10 (WARD, count) pairs with the most number of cases
# To sort in descending order, use the desc() function - .sort(desc("count"))
from pyspark.sql.functions import desc



In [None]:
# TODO: List top 15 categories (PRIMARY DESCRIPTION) of cases


In [None]:
# TODO: List top 5 locations (LOCATION DESCRIPTION) where cases occur


In [None]:
# TODO: Save one of the results to disk (choose any format)
# Note: if your dataframe ends up being partitioned, you can call `your_df.coalesce(1)` before saving (`df.coalesce(1).write...`)


In [None]:
# TODO: submit the preceeding task as a spark job
# 1. Create a python file named jobs/top_20_crime_locations.py
# 2. define spark session object
#   - from pyspark.sql import SparkSession
#   - spark = SparkSession.builder.appName("MyAppName").getOrCreate()
# 3. Copy the code in the preceeding cell into the file 
# 4. submit the job: ${SPARK_HOME}/bin/spark-submit --master local ./jobs/top_20_crime_locations.py

# if you get stuck, you can refer to ./jobs/top_N_crime_locations_solution.py

In [None]:
# TODO: Use your creativity - create any other interesting DataFrames or insights into the crimes data!

## Using SQL with Spark DataFrames

In [None]:
df = spark.read.json("../data/people/names.json")

In [None]:
df.createOrReplaceTempView('names')

In [None]:
spark.sql("SELECT * FROM names")
# add .show() to see the resulting dataframe. Example:
# df = spark.sql("SELECT * FROM names")
# df.show()

In [None]:
spark.sql("SELECT * FROM names WHERE height > 170").show()