# Manipulating data

### Creating columns

Use the methods defined by Spark's DataFrame class to perform common data operations.

Columns wise operations can be done by using the .withColumn() method.

You can use spark.table("...") to create dataframe.

In [None]:
flights = spark.table("flights")
flights.show()
flights = flights.withColumn("duration_hrs", flights.air_time / 60)

## SQL in a nutshell


Similar to the code above you can do column-wise calculations with SQL.

SELECT origin, dest, air_time / 60 AS duration_hrs FROM flights;

### Filtering Data

the .filter() method is the Spark counterpart of SQL's WHERE clause. The .filter() method takes either an expression that would follow the WHERE clause of a SQL expression as a string, or a Spark Column of boolean (True/False) values.

Spark's .filter() can accept any expression that could go in the WHEREclause of a SQL query (in this case, "air_time > 120"), as long as it is passed as a string.

In [None]:
long_flights1 = flights.filter("distance > 1000")
long_flights2 = flights.filter(flights.distance > 1000)

print(long_flights1.show())
print(long_flights2.show())

### Selecting

The Spark variant of SQL's SELECT is the .select() method. This method takes multiple arguments - one for each column you want to select. These arguments can either be the column name as a string or a column object.

When you pass a column object, you can perform operations like addition or subtraction on the column to change the data contained in it, much like inside .withColumn().

.select() returns only the columns you specify, while .withColumn() returns all the columns of the DataFrame in addition to the one you defined. It's often a good idea to drop columns you don't need at the beginning of an operation. In this case, you would use .select() and not .withColumn().

In [None]:
selected1 = flights.select("tailnum", "origin", "dest")

temp = flights.select(flights.origin, flights.dest, flights.carrier)

filterA = flights.origin = "SEA"
filterB = flights.dest = "PDX"

selected2 = temp.filter(filterA).filter(filterB)

### Selecting II


When you're selecting a column using the df.colName notation, you can perform any column operation and the .select() method will return the transformed column.  You can also use the .alias() method to rename a column you're selecting. The equivalent Spark DataFrame method .selectExpr() takes SQL expressions as a string.

In [None]:
avg_speed = (flights.distance / (flights.air_time / 60)).alias("avg_speed")

speed1 = flights.select("origin", "dest", "tailnum", avg_speed)

speed2 = flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed" )

### Aggregating

All of the common aggregation methods, like .min(), .max(), and .count() are GroupedData methods. These are created by calling the .groupBy() DataFrame method.

In [None]:
fligts.filter(flights.origin = "PDX").groupBy().min("distance").show()

fligts.filter("origin == 'SEA'").groupBy().max("air_time").show()

### Aggregating II


In [None]:
flights.filter(flights.carrier == "DL").filter(flights.origin == "SEA").groupBy().avg("air_time").show()

flights.withColumn("duration_hrs", flights.air_time/60).groupBy().sum("duration_hrs")

### Grouping and Aggregating I


Pass the name of one or more columns in your DataFrame to the .groupBy() method, the aggregation methods behave like when you use a GROUP BY statement in a SQL query.

In [None]:
by_plane = flights.groupBy("tailnum")
by_plane.count().show()

by_origin = flights.groupBy("origin")
by_origin.avg("air_time").show()

### Grouping and Aggregating II


In addition to the GroupedData methods, there is also the .agg() method. This method lets you pass an aggregate column expression that uses any of the aggregate functions from the pyspark.sql.functions submodule. This submodule contains many useful functions for computing things like standard deviations. All the aggregation functions in this submodule take the name of a column in a GroupedData table.

In [None]:
import pyspark.sql.functions as F

by_month_dest = flights.groupBy("month", "dest")
by_month_dest.avg("dep_delay").show()

by_month_dest.agg(F.stddev("dep_delay")).show()

### Joining

A join will combine two different tables along a column that they share. This column is called the key.

### Joining II

In PySpark, joins are performed using the DataFrame method .join()

This method takes three arguments. The first is the second DataFrame that you want to join with the first one. The second argument, on, is the name of the key column(s) as a string. The names of the key column(s) must be the same in each table. The third argument, how, specifies the kind of join to perform. In this course we'll always use the value how="leftouter".



In [None]:
print(airports.show())

airports = airports.withColumnRenamed("faa","dest")
flights_with_airports = flights.join(airports, on="dest", how="leftouter")
print(flights_with_airports.show())