# 04.00 Data analytics
SQL can be also performed for essential data analytics. For this, we'll be using the on-time performance dataset for flights in the US from the BTS.

You can find a subset of attributes for 2018 under `/course/cs0060/data/otp_flights.tar.gz`

I.e. use `scp <your user>@ssh.cs.brown.edu:/course/cs0060/data/otp_flights.tar.gz .` to fetch the data

The first step is to load the actual data into PostgreSQL, this can be done using the `COPY` command

In [None]:
!mkdir -p data && tar xf otp_flights.tar.gz -C data/

In [None]:
!ls data

In [None]:
!head data/otp_flights_2018_1.csv

In [None]:
!psql -c 'CREATE TABLE flights(OP_UNIQUE_CARRIER VARCHAR, \
OP_CARRIER_FL_NUM NUMERIC, \
FL_DATE VARCHAR, \
ORIGIN_CITY_NAME VARCHAR, \
DEST_CITY_NAME VARCHAR, \
DISTANCE NUMERIC, \
DEP_DELAY NUMERIC, \
ARR_DELAY NUMERIC);' cs6

In [None]:
!psql -c '\d flights' cs6

Loading the data into the database via `COPY FROM`

In [None]:
!psql -c "COPY flights(OP_UNIQUE_CARRIER,OP_CARRIER_FL_NUM,FL_DATE, \
ORIGIN_CITY_NAME,DEST_CITY_NAME,DISTANCE,DEP_DELAY,ARR_DELAY) \
FROM '$(pwd)/data/otp_flights_2018_12.csv' DELIMITER ',' CSV HEADER;" cs6

In [None]:
!psql -c 'SELECT * FROM flights LIMIT 5;' cs6

## 04.01 Basic analytics
We can use simple aggregates to get information.

Aggregates that can be used are

- `AVG` computes average
- `SUM` computes sum
- `MIN` or `MAX`
- ...

A complete list is available under https://www.postgresql.org/docs/9.5/functions-aggregate.html

I.e. what is the longest flight recorded in the dataset?

In [None]:
!psql -c 'SELECT MAX(distance) FROM flights;' cs6

We can now get the rows which belong to the longest flight

In [None]:
!psql -c 'SElECT origin_city_name, dest_city_name FROM flights WHERE distance=4983.0' cs6

As we can see, the output is quite long because all pairs where returned!
==> we can shorten it by using the DISTINCT keyword to eliminate the duplicates

In [None]:
!psql -c 'SElECT DISTINCT origin_city_name, dest_city_name FROM flights WHERE distance=4983.0' cs6

## 04.02 Joining other datasets
Sometimes there is information stored in other tables which we would like to combine with the current data.

For this, let's ask the following question:

Which carrier serves the most flights from New York? 

In [None]:
!psql -c "SELECT op_unique_carrier, COUNT(*) FROM flights WHERE origin_city_name \
LIKE '%New York%' GROUP BY op_unique_carrier ORDER BY COUNT(*) DESC;" cs6

The carrier code here however is unreadable, but there is a lookup table which we can join in!

In [None]:
!head airlines.csv

In [None]:
!psql -c 'CREATE TABLE airlines(code VARCHAR, name VARCHAR);' cs6

In [None]:
!psql -c "COPY airlines(code, name) FROM '$(pwd)/airlines.csv' DELIMITER ',' CSV HEADER;" cs6

Let's now join this table with the flights!

In [None]:
!psql -c "SELECT DISTINCT a.name, f.origin_city_name, f.dest_city_name \
FROM flights f JOIN airlines a ON f.op_unique_carrier = a.code WHERE f.origin_city_name LIKE '%Providence%'" cs6

In [None]:
!psql -c "SELECT a.name, COUNT(*) FROM flights f JOIN airlines a ON f.op_unique_carrier = a.code \
WHERE origin_city_name LIKE '%New York%' \
GROUP BY a.name ORDER BY COUNT(*) DESC;" cs6

==> I.e. Delta Air Lines seems to operate the most flights out of NYC!

How about Pvd?

In [None]:
!psql -c "SELECT a.name, COUNT(*) FROM flights f JOIN airlines a ON f.op_unique_carrier = a.code \
WHERE origin_city_name LIKE '%Providence%' \
GROUP BY a.name ORDER BY COUNT(*) DESC;" cs6

To reset this notebook's db, run

In [None]:
!psql -c "DROP TABLE airlines; DROP TABLE flights;" cs6

## 04.03 Data analytics in MongoDB

MongoDB also provides data aggregation features, however it's a bit more complicated than writing SQL-statements. Rather, to use aggregation across documents you'll need to define in MongoDB an aggregation pipeline:

https://docs.mongodb.com/manual/aggregation/

In [None]:
import pymongo

client = pymongo.MongoClient()

db = client['cs6']

Let's load the flight data into MongoDB

In [None]:
import csv

In [None]:
%%time
with open('data/otp_flights_2018_1.csv') as fp:
    reader = csv.DictReader(fp)
    
    rows = [dict(row) for row in reader]
    
    db.flights.insert_many(rows)

In [None]:
db.flights.find_one()

Let's try to do the same query above for New York in MongoDB!

In [None]:
db.flights.find_one({'ORIGIN_CITY_NAME' : {'$regex' : 'New York'}})

First step is to define a simple counting pipeline, which restricts documents to the ones belonging to flights originating in New York

In [None]:
res = db.flights.aggregate([{'$match' : {'ORIGIN_CITY_NAME' : {'$regex' : 'New York'}}},
                    {'$group' : {'_id' : '$OP_UNIQUE_CARRIER', 'total': { '$sum' : 1}}}])

list(res)

Second step is to sort the result

In [None]:
res = db.flights.aggregate([{'$match' : {'ORIGIN_CITY_NAME' : {'$regex' : 'New York'}}},
                    {'$group' : {'_id' : '$OP_UNIQUE_CARRIER', 'total': { '$sum' : 1}}},
                           {'$sort' : {'total' : -1}}]) # 1 for ascending, -1 for descending

list(res)

The `_id` field looks rather unpleasant, but it can be renamed

In [None]:
res = db.flights.aggregate([{'$match' : {'ORIGIN_CITY_NAME' : {'$regex' : 'New York'}}},
                    {'$group' : {'_id' : '$OP_UNIQUE_CARRIER', 'total': { '$sum' : 1}}},
                           {'$sort' : {'total' : -1}},
                           {'$project' : {'_id' : 0, 'carrier_code' : '$_id', 'total' : 1}}]) # 1 for ascending, -1 for descending

list(res)

What is missing though is the lookup on the airline name. This can be also done in MongoDB!

In [None]:
%%time
with open('airlines.csv') as fp:
    reader = csv.DictReader(fp)
    
    rows = [dict(row) for row in reader]
    
    db.airlines.insert_many(rows)

In [None]:
res = db.flights.aggregate([{'$match' : {'ORIGIN_CITY_NAME' : {'$regex' : 'New York'}}},
                    {'$group' : {'_id' : '$OP_UNIQUE_CARRIER', 'total': { '$sum' : 1}}},
                           {'$sort' : {'total' : -1}},
                           {'$project' : {'_id' : 0, 'carrier_code' : '$_id', 'total' : 1}},
                           {'$lookup' : {'from' : 'airlines', 
                                         'localField' : 'carrier_code',
                                         'foreignField' : 'Code',
                                         'as' : 'airline'}}]) # 1 for ascending, -1 for descending

list(res)[:5]

Again, some projection is needed to get things nicely formatted!

In [None]:
res = db.flights.aggregate([{'$match' : {'ORIGIN_CITY_NAME' : {'$regex' : 'New York'}}},
                    {'$group' : {'_id' : '$OP_UNIQUE_CARRIER', 'total': { '$sum' : 1}}},
                           {'$sort' : {'total' : -1}},
                           {'$project' : {'_id' : 0, 'carrier_code' : '$_id', 'total' : 1}},
                           {'$lookup' : {'from' : 'airlines', 
                                         'localField' : 'carrier_code',
                                         'foreignField' : 'Code',
                                         'as' : 'airline'}},
                           {'$project' : {'total' : 1, 'airline' : {
                               '$arrayElemAt': [ '$airline.Description', 0 ] }
                                         }
                           }])

list(res)

To reset the mongodb collection, use

In [None]:
db.flights.drop()
db.airlines.drop()