## Aggregates and Joins

<br/>

### Summary Methods

Pandas provides a series of very helpful summary functions. These functions provide easy and quick overview of data inside the Dataframe. We can easily get things like _counts_, _mean values_, _unique counts_, and _frequency of values_: 


In [None]:
import pandas as pd

# read data fom csv
flights = pd.read_csv('../data/flights.csv', header=0)

# invoke summary methods on columns
# describe method on a text field
flights.src.describe()
# describe method on float field
flights.flight_time.describe()

# getting unique values
flights.airline.unique()


<br/>

### Aggregates

Pandas provides a `.groupby()` method which makes it easy to compute aggregates over the Dataframe. This is very handy to find things like sums, counts, min. and max values.

The example below shows how to use `count()`, `sum()`, `min()`, and `max()`:

In [None]:
import pandas as pd

# read data fom csv
flights = pd.read_csv('../data/flights.csv', header=0)

# get flight counts by airline
flights_per_airline = flights.groupby('airline').flight_number.count()

# total traveled miles by airlines
miles_per_airline = flights.groupby('airline').distance.sum()
# use other functions like min, max with aggregates
min_distance_per_airlines = flights.groupby('airline').distance.min()
max_distance_per_airlines = flights.groupby('airline').distance.max()

print("flights per airline:\n", flights_per_airline.head(5))
print("miles per airline:\n", miles_per_airline.head(5))
print("min distance per airline:\n", min_distance_per_airlines.head(5))
print("max distance per airline:\n", max_distance_per_airlines.head(5))


<br/>

#### Multiple Aggregates Stiched Together

The example below shows how you can create a grouped (aggregated) series to compute different aggregate on multiple columns, or even apply a tranformation, and then stiched everything back to display the results. 

The key here is that pandas creates an indexed series when you use a `.groupby()` method. The index is set to the group keys used for any consequent aggregates:


In [None]:
# create a grouped series
grouped = flights.groupby('airline')
# create aggregates series
flight_count = grouped.flight_number.count()
total_distance = grouped.distance.sum()
min_distance = grouped.distance.min()
max_distance = grouped.distance.max()
# change series names
flight_count.name,  total_distance.name = 'flight_count', 'total_distance'
min_distance.name, max_distance.name = 'min_dist', 'max_dist'
# stich back the aggregated together
stiched_back = pd.concat([flight_count, total_distance, min_distance, max_distance], axis=1)

# print
print(stiched_back.head(5))


<br/>

#### Multiple Aggregates Using .aggr()

Alternatively, pandas provides the `.agg()` method to apply multiple aggregates on a column at the same time. You can accomplished the same results much more concisely by using the `.agg()` method such as:


In [None]:
# create a grouped series
grouped = flights.groupby('airline').distance.agg([len, sum, min, max])
print(grouped.head(5))



<br/>

#### GroupBy Multiple Columns

You can pass an array of columns to `groupby()` method to aggregate by multiple columns at the same time. The example below calcualtes flight counts per airline route (airline, src, dest):


In [None]:
# get flight counts for dictinct routes
flights_per_route = flights.groupby(['airline', 'src', 'dest']).flight_number.count()
flights_per_route


<br/>

### Sorting Values

Use the `.sort_values()` series method to sort a Dataframe based on values of a column:


In [None]:
# get total flight distance by airline
grouped = flights.groupby('airline').distance.sum()
# sort in descending order
grouped.sort_values(ascending=False)