# Pandas Essentials:  Sorting and Grouping your Data

This Pandas Notebook illustrates the essentials of grouping and sorting your data.  Concepts are illustrated with the [New York City pizza restaurant inspection data](https://github.com/ecerami/pydata-essentials/blob/master/pandas/data/NYC_Pizza_2017.csv).

Topics include:

* Sorting data
* Grouping data

# Loading and Peparing Data

In [1]:
import pandas as pd

# To get started, we load the NYC Pizza Restaurant Inspection Data Set
pizza_df = pd.read_csv("data/NYC_Pizza_2017.csv")

# Change data type of CAMIS and ZIPCODE to String
pizza_df.CAMIS = pizza_df.CAMIS.astype(str)
pizza_df.ZIPCODE = pizza_df.ZIPCODE.astype(int).astype(str)

# Set max display columns and rows (for more compact view)
pd.options.display.max_columns = 6
pd.options.display.max_rows = 6

In [2]:
pizza_df.head()

Unnamed: 0,CAMIS,DBA,BORO,...,SCORE,GRADE,GRADE DATE
0,40363644,DOMINO'S,MANHATTAN,...,4.0,A,2017-03-30
1,40363945,DOMINO'S,MANHATTAN,...,12.0,A,2017-03-02
2,40364920,RIZZO'S FINE PIZZA,QUEENS,...,12.0,A,2016-11-03
3,40365280,COMO PIZZA,MANHATTAN,...,10.0,A,2016-08-29
4,40365632,J&V FAMOUS PIZZA,BROOKLYN,...,2.0,A,2017-04-05


# Sorting DataFrames

Use `sort_values()` to sort a data frame by a specific column.

In [11]:
# Example:  Select restaurants by zipcode and then sort by score
neighborhood = pizza_df[pizza_df.ZIPCODE == "10029"]
neighborhood = neighborhood.sort_values(by="SCORE", ascending=True)
neighborhood.head()

Unnamed: 0,CAMIS,DBA,BORO,...,SCORE,GRADE,GRADE DATE
1122,50058564,DREAMERS PIZZA,MANHATTAN,...,2.0,A,2017-02-02
134,40721285,DOMINO'S,MANHATTAN,...,2.0,A,2017-03-24
203,40878675,TACO BELL PIZZA HUT EXPRESS,MANHATTAN,...,2.0,A,2016-09-26
1115,50058340,LITTLE CAESARS,MANHATTAN,...,2.0,A,2017-03-22
898,50035899,NEAPOLITAN EXPRESS,MANHATTAN,...,4.0,A,2016-10-05


# Groupby

The `groupby()` function uses the **split/apply/combine** paradigm to analyze a dataframe.

## Example 1:  Which ZipCode has the most Pizza Restaurants?

In [5]:
# Count by ZipCode
zipcode = pizza_df.groupby(by="ZIPCODE").size()
zipcode = zipcode.sort_values(ascending=False)
zipcode.head()

ZIPCODE
10019    21
11385    21
10002    19
10003    19
10029    18
dtype: int64

## Example 2:  Which Pizza Chain is most Dominant in NYC?

In [6]:
# Count by Business Name, identify chains versus independent restaurants
dba = pizza_df.groupby(by="DBA").size()
dba = dba.sort_values(ascending=False)
dba.head()

DBA
DOMINO'S          84
PAPA JOHN'S       46
LITTLE CAESARS    38
PIZZA HUT         18
JOE'S PIZZA        9
dtype: int64

## Example 3:  Which Borough has the Lowest Mean Inspection Score?

In [7]:
# Mean Score by BORO
boro = pizza_df.groupby(by="BORO").mean()
boro = boro.sort_values(by="SCORE")
boro.head()

Unnamed: 0_level_0,SCORE
BORO,Unnamed: 1_level_1
BROOKLYN,9.973333
QUEENS,9.98913
MANHATTAN,10.075758
BRONX,10.554974
STATEN ISLAND,10.843137


In [8]:
# Alternative Syntax #1
boro = pizza_df.groupby(by="BORO").SCORE.mean()
boro = boro.sort_values()
boro.head()

BORO
BROOKLYN          9.973333
QUEENS            9.989130
MANHATTAN        10.075758
BRONX            10.554974
STATEN ISLAND    10.843137
Name: SCORE, dtype: float64

In [9]:
# Alternative Syntax #2
boro = pizza_df.groupby(by="BORO")["SCORE"].mean()
boro = boro.sort_values()
boro.head()

BORO
BROOKLYN          9.973333
QUEENS            9.989130
MANHATTAN        10.075758
BRONX            10.554974
STATEN ISLAND    10.843137
Name: SCORE, dtype: float64

## Example 4:  Calculate Inspection score stats by Borough

In [14]:
import numpy as np
boro = pizza_df.groupby(by="BORO").aggregate(
    [np.min, np.max, np.mean, np.std])
boro.head()

Unnamed: 0_level_0,SCORE,SCORE,SCORE,SCORE
Unnamed: 0_level_1,amin,amax,mean,std
BORO,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
BRONX,0.0,54.0,10.554974,7.125629
BROOKLYN,0.0,43.0,9.973333,5.740715
MANHATTAN,0.0,42.0,10.075758,5.352358
QUEENS,0.0,42.0,9.98913,5.407391
STATEN ISLAND,2.0,39.0,10.843137,7.035261
