# Pandas Essentials:  Loading and Grokking your Data

This Pandas Notebook illustrates the essentials of loading and "grokking" or understanding your data.  Concepts are illustrated with the [New York City pizza restaurant inspection data](https://github.com/ecerami/pydata-essentials/blob/master/pandas/data/NYC_Pizza_2017.csv).

Topics include:

* Loading data
* Grokking your data set:
    * Getting the shape of your data
    * Peaking at your data
    * Understanding indexes, columns and data types
    * Generating descriptive statistics for your data


# Loading Data

In [16]:
# Loading data via pd.read_csv
import pandas as pd
pizza_df = pd.read_csv("data/NYC_Pizza_2017.csv")

# Grokking your data

## Getting the shape of your data

In [17]:
# Determine dimensions of data frame
pizza_df.shape

(1148, 10)

In [20]:
# the shape attribute is a tuple, and you can extract individual elements
# for example, to extract the number of rows, just get the 0th element of the tuple
pizza_df.shape[0]

1148

## Peeking at your data

In [3]:
pizza_df.head()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,SCORE,GRADE,GRADE DATE
0,40363644,DOMINO'S,MANHATTAN,464,3 AVENUE,10016.0,Pizza,4.0,A,2017-03-30
1,40363945,DOMINO'S,MANHATTAN,148,WEST 72 STREET,10023.0,Pizza,12.0,A,2017-03-02
2,40364920,RIZZO'S FINE PIZZA,QUEENS,3013,STEINWAY STREET,11103.0,Pizza,12.0,A,2016-11-03
3,40365280,COMO PIZZA,MANHATTAN,4035,BROADWAY,10032.0,Pizza,10.0,A,2016-08-29
4,40365632,J&V FAMOUS PIZZA,BROOKLYN,6322,18 AVENUE,11204.0,Pizza,2.0,A,2017-04-05


In [4]:
pizza_df.tail()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,SCORE,GRADE,GRADE DATE
1143,50059977,LUIGI'S PIZZA,MANHATTAN,2127,AMSTERDAM AVE,10032.0,Pizza,2.0,A,2017-03-03
1144,50060049,TRADITA BRICK OVEN PIZZA,BRONX,292,E 204TH ST,10467.0,Pizza,4.0,A,2017-03-27
1145,50060439,LA VERA PIZZA,MANHATTAN,922,2ND AVE,10017.0,Pizza,9.0,A,2017-03-24
1146,50060695,2 BROS PIZZA,QUEENS,16417,JAMAICA AVE,11432.0,Pizza,7.0,Z,2017-04-10
1147,50062741,LA ROMANA PIZZERIA,BROOKLYN,755,GRAND ST,11211.0,Pizza,6.0,A,2017-04-18


## Understanding indexes, columns and data types

In [5]:
pizza_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1148 entries, 0 to 1147
Data columns (total 10 columns):
CAMIS                  1148 non-null int64
DBA                    1148 non-null object
BORO                   1148 non-null object
BUILDING               1148 non-null object
STREET                 1148 non-null object
ZIPCODE                1148 non-null float64
CUISINE DESCRIPTION    1148 non-null object
SCORE                  1148 non-null float64
GRADE                  1148 non-null object
GRADE DATE             1148 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 89.8+ KB


In [6]:
pizza_df.index

RangeIndex(start=0, stop=1148, step=1)

In [7]:
pizza_df.columns

Index([u'CAMIS', u'DBA', u'BORO', u'BUILDING', u'STREET', u'ZIPCODE',
       u'CUISINE DESCRIPTION', u'SCORE', u'GRADE', u'GRADE DATE'],
      dtype='object')

In [8]:
pizza_df.dtypes

CAMIS                    int64
DBA                     object
BORO                    object
BUILDING                object
STREET                  object
ZIPCODE                float64
CUISINE DESCRIPTION     object
SCORE                  float64
GRADE                   object
GRADE DATE              object
dtype: object

## Generating descriptive statistics for your data

In [9]:
# By default, describe() outputs summary data for numeric data only
pizza_df.describe()

Unnamed: 0,CAMIS,ZIPCODE,SCORE
count,1148.0,1148.0,1148.0
mean,44739700.0,10740.940767,10.141986
std,4373390.0,568.064156,5.869539
min,40363640.0,10001.0,0.0
25%,41146230.0,10036.0,7.0
50%,41614640.0,11022.0,10.0
75%,50019040.0,11229.0,12.0
max,50062740.0,11694.0,54.0


In [10]:
# Use include="all" to get summary info for all columns
pizza_df.describe(include="all")

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE DESCRIPTION,SCORE,GRADE,GRADE DATE
count,1148.0,1148,1148,1148.0,1148,1148.0,1148,1148.0,1148,1148
unique,,827,5,967.0,586,,1,,4,310
top,,DOMINO'S,MANHATTAN,201.0,BROADWAY,,Pizza,,A,2017-02-22
freq,,84,330,5.0,50,,1148,,1043,11
mean,44739700.0,,,,,10740.940767,,10.141986,,
std,4373390.0,,,,,568.064156,,5.869539,,
min,40363640.0,,,,,10001.0,,0.0,,
25%,41146230.0,,,,,10036.0,,7.0,,
50%,41614640.0,,,,,11022.0,,10.0,,
75%,50019040.0,,,,,11229.0,,12.0,,


You can also calculate specific statistics directly.

In [9]:
pizza_df.mean()

CAMIS      4.473970e+07
ZIPCODE    1.074094e+04
SCORE      1.014199e+01
dtype: float64

In [10]:
pizza_df.median()

CAMIS      41614636.5
ZIPCODE       11022.0
SCORE            10.0
dtype: float64

In [15]:
pizza_df.std()

CAMIS      4.373390e+06
ZIPCODE    5.680642e+02
SCORE      5.869539e+00
dtype: float64