# Pandas

* [Pandas](http://pandas.pydata.org/) is the swiss army knife of data analysis.  It is worthy of a series of workshops.
* If you are not familiar with Pandas already, you should read the [10 minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) guide.  
* For a more extensive and light hearted exploration of Pandas, also check [A Pandas Cookbook by Julia Evans](http://jvns.ca/blog/2013/12/22/cooking-with-pandas/)

#Requests

Requests is a simple HTTP library for doing wget/curl type operations.

## Open Data and Socrata

* Socrata describes itself as a social data discovery platform.
* Provides services to many cities including powering the [Los Angeles data portal](https://data.lacity.org/)
* Their own portal has lots of goodies also, including [All Starbucks Locations in the world](https://opendata.socrata.com/Business/All-Starbucks-Locations-in-the-World/xy4y-c4mk)

In [1]:
import numpy as np
import pandas as pd
import requests
import os

In [2]:
# GET A CSV OF ALL STARBUCKS LOCATIONS

# If this link is ever broken, use the link above to get a new one

fname = 'All_Starbucks_Locations_in_the_World.csv'
if not(os.path.isfile(fname)):
    print 'Getting file from Socrata portal'
    r = requests.get('https://opendata.socrata.com/api/views/xy4y-c4mk/rows.csv?accessType=DOWNLOAD')
    f = open(fname, 'w')
    f.write(r.text.encode('utf-8'))
    f.close()
df = pd.read_csv(fname)

Getting file from Socrata portal


In [3]:
# LET'S GET SOME SUMMARY STATISTICS BY COUNTRY

by_country = pd.DataFrame(df.groupby(['Country'])['Store ID'].count())
by_country.sort('Store ID', ascending=False, inplace=True)
by_country.columns = ['count']
by_country['percentage'] = by_country['count'] / by_country['count'].sum()
by_country.head()

Unnamed: 0_level_0,count,percentage
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
US,12171,0.566304
CN,1715,0.079797
CA,1332,0.061977
JP,1081,0.050298
GB,806,0.037502


In [4]:
# DRILL DOWN BY STATES

filter = df['Country'] == 'US'
usa = pd.DataFrame(df[filter])
by_state = pd.DataFrame(usa.groupby(['Country Subdivision'])['Store ID'].count())
by_state.sort('Store ID', ascending=False, inplace=True)
by_state.columns = ['count']
by_state['percentage'] = by_state['count'] / by_state['count'].sum()
by_state.head()

Unnamed: 0_level_0,count,percentage
Country Subdivision,Unnamed: 1_level_1,Unnamed: 2_level_1
CA,2631,0.21617
TX,916,0.075261
WA,714,0.058664
FL,624,0.051269
NY,573,0.047079


## Copy is a gotcha

In [5]:
# FOCUS ON LOS ANGELES

cfilter = df['Country'] == 'US'
sfilter = df['Country Subdivision'] == 'CA'
lafilter = df['City'] == 'Los Angeles'
filter = cfilter & sfilter & lafilter
la = df[filter].copy()

In [6]:
# HOW MANY ROWS AND COLUMNS?

la.shape

(110, 21)

In [7]:
# CAN YOU FIND YOUR FAVORITE?

la[['Street 1', 'Street 2']]

Unnamed: 0,Street 1,Street 2
4765,12313 Jefferson Blvd,
4914,3242 West Cahuenga Blvd.,
5146,5353 Wilshire Blvd.,
5485,4177 W. Washington Blvd.,
5517,4430 York Blvd.,
5697,3535 S La Cienega Blvd,
5720,3722 Crenshaw Blvd.,
6352,5020 Wilshire Blvd.,
6473,1437 E. Gage Avenue,
6478,8817 South Sepulveda Blvd.,


## A few Pandas features used in this workshop

In [8]:
co_series = la['Ownership Type']=='CO'
co_series.head()

4765    True
4914    True
5146    True
5485    True
5517    True
Name: Ownership Type, dtype: bool

In [9]:
~co_series.head()

4765    False
4914    False
5146    False
5485    False
5517    False
Name: Ownership Type, dtype: bool

In [10]:
co_series.tolist()

[True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False]

In [11]:
la.sort('Postal Code', inplace=True)
la.head()

Unnamed: 0,Store ID,Name,Brand,Store Number,Phone Number,Ownership Type,Street Combined,Street 1,Street 2,Street 3,...,Country Subdivision,Country,Postal Code,Coordinates,Latitude,Longitude,Timezone,Current Timezone Offset,Olson Timezone,First Seen
19442,1006808,Central & Slauson,Starbucks,21800-197225,323-521-1535,CO,5857 S. Central Ave.,5857 S. Central Ave.,,,...,CA,US,90001,"(33.9883270263672, -118.257118225098)",33.988327,-118.257118,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,05/03/2014 04:00:00 AM
6473,8822,"Gage & Compton, Huntington Park",Starbucks,8819-94823,323-585-1928,CO,1437 E. Gage Avenue,1437 E. Gage Avenue,,,...,CA,US,900011789,"(33.9823722839355, -118.249458312988)",33.982372,-118.249458,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
6642,9035,Hancock Park,Starbucks,507-450,323-469-1081,CO,206 North Larchmont,206 North Larchmont,,,...,CA,US,900043707,"(34.0749092102051, -118.323463439941)",34.074909,-118.323463,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
16602,90142,Trojan Grounds @ USC,Starbucks,15244-113064,213-740-6285,LS,642 W 34th St,642 W 34th St,,,...,CA,US,90007,"(34.0212821960449, -118.282356262207)",34.021282,-118.282356,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
16758,91220,Figueroa & Exposition,Starbucks,17413-168644,213-749-9302,CO,"3584 S. Figueroa St., #1B",3584 S. Figueroa St.,#1B,,...,CA,US,90007,"(34.0185890197754, -118.281768798828)",34.018589,-118.281769,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM


### Indexes

In [12]:
la.index

Int64Index([19442, 6473, 6642, 16602, 16758, 16765, 19714, 11813, 19407, 14979, 15795, 7017, 8542, 12770, 16769, 13464, 14156, 8851, 7447, 12917, 16335, 20202, 14168, 5697, 5720, 8484, 11028, 6520, 16913, 5485, 6587, 13554, 6517, 19819, 9994, 16476, 6771, 19736, 8373, 12240, 11405, 8889, 14980, 6891, 8675, 17044, 8158, 14598, 6806, 9960, 6924, 19423, 12398, 20346, 20924, 9980, 6483, 14813, 5146, 6352, 11119, 12276, 11353, 7609, 9197, 11363, 15669, 13453, 5517, 20095, 18544, 18887, 21120, 8413, 9993, 6478, 7276, 9505, 9869, 9870, 9878, 10783, 16182, 18705, 14097, 20344, 9048, 8617, 10795, 9968, 7116, 17420, 11182, 11518, 7368, 18064, 9976, 9120, 8784, 11193, ...], dtype='int64')

In [13]:
la.index = np.arange(la.shape[0])
la.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

### Column renaming and dropping

In [14]:
la.head()

Unnamed: 0,Store ID,Name,Brand,Store Number,Phone Number,Ownership Type,Street Combined,Street 1,Street 2,Street 3,...,Country Subdivision,Country,Postal Code,Coordinates,Latitude,Longitude,Timezone,Current Timezone Offset,Olson Timezone,First Seen
0,1006808,Central & Slauson,Starbucks,21800-197225,323-521-1535,CO,5857 S. Central Ave.,5857 S. Central Ave.,,,...,CA,US,90001,"(33.9883270263672, -118.257118225098)",33.988327,-118.257118,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,05/03/2014 04:00:00 AM
1,8822,"Gage & Compton, Huntington Park",Starbucks,8819-94823,323-585-1928,CO,1437 E. Gage Avenue,1437 E. Gage Avenue,,,...,CA,US,900011789,"(33.9823722839355, -118.249458312988)",33.982372,-118.249458,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
2,9035,Hancock Park,Starbucks,507-450,323-469-1081,CO,206 North Larchmont,206 North Larchmont,,,...,CA,US,900043707,"(34.0749092102051, -118.323463439941)",34.074909,-118.323463,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
3,90142,Trojan Grounds @ USC,Starbucks,15244-113064,213-740-6285,LS,642 W 34th St,642 W 34th St,,,...,CA,US,90007,"(34.0212821960449, -118.282356262207)",34.021282,-118.282356,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
4,91220,Figueroa & Exposition,Starbucks,17413-168644,213-749-9302,CO,"3584 S. Figueroa St., #1B",3584 S. Figueroa St.,#1B,,...,CA,US,90007,"(34.0185890197754, -118.281768798828)",34.018589,-118.281769,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM


In [15]:
la.drop('Brand', axis=1, inplace=True)

cols = la.columns.tolist()
cols[0] = 'store_id'
la.columns = cols

la.head()

Unnamed: 0,store_id,Name,Store Number,Phone Number,Ownership Type,Street Combined,Street 1,Street 2,Street 3,City,Country Subdivision,Country,Postal Code,Coordinates,Latitude,Longitude,Timezone,Current Timezone Offset,Olson Timezone,First Seen
0,1006808,Central & Slauson,21800-197225,323-521-1535,CO,5857 S. Central Ave.,5857 S. Central Ave.,,,Los Angeles,CA,US,90001,"(33.9883270263672, -118.257118225098)",33.988327,-118.257118,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,05/03/2014 04:00:00 AM
1,8822,"Gage & Compton, Huntington Park",8819-94823,323-585-1928,CO,1437 E. Gage Avenue,1437 E. Gage Avenue,,,Los Angeles,CA,US,900011789,"(33.9823722839355, -118.249458312988)",33.982372,-118.249458,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
2,9035,Hancock Park,507-450,323-469-1081,CO,206 North Larchmont,206 North Larchmont,,,Los Angeles,CA,US,900043707,"(34.0749092102051, -118.323463439941)",34.074909,-118.323463,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
3,90142,Trojan Grounds @ USC,15244-113064,213-740-6285,LS,642 W 34th St,642 W 34th St,,,Los Angeles,CA,US,90007,"(34.0212821960449, -118.282356262207)",34.021282,-118.282356,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM
4,91220,Figueroa & Exposition,17413-168644,213-749-9302,CO,"3584 S. Figueroa St., #1B",3584 S. Figueroa St.,#1B,,Los Angeles,CA,US,90007,"(34.0185890197754, -118.281768798828)",34.018589,-118.281769,Pacific Standard Time,-420,GMT-08:00 America/Los_Angeles,12/08/2013 10:41:59 PM


# Uses for Requests in Data Science
* Easy retrieval of data from the web

# Uses for Pandas in Data Science
* Too many to count!