# Lecture 11: GNIS & Baseball Examples

In [1]:
import numpy as np
import pandas as pd

---

# Scalar Functions and Query Plans

In [2]:
## we'll use the Lahman baseball database in our examples today.
## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://localhost:5432/baseball
%config SqlMagic.displaylimit = 20

In [3]:
%%sql
WITH year_num AS
  (SELECT year_id, (year_id % 100) as year
     FROM batting
  )
SELECT year_id, CONCAT('''', LPAD(year::text, 2, '0')) as year
  FROM year_num
 LIMIT 5;

year_id,year
2004,'04
2007,'07
2009,'09
2010,'10
2012,'12


Let's analyze the below query (we've flattened it for convenience):

In [4]:
%%sql
EXPLAIN (VERBOSE true)
SELECT year_id,
       CONCAT('''', LPAD((year_id % 100)::text, 2, '0')) AS year
FROM batting;


QUERY PLAN
Seq Scan on public.batting (cost=0.00..3922.29 rows=104324 width=36)
"Output: year_id, concat('''', lpad(((year_id % 100))::text, 2, '0'::text))"


What if scalar functions mention multiple tables?

The below query computes an arbitrary statistic for pitchers:
* 1 point for every strikeout they throw as pitcher
* –1 for every point they themselves struck out as batter

If the notebook-like output is hard to read, try out the query in `psql`. Note that notebooks don't preserve whitespace when displaying dataframes.

In [5]:
%%sql
EXPLAIN (VERBOSE true)
SELECT p.player_id, p.so - b.so
  FROM pitching p
  INNER JOIN batting b
  ON p.player_id=b.player_id;

QUERY PLAN
Nested Loop (cost=0.43..12961.23 rows=336307 width=13)
"Output: p.player_id, (p.so - b.so)"
-> Seq Scan on public.pitching p (cost=0.00..1374.06 rows=45806 width=13)
"Output: p.player_id, p.year_id, p.stint, p.team_id, p.lg_id, p.w, p.l, p.g, p.gs, p.cg, p.sho, p.sv, p.ipouts, p.h, p.er, p.hr, p.bb, p.so, p.baopp, p.era, p.ibb, p.wp, p.hbp, p.bk, p.bfp, p.gf, p.r, p.sh, p.sf, p.gidp"
-> Memoize (cost=0.43..0.73 rows=7 width=13)
"Output: b.so, b.player_id"
Cache Key: p.player_id
Cache Mode: logical
-> Index Scan using batting_pkey on public.batting b (cost=0.42..0.72 rows=7 width=13)
"Output: b.so, b.player_id"


### Window Functions

In [6]:
%%sql
SELECT name_first, name_last, year_id, HR,
       rank() OVER (ORDER BY HR DESC),
       avg(HR)    OVER (PARTITION BY b.player_id ORDER BY year_id ROWS 3 PRECEDING) as avg_3yr,
       lag(HR, 7) OVER (PARTITION BY b.player_id ORDER BY year_id) as previous,
       lag(HR, 2) OVER (PARTITION BY b.player_id ORDER BY year_id) as lag2
FROM batting b, people p
WHERE p.player_id = b.player_id
   AND (name_last = 'Bonds' or name_last = 'Ruth')
ORDER BY HR DESC
LIMIT 10;

name_first,name_last,year_id,hr,rank,avg_3yr,previous,lag2
Barry,Bonds,2001,73,1,48.25,37.0,34
Babe,Ruth,1927,60,2,44.5,54.0,25
Babe,Ruth,1921,59,3,38.25,0.0,29
Babe,Ruth,1920,54,4,24.0,,11
Babe,Ruth,1928,54,4,46.5,59.0,47
Barry,Bonds,2000,49,6,40.0,46.0,37
Babe,Ruth,1930,49,6,52.25,41.0,54
Babe,Ruth,1926,47,8,39.75,29.0,46
Barry,Bonds,1993,46,9,34.5,16.0,25
Barry,Bonds,2002,46,9,50.5,33.0,49


### Inverse Distribution Window Functions

In [7]:
%%sql
SELECT MIN(HR),
       percentile_cont(0.25) WITHIN GROUP (ORDER BY HR) AS p25,
       percentile_cont(0.50) WITHIN GROUP (ORDER BY HR) AS median,
       percentile_cont(0.75) WITHIN GROUP (ORDER BY HR) AS p75,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY HR) AS p99,
       MAX(HR),
       AVG(HR) AS "average hit rate"
FROM batting;

min,p25,median,p75,p99,max,average hit rate
0,0.0,0.0,2.0,31.0,73,2.831582377976305


In [8]:
%%sql
SELECT HR, COUNT(*) FROM batting GROUP BY HR ORDER BY HR DESC;

hr,count
73,1
70,1
66,1
65,1
64,1
63,1
61,1
60,1
59,2
58,3


### Hypothetical-Set Window Functions

In [9]:
hrs = 4 # hypothetically, four home runs

In [10]:
%%sql
SELECT {{hrs}} as hypothetical,
       rank({{hrs}}) WITHIN GROUP (ORDER BY HR DESC),
       dense_rank({{hrs}}) WITHIN GROUP (ORDER BY HR DESC),
       percent_rank({{hrs}}) WITHIN GROUP (ORDER BY HR DESC) * 100 AS pct_rank,
       cume_dist({{hrs}}) WITHIN GROUP (ORDER BY HR)
FROM batting
LIMIT 10;

hypothetical,rank,dense_rank,pct_rank,cume_dist
4,18420,63,17.655573022506807,0.823445962137551


Without `jupysql` variable substituion

In [11]:
%%sql
SELECT 4 as hypothetical,
       rank(4) WITHIN GROUP (ORDER BY HR DESC),
       dense_rank(4) WITHIN GROUP (ORDER BY HR DESC),
       percent_rank(4) WITHIN GROUP (ORDER BY HR DESC) * 100 AS pct_rank,
       cume_dist(4) WITHIN GROUP (ORDER BY HR)
FROM batting
LIMIT 10;

hypothetical,rank,dense_rank,pct_rank,cume_dist
4,18420,63,17.655573022506807,0.823445962137551


# GNIS

This notebook transforms the existing [Geographics Names Information Systems (GNIS)](https://www.usgs.gov/core-science-systems/ngp/board-on-geographic-names/download-gnis-data) national zip file.

We have provided a subset of the sql database for you in `data/national.sql`.

If you'd like to make your own version of this database, see the end of this notebook. Note: Because of its size, we don't recommend building the GNIS SQL database from scratch on DataHub.


In [12]:
!psql -h localhost -d gnis -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();'
!psql -h localhost -c 'DROP DATABASE IF EXISTS gnis'
!psql -h localhost -c 'CREATE DATABASE gnis' 
!psql -h localhost -d gnis -f data/gnis.sql

 pg_terminate_backend 
----------------------
 t
(1 row)

DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 3195
COPY 11533
CREATE INDEX


In [13]:
%reload_ext sql
%sql postgresql://localhost:5432/gnis
%config SqlMagic.displaylimit = 15

* View schema in `psql`
* View some rows below

In [14]:
%sql SELECT COUNT(*) FROM national;

count
11533


In [15]:
%sql SELECT * FROM national WHERE county_name = 'Alameda';

feature_id,feature_name,feature_class,state_alpha,state_numeric,county_name,county_numeric,primary_lat_dms,prim_long_dms,prim_lat_dec,prim_long_dec,source_lat_dms,source_long_dms,source_lat_dec,source_long_dec,elev_in_m,elev_in_ft,map_name,date_created,date_edited
218316,Apperson Creek,Stream,CA,6,Alameda,1.0,373349N,1215000W,37.5635453,-121.8332887,373232N,1214804W,37.5422222,-121.8011111,148.0,486.0,La Costa Valley,01/19/1981,
225998,Irvington High School,School,CA,6,Alameda,1.0,373126N,1215801W,37.523814,-121.9670659,,,,,13.0,43.0,Niles,01/19/1981,03/31/2021
226951,Laurel Elementary School,School,CA,6,Alameda,1.0,374734N,1221147W,37.792899,-122.1964288,,,,,68.0,223.0,Oakland East,06/14/2000,03/14/2021
229367,Murray Elementary School,School,CA,6,Alameda,1.0,374318N,1215557W,37.721801,-121.9326269,,,,,112.0,367.0,Dublin,01/19/1981,03/14/2021
235581,Strawberry Creek,Stream,CA,6,Alameda,1.0,375221N,1221443W,37.8724258,-122.2452464,375251N,1221354W,37.8807588,-122.2316349,154.0,505.0,Oakland East,01/19/1981,08/31/2016
1654274,Hayward Golf Course,Locale,CA,6,Alameda,1.0,373726N,1220250W,37.6238222,-122.0471843,,,,,5.0,16.0,Newark,01/19/1981,
1664964,KOFY-AM (San Mateo),Tower,CA,6,Alameda,1.0,374934N,1221842W,37.8260385,-122.3116366,,,,,2.0,7.0,Oakland West,07/01/1994,
1670278,Lake Elizabeth,Lake,CA,6,Alameda,1.0,373255N,1215742W,37.5487056,-121.9617554,,,,,16.0,52.0,Niles,11/09/1995,03/07/2019
1692819,California School for the Deaf - Fremont,School,CA,6,Alameda,1.0,373334N,1215747W,37.5593966,-121.9631843,,,,,20.0,66.0,Niles,05/08/1996,09/16/2016
1692863,J A Freitas Library,Building,CA,6,Alameda,1.0,374335N,1220925W,37.7263185,-122.1569101,,,,,19.0,62.0,San Leandro,05/08/1996,


In [16]:
%%sql
SELECT *
FROM national TABLESAMPLE BERNOULLI(10);

feature_id,feature_name,feature_class,state_alpha,state_numeric,county_name,county_numeric,primary_lat_dms,prim_long_dms,prim_lat_dec,prim_long_dec,source_lat_dms,source_long_dms,source_lat_dec,source_long_dec,elev_in_m,elev_in_ft,map_name,date_created,date_edited
1230,Belmont Mountains,Range,AZ,4,Maricopa,13.0,333832N,1125404W,33.642258,-112.9010129,,,,,931.0,3054.0,Belmont Mountain,02/08/1980,
2336,Cabeza Prieta Game Range,Park,AZ,4,Yuma,27.0,321500N,1132703W,32.250056,-113.45074,,,,,275.0,902.0,Bryan Mountains,02/08/1980,
2750,Chandler Springs,Spring,AZ,4,Navajo,17.0,352236N,1102831W,35.3766788,-110.4754096,,,,,1685.0,5528.0,Shonto Butte,02/08/1980,
3342,Cottonwood Creek,Stream,AZ,4,Mohave,15.0,365407N,1123348W,36.901931,-112.5632547,370545N,1123733W,37.095818,-112.6257621,1389.0,4557.0,Fredonia,02/08/1980,
7204,Little Tank,Reservoir,AZ,4,Mohave,15.0,343754N,1134557W,34.631793,-113.7658874,,,,,1043.0,3422.0,Beecher Canyon,02/08/1980,03/25/2019
7476,Lower Jumbo Mine,Mine,AZ,4,Yavapai,25.0,335728N,1123413W,33.9578095,-112.5701738,,,,,881.0,2890.0,Red Picacho,02/08/1980,
8960,Old Quijotoa Well,Well,AZ,4,Pima,19.0,320259N,1120352W,32.0497945,-112.064574,,,,,626.0,2054.0,Vainom Kug,02/08/1980,03/27/2018
10197,Reese Ranch,Locale,AZ,4,Gila,7.0,330510N,1104244W,33.0861745,-110.7123268,,,,,633.0,2077.0,Christmas,02/08/1980,
13821,Willow Beach,Locale,AZ,4,Mohave,15.0,355213N,1143934W,35.8702623,-114.6594215,,,,,205.0,673.0,Willow Beach,02/08/1980,
24724,Bethany Villa Adult Mobile Home Park,Populated Place,AZ,4,Maricopa,13.0,333134N,1120951W,33.5261527,-112.1640412,,,,,352.0,1155.0,Glendale,06/27/1984,


# Numerical Granularity

In [17]:
%sql SELECT elev_in_m FROM National LIMIT 2;

elev_in_m
931.0
2707.0


In [18]:
%%sql
SELECT elev_in_m, 
    (elev_in_m / 100)::INTEGER AS quantized,
    ((elev_in_m / 100)::INTEGER) * 100 AS round_to_100,
    SUBSTRING(elev_in_m::TEXT, 1, 2),
    CONCAT(SUBSTRING(elev_in_m::TEXT, 1, 2), '00') AS substring2
FROM National
LIMIT 5;

elev_in_m,quantized,round_to_100,substring,substring2
931.0,9,900,93,9300
2707.0,27,2700,27,2700
275.0,3,300,27,2700
1685.0,17,1700,16,1600
1354.0,14,1400,13,1300


In [19]:
%config SqlMagic.named_parameters=True

In [20]:
right_shift = '>>'
left_shift = '<<'

In [21]:
%%sql
/* Since jupysql does not like bitshifts, we can fake it with string interoplation. */
SELECT elev_in_m,
    (16::INTEGER::BIT(12)) AS bit12,
    (16::INTEGER::BIT(12)) {{left_shift}} 3
FROM national
LIMIT 5;

elev_in_m,bit12,?column?
931.0,10000,10000000
2707.0,10000,10000000
275.0,10000,10000000
1685.0,10000,10000000
1354.0,10000,10000000


In [22]:
%%sql
EXPLAIN (verbose true)
WITH shifts AS (
    SELECT elev_in_m,
       (elev_in_m::integer::bit(12)) AS bit12,
       (elev_in_m::integer::bit(12) {{right_shift}} 8) AS rightshifted,
       ((elev_in_m::integer::bit(12) {{right_shift}} 8) {{left_shift}} 8)::integer AS round_to_256,
       ((elev_in_m::integer::bit(12) {{right_shift}} 8) {{left_shift}} 8)::integer % 256 AS test
  FROM national
)
SELECT COUNT(DISTINCT elev_in_m) AS elevation_meters_count,
       COUNT(DISTINCT bit12) AS bit12_count,
       COUNT(DISTINCT rightshifted) AS rightshift_count,
       COUNT(DISTINCT round_to_256) AS rounded_count
  FROM shifts;

QUERY PLAN
Aggregate (cost=508.61..508.62 rows=1 width=32)
"Output: count(DISTINCT ""national"".elev_in_m), count(DISTINCT ((""national"".elev_in_m)::integer)::bit(12)), count(DISTINCT (((""national"".elev_in_m)::integer)::bit(12) >> 8)), count(DISTINCT (((((""national"".elev_in_m)::integer)::bit(12) >> 8) << 8))::integer)"
"-> Seq Scan on public.""national"" (cost=0.00..331.58 rows=5058 width=8)"
"Output: ""national"".feature_id, ""national"".feature_name, ""national"".feature_class, ""national"".state_alpha, ""national"".state_numeric, ""national"".county_name, ""national"".county_numeric, ""national"".primary_lat_dms, ""national"".prim_long_dms, ""national"".prim_lat_dec, ""national"".prim_long_dec, ""national"".source_lat_dms, ""national"".source_long_dms, ""national"".source_lat_dec, ""national"".source_long_dec, ""national"".elev_in_m, ""national"".elev_in_ft, ""national"".map_name, ""national"".date_created, ""national"".date_edited"


# Demo 1: Roll-up / Drill-down Practice

Let's start with county-level data on elevations:

In [23]:
%%sql
SELECT state_numeric, county_numeric,
       avg(elev_in_m),
       stddev(elev_in_m), count(*)
FROM national TABLESAMPLE BERNOULLI(10)
GROUP BY state_numeric, county_numeric;

state_numeric,county_numeric,avg,stddev,count
34,39.0,5.0,,1
53,73.0,555.0,777.8174593052023,2
55,135.0,269.0,,1
49,49.0,1688.6666666666667,83.6082133126485,3
37,139.0,2.0,1.0,3
36,65.0,227.75,160.5498364994496,4
28,71.0,109.0,,1
45,45.0,292.0,,1
30,105.0,701.5,86.97413408594535,2
42,107.0,275.0,,1


**Roll up** to state level.
* We save the view as `state_elevations` for later...

In [24]:
%%sql
DROP VIEW IF EXISTS state_elevations;

CREATE VIEW state_elevations AS (
    SELECT state_numeric,
       avg(elev_in_m),
       stddev(elev_in_m), count(*)
    FROM national
    GROUP BY state_numeric
);

In [25]:
%sql SELECT * FROM state_elevations;

state_numeric,avg,stddev,count
68,6.666666666666667,7.99166232186187,14
51,254.55197132616487,260.54513270095333,283
70,18.33333333333333,31.75426480542942,3
69,90.66666666666669,82.82189726555814,6
60,63.5,92.18188542224551,6
22,21.64179104477612,24.352228743507624,208
44,43.67346938775511,53.659958595423696,52
11,46.72,39.98803987861037,25
42,295.9726775956284,174.72171005278577,368
40,323.8679245283019,129.1529420712713,187


**Drill down** to include feature class.

In [26]:
%%sql
SELECT state_numeric, feature_class,
       avg(elev_in_m),
       stddev(elev_in_m), count(*)
FROM national TABLESAMPLE Bernoulli(10)
GROUP BY state_numeric, feature_class
ORDER BY count(*) DESC;

state_numeric,feature_class,avg,stddev,count
36,School,212.8181818181818,207.28474048121257,11
6,Stream,570.1,504.61722567154254,10
48,Populated Place,166.6,209.03864608142567,10
37,Stream,488.44444444444446,351.34531130751947,9
41,Stream,468.77777777777777,441.4350965254626,9
6,Building,352.75,422.09773749689776,8
30,Well,1134.125,314.6673697278263,8
13,Church,173.0,172.09050774851838,8
47,Church,185.25,96.76591784891444,8
55,School,300.2857142857143,99.51333961788622,7


# Demo 2: Connections to Statistics

## Roll up with marginal distributions

In [27]:
%%sql
SELECT state_numeric,
       AVG(elev_in_m),
       STDDEV(elev_in_m), COUNT(*),
       SUM(COUNT(*)) OVER () AS total,
       COUNT(*)/SUM(COUNT(*)) OVER () AS marginal
FROM national TABLESAMPLE Bernoulli(.07)
GROUP BY state_numeric;

state_numeric,avg,stddev,count,total,marginal
1,30.0,,1,12,0.0833333333333333
6,283.0,,1,12,0.0833333333333333
24,755.0,,1,12,0.0833333333333333
25,66.0,,1,12,0.0833333333333333
28,66.0,,1,12,0.0833333333333333
29,384.0,,1,12,0.0833333333333333
37,329.0,352.35351566289216,3,12,0.25
45,283.0,,1,12,0.0833333333333333
49,3025.0,,1,12,0.0833333333333333
51,2.0,,1,12,0.0833333333333333


In [28]:
%%sql
SELECT COUNT(DISTINCT county_numeric) FROM national;

count
291


## Drill down with normally-distributed elevations:

Start with the `state_elevations` view from earlier:

In [29]:
%sql SELECT * FROM state_elevations;

state_numeric,avg,stddev,count
68,6.666666666666667,7.99166232186187,14
51,254.55197132616487,260.54513270095333,283
70,18.33333333333333,31.75426480542942,3
69,90.66666666666669,82.82189726555814,6
60,63.5,92.18188542224551,6
22,21.64179104477612,24.352228743507624,208
44,43.67346938775511,53.659958595423696,52
11,46.72,39.98803987861037,25
42,295.9726775956284,174.72171005278577,368
40,323.8679245283019,129.1529420712713,187


The `fips_counties` relation has all counties, including those not in `national`:

In [30]:
%sql SELECT * FROM fips_counties LIMIT 10;

fips,county,state_numeric
1000,Alabama,1
1001,Autauga County,1
1003,Baldwin County,1
1005,Barbour County,1
1007,Bibb County,1
1009,Blount County,1
1011,Bullock County,1
1013,Butler County,1
1015,Calhoun County,1
1017,Chambers County,1


If we wanted to **drill down** to the FIPS counties, we'd need to simulate an elevation for those counties that don't exist in `national`.

Here's the first step in that process, which creates a simulated value for *every* county in `fips_counties`.
* The value is simulated from a normal distribution using that state's elevation statistics (average, standard deviation).
* Just like a Python package, we would need to import `tablefunc` in order to use the `normal_rand` function.

In [31]:
%sql CREATE EXTENSION IF NOT EXISTS tablefunc;

In [32]:
%%sql
WITH state_cty AS
(SELECT s.state_numeric, f.fips as county_numeric, s.avg, s.stddev, s.count
  FROM state_elevations s, fips_counties f
  WHERE s.state_numeric = f.state_numeric
)
SELECT s.*,
       n.n AS elev_in_m,
       true as elev_in_m_sim -- user-facing flag
  FROM state_cty s,
       LATERAL normal_rand(CAST(s.count AS INTEGER), s.avg, s.stddev) AS n
LIMIT 10;

state_numeric,county_numeric,avg,stddev,count,elev_in_m,elev_in_m_sim
1,1000,146.37888198757764,102.92185851771194,339,107.10747161200904,True
1,1000,146.37888198757764,102.92185851771194,339,45.58477352155776,True
1,1000,146.37888198757764,102.92185851771194,339,296.8410693353228,True
1,1000,146.37888198757764,102.92185851771194,339,111.63404608474836,True
1,1000,146.37888198757764,102.92185851771194,339,271.34961903540034,True
1,1000,146.37888198757764,102.92185851771194,339,108.5882589722316,True
1,1000,146.37888198757764,102.92185851771194,339,364.7120809327268,True
1,1000,146.37888198757764,102.92185851771194,339,306.94352902800904,True
1,1000,146.37888198757764,102.92185851771194,339,5.305131821588361,True
1,1000,146.37888198757764,102.92185851771194,339,79.41120236696895,True


# Assembling an Explicit Hierarchy

In [33]:
## we'll use the Lahman baseball database in our initial examples today.
## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://localhost:5432/baseball

Two relations have the pieces of the hierarchy we want:

In [34]:
%sql SELECT * FROM Appearances WHERE year_id > 1970 LIMIT 2;

year_id,team_id,lg_id,player_id,g_all,gs,g_batting,g_defense,g_p,g_c,g_1b,g_2b,g_3b,g_ss,g_lf,g_cf,g_rf,g_of,g_dh,g_ph,g_pr
1971,ATL,NL,aaronha01,139,129,139,129,0,0,71,0,0,0,0,0,60,60,0,10,0
1971,ATL,NL,aaronto01,25,10,25,18,0,0,11,0,7,0,0,0,0,0,0,8,0


In [35]:
%sql SELECT * FROM Teams LIMIT 1;

year_id,lg_id,team_id,franch_id,div_id,rank,g,ghome,w,l,divwin,wcwin,lgwin,wswin,r,ab,h,h2b,h3b,hr,bb,so,sb,cs,hbp,sf,ra,er,era,cg,sho,sv,ipouts,ha,hra,bba,soa,e,dp,fp,name,park,attendance,bpf,ppf,team_idbr,team_idlahman45,team_idretro
1871,,BS1,BNA,,3,31,,20,10,,,N,,401,1372,426,70,37,3,60,19,73,16,,,303,109,3.55,22,1,3,828,367,2,42,23,243,24,0.834,Boston Red Stockings,South End Grounds I,,103,98,BOS,BS1,BS1


Let's join these two to make our hierarchy! Which way should we make this?

In [36]:
%%sql
SELECT a.player_id, a.team_id, t.div_id, a.*
FROM Appearances a
NATURAL JOIN Teams t
WHERE a.year_id = 2015
LIMIT 100;

player_id,team_id,div_id,year_id,team_id_1,lg_id,player_id_1,g_all,gs,g_batting,g_defense,g_p,g_c,g_1b,g_2b,g_3b,g_ss,g_lf,g_cf,g_rf,g_of,g_dh,g_ph,g_pr
alvarda02,BAL,E,2015,BAL,AL,alvarda02,12,10,12,12,0,0,0,0,0,0,0,1,12,12,0,0,0
brachbr01,BAL,E,2015,BAL,AL,brachbr01,62,0,5,62,62,0,0,0,0,0,0,0,0,0,0,0,0
brittza01,BAL,E,2015,BAL,AL,brittza01,64,0,2,64,64,0,0,0,0,0,0,0,0,0,0,0,0
cabrace01,BAL,E,2015,BAL,AL,cabrace01,2,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0
cabreev01,BAL,E,2015,BAL,AL,cabreev01,29,28,29,28,0,0,0,2,0,27,0,0,0,0,0,0,1
chenwe02,BAL,E,2015,BAL,AL,chenwe02,31,31,0,31,31,0,0,0,0,0,0,0,0,0,0,0,0
clevest01,BAL,E,2015,BAL,AL,clevest01,30,24,30,10,0,9,1,0,0,0,0,0,0,0,18,4,0
davisch02,BAL,E,2015,BAL,AL,davisch02,160,159,160,138,0,0,111,0,0,0,0,0,30,30,22,0,0
deazaal01,BAL,E,2015,BAL,AL,deazaal01,30,27,30,27,0,0,0,0,0,0,19,0,13,27,0,3,0
drakeol01,BAL,E,2015,BAL,AL,drakeol01,13,0,1,13,13,0,0,0,0,0,0,0,0,0,0,0,0


In [37]:
%%sql
CREATE OR REPLACE VIEW bball_tree AS (
    SELECT DISTINCT
        a.player_id, a.team_id, t.div_id,
        a.lg_id, a.year_id
    FROM appearances a
    NATURAL JOIN teams t
);

In [38]:
%sql SELECT * FROM bball_tree LIMIT 25;

player_id,team_id,div_id,lg_id,year_id
gumbeha01,NY1,,NL,1935
gradymi01,SLN,,NL,1897
deshoji01,WS1,,AL,1938
prattla01,BRF,,FL,1915
thompsa01,PHI,,NL,1890
hollica01,DET,,AL,1922
halege01,SLA,,AL,1916
mamaual01,NYA,,AL,1924
henryji01,BOS,,AL,1937
cristch01,PHI,,NL,1906


### Revisiting the Home Run Query

Recall our old home run query:

In [39]:
%%sql
SELECT name_first, name_last, year_id,
       MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)
FROM batting b, people p
WHERE b.player_id = p.player_id
GROUP BY name_last, name_first, year_id
ORDER BY max DESC
LIMIT 10;

name_first,name_last,year_id,min,max,avg,stddev,sum
Barry,Bonds,2001,73,73,73.0,,73
Mark,McGwire,1998,70,70,70.0,,70
Sammy,Sosa,1998,66,66,66.0,,66
Mark,McGwire,1999,65,65,65.0,,65
Sammy,Sosa,2001,64,64,64.0,,64
Sammy,Sosa,1999,63,63,63.0,,63
Roger,Maris,1961,61,61,61.0,,61
Babe,Ruth,1927,60,60,60.0,,60
Babe,Ruth,1921,59,59,59.0,,59
Giancarlo,Stanton,2017,59,59,59.0,,59


Set up for roll up/drill down on `bball_tree` hierarchy.
* Join each (raw) person with the associated `bball_tree` entry by `(playerid, yearid)` in a CTE
* Use this result for roll-up and drill-down.

(blank space before we get to the next exercise....)
<br/><br/><br/><br/><br/>
<br/><br/><br/><br/><br/>
<br/><br/><br/><br/><br/>
<br/><br/><br/><br/><br/>

In [40]:
%%sql
WITH batting_tree AS (
    SELECT b.*, t.div_id
    FROM batting b, bball_tree t
    WHERE b.player_id = t.player_id
      AND b.year_id = t.year_id
)
SELECT name_first, name_last,
       bt.team_id, bt.lg_id, bt.div_id, bt.year_id,
       MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)
FROM batting_tree bt, people p
WHERE bt.player_id = p.player_id
GROUP BY bt.player_id, bt.team_id, bt.lg_id, bt.div_id, bt.year_id, name_last, name_first
ORDER BY max DESC
LIMIT 10;


name_first,name_last,team_id,lg_id,div_id,year_id,min,max,avg,stddev,sum
Barry,Bonds,SFN,NL,W,2001,73,73,73.0,,73
Mark,McGwire,SLN,NL,C,1998,70,70,70.0,,70
Sammy,Sosa,CHN,NL,C,1998,66,66,66.0,,66
Mark,McGwire,SLN,NL,C,1999,65,65,65.0,,65
Sammy,Sosa,CHN,NL,C,2001,64,64,64.0,,64
Sammy,Sosa,CHN,NL,C,1999,63,63,63.0,,63
Roger,Maris,NYA,AL,,1961,61,61,61.0,,61
Babe,Ruth,NYA,AL,,1927,60,60,60.0,,60
Babe,Ruth,NYA,AL,,1921,59,59,59.0,,59
Giancarlo,Stanton,MIA,NL,E,2017,59,59,59.0,,59


# [Extra] Load in the database from scratch

We download the database, unzip it, load it into pandas, then export to a new database via `jupysql` cell magic.

**CAUTION**: This may crash your DataHub instance. The file is pretty big....

The direct zip download of this file is [here](https://geonames.usgs.gov/docs/stategaz/NationalFile.zip).

In [41]:
# first download and unzip the data
!mkdir -p data
!wget https://geonames.usgs.gov/docs/stategaz/NationalFile.zip -P data/
!unzip -u data/NationalFile.zip -d data/

--2024-10-08 22:34:48--  https://geonames.usgs.gov/docs/stategaz/NationalFile.zip
Resolving geonames.usgs.gov (geonames.usgs.gov)... 137.227.239.220, 2001:49c8:8000:121d::76
Connecting to geonames.usgs.gov (geonames.usgs.gov)|137.227.239.220|:443... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2024-10-08 22:34:48 ERROR 503: Service Unavailable.

unzip:  cannot find or open data/NationalFile.zip, data/NationalFile.zip.zip or data/NationalFile.zip.ZIP.


In [42]:
import os
fname = os.path.join("data", "NationalFile_20210825.txt")
fname

'data/NationalFile_20210825.txt'

In [43]:
!du -h {fname} # big file

du: cannot access 'data/NationalFile_20210825.txt': No such file or directory


In [44]:
!head -c 1024 {fname}

head: cannot open 'data/NationalFile_20210825.txt' for reading: No such file or directory


In [45]:
# next, load it into pandas
import pandas as pd

national = pd.read_csv("data/NationalFile_20210825.txt", delimiter="|")
national.head(2)

FileNotFoundError: [Errno 2] No such file or directory: 'data/NationalFile_20210825.txt'

In [None]:
national = national.rename(columns=dict([(col, col.lower().strip()) for col in national.columns]))
national.head(2)

Next, get a table sample in pandas.

In [None]:
import numpy as np

p = 0.005 # fraction, not percentage

np.random.seed(42)
national['keep_bool'] = np.random.random(len(national)) < p
national['keep_bool'].value_counts()

In [None]:
national = national[national['keep_bool']].drop(columns=['keep_bool'])
national

Now, export to SQL

In [None]:
!psql -h localhost -d gnis -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();'
!psql -h localhost -c 'DROP DATABASE IF EXISTS gnis'
!psql -h localhost -c 'CREATE DATABASE gnis' 

In [None]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/gnis

In [None]:
%sql --persist-replace national

In [None]:
%sql ALTER TABLE national DROP COLUMN index;

Now, export to file with `pgdump`

In [None]:
!pg_dump -h localhost --encoding utf8 gnis -f data/gnis.sql 

Finally, run the beginning of this notebook again

In [None]:
!du -h data/gnis.sql

## FIPS

Federal Information Processing System (FIPS) Codes for States and Counties

Manually download the file from this link (https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt) and save it in `data/`.
* `wget` does not work here; likely the FCC website only accepts HTTPS connections to deter from server attacks.

In [None]:
!wget https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt -P data/

In [None]:
import pandas as pd
import re

In [None]:
with open('data/fips.txt', 'r') as f:
    lines = f.readlines()

In [None]:
COUNTY_STARTS = 69
OFFSET = 3 # the start of the data itself, after headers

In [None]:
re.match('\s+(\d+)\s+(\w+)', lines[COUNTY_STARTS+3]).groups()

In [None]:
splits = [re.match('\s+(\d+)\s+(.*)', line).groups()
          for line in 
          lines[COUNTY_STARTS+OFFSET:]]
splits[0]

**For later**: There is a significant discrepancy between the number of counties created and the number of lines remaining in our dataset. We encourage you to investigate this!

In [None]:
len(lines), len(splits)

> FIPS codes are numbers which uniquely identify geographic areas.  The number of 
digits in FIPS codes vary depending on the level of geography.  State-level FIPS
codes have two digits, county-level FIPS codes have five digits of which the 
first two are the FIPS code of the state to which the county belongs.  When 
using the list below to look up county FIPS codes, it is advisable to first look
up the FIPS code for the state to which the county belongs.  This will help you
identify the right section of the list while scrolling down, which can be
important since there are over 3000 counties and county-equivalents (e.g.
independent cities, parishes, boroughs) in the United States.

In [None]:
fips_counties = pd.DataFrame(data=splits, columns=['fips', 'county'])
fips_counties['state_numeric'] = fips_counties['fips'].str[:2].astype(int)
fips_counties['fips'] = fips_counties['fips'].astype(int)
fips_counties = fips_counties.set_index('fips')
fips_counties

In [None]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/gnis

In [None]:
%sql --persist-replace fips_counties

Now, export to file with `pgdump`. This exports both `national` and `fips_counties` relations to the same `gnis.sql` database dump.

In [None]:
!pg_dump -h localhost --encoding utf8 gnis -f data/gnis.sql 

Finally, run the beginning of this notebook again

In [None]:
!du -h data/gnis.sql