## Using Maths and Stats on Census Data

Population Distribution and Change: 2000 to 2010

- https://www.census.gov/prod/cen2010/briefs/c2010br-01.pdf

Data dictionary
- https://www.census.gov/prod/cen2010/doc/pl94-171.pdf



In [1]:
%load_ext sql

Connect to the empty database made with pgadmin

In [2]:
%sql postgresql://postgres:eric@localhost:5432/analysis

'Connected: postgres@analysis'

In [5]:
%sql select * from us_counties_2010 limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


geo_name,state_us_abbreviation,summary_level,region,division,state_fips,county_fips,area_land,area_water,population_count_100_percent,housing_unit_count_100_percent,internal_point_lat,internal_point_lon,p0010001,p0010002,p0010003,p0010004,p0010005,p0010006,p0010007,p0010008,p0010009,p0010010,p0010011,p0010012,p0010013,p0010014,p0010015,p0010016,p0010017,p0010018,p0010019,p0010020,p0010021,p0010022,p0010023,p0010024,p0010025,p0010026,p0010047,p0010063,p0010070,p0020001,p0020002,p0020003,p0020004,p0020005,p0020006,p0020007,p0020008,p0020009,p0020010,p0020011,p0020012,p0020028,p0020049,p0020065,p0020072,p0030001,p0030002,p0030003,p0030004,p0030005,p0030006,p0030007,p0030008,p0030009,p0030010,p0030026,p0030047,p0030063,p0030070,p0040001,p0040002,p0040003,p0040004,p0040005,p0040006,p0040007,p0040008,p0040009,p0040010,p0040011,p0040012,p0040028,p0040049,p0040065,p0040072,h0010001,h0010002,h0010003
Autauga County,AL,50,3,6,1,1,1539582278,25775735,54571,22135,32.5363818,-86.6444901,54571,53702,42855,9643,232,474,32,466,869,814,219,262,177,11,50,32,19,9,16,0,0,5,5,8,1,49,6,0,0,54571,1310,53261,52500,42154,9595,217,467,22,45,761,719,36,6,0,0,39958,39530,31910,6767,180,346,23,304,428,404,22,2,0,0,39958,828,39130,38746,31461,6738,169,341,15,22,384,363,19,2,0,0,22135,20221,1914


In [7]:
%sql select sqrt(10)

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


sqrt
3.16227766016838


In [8]:
%sql select 3 ^ 4

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


?column?
81.0


### Using aliases to improve readability

In [10]:
%%sql
SELECT geo_name,
       state_us_abbreviation AS "st",
       p0010001 AS "Total Population",
       p0010003 AS "White Alone",
       p0010004 AS "Black or African American Alone",
       p0010005 AS "Am Indian/Alaska Native Alone",
       p0010006 AS "Asian Alone",
       p0010007 AS "Native Hawaiian and Other Pacific Islander Alone",
       p0010008 AS "Some Other Race Alone",
       p0010009 AS "Two or More Races"
FROM us_counties_2010
limit 5;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


geo_name,st,Total Population,White Alone,Black or African American Alone,Am Indian/Alaska Native Alone,Asian Alone,Native Hawaiian and Other Pacific Islander Alone,Some Other Race Alone,Two or More Races
Autauga County,AL,54571,42855,9643,232,474,32,466,869
Baldwin County,AL,182265,156153,17105,1216,1348,89,3631,2723
Barbour County,AL,27457,13180,12875,114,107,29,894,258
Bibb County,AL,22915,17381,5047,64,22,13,185,203
Blount County,AL,57322,53068,761,307,117,38,2347,684


### Checking census totals

difference should sum to zero

In [15]:

%%sql
SELECT geo_name,
       state_us_abbreviation AS "st",
       p0010001 AS "Total",
       p0010003 + p0010004 + p0010005 + p0010006 + p0010007
           + p0010008 + p0010009 AS "All Races",
       (p0010003 + p0010004 + p0010005 + p0010006 + p0010007
           + p0010008 + p0010009) - p0010001 AS "Difference"
FROM us_counties_2010
ORDER BY "Difference" DESC
limit 5

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


geo_name,st,Total,All Races,Difference
Baldwin County,AL,182265,182265,0
Barbour County,AL,27457,27457,0
Bibb County,AL,22915,22915,0
Blount County,AL,57322,57322,0
Autauga County,AL,54571,54571,0


### Finding percentages in the dataset

Calculate for each county the percentage of population as Asian

In [35]:
%%sql
SELECT geo_name,
state_us_abbreviation AS "st",
(CAST(p0010006 AS numeric(8,1))/ p0010001) * 100 AS "pct_asian"
FROM us_counties_2010
ORDER by "pct_asian" DESC
limit 5;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


geo_name,st,pct_asian
Honolulu County,HI,43.89497769109962
Aleutians East Borough,AK,35.97580388411333
San Francisco County,CA,33.27165361664607
Santa Clara County,CA,32.02237037519322
Kauai County,HI,31.324618801329535


In [43]:
%%sql
SELECT geo_name,
state_us_abbreviation AS "st",
(CAST(p0010004 AS numeric(8,1))/ p0010001) * 100 AS "pct_black"
FROM us_counties_2010
ORDER by "pct_black" DESC
limit 5;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


geo_name,st,pct_black
Jefferson County,MS,85.68470100957805
Claiborne County,MS,84.38150770512286
Holmes County,MS,83.43577455984999
Macon County,AL,82.64497482752192
Greene County,AL,81.48148148148147


### Aggregate functions

calculate total population and avg population of all counties

In [49]:
%sql SELECT sum(p0010001) AS "COUNTY_SUM" , round(avg(p0010001),0) AS "COUNTY_AVG" FROM us_counties_2010;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


COUNTY_SUM,COUNTY_AVG
308745538,98233


### Median

Comparing discrete and continuous values  

In [57]:
#%sql DROP TABLE precentile_test;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [58]:
%%sql
CREATE TABLE percentile_test (
    numbers integer
)

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [59]:
%%sql
INSERT INTO percentile_test (numbers)
VALUES (1),(2),(3),(4),(5),(6);

 * postgresql://postgres:***@localhost:5432/analysis
6 rows affected.


[]

In [62]:
#use .5 to represent the 50th percentile - equivalent to median

In [63]:
%%sql
SELECT percentile_cont(.5)
WITHIN GROUP (ORDER BY numbers),
percentile_disc(.5)
WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


percentile_cont,percentile_disc
3.5,3


Using median and percentiles on census data - compare average with median

In [65]:
%%sql 
SELECT sum(p0010001) AS "COUNTY_SUM",
round(avg(p0010001),0) AS "COUNTY_AVG" ,
percentile_cont(.5)
WITHIN GROUP (ORDER BY p0010001) AS "County Median"
FROM us_counties_2010;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


COUNTY_SUM,COUNTY_AVG,County Median
308745538,98233,25857.0


results skewed as some populations are very big as shown below

In [73]:
%%sql
SELECT geo_name,
       state_us_abbreviation AS "st",
       p0010001 AS "Total Population"
FROM us_counties_2010
ORDER BY p0010001 DESC
limit 5;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


geo_name,st,Total Population
Los Angeles County,CA,9818605
Cook County,IL,5194675
Harris County,TX,4092459
Maricopa County,AZ,3817117
San Diego County,CA,3095313


### using an array of percentiles

In [74]:
%%sql
SELECT percentile_cont(array[.25,.5,.75])
WITHIN GROUP (ORDER BY p0010001) AS "Quantiles"
FROM us_counties_2010;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


Quantiles
"[11104.5, 25857.0, 66699.0]"


Which ny state county has the highest percentage of pop id as "american indian/alaska native alone"?

In [88]:
%%sql
SELECT geo_name,
state_us_abbreviation AS "st",
(CAST(p0010005 AS numeric(8,1))/ p0010001) * 100 AS "pct_native_am"
FROM us_counties_2010
WHERE state_us_abbreviation = 'NY'
ORDER by "pct_native_am" DESC
limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


geo_name,st,pct_native_am
Franklin County,NY,7.358669741661661
