## SQL Summarizing Data

Using statistical functions to explore the data and writing complex queries with temp table saves

- https://github.com/riched158/SQL/blob/master/data/erdiagram.png




In [1]:
%load_ext sql

Connect to the empty database made with pgadmin

In [2]:
%sql postgresql://postgres:eric@localhost:5432/analysis

'Connected: postgres@analysis'

### The tables

In [35]:
%sql select * from evanston311 limit 1

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,priority,source,category,date_created,date_completed,street,house_num,zip,description
1340563,NONE,gov.publicstuff.com,Fire Prevention - Inspection of a Commercial Property,2016-01-13 15:03:18+00:00,2016-01-19 16:51:26+00:00,Sheridan Road,606-612,60202,Please contact Debbie at Ext. 222


In [36]:
%sql select * from company limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,exchange,ticker,name,parent_id
1,nasdaq,PYPL,PayPal Holdings Incorporated,


In [37]:
%sql select * from fortune500 limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


rank,title,name,ticker,url,hq,sector,industry,employees,revenues,revenues_change,profits,profits_change,assets,equity
1,Walmart,"Wal-Mart Stores, Inc.",WMT,http://www.walmart.com,"Bentonville, AR",Retailing,General Merchandisers,2300000,485873,0.8,13643,-7.2,198825,77798


In [38]:
%sql select * from stackoverflow limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,tag,date,question_count,question_pct,unanswered_count,unanswered_pct
1,paypal,2018-09-25,18050,0.001093757,8402,0.001751857


In [39]:
%sql select * from tag_type limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,tag,type
1,amazon-cloudformation,cloud


In [40]:
%sql select * from tag_company limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


tag,company_id
actionscript,10


### Division
Try computing the average revenue per employee for Fortune 500 companies by sector

In [3]:
%%sql
SELECT sector, 
avg(revenues/employees::numeric) AS avg_rev_employee
FROM fortune500
GROUP BY sector
ORDER BY avg_rev_employee desc;

 * postgresql://postgres:***@localhost:5432/analysis
21 rows affected.


sector,avg_rev_employee
Materials,4.757583515544884
Energy,1.826014233252424
Financials,1.7263847014097256
Wholesalers,1.413245811893712
Engineering & Construction,0.8611637667387481
"Food, Beverages & Tobacco",0.8308847593924985
Media,0.7956186656546318
Health Care,0.7905328691968146
Telecommunications,0.6295899727918749
Chemicals,0.5954997665807445


What information does the unanswered_pct column in the stackoverflow table contain? Is it the percent of questions with the tag that are unanswered?

In [5]:
%%sql
SELECT unanswered_count/question_count::numeric AS computed_pct, 
/*compare to unanswered_pct*/
unanswered_pct
FROM stackoverflow
WHERE question_count <> 0
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


computed_pct,unanswered_pct
0.4654847645429362,0.001751857
0.3863636363636363,0.000116972
0.3937677053824362,5.8e-05
0.3318965517241379,1.61e-05
0.4292857142857142,0.000125312
0.3479896172925006,0.012886449
0.3508386217225587,0.007619406
0.3072916666666666,1.23e-05
0.3542805100182149,8.11e-05
0.3806577661999348,0.000243743


The values don't match. unanswered_pct is the percent of unanswered questions on Stack Overflow with the tag, not the percent of questions with the tag that are unanswered.

### Summarizing numeric columns

In [6]:
%%sql
SELECT min(profits),
       avg(profits),
       max(profits),
       stddev(profits)
FROM fortune500;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


min,avg,max,stddev
-6177,1783.4753507014027,45687,3940.495363490788


change by sector

In [8]:
%%sql
SELECT sector,
       min(profits),
       avg(profits),
       max(profits),
       stddev(profits)
FROM fortune500
GROUP BY sector
ORDER BY avg DESC;

 * postgresql://postgres:***@localhost:5432/analysis
21 rows affected.


sector,min,avg,max,stddev
Technology,-1672.0,4137.241860465116,45687.0,8042.983363606666
Telecommunications,-383.5,4127.28,13127.0,5400.73173268627
Health Care,-1721.0,2773.2605263157893,16540.0,3751.818796086771
Financials,-1128.0,2719.7761904761905,24733.0,5064.764070852874
"Food, Beverages & Tobacco",-677.0,2346.1833333333334,14239.0,3412.352156334481
Aerospace & Defense,-941.0,2093.3083333333334,5302.0,2064.779951937795
Motor Vehicles & Parts,-674.9,1919.5333333333333,9427.0,3176.30073198367
Media,-495.9,1821.336363636364,9391.0,2839.299478136369
Industrials,-176.1,1727.6894736842105,8831.0,2326.018251073599
Transportation,69.0,1670.2941176470588,4373.0,1373.013160657332


### Summarize group statistics
how does the maximum value per group vary across groups?

To find out, first summarize by group, and then compute summary statistics of the group results. One way to do this is to compute group values in a subquery, and then summarize the results of the subquery.

what is the standard deviation across tags in the maximum number of Stack Overflow questions per day?

In [10]:
%%sql
SELECT 
stddev(maxval),
min(maxval),
max(maxval),
avg(maxval)
/*Subquery to compute max of question_count by tag*/
FROM (SELECT max(question_count) AS maxval
          FROM stackoverflow
          GROUP BY tag) AS max_results; 

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


stddev,min,max,avg
176458.3795272,30,1138658,52652.43396226415


### Exploring distributions

Use Truncate to bin data

In [11]:
%%sql
SELECT trunc(employees, -5) AS employee_bin,
count(*)
FROM fortune500
GROUP BY employee_bin
ORDER BY employee_bin;

 * postgresql://postgres:***@localhost:5432/analysis
6 rows affected.


employee_bin,count
0,433
100000,35
200000,20
300000,7
400000,4
2300000,1


repeat for companies < 100000

In [14]:
%%sql
SELECT trunc(employees, -4) AS employee_bin,
count(*)
FROM fortune500
WHERE employees < 100000
GROUP BY employee_bin
ORDER BY employee_bin;

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


employee_bin,count
0,102
10000,108
20000,63
30000,42
40000,35
50000,31
60000,18
70000,18
80000,6
90000,10


### Generate series
Summarize the distribution of the number of questions with the tag "dropbox" on Stack Overflow per day by binning the data.

In [15]:
%%sql
SELECT min(question_count), 
       max(question_count)
FROM stackoverflow
WHERE tag =  'dropbox';

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


min,max
2315,3072


In [16]:
%%sql
SELECT generate_series(2200, 3050, 50) AS lower,
       generate_series(2250, 3100, 50) AS upper;

 * postgresql://postgres:***@localhost:5432/analysis
18 rows affected.


lower,upper
2200,2250
2250,2300
2300,2350
2350,2400
2400,2450
2450,2500
2500,2550
2550,2600
2600,2650
2650,2700


### Correlation
Compute the correlation between revenues and other financial variables with the corr() function

In [17]:
%%sql
SELECT corr(revenues,profits) AS rev_profits,
corr(revenues,assets) AS rev_assets,
corr(revenues,equity) AS rev_equity 
FROM fortune500;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


rev_profits,rev_assets,rev_equity
0.599993581572479,0.329499521318506,0.546570999718431


### Mean and Median
Compute the mean (avg()) and median assets of Fortune 500 companies by sector.

Use the percentile_disc() function to compute the median:

In [18]:
%%sql
SELECT sector,
avg(assets) AS mean,
/*Select the median*/
percentile_disc(0.5) WITHIN GROUP (ORDER BY assets) AS median
FROM fortune500
GROUP BY sector
ORDER BY sector;

 * postgresql://postgres:***@localhost:5432/analysis
21 rows affected.


sector,mean,median
Aerospace & Defense,31897.666666666668,20038
Apparel,11064.8,9739
Business Services,19626.1,12485
Chemicals,20151.214285714286,15769
Energy,48756.21052631579,36119
Engineering & Construction,8199.23076923077,8709
Financials,319245.09523809527,123449
Food & Drug Stores,24630.714285714286,17464
"Food, Beverages & Tobacco",29059.75,15984
Health Care,42078.89473684211,25396


### Create a temp table
Find the Fortune 500 companies that have profits in the top 20% for their sector (compared to other Fortune 500 companies).

To do this, first, find the 80th percentile of profit for each sector with

percentile_disc(fraction) 
WITHIN GROUP (ORDER BY sort_expression)
and save the results in a temporary table.

Then join fortune500 to the temporary table to select companies with profits greater than the 80th percentile cut-off.

In [19]:
%%sql
DROP TABLE IF EXISTS profit80;

CREATE TEMP TABLE profit80 AS
  SELECT sector, 
         percentile_disc(0.8) WITHIN GROUP (ORDER BY profits) AS pct80
    FROM fortune500 
   GROUP BY sector;
   
SELECT * 
  FROM profit80;

 * postgresql://postgres:***@localhost:5432/analysis
Done.
21 rows affected.
21 rows affected.


sector,pct80
Aerospace & Defense,4895.0
Apparel,1074.1
Business Services,1401.0
Chemicals,1500.0
Energy,1311.0
Engineering & Construction,602.7
Financials,3014.0
Food & Drug Stores,2025.7
"Food, Beverages & Tobacco",6073.0
Health Care,4965.0


In [20]:
%%sql
DROP TABLE IF EXISTS profit80;

CREATE TEMP TABLE profit80 AS
  SELECT sector, 
         percentile_disc(0.8) WITHIN GROUP (ORDER BY profits) AS pct80
    FROM fortune500 
   GROUP BY sector;

SELECT title, fortune500.sector, 
profits, profits/pct80 AS ratio
FROM fortune500 
LEFT JOIN profit80
ON fortune500.sector=profit80.sector
WHERE profits > pct80;

 * postgresql://postgres:***@localhost:5432/analysis
Done.
21 rows affected.
90 rows affected.


title,sector,profits,ratio
Walmart,Retailing,13643.0,11.109934853420196
Berkshire Hathaway,Financials,24074.0,7.987392169873922
Apple,Technology,45687.0,6.287778695293146
Exxon Mobil,Energy,7840.0,5.980167810831426
McKesson,Wholesalers,2258.0,3.7266875722066346
UnitedHealth Group,Health Care,7017.0,1.4132930513595166
CVS Health,Health Care,5317.0,1.070896273917422
General Motors,Motor Vehicles & Parts,9427.0,2.051131418624891
AT&T,Telecommunications,12976.0,1.4923519263944796
AmerisourceBergen,Wholesalers,1427.9,2.3566595147714144
