## SQL Exploratory data analysis

Exploring data in a database.  The Entitiy Relationship diagram can be found here:

- https://github.com/riched158/SQL/blob/master/data/erdiagram.png




In [33]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


Connect to the empty database made with pgadmin

In [34]:
%sql postgresql://postgres:eric@localhost:5432/analysis

'Connected: postgres@analysis'

### The tables

In [35]:
%sql select * from evanston311 limit 1

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,priority,source,category,date_created,date_completed,street,house_num,zip,description
1340563,NONE,gov.publicstuff.com,Fire Prevention - Inspection of a Commercial Property,2016-01-13 15:03:18+00:00,2016-01-19 16:51:26+00:00,Sheridan Road,606-612,60202,Please contact Debbie at Ext. 222


In [36]:
%sql select * from company limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,exchange,ticker,name,parent_id
1,nasdaq,PYPL,PayPal Holdings Incorporated,


In [37]:
%sql select * from fortune500 limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


rank,title,name,ticker,url,hq,sector,industry,employees,revenues,revenues_change,profits,profits_change,assets,equity
1,Walmart,"Wal-Mart Stores, Inc.",WMT,http://www.walmart.com,"Bentonville, AR",Retailing,General Merchandisers,2300000,485873,0.8,13643,-7.2,198825,77798


In [38]:
%sql select * from stackoverflow limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,tag,date,question_count,question_pct,unanswered_count,unanswered_pct
1,paypal,2018-09-25,18050,0.001093757,8402,0.001751857


In [39]:
%sql select * from tag_type limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


id,tag,type
1,amazon-cloudformation,cloud


In [40]:
%sql select * from tag_company limit 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


tag,company_id
actionscript,10


### Count missing values
Look at three columns in fortune500 to see which has most missing values
Can do this by knowing count(column) returns non-null values


In [41]:
%%sql
SELECT count(*) - count(ticker) AS missing
  FROM fortune500;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


missing
32


In [42]:
%%sql
SELECT count(*) - count(profits_change) AS missing
  FROM fortune500;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


missing
63


In [43]:
%%sql
SELECT count(*) - count(industry) AS missing
  FROM fortune500;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


missing
13


### Join on common columns

company and fortune 500 can be joined on common column

In [44]:
%%sql
SELECT company.name
FROM company
INNER JOIN fortune500
on company.ticker=fortune500.ticker;

 * postgresql://postgres:***@localhost:5432/analysis
8 rows affected.


name
Apple Incorporated
Amazon.com Inc
Alphabet
Microsoft Corp.
International Business Machines Corporation
PayPal Holdings Incorporated
"eBay, Inc."
Adobe Systems Incorporated


### Most common stackoverflow tagtype?
first find out number of tags with each type 

In [45]:
%%sql
SELECT type, count(*) AS count
FROM tag_type
GROUP BY type
ORDER BY count DESC;

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


type,count
cloud,31
database,6
payment,5
mobile-os,4
api,4
company,4
storage,2
os,2
spreadsheet,2
identity,1


Join the tag_company, company, and tag_type tables, keeping only mutually occurring records.
Select company.name, tag_type.tag, and tag_type.type for tags 

In [46]:
%%sql
SELECT company.name, tag_type.tag, tag_type.type
FROM company
INNER JOIN tag_company
ON company.id = tag_company.company_id
INNER JOIN tag_type
ON tag_company.tag = tag_type.tag
WHERE type='cloud';

 * postgresql://postgres:***@localhost:5432/analysis
31 rows affected.


name,tag,type
Amazon Web Services,amazon-cloudformation,cloud
Amazon Web Services,amazon-cloudfront,cloud
Amazon Web Services,amazon-cloudsearch,cloud
Amazon Web Services,amazon-cloudwatch,cloud
Amazon Web Services,amazon-cognito,cloud
Amazon Web Services,amazon-data-pipeline,cloud
Amazon Web Services,amazon-dynamodb,cloud
Amazon Web Services,amazon-ebs,cloud
Amazon Web Services,amazon-ec2,cloud
Amazon Web Services,amazon-ecs,cloud


### Using Coalesce

The coalesce() function can be useful for specifying a default or backup value when a column contains NULL values.

- coalesce(NULL, 1, 2) = 1
- coalesce(NULL, NULL) = NULL
- coalesce(2, 3, NULL) = 2


Here use the coalesce on fortune500 data.  Column industry contains some missing values. Here use coalesce() to use the value of sector as the industry when industry is NULL and then find the most common industry

In [47]:
#%%sql
#SELECT coalesce(industry, sector, 'Unknown') AS industry2,
#count(*) AS count
#FROM fortune500 
#GROUP BY industry2

In [48]:
%%sql
SELECT coalesce(industry, sector, 'Unknown') AS industry2,
count(*) AS count
FROM fortune500 
GROUP BY industry2
ORDER BY count desc
LIMIT 1;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


industry2,count
Utilities: Gas and Electric,22


### Can I use coalesce with self-join?
as previous but include subsidies using coalesce and self join

In [49]:
%%sql
SELECT company_original.name, title, rank
FROM company AS company_original
LEFT JOIN company AS company_parent
ON company_original.parent_id = company_parent.id 
INNER JOIN fortune500 
ON coalesce(company_parent.ticker, 
            company_original.ticker) = 
             fortune500.ticker
 ORDER BY rank;

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


name,title,rank
Apple Incorporated,Apple,3
Amazon.com Inc,Amazon.com,12
Amazon Web Services,Amazon.com,12
Alphabet,Alphabet,27
Google LLC,Alphabet,27
Microsoft Corp.,Microsoft,28
International Business Machines Corporation,IBM,32
PayPal Holdings Incorporated,PayPal Holdings,264
"eBay, Inc.",eBay,310
Adobe Systems Incorporated,Adobe Systems,443


### Casting

Exploring effect of casting data

In [50]:
%%sql
SELECT profits_change,
CAST(profits_change AS integer) AS profits_change_int
FROM fortune500
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


profits_change,profits_change_int
-7.2,-7
0.0,0
-14.4,-14
-51.5,-52
53.0,53
20.7,21
1.5,2
-2.7,-3
-2.8,-3
-37.7,-38


In [51]:
%%sql
SELECT 10/3, 
10::numeric/3;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


?column?,?column?_1
3,3.333333333333333


In [52]:
%%sql
SELECT '3.2'::numeric,
       '-123'::numeric,
       '1e3'::numeric,
       '1e-3'::numeric,
       '02314'::numeric,
       '0002'::numeric;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


numeric,numeric_1,numeric_2,numeric_3,numeric_4,numeric_5
3.2,-123,1000,0.001,2314,2


### Summarizing revenues of Fotune500


In [54]:
%%sql
SELECT revenues_change, count(*)
FROM fortune500
GROUP BY revenues_change
ORDER BY count(*)
limit 5;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


revenues_change,count
-45.0,1
-11.3,1
17.3,1
-1.1,1
-4.6,1


too many values reduce by casting as int

In [57]:
%%sql
SELECT revenues_change::integer, count(*)
FROM fortune500
GROUP BY revenues_change::integer
ORDER BY count(*) desc
limit 5;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


revenues_change,count
2,41
0,25
6,25
3,25
4,24


how many increased?

In [59]:
%%sql
SELECT count(*)
FROM fortune500
WHERE revenues_change > 0;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


count
298
