**Notes for querying postgreSQL in the terminal**

Enter the database:
<br>
`psql baseball -h localhost -U lacar`

pw: 

Find the tables of the database:
<br>
`\dt`

Get all columns of a table
<br>
`\d+ my_table`

Selecting based on column value (note single quote marks)
<br>
`SELECT * FROM player_id
WHERE name_last='machado';`

Selecting based on two column values
<br>
`SELECT * FROM player_id
WHERE name_last='machado'
AND name_first='manny';`

Count the number of games that Machado has played 3B in 2019
<br>
Machado's MLB key: 592518

Tried several things

Gives all pitches:
`SELECT COUNT("game_date") FROM statcast
WHERE "fielder_5"=592518 AND "game_date"
BETWEEN '2019-01-01' AND '2019-12-31' LIMIT 5;`

To get unique game dates:
`SELECT DISTINCT "game_date" FROM statcast WHERE "fielder_5"=592518 AND "game_date" BETWEEN '2019-01-01' AND '2019-12-31' LIMIT 5;`

To get number of unique game dates (need the table AS):
<br>
`SELECT COUNT(*) FROM (SELECT DISTINCT "game_date" FROM statcast WHERE "fielder_5"=592518 AND "game_date" BETWEEN '2019-01-01' AND '2019-12-31' LIMIT 5) AS machado_3b_games;`

... and without limit
<br>
`SELECT COUNT(*) FROM (SELECT DISTINCT "game_date" FROM statcast WHERE "fielder_5"=592518 AND "game_date" BETWEEN '2019-01-01' AND '2019-12-31') AS machado_3b_games;`
 
 `count 
   119
(1 row)`

Games at SS:
<br>
`SELECT COUNT(*) FROM (SELECT DISTINCT "game_date" FROM statcast WHERE "fielder_6"=592518 AND "game_date" BETWEEN '2019-01-01' AND '2019-12-31') AS machado_3b_games;`
<br>
`count 
    37
(1 row)`




Try advanced query

# Order of execution

1. FROM /JOIN (subqueries)
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. DISTINCT
7. ORDER BY
8. LIMIT/OFFSET

<br>
- First creates a working dataset, then filters with rows added by conditions
<br>
- Column name aliases are not accessed for all commands except after SELECT (aliases for table names are okay.)

# Example of how to format a long query

In [57]:
def return_df_metric_rate_batter(metric, batter_id):

    sql_query = """
    SELECT "game_date", "batter", "events" FROM statcast
    WHERE "batter"= 
    """ + str(batter_id) + """
    AND "game_date" BETWEEN '2019-03-28' AND '2019-04-30'
    AND "events" IS NOT NULL
    """

    df_events = pd.read_sql_query(sql_query,con)
    
    df_summary = df_events.groupby('events').count()['game_date'] / df_events.count()[0]
    return df_summary[metric]

# SQL with baseball database

In [58]:
import pandas as pd
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

In [182]:
# Define a database name
# Set your postgres username
dbname = "baseball"
username = "lacar"  # change this to your username

In [183]:
# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database=dbname, user=username)

## Show what statcast table looks like

In [144]:
sql_query = """
SELECT * FROM statcast
LIMIT 10
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   level_0  index pitch_type  game_date  release_speed  release_pos_x  \
0   628924   6520         SL 2019-04-23           85.6        -1.2669   
1   628925   6540         SL 2019-04-23           84.4        -1.2766   
2   628926   6542         FF 2019-04-23           92.9        -1.1886   
3   628927   6570         FF 2019-04-23           92.2        -1.0392   
4   628928   6576         FF 2019-04-23           91.2        -1.1659   
5   628929   6599         FF 2019-04-23           86.8         2.5039   
6   628930   6606         CU 2019-04-23           70.9         2.5916   
7   628931   6633         FF 2019-04-23           88.5         2.4167   
8   628932   6641         SL 2019-04-23           81.5         2.8770   
9   628933   6666         SL 2019-04-23           80.5         2.7526   

   release_pos_z       player_name    batter   pitcher  ... home_score  \
0         6.1763      Erik Swanson  665487.0  657024.0  ...        3.0   
1         6.1412      Erik Swanson  594824.0  65

## Get Padres stats for 2019

In [130]:
sql_query = """
SELECT "game_date", "batter", "events", "home_team" FROM statcast
WHERE "game_date" BETWEEN '2019-03-28' AND '2019-04-30'
AND "events" IS NOT NULL
AND "home_team"='SD'
LIMIT 10
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   game_date    batter                     events home_team
0 2019-04-24  570267.0  grounded_into_double_play        SD
1 2019-04-24  596129.0                     single        SD
2 2019-04-24  571745.0                  field_out        SD
3 2019-04-24  571976.0                  strikeout        SD
4 2019-04-24  665487.0                  force_out        SD
5 2019-04-24  594824.0                     single        SD
6 2019-04-24  642336.0                  field_out        SD
7 2019-04-24  429665.0                  strikeout        SD
8 2019-04-24  605480.0                  field_out        SD
9 2019-04-24  592387.0                  field_out        SD


## Get batters of average WAR (Renfroe like) 2019

In [62]:
sql_query = """
SELECT "Name", "WAR" FROM batting_stats
WHERE "WAR" BETWEEN 1.5 and 2.5
"""

df_query = pd.read_sql_query(sql_query,con)
    
print(df_query)

                  Name  WAR
0            Josh Bell  2.5
1         Kole Calhoun  2.5
2    Edwin Encarnacion  2.5
3          Khris Davis  2.5
4          Nelson Cruz  2.5
..                 ...  ...
231      Freddy Galvis  1.5
232       Amed Rosario  1.5
233    Kevin Kiermaier  1.5
234        Kyle Seager  1.5
235       Juan Lagares  1.5

[236 rows x 2 columns]


In [None]:
sql_query = """
SELECT "Name", "WAR" FROM batting_stats
WHERE "Name" LIKE '%Renfroe'
LIMIT 10
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)


## Get Pitchers of average WAR (1.9-2) 2019

In [132]:
sql_query = """
SELECT "Name", "WAR", "Season" FROM pitching_stats
WHERE "WAR" BETWEEN 1.9 AND 2
AND "Season"=2019
"""

df_query = pd.read_sql_query(sql_query,con)
#print(df_query)
df_query

Unnamed: 0,Name,WAR,Season
0,Domingo German,2.0,2019.0
1,Wade Miley,2.0,2019.0
2,Brett Anderson,2.0,2019.0
3,Merrill Kelly,2.0,2019.0
4,Ivan Nova,2.0,2019.0
5,Tanner Roark,2.0,2019.0
6,Aroldis Chapman,2.0,2019.0
7,Martin Perez,1.9,2019.0
8,Trent Thornton,1.9,2019.0
9,Daniel Norris,1.9,2019.0


## Get no. of at-bats that Tatis faced a two strike count

In [135]:
# Get Tatis keyalbam
sql_query = """
SELECT * FROM player_id
WHERE name_last='tatis';
"""
df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   index name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  \
0   1738     tatis   fernando     665487  tatif002  tatisfe02          19709   

   mlb_played_first  mlb_played_last  
0            2019.0           2019.0  


In [None]:
# Get all at-bats hard coding the keymlbam

In [138]:
sql_query = """
SELECT "strikes", "batter", "name_last" FROM statcast
JOIN player_id
ON statcast.batter=player_id.key_mlbam
WHERE key_mlbam='665487'
LIMIT 5
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   strikes    batter name_last
0      0.0  665487.0     tatis
1      0.0  665487.0     tatis
2      0.0  665487.0     tatis
3      2.0  665487.0     tatis
4      1.0  665487.0     tatis


## Get all at-bats using a subquery, searching for "Tatis"

In [141]:
sql_query = """
SELECT COUNT("strikes") FROM statcast
JOIN player_id
ON statcast.batter=player_id.key_mlbam
WHERE key_mlbam=
    (SELECT key_mlbam FROM player_id
    WHERE name_last='tatis'
    AND name_first='fernando')
AND "strikes" = 2
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   count
0    393


## Count number of Tatis at-bats

In [142]:
sql_query = """
SELECT COUNT("events") FROM statcast
JOIN player_id
ON statcast.batter=player_id.key_mlbam
WHERE key_mlbam=
    (SELECT key_mlbam FROM player_id
    WHERE name_last='tatis'
    AND name_first='fernando')
AND "events" IS NOT NULL
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   count
0    371


## Get the number of pitches of each pitch type that Tatis faced on two strike counts

In [150]:
sql_query = """
SELECT pitch_type, COUNT(*) AS no_pitches
FROM statcast
JOIN player_id
ON statcast.batter=player_id.key_mlbam
WHERE key_mlbam=
    (SELECT key_mlbam FROM player_id
    WHERE name_last='tatis'
    AND name_first='fernando')
AND "strikes" = 2
GROUP BY pitch_type
ORDER BY no_pitches DESC
LIMIT 10;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  pitch_type  no_pitches
0         FF         126
1         SL          97
2         CH          57
3         CU          34
4         SI          22
5         FT          21
6         FC          17
7         KC          14
8         FS           5


## Get the breakdown in proportion of pitch types that Tatis faced on two strike counts

**Return to this**

In [195]:
# Just get total no two strike counts
sql_query = """
SELECT COUNT(*) 
FROM statcast
JOIN player_id
ON statcast.batter=player_id.key_mlbam
WHERE key_mlbam=
    (SELECT key_mlbam FROM player_id
    WHERE name_last='tatis'
    AND name_first='fernando')
AND "strikes" = 2;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   count
0    393


In [196]:
sql_query = """
SELECT pitch_type, COUNT(*) AS no_pitches
FROM statcast
JOIN player_id
ON statcast.batter=player_id.key_mlbam
WHERE key_mlbam=
    (SELECT key_mlbam FROM player_id
    WHERE name_last='tatis'
    AND name_first='fernando')
AND "strikes" = 2
GROUP BY pitch_type
ORDER BY no_pitches DESC;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  pitch_type  no_pitches
0         FF         126
1         SL          97
2         CH          57
3         CU          34
4         SI          22
5         FT          21
6         FC          17
7         KC          14
8         FS           5


**Complex query**

- use of CTE
- note notation, where commas are
- note use of decimal format 

In [210]:
sql_query = """
WITH
    total_table AS
        (SELECT COUNT(*) AS total_pitches
        FROM statcast
        JOIN player_id
        ON statcast.batter=player_id.key_mlbam
        WHERE key_mlbam=
            (SELECT key_mlbam FROM player_id
            WHERE name_last='tatis'
            AND name_first='fernando')
        AND "strikes" = 2),
        
    pitch_counts_table AS
        (SELECT pitch_type, COUNT(*) AS no_pitches
        FROM statcast
        JOIN player_id
        ON statcast.batter=player_id.key_mlbam
        WHERE key_mlbam=
            (SELECT key_mlbam FROM player_id
            WHERE name_last='tatis'
            AND name_first='fernando')
        AND "strikes" = 2
        GROUP BY pitch_type
        ORDER BY no_pitches DESC)
        
SELECT
    pitch_counts_table.pitch_type, 
    pitch_counts_table.no_pitches,
    pitch_counts_table.no_pitches::decimal / total_table.total_pitches AS pitch_type_proportion
FROM pitch_counts_table, total_table;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  pitch_type  no_pitches  pitch_type_proportion
0         FF         126               0.320611
1         SL          97               0.246819
2         CH          57               0.145038
3         CU          34               0.086514
4         SI          22               0.055980
5         FT          21               0.053435
6         FC          17               0.043257
7         KC          14               0.035623
8         FS           5               0.012723


# Alise's SQL questions

In [65]:
import pandas as pd
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

Host: ec2-54-245-31-214.us-west-2.compute.amazonaws.com
<br>
Port: 5291
<br>
User (Role): sqlpractice [Only has READ privileges]
<br>
Password: iloveSQL!
<br>
Database: ecommerce


In [166]:
# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(host = 'ec2-54-245-31-214.us-west-2.compute.amazonaws.com',
                       port = '5291',
                       database = 'ecommerce',
                       user = 'sqlpractice',
                       password = 'iloveSQL!')

## Overview of tables

In [167]:
## Overview of table
sql_query = """
SELECT * FROM public.employees 
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   employeeid   lastname firstname                  title titleofcourtesy  \
0           1    Davolio     Nancy   Sales Representative             Ms.   
1           2     Fuller    Andrew  Vice President, Sales             Dr.   
2           3  Leverling     Janet   Sales Representative             Ms.   
3           4    Peacock  Margaret   Sales Representative            Mrs.   
4           5   Buchanan    Steven          Sales Manager             Mr.   

   birthdate   hiredate                    address      city region  \
0 1966-12-08 2010-05-01  507 - 20th Ave. E. Apt.57  New York     NY   
1 1970-02-19 2010-08-14         908 W. Capital Way    Tacoma     WA   
2 1981-08-30 2010-04-01         722 Moss Bay Blvd.  Kirkland     WA   
3 1955-09-19 2011-05-03       4110 Old Redmond Rd.   Redmond     WA   
4 1973-03-04 2011-10-17            14 Garrett Hill    London   None   

  postalcode country       homephone extension  \
0      10027     USA  (206) 555-9857      5467   
1      984

In [94]:
sql_query = """
SELECT COUNT(*) FROM public.employees 
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   count
0      9


## Question 1
Return a distinct list of titles and the count of employees with this title

In [77]:
sql_query = """
SELECT DISTINCT(title) AS unique_title, COUNT(employeeid) 
FROM employees 
GROUP BY title;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

               unique_title  count
0             Sales Manager      1
1     Vice President, Sales      1
2      Sales Representative      6
3  Inside Sales Coordinator      1


## Question 2
Return the employee(s) first name, last name and title who has "Sales" in their title and hire date before January 1, 2011

In [90]:
sql_query = """
SELECT firstname, lastname, title 
FROM employees 
WHERE title LIKE '%Sales%'
AND hiredate < '2011-01-01';
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  firstname   lastname                  title
0     Nancy    Davolio   Sales Representative
1    Andrew     Fuller  Vice President, Sales
2     Janet  Leverling   Sales Representative


## Question 3
Return the employees first and last name and the number of years of employment

In [92]:
sql_query = """
SELECT firstname, lastname, EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM hiredate) AS employment_length
FROM employees;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  firstname   lastname  employment_length
0     Nancy    Davolio               10.0
1    Andrew     Fuller               10.0
2     Janet  Leverling               10.0
3  Margaret    Peacock                9.0
4    Steven   Buchanan                9.0
5   Michael     Suyama                9.0
6    Robert       King                8.0
7     Laura   Callahan                8.0
8      Anne  Dodsworth                8.0


## Question 4
Return the maximum number of years of employment for any employee

In [110]:
# Method 1
sql_query = """
SELECT EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM hiredate) AS employment_length
FROM employees
ORDER BY employment_length DESC
LIMIT 1;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   employment_length
0               10.0


In [112]:
# Method 2
sql_query = """
SELECT MAX(EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM hiredate)) AS max_years
FROM employees;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   max_years
0       10.0


## Question 5
Return the employee city ordered by the total number of unique employees currently employed (Assume each employeeid appears only once)

In [126]:
# Method 1
sql_query = """
SELECT city, COUNT(DISTINCT employeeid) AS no_unique_employees
FROM employees
GROUP BY city
ORDER BY no_unique_employees DESC;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

         city  no_unique_employees
0      London                    2
1       Miner                    1
2    New York                    1
3     Redmond                    1
4     Seattle                    1
5      Tacoma                    1
6    Kirkland                    1
7  Winchester                    1


In [168]:
# Method 2 (Alise method)

sql_query = """
SELECT city, COUNT(employeeid) AS employeecount 
FROM public.employees 
GROUP BY 1 
ORDER BY COUNT(employeeid) DESC;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

         city  employeecount
0      London              2
1     Seattle              1
2    Kirkland              1
3  Winchester              1
4       Miner              1
5      Tacoma              1
6    New York              1
7     Redmond              1


## Question 6

Return each city and the birthdate of the oldest employee as "mindate"

In [174]:
# Method 1

sql_query =  """
SELECT city, MIN(birthdate) AS mindate
FROM employees
GROUP BY city
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

         city    mindate
0    New York 1966-12-08
1     Seattle 1976-01-09
2    Kirkland 1981-08-30
3      London 1973-03-04
4  Winchester 1978-05-29
5       Miner 1981-07-02
6      Tacoma 1970-02-19
7     Redmond 1955-09-19


In [175]:
# Method 2

sql_query =  """
SELECT city, MIN(birthdate) AS mindate
FROM employees
GROUP BY 1
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

         city    mindate
0    New York 1966-12-08
1     Seattle 1976-01-09
2    Kirkland 1981-08-30
3      London 1973-03-04
4  Winchester 1978-05-29
5       Miner 1981-07-02
6      Tacoma 1970-02-19
7     Redmond 1955-09-19


## Question 7

Return the first and last name and age of the youngest employee
<br>
**Review years extraction from date**

In [179]:
# Method 1

sql_query =  """
SELECT firstname, lastname, EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM birthdate) AS age
FROM employees
ORDER BY age ASC
LIMIT 1;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

  firstname   lastname   age
0      Anne  Dodsworth  36.0


In [180]:
# Method 2 - use of MAX in where

sql_query =  """
SELECT firstname, lastname, (EXTRACT(YEAR FROM CURRENT_TIMESTAMP) - EXTRACT(YEAR FROM birthdate)) AS age 
FROM public.employees 
WHERE birthdate = (SELECT MAX(birthdate) AS maxbdate FROM public.employees);;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

  firstname   lastname   age
0      Anne  Dodsworth  36.0


# Bottom of notebook