# SQL in Pandas with SQLAlchemy

We will use the "sshtunnel" library to connect to our remote AWS instance and then pull some data into Pandas using SQLAlchemy and Pyscopg2.

## Creating the SSH Tunnel

The "sshtunnel" library can read an SSH config file, so creating a tunnel is quite easy assuming SSH keys are setup and the SSH config entry has been created. With this setup, the "sshtunnel" library automatically determines what the address of the local port should be.

In [2]:
from sshtunnel import SSHTunnelForwarder

AWS_IP_ADDRESS = '35.164.67.69'
AWS_USERNAME = 'anjali'
SSH_KEY_PATH = '/Users/user/.ssh/id_rsa'

server = SSHTunnelForwarder(
    AWS_IP_ADDRESS,
    ssh_username=AWS_USERNAME,
    ssh_pkey=SSH_KEY_PATH,
    remote_bind_address=('localhost', 5432),
)

server.start()
print(server.is_active, server.is_alive, server.local_bind_port)

True True 62514


##  Connecting via Python
We'll be using a Psycopg2 connector alongside SQLAlchemy to connect to this database.

* **SQLAlchemy:** generates SQL statements
* **Psycopg2:** sends the SQL statements to the Postgres database

    Let's make the connection to the database. Note that the IP address of the Postgres database is 'localhost' and the port is set to whatever the `server` connection above contains. This is because we have used the SSH tunnel to create a connection between the AWS instance and our computer. SSH tunnels enable remote instances to behave as if they are *local*.

In [3]:
from sqlalchemy import create_engine

# Postgres username, password, and database name
POSTGRES_IP_ADDRESS = 'localhost' ## This is localhost because SSH tunnel is active
POSTGRES_PORT = str(server.local_bind_port)
POSTGRES_USERNAME = 'anjali'     ## CHANGE THIS TO YOUR POSTGRES USERNAME
POSTGRES_PASSWORD = 'Brandywine10' ## CHANGE THIS TO YOUR POSTGRES PASSWORD
POSTGRES_DBNAME = 'tennis'

# A long string that contains the necessary Postgres login information
postgres_str = ('postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}'
                .format(username=POSTGRES_USERNAME, 
                        password=POSTGRES_PASSWORD,
                        ipaddress=POSTGRES_IP_ADDRESS,
                        port=POSTGRES_PORT,
                        dbname=POSTGRES_DBNAME))

# Create the connection
cnx = create_engine(postgres_str)

## Load Some Data!

Pandas has a `read_sql_query` method that will pass a SQL statement to a database connection. Here is an example from the all-star table.

In [4]:
import pandas as pd

pd.read_sql_query('''SELECT * FROM aus_ladies_2013 LIMIT 5;''', cnx)

Unnamed: 0,player1,player2,round,result,fnl1,fnl2,fsp_1,fsw_1,ssp_1,ssw_1,...,bpc_2,bpw_2,npa_2,npw_2,tpw_2,st1_2,st2_2,st3_2,st4_2,st5_2
0,Serena Williams,Ashleigh Barty,1,1,2.0,0.0,59.0,20.0,41.0,8.0,...,0,0,2.0,4.0,31,2,1,,,
1,Vesna Dolonc,Lara Arruabarrena,1,1,2.0,1.0,65.0,33.0,35.0,10.0,...,4,7,,,74,6,2,4.0,,
2,Pauline Parmentier,Karolina Pliskova,1,0,0.0,2.0,63.0,16.0,37.0,4.0,...,5,14,,,64,6,6,,,
3,Heather Watson,Daniela Hantuchova,1,0,1.0,2.0,61.0,41.0,39.0,19.0,...,5,13,5.0,8.0,102,7,3,6.0,,
4,Samantha Stosur,Klara Zakopalova,1,1,2.0,0.0,65.0,28.0,35.0,11.0,...,4,14,10.0,15.0,60,3,4,,,


In [5]:
sql_query = '''WITH P1 AS (SELECT player1, COUNT(player1)
               FROM aus_ladies_2013
               GROUP BY  player1)
               
               WITH P2 AS (SELECT player2, COUNT(player1)
               FROM aus_ladies_2013
               GROUP BY  player2)
               
               SELECT column1 [, column2 ]
               FROM table1 [, table2 ]
               UNION ALL
               SELECT column1 [, column2 ]
               FROM table1 [, table2 ]
               
               Select player1, From P1;'''

pd.read_sql_query(sql_query, cnx)

ProgrammingError: (psycopg2.ProgrammingError) syntax error at or near "WITH"
LINE 5:                WITH P2 AS (SELECT player2, COUNT(player1)
                       ^
 [SQL: 'WITH P1 AS (SELECT player1, COUNT(player1)\n               FROM aus_ladies_2013\n               GROUP BY  player1)\n               \n               WITH P2 AS (SELECT player2, COUNT(player1)\n               FROM aus_ladies_2013\n               GROUP BY  player2)\n               \n               SELECT column1 [, column2 ]\n               FROM table1 [, table2 ]\n               UNION ALL\n               SELECT column1 [, column2 ]\n               FROM table1 [, table2 ]\n               \n               Select player1, From P1;'] (Background on this error at: http://sqlalche.me/e/f405)

In [6]:
sql_command = '''
CREATE TABLE final AS

SELECT  player1 AS name,
      'M' AS gender,
      'US' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    us_men_2013

UNION ALL

SELECT  player2 AS name,
      'M' AS gender,
      'US' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    us_men_2013

UNION ALL

SELECT  player1 AS name,
      'M' AS gender,
      'AUS' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    aus_men_2013

UNION ALL

SELECT  player2 AS name,
      'M' AS gender,
      'AUS' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    aus_men_2013

UNION ALL

SELECT  player1 AS name,
      'M' AS gender,
      'French' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    french_men_2013

UNION ALL

SELECT  player2 AS name,
      'M' AS gender,
      'French' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    french_men_2013

UNION ALL

SELECT  player1 AS name,
      'M' AS gender,
      'wimbledon' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    wim_men_2013

UNION ALL

SELECT  player2 AS name,
      'M' AS gender,
      'wimbledon' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    wim_men_2013

UNION ALL

SELECT  player1 AS name,
      'F' AS gender,
      'wimbledon' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    wim_women_2013

UNION ALL

SELECT  player2 AS name,
      'F' AS gender,
      'wimbledon' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    wim_women_2013

UNION ALL

SELECT  player1 AS name,
      'F' AS gender,
      'French' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    french_women_2013

UNION ALL

SELECT  player2 AS name,
      'F' AS gender,
      'French' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    french_women_2013

UNION ALL

SELECT  player1 AS name,
      'F' AS gender,
      'AUS' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    aus_ladies_2013

UNION ALL

SELECT  player2 AS name,
      'F' AS gender,
      'AUS' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    aus_ladies_2013

UNION ALL

SELECT  player1 AS name,
      'F' AS gender,
      'US' AS tournament,
      result AS win,
      FSP_1 AS fsp,
      DBF_1 AS dbf,
      UFE_1 AS ufe
FROM    us_women_2013

UNION ALL

SELECT  player2 AS name,
      'F' AS gender,
      'US' AS tournament,
      1-result AS win,
      FSP_2 AS fsp,
      DBF_2 AS dbf,
      UFE_2 AS ufe
FROM    us_women_2013;
          '''
cnx.execute(sql_command)

ProgrammingError: (psycopg2.ProgrammingError) relation "final" already exists
 [SQL: "\nCREATE TABLE final AS\n\nSELECT  player1 AS name,\n      'M' AS gender,\n      'US' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    us_men_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'M' AS gender,\n      'US' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    us_men_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'M' AS gender,\n      'AUS' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    aus_men_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'M' AS gender,\n      'AUS' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    aus_men_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'M' AS gender,\n      'French' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    french_men_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'M' AS gender,\n      'French' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    french_men_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'M' AS gender,\n      'wimbledon' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    wim_men_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'M' AS gender,\n      'wimbledon' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    wim_men_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'F' AS gender,\n      'wimbledon' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    wim_women_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'F' AS gender,\n      'wimbledon' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    wim_women_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'F' AS gender,\n      'French' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    french_women_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'F' AS gender,\n      'French' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    french_women_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'F' AS gender,\n      'AUS' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    aus_ladies_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'F' AS gender,\n      'AUS' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    aus_ladies_2013\n\nUNION ALL\n\nSELECT  player1 AS name,\n      'F' AS gender,\n      'US' AS tournament,\n      result AS win,\n      FSP_1 AS fsp,\n      DBF_1 AS dbf,\n      UFE_1 AS ufe\nFROM    us_women_2013\n\nUNION ALL\n\nSELECT  player2 AS name,\n      'F' AS gender,\n      'US' AS tournament,\n      1-result AS win,\n      FSP_2 AS fsp,\n      DBF_2 AS dbf,\n      UFE_2 AS ufe\nFROM    us_women_2013;\n          "] (Background on this error at: http://sqlalche.me/e/f405)

In [8]:
sql_query = '''SELECT *
               FROM final
               LIMIT 5;'''

pd.read_sql_query(sql_query, cnx)

Unnamed: 0,name,gender,tournament,win,fsp,dbf,ufe
0,Richard Gasquet,M,US,1,63.0,7,
1,Stephane Robert,M,US,1,61.0,2,
2,Jan-Lennard Struff,M,US,0,55.0,13,
3,Aljaz Bedene,M,US,0,52.0,8,
4,Feliciano Lopez,M,US,1,58.0,3,


In [9]:
# question 1
sql_query = '''SELECT name,tournament, COUNT(name) AS matches_played
               FROM final
               GROUP BY  name,tournament
               ORDER BY name ASC
               LIMIT 10;'''

pd.read_sql_query(sql_query, cnx)

Unnamed: 0,name,tournament,matches_played
0,A Barty,US,1
1,A.Beck,wimbledon,2
2,A.Bedene,wimbledon,1
3,A.Bogomolov Jr.,wimbledon,1
4,A.Cadantu,wimbledon,2
5,A Cornet,US,3
6,A.Cornet,wimbledon,3
7,A.Dolgopolov,wimbledon,2
8,Adrian Mannarino,French,1
9,Adrian Mannarino,US,3


In [10]:
# question 2
sql_query = '''SELECT name,gender, COUNT(name) AS matches_played
               FROM final
               GROUP BY  name,gender
               ORDER BY matches_played  DESC
               LIMIT 10;'''

pd.read_sql_query(sql_query, cnx)

Unnamed: 0,name,gender,matches_played
0,Rafael Nadal,M,21
1,Stanislas Wawrinka,M,17
2,Novak Djokovic,M,17
3,David Ferrer,M,17
4,Roger Federer,M,15
5,Tommy Robredo,M,14
6,Richard Gasquet,M,13
7,Serena Williams,F,11
8,Mikhail Youzhny,M,11
9,Maria Sharapova,F,11


In [11]:
# question 3
sql_query = '''SELECT name, MAX(fsp) AS highest_fsp
               FROM final
               GROUP BY  name
               ORDER BY highest_fsp  DESC
               LIMIT 10;'''

pd.read_sql_query(sql_query, cnx)

Unnamed: 0,name,highest_fsp
0,S Errani,93.0
1,Sara Errani,91.0
2,Victoria Azarenka,88.0
3,Kurumi Nara,86.0
4,Anabel Medina Garrigues,86.0
5,V.Hanescu,85.0
6,M.Niculescu,85.0
7,Gael Monfils,84.0
8,Rafael Nadal,84.0
9,Carlos Berlocq,84.0


In [17]:
# question 4
sql_query = '''WITH top_players AS (SELECT name, SUM(win) AS total_wins
                            FROM final
                            GROUP BY name 
                            ORDER BY total_wins  DESC
                            LIMIT 3)

                SELECT name, SUM(ufe*100)/SUM(ufe + dbf) AS percentage
                FROM final
                WHERE name IN (SELECT name FROM top_players)
                GROUP BY name
               ;'''

pd.read_sql_query(sql_query, cnx)

Unnamed: 0,name,percentage
0,Stanislas Wawrinka,91
1,David Ferrer,90
2,Rafael Nadal,93


And another from the schools table.

More sophisticated queries can also be used. This example finds the states with the most schools.

In [15]:
server.close()