# SQL first steps
First you need to start the postgres service with `sudo service postgresql start` from a terminal. By default the RDBMS is not running, and we need to turn it on.

Next, we need to consider the connection credentials. In this case, I've set your docker environments up with a new default database called `si330` with a username of `jovyan` and a password of `si330studentuser`. We'll be doing all of our connections on `localhost`, which just means from the docker container to the docker container.

In [1]:
# connection information
host="localhost"
dbname="si330"
user="jovyan"
password="si330studentuser"

To actually connect to the database from within python we're going to use a library called `psycopg2`. This implements the API in PEP 249. Recall that in that API there were a `Connection` objects that can be used to get `Cursor` objects.

In [2]:
# create a connection to the database
import psycopg2
conn = psycopg2.connect(host=host,dbname=dbname, user=user, password=password)

In [3]:
# we can see information about the connection, including the closed field
print(conn)

<connection object at 0x7f1c45264db0; dsn: 'user=jovyan password=xxx dbname=si330 host=localhost', closed: 0>


The RDBMS model is especially strong when it comes to multiple users and ensuring that one user doesn't unintentionally impact others when working with data. One of the ways in which this is accomplished is through **transactions**. A transaction is a set of things which much happen together to be valid, like withdrawing money from an ATM or buying items from a vendor. As a data scientist, you tend not to use transactions much. To make our life easier here we are going to set our transaction size to be one query -- every command we issue to the databse we will have considered as a single transaction.

In [4]:
conn.autocommit=True

## SELECT to get data

The general form of a SELECT statement is:
```
SELECT some_fields FROM some_table
WHERE some_condition_exists
[ GROUP BY some_field ]
[ ORDER BY some_field ]
```

Let's try this, there is a system database which exists called `pg_catalog`, which has a table in it called `pg_tables`, which lists all of the tables in the database. Since we don't know which fields we should use, we can use `*` as a default value to get them all. And the WHERE portion of the clause is optional.

In [5]:
# we can create a new cursor to start doing stuff just by calling conn.cursor()
# in python we want to treat a cursor a bit like a file, so we use a context manager
# this ensures that the cursor is closed and cleaned up at the end
with conn.cursor() as cur:
    sql="""
    SELECT * FROM pg_catalog.pg_tables
    """
    cur.execute(sql)
    results=cur.fetchall()
    for result in results:
        print(result)

('pg_catalog', 'pg_statistic', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_type', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_policy', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_authid', 'postgres', 'pg_global', True, False, False, False)
('pg_catalog', 'pg_user_mapping', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_subscription', 'postgres', 'pg_global', True, False, False, False)
('pg_catalog', 'pg_attribute', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_proc', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_class', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_attrdef', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_constraint', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_inherits', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_index', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg

The result is a set of rows, as tuples. The meaning is a bit unclear though - we have no column names! We have to do a bit more boilerplate to get this to work.

In [10]:
from psycopg2.extras import RealDictCursor
with conn.cursor(cursor_factory = RealDictCursor) as cur:
    sql="""
    SELECT * FROM pg_catalog.pg_tables
    """
    cur.execute(sql)
    results=cur.fetchall()
    #results[0].keys()
    for result in results:
        print(result)

RealDictRow([('schemaname', 'pg_catalog'), ('tablename', 'pg_statistic'), ('tableowner', 'postgres'), ('tablespace', None), ('hasindexes', True), ('hasrules', False), ('hastriggers', False), ('rowsecurity', False)])
RealDictRow([('schemaname', 'pg_catalog'), ('tablename', 'pg_type'), ('tableowner', 'postgres'), ('tablespace', None), ('hasindexes', True), ('hasrules', False), ('hastriggers', False), ('rowsecurity', False)])
RealDictRow([('schemaname', 'pg_catalog'), ('tablename', 'pg_policy'), ('tableowner', 'postgres'), ('tablespace', None), ('hasindexes', True), ('hasrules', False), ('hastriggers', False), ('rowsecurity', False)])
RealDictRow([('schemaname', 'pg_catalog'), ('tablename', 'pg_authid'), ('tableowner', 'postgres'), ('tablespace', 'pg_global'), ('hasindexes', True), ('hasrules', False), ('hastriggers', False), ('rowsecurity', False)])
RealDictRow([('schemaname', 'pg_catalog'), ('tablename', 'pg_user_mapping'), ('tableowner', 'postgres'), ('tablespace', None), ('hasindexes'

In [11]:
# It turns out we actually know how to make something beautiful from dictionary objects that represent rows
import pandas as pd
df=None

with conn.cursor(cursor_factory = RealDictCursor) as cur:
    sql="""
    SELECT * FROM pg_catalog.pg_tables
    """
    cur.execute(sql)
    df=pd.DataFrame(cur.fetchall())
df

Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,pg_catalog,pg_statistic,postgres,,True,False,False,False
1,pg_catalog,pg_type,postgres,,True,False,False,False
2,pg_catalog,pg_policy,postgres,,True,False,False,False
3,pg_catalog,pg_authid,postgres,pg_global,True,False,False,False
4,pg_catalog,pg_user_mapping,postgres,,True,False,False,False
...,...,...,...,...,...,...,...,...
64,information_schema,sql_features,postgres,,False,False,False,False
65,information_schema,sql_implementation_info,postgres,,False,False,False,False
66,information_schema,sql_packages,postgres,,False,False,False,False
67,information_schema,sql_sizing,postgres,,False,False,False,False


Great! But, this is not the point. We did this just for beauty and understanding, but there is already a great way to bring data into a `DataFrame` from an SQL table, using `read_sql()` and I use this **all the time**. It's a staple in my workflows. But if we just load the whole table into a `DataFrame` are aren't really doing anything too fancy with SQL, and are not taking advantage of the benefits of an RDBMS.

In [12]:
# for completeness, here's how we convert a query into a dataframe with pandas.
pd.read_sql("SELECT * FROM pg_catalog.pg_tables", conn)

Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,pg_catalog,pg_statistic,postgres,,True,False,False,False
1,pg_catalog,pg_type,postgres,,True,False,False,False
2,pg_catalog,pg_policy,postgres,,True,False,False,False
3,pg_catalog,pg_authid,postgres,pg_global,True,False,False,False
4,pg_catalog,pg_user_mapping,postgres,,True,False,False,False
...,...,...,...,...,...,...,...,...
64,information_schema,sql_features,postgres,,False,False,False,False
65,information_schema,sql_implementation_info,postgres,,False,False,False,False
66,information_schema,sql_packages,postgres,,False,False,False,False
67,information_schema,sql_sizing,postgres,,False,False,False,False


In [7]:
# but wait, there's more! If we just want to browse the data from within jupyter we can use the cell magics
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [8]:
%%sql postgres://jovyan:si330studentuser@localhost:5432/si330
SELECT * FROM pg_catalog.pg_tables

69 rows affected.


schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
pg_catalog,pg_statistic,postgres,,True,False,False,False
pg_catalog,pg_type,postgres,,True,False,False,False
pg_catalog,pg_policy,postgres,,True,False,False,False
pg_catalog,pg_authid,postgres,pg_global,True,False,False,False
pg_catalog,pg_user_mapping,postgres,,True,False,False,False
pg_catalog,pg_subscription,postgres,pg_global,True,False,False,False
pg_catalog,pg_attribute,postgres,,True,False,False,False
pg_catalog,pg_proc,postgres,,True,False,False,False
pg_catalog,pg_class,postgres,,True,False,False,False
pg_catalog,pg_attrdef,postgres,,True,False,False,False


In [10]:
res = %sql SELECT * FROM pg_catalog.pg_tables
for item in res:
    print(item)

 * postgres://jovyan:***@localhost:5432/si330
69 rows affected.
('pg_catalog', 'pg_statistic', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_type', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_policy', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_authid', 'postgres', 'pg_global', True, False, False, False)
('pg_catalog', 'pg_user_mapping', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_subscription', 'postgres', 'pg_global', True, False, False, False)
('pg_catalog', 'pg_attribute', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_proc', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_class', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_attrdef', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_constraint', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_inherits', 'postgres', None, True, False, False, False)
('pg_catalog', 'pg_index',

In [12]:
res = %sql SELECT * FROM pg_catalog.pg_tables
res.DataFrame()

 * postgres://jovyan:***@localhost:5432/si330
69 rows affected.


Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,pg_catalog,pg_statistic,postgres,,True,False,False,False
1,pg_catalog,pg_type,postgres,,True,False,False,False
2,pg_catalog,pg_policy,postgres,,True,False,False,False
3,pg_catalog,pg_authid,postgres,pg_global,True,False,False,False
4,pg_catalog,pg_user_mapping,postgres,,True,False,False,False
...,...,...,...,...,...,...,...,...
64,information_schema,sql_features,postgres,,False,False,False,False
65,information_schema,sql_implementation_info,postgres,,False,False,False,False
66,information_schema,sql_packages,postgres,,False,False,False,False
67,information_schema,sql_sizing,postgres,,False,False,False,False
