# Dev Setups: Connecting Python and SQL

The purpose of this Jupyter notebook is to demonstrate the usefulness of connecting python to a relational database by using a python toolkit called SQLAlchemy. This tutorial follows the previous document, *** Testing Python and Data Science basic stack ***

**This notebook is for Mac OS and Windows specific instructions. See `DS_sql_dev_setup_linux.ipynb` for Linux.**

### First off, what is a relational database?

Basically, it is a way to store information such that information can be retrieved from it.

MySQL and PostgreSQL are examples of relational databases.  For the purposes of an Insight project, you can use either one.

Why would you use a relational database instead of a csv or two?

**A few reasons:**

- They scale easily

-  They are easy to query

- It’s possible to do transactions in those cases where you need to write to a database, not just read from it

- Everyone in industry uses them, so you should get familiar with them, too.

***What does a relational database look like? ***

We can take a look.  First we need to set up a few things. The first thing we want to do is to get a PostgreSQL server up and running.  




## Postgres Installation

**Mac OS installation:**
Go to http://postgresapp.com/ and follow the three steps listed in the Quick Installation Guide. 

**Windows OS installation:** 
Go to https://www.postgresql.org/download/windows/ to download the installer.

**If you're on a mac, you might need to add psql to PATH in order to interact with Postgres in the Terminal more easily**. See this website for info on bash profiles and PATH: https://hathaway.cc/2008/06/how-to-edit-your-path-environment-variables-on-mac/<br>

**Edit your .bash_profile in your home directory. Since you already installed Anaconda, it should look something like:**<br>
```export PATH="/Users/YOUR_USER_NAME/anaconda/bin:$PATH"```

**Right below the line added by anaconda you can add this line:**<br>

```export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"```

**Save and reload the bash profile in Terminal**<br>
``` source .bash_profile```



## Start your postgresql server

**There are multiple ways to launch a postgres server. For now, let's stick with the following. **

**The only user right now for PSQL is 'postgres', you can make your database and enter it with that username in the terminal.** We're using a dataset on births, so we'll call it birth_db. <br>
``` createdb birth_db -U postgres```<br>
``` psql birth_db```

**If you want to make a new user for this database you can make one now. 
Note: username in the below line must match your Mac/Linux username:**<br>
``` CREATE USER username SUPERUSER PASSWORD 'yourpassword'```<br>

**Exit out of PSQL (\q) and test logging in through this user:**<br>
``` psql birth_db -h localhost -U username```<br>
``` \c ```  (once in PSQL to check how you're logged in)<br>


## Set up SQLalchemy

In jupyter you can run code in the command line with the "!" special character as you'll see in the next cell.  We do this here for ease but it's generally considered poor practice. Run the following commands to install the necessary packages for python to talk to a sql database.

In [0]:
!pip install sqlalchemy_utils 
!conda install psycopg2 -y

In [0]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd

## Define and populate a database

Run the cells below to define your new database and populate it with data from the included CSV file. 

In [0]:
# Define a database name 
# Set your postgres username
dbname = 'birth_db'
username = 'April' # change this to your username

In [0]:
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print(engine.url)

In [0]:
## create a database (if it doesn't exist)
if not database_exists(engine.url):
    create_database(engine.url)
print(database_exists(engine.url))

In [0]:
# read a database from the included CSV and load it into a pandas dataframe
# you may need to add a path in the command below if you aren't working in the same directory you saved the CSV
birth_data = pd.read_csv('births2012_downsampled.csv', index_col=0)

In [0]:
## insert data into database from Python (proof of concept - this won't be useful for big data, of course)
birth_data.to_sql('birth_data_table', engine, if_exists='replace')

The above line (to_sql) is doing a lot of heavy lifting.  It's reading a dataframe, it's creating a table, and adding the data to the table.  So ** SQLAlchemy is quite useful! **

## Working with PostgresSQL without Python

**Open up the PostgreSQL app, click on the database you just created, ** <br>

or alternatively type <br>

 ```  psql -h localhost ``` <br>
 ```  \c birth_db ```


into the command line  

**You should see something like the following**

`You are now connected to database "birth_db" as user "April".`


**Then try the following query:**

 ```   SELECT * FROM birth_data_table; ```
    
Note that the semi-colon indicates an end-of-statement.

### You can see the table we created!  But it's kinda ugly and hard to read.

Press q in your terminal at any time to get back to the command line. 

Try a few other sample queries.  Before you type in each one, ask yourself what you think the output will look like:

`SELECT * FROM birth_data_table WHERE infant_sex='M';`

`SELECT COUNT(infant_sex) FROM birth_data_table WHERE infant_sex='M';`

`SELECT COUNT(gestation_weeks), infant_sex FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks, infant_sex;`

`SELECT gestation_weeks, COUNT(gestation_weeks) FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks;`

All the above queries run, but they are difficult to visually inspect in the Postgres terminal.

## Working with PostgreSQL in Python

In [0]:
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database = dbname, user = username)

# query:
sql_query = """
SELECT * FROM birth_data_table WHERE delivery_method='Cesarean';
"""
birth_data_from_sql = pd.read_sql_query(sql_query,con)
birth_data_from_sql.head()

Once the data has been pulled into python, we can leverage pandas methods to work with the data.

In [0]:
%matplotlib inline
birth_data_from_sql.hist(column='birth_weight');

### Is reading from a SQL database faster than from a Pandas dataframe?  Probably not for the amount of data you can fit on your machine.

In [0]:
def get_data(sql_query, con):
    data = pd.read_sql_query(sql_query, con)
    return data

%timeit get_data(sql_query, con)

birth_data_from_sql = get_data(sql_query, con)
birth_data_from_sql.head()

In [0]:
def get_pandas_data(df, col, value):
    sub_df = df.loc[(df[col] == value)]
    return sub_df

%timeit get_pandas_data(birth_data, 'delivery_method', 'Cesarean')

birth_data_out = get_pandas_data(birth_data, 'delivery_method', 'Cesarean')
birth_data_out.head()

This should have given you a quick taste of how to use SQLALchemy, as well as how to run a few SQL queries both at the command line and in python.  You can see that `pandas` is actually a quite a bit faster than PostgreSQL here. This is because we're working with quite a small database (2716 rows × 37 columns), and there is an overhead of time it takes to communicate between python and PostGreSQL.  But as your database gets bigger (and certainly when it's too large to store in memory), working with relational databases becomes a necessity.

#### Congrats! You now have Python and SQL ready to go!