## Demo: Interacting with PostgreSQL
For this demo, we'll load the penguin dataset into my local PostgreSQL database.

<img align="left" style="padding-right:10px;" src="figures_wk2/penguins_logo.png" width=150><br>
The Palmer Penguins data set is one of the Seaborn "built-in" data sets. 

The seaborn library has access to a speccial GitHub repository that contains 17 different dataset.

To access one of these datasets, use seaborns's `load_dataset()`.

In [19]:
import seaborn as sns
import pandas as pd

In [21]:
penguins = sns.load_dataset('penguins')

**Palmer Penguins:** The dataset consists of 7 columns
|field_name|description|data_type|
|---|---|---|
|species|penguin species (Chinstrap, Adélie, or Gentoo)|nominal|
|island|island name (Dream, Torgersen, or Biscoe)|nominal|
|culmen_length_mm|culmen length (mm)|continuous|
|culmen_depth_mm|culmen depth (mm)|continuous|
|flipper_length_mm|flipper length (mm)|continuous|
|body_mass_g|body mass (g)|continuous|

|sex|penguin sex|nominal|

Let's compare the anticipated dataset structure to the actual dataset.

In [24]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [29]:
penguins.describe(include = 'all')

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
count,344,344,342.0,342.0,342.0,342.0,333
unique,3,3,,,,,2
top,Adelie,Biscoe,,,,,Male
freq,152,168,,,,,168
mean,,,43.92193,17.15117,200.915205,4201.754386,
std,,,5.459584,1.974793,14.061714,801.954536,
min,,,32.1,13.1,172.0,2700.0,
25%,,,39.225,15.6,190.0,3550.0,
50%,,,44.45,17.3,197.0,4050.0,
75%,,,48.5,18.7,213.0,4750.0,


In [100]:
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


So what do we know at this point?

Based on the dataset descriptive file, we were expecting our dataset to have a total of 7 columns: 3 text-based columns and 4 numeric columns. Also, we can see that all of our text-based columns are categorical, meaning that each of these columns has a finite set of possible values.

Compared to the information that we gathered from the actual dataset, we see that the overall structure of the dataset matches the descriptive file. We also know at this point that the dataset has 344 entries and there are some missing values within the dataset. Finally, we have a range of values for all the columns.

### Creating a new database and schema
Before we proceed too far, let's take a quick peek at my local PostgreSQL database using pgAdmin4.
<img align="center" style="padding-right:10px;" src="figures_wk2/pgadmin_before.png" width=650><br>
PostgreSQL loads a generic default database, **postgres**, with one schema named **public** for you.

To create a new database, click on the Object drop-down, select Create and select database. 
<img align="center" style="padding-right:10px;" src="figures_wk2/create_db.png" width=450><br>

Enter the name of your new database and click Save.
<img align="center" style="padding-right:10px;" src="figures_wk2/create_db_2.png" width=450><br>

Now it's time to add a new schema. Select the newly created database from the tree on the left side of the pgAdmin screen. The select the Object -> Create again. This time pick Schema. Like you did before, enter the schema name.
<img align="center" style="padding-right:10px;" src="figures_wk2/create_schema.png" width=650><br>

<img align="right" style="padding-left:10px;" src="figures_wk2/pgadmin_after.png" width=200><br>
If we expand the tree view (toggle '>') for the newly created raw schema, we can see a Tables entry in the tree-tree view. This is where we will be making a storage spot for our data. 

Since this is a new PostgreSQL installation, there are no tables in this schema, which is fine. We will fix that.


### Loading data
Back to our data.

In [50]:
penguins.shape

(344, 7)

If you don't have the psycopg2 package installed, you can use the following cell. Just uncomment it and run it.  Then comment it out again.

In [69]:
#!pip install psycopg2

Import the packages necessary to interact with the database.

In [72]:
from sqlalchemy import create_engine

Let's establish a few variables to make our code a bit more readable.

In [75]:
# Note:: The make sure you use the information from your specific PostgreSQL installation
host = r'127.0.0.1' # denotes that the db in a local installation
db = r'MSDS610' # db we just created
user = r'postgres' # using the postgres user for this demo
pw = r'postgres' # this is the password established during installation
port = r'5432' # default port estabalished during install
schema = r'raw' # schema we just created

In [77]:
db_conn = create_engine("postgresql://{}:{}@{}:{}/{}".format(user, pw, host, port, db))

Let's test out our connection to the database. I'm going to pull back a list of the tables that are in the **raw** schema for the **MSDS610** database. <br>
<i>Hint: We know there aren't any tables out there, but if there is anything wrong with the connect to the datbase, it will tell us.</i>

In [80]:
sql="select tables.table_name from information_schema.tables where (table_schema ='"+schema+"')order by 1;"
tbl_df = pd.read_sql(sql, db_conn, index_col=None)
tbl_df

Unnamed: 0,table_name


This is good news! Everything is matching up.

### Time to load some data
We need to define a name for the table that we are above to create in our database.

In [85]:
table_name = r'penguin_data'

One of the reasons Pandas is popular is that it has a lot of built in functions.  We used the to_sql() function and Pandas handled all the work for us. You can read more about this function on the [to_sql()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) page.

In [88]:
penguins.to_sql(table_name, con=db_conn, if_exists='replace', index=False, schema=schema, chunksize=1000, method='multi')

344

If everything has worked out, we should now be able to see that our raw schema has one table in it now.

In [91]:
tbl_df = pd.read_sql(sql, db_conn, index_col=None)
tbl_df

Unnamed: 0,table_name
0,penguin_data


Hooray! We at least know that a table was created at this point.

### Retrieving data 
Okay, time to actually verify that our data was loaded to the database. For this, I'll retreive the entire dataset.

In [94]:
sql=r'SELECT * FROM ' + schema + '.' + table_name
penquin_check = pd.read_sql(sql, db_conn, index_col=None)

In [96]:
penquin_check.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


### Verifying through pgAdmin
In pgAdmin, right-click on the MSDS610 database and select Refresh.
<img align="center" style="padding-left:10px;" src="figures_wk2/pgadmin_refresh.png" width=400><br>

After the database refreshes, you should see the penguin_data table.
<img align="right" style="padding-left:10px;" src="figures_wk2/pgadmin_refresh_2.png" width=200><br>

There are several tutorials online that will show you how to further explore pgAdmin.