<a href="https://colab.research.google.com/github/alostmathematician/ISTA-366/blob/main/HW4_ISTA_322_Data_into_databases_blank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW 4 - Building a normalized RDB

The goal of this homework is to take a semi-structured non-normalized CSV file and turn it into a set of normalized tables that you then push to your postgres database on AWS.

The original dataset contains 100k district court decisions, although I've downsampled it to only 1000 rows to make the uploads faster. Each row contains info about a judge, their demographics, party affiliation, etc. Rows also contain information about the case they were deciding on. Was it a criminal or civil case? What year was it? Was the direction of the decision liberal or conservative?

While the current denormalized format is fine for analysis, it's not fine for a database as it violates many normalization rules. Your goal is to normalize it by designing a simple schema, then wrangling it into the proper dataframes, then pushing it all to AWS.

For the first part of this assignment you should wind up with three tables. One with case information, one with judge information, and one that has casetype information. Each table should be reduced so that there are not then repeating rows, and primary keys should be assigned within each. These tables should be called 'cases', 'judges', and 'casetype'.

For the last part you should make a rollup table that calculates the percent of liberal decisions for each party level and each case category. This will allow for one to get a quick look at how the political party affiliation of judges impacts the direction of a decision for different case categories (e.g. criminal, civil, labor).

**Submission**
1) Run all cells.
2) Create a directory with your name.
3) Create a pdf copy of your notebook.
4) Download .py and .ipynb of the notebook.
5) Put all three files in it.
6) Zip and submit.

## Q1 Bring in data, explore, make schema - 3 point

Start by bringing in your data to `cases`. Call a `.head()` on it to see what columns are there and what they contain.

In [None]:
## Q1 Your code starts here
import pandas as pd
cases = pd.read_csv('https://docs.google.com/spreadsheets/d/1AWLK06JOlSKImgoHNTbj7oXR5mRfsL2WWeQF6ofMq1g/gviz/tq?tqx=out:csv')

In [None]:
# head of cases
...
## Q1 Your function ends here - Any code outside of these start/end markers won't be graded

### Make schema

OK, given that head, you need to make three related tables that will make up a normalized database. Those tables are 'cases', 'judges', and 'casetype'. If it's not clear what info should go into each, explore the data more.

*For each of the tables you create, keep the original column names from the imported cases file above. I'll be using these to test your tables, so if they don't match I won't be able to test them and you'll lose points*

Remember, you might not have keys, will need to reduce the rows, select certain columns, etc. There isn't a defined path here.

***Include an image file of your schema in your zip file in order to get the 3 points***

## Q2 Make cases table. - 6 points

Start by making a table that contains just each case's info. I would call this table that you're going to upload `cases_df` so you don't overwrite your raw data.

This table should have six columns and 1000 rows.

Note, one of these columns should be a judge_id that links to the judges table. You'll need to make this foreign key.

Also, you can leave 'category_name' in this table as well as its id. Normally you'd split that off into it's own table as well, but you're already doing that for casetype which is enough for now.

In [None]:
## Q2 part 1 Your code starts here
# Make judge_id in cases
...

In [None]:
# select necessary columns to make cases_df
...

In [None]:
# Show the head of cases_df and print it's shape?
...
## Q2 part 1 Your function ends here - Any code outside of these start/end markers won't be graded

### Make cases table in your database

Put the helper functions (*get_conn_cur()*, *get_table_names()*, etc. from previous NB and HW) to create the connection here.
Once you do that you'll need to do the following

* Connect, make a table called 'cases' with the correct column names and data types. Be sure to execute and commit the table.
* Make tuples of your data
* Write a SQL string that allows you to insert each tuple of data into the correct columns
* Execute the string many times to fill out 'cases'
* Commit changes and check the table.

I'm not going to leave a full roadmap beyond this. Feel free to add cells as needed to do the above.

In [None]:
## Q2 part 2 Your code starts here
import psycopg2
aws_host = "test-hw-db.ctxkekv3vnim.us-east-2.rds.amazonaws.com"
def get_conn_cur(): # define function name and arguments (there aren't any)
  # Make a connection
  conn = psycopg2.connect(
    host= aws_host,
    database= 'hw3_db',
    user= 'postgres',
    password= ...,
    port='5432')

  cur = conn.cursor()   # Make a cursor after

  return(conn, cur)   # Return both the connection and the cursor

...
...
...
### This is an extra function I'm giving you that allows you to drop tables from your RDB. This will be vital as you can only create your table once.
# If you try creating the same table when it already exists on your RDB, you'll get an error.
# I recommend calling this function one line above your code creating your table. eg for cases, you'd call it like this: my_drop_table('cases')
def my_drop_table(tab_name):
  conn, cur = get_conn_cur()
  tq = """DROP TABLE IF EXISTS %s CASCADE;""" %tab_name
  cur.execute(tq)
  conn.commit()

In [None]:
# Use sql_head to check cases
sql_head(table_name='cases')
## Q2 part 2 Your function ends here - Any code outside of these start/end markers won't be graded

## Q3 Make judges - 6 points

Now make your judges table from the original `cases` dataframe (not the SQL table you just made).

Judges should have five columns, including the `judge_id` column you made. There should be 553 rows after you drop duplicates (remember that judges may have had more than one case).

After you make the dataset go and push to a SQL table called 'judges'.

In [None]:
## Q3 Your code starts here
#Your answer
...


In [None]:
#Run this cell
sql_head(table_name='judges')
## Q3 Your function ends here - Any code outside of these start/end markers won't be graded

## Q4 Make casetype - 6 points

Go make the casetype table. This should have only two columns that allow you to link the casetype name back to the ID in the 'cases' table. There should be 27 rows as well.

In [None]:
## Q4 Your code starts here
#Your answer
...

In [None]:
#run this cell
sql_head(table_name='casetype')
## Q4 Your function ends here - Any code outside of these start/end markers won't be graded

## Q5 A quick test of your tables - 3 point

Below is a query to get the number of unique judges that have ruled on criminal court motion cases. You should get a value of 119 as your return if your database is set up correctly!

In [None]:
## Nothing to code here! Just run this and, if it returns 119 you should get full points!
run_query("""SELECT COUNT(DISTINCT(judges.judge_id)) FROM cases
    JOIN judges ON cases.judge_id = judges.judge_id
        WHERE casetype_id = (SELECT casetype_id FROM casetype
                  WHERE casetype_name = 'criminal court motions'); """)


## Q6 Make rollup table - 6 points

Now let's make that rollup table! The goal here is to make a summary table easily accessed. We're going to roll the whole thing up by the judges party and the category, but you could imagine doing this for each judge to track how they make decisions over time which would then be useful for an analytics database. The one we're making could also be used as a dimension table where we needed overall party averages.

We want to get a percentage of liberal decisions by each grouping level (party_name, category_name). To do this we need first, the number of cases seen at each level, and second, the number of liberal decisions made at each level. `cases` contains the columns `libcon_id` which is a 0 if the decision was conservative in its ruling, and a 1 if it was liberal in its ruling. Thus, you can get a percentage of liberal decisions if you divide the sum of that column by the total observations. Your `agg()` can both get the sum and count.

After you groupby you'll need to reset the index, rename the columns, then make the percentage.

Once you do that you can push to a SQL table called 'rollup'

Let's get started

In [None]:
## Q6 Your code starts here
# Make a groupby called cases_rollup. This should group by party_name and categrory name. It should aggregate the count and sum of libcon_id
...

In [None]:
# reset your index
...

In [None]:
# rename your columns now. Keep the first to the same but call the last two 'total_cases' and 'num_lib_decisions'
...

Now make a new column called 'percent_liberal'

This should calculate the percentage of decisions that were liberal in nature. Multiple it by 100 so that it's a full percent. Also use the `round()` function on the whole thing to keep it in whole percentages.

In [None]:
# make your metric called 'percent_liberal'
...


Now go and push the whole thing to a table called 'rollup'

There should be five columns and nine rows.

In [None]:
...

In [None]:
# Run this cell
sql_head('rollup')
## Q6 Your function ends here - Any code outside of these start/end markers won't be graded