Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

## Set User Credentials

With a shared resource at a provider like a MySQL RDBMS, we need to use credentials to authenticate ourselves to the server, and need the logical location of the server itself.

For these notebooks, these are kept in a text file named 'creds.json', stored either in the same directory or in a data directory.  For this notebook, this is stored in the same directory as the notebook.

- Right click on the `creds.json` file and select *Open With*->*Editor*
- The server should be correct, mapped to `"hadoop2.mathsci.denison.edu"`. Likewise, the scheme should be correct, mapped to `"mysql+mysqlconnector"`.  
- Replace the mysql dictionary's key for "user" (currently `"nostudent"`) with the base part of your email address (i.e. without the `@denison.edu`).  Your password on the mysql server, at present, is the same as your username, so change that from `"nostudent"` as well.  
- **NEW INSTRUCTIONS** Lastly, change `database` from `"nostudent"` to your username, because you will be creating and dropping tables, and this must happen in your own space.

**Make sure to use double quotes for strings** ... this is `JSON`, not Python, and we have to follow JSON syntax.

Once this is complete, execute the following cell to connect to the database using SQL alchemy. If you are off-campus you will need to use a VPN first.

We have set it up so that you have **both** sqlalchemy engine/connection objects **and** a connection to allow the use of cell magics.  During the course of the notebook, you will use the sql magics to allow trial and experimentation, but we will also develop functions that use the sqlalchemy engine/connection to allow us to abstract one or more SQL statements as a function.

In [None]:
import pandas as pd
import os
import os.path
import json
import sqlalchemy as sa

def getmysql_creds(dirname=".",filename="creds.json"):
    """ Using directory and filename parameters, open a credentials file
        and obtain the four parts needed for a connection string to
        a remote provider using the "mysql" dictionary within
        an outer dictionary.  
        
        Return a scheme, server, user, and password
    """
    assert os.path.isfile(os.path.join(dirname, filename))
    with open(os.path.join(dirname, filename)) as f:
        D = json.load(f)
    mysql = D["mysql"]
    return mysql["scheme"], mysql["server"], mysql["user"], mysql["pass"],mysql["database"]

scheme, server, u, password, database = getmysql_creds()
template = '{}://{}:{}@{}/{}'

cstring = template.format(scheme, u, password, server,database)

engine=sa.create_engine(cstring)

print(cstring) # you should be in your personal 
               # database space now, if you edited the JSON

In [None]:
%load_ext sql

In [None]:
%sql $cstring

If we query the database server for the set of tables defined for the default database, we should see an "empty" database.

In [None]:
%sql SHOW TABLES

Since the python variable `database` references a string naming our default database (the same as our username), if we ever change our default database, we can change the default back using the following cell.

In [None]:
%sql USE $database

In [None]:
%sql SHOW TABLES

**Q1** When we have defined a sound database structure, and identified the table names, column attributes, data types for the column attributes, and the single/composite primary key for a table, we can now tell SQL to create a table.

The first link below takes you to the TutorialsPoint tutorial section on creating SQL tables.  Read the section and then copy the CREATE TABLE SQL into the following cell (where it says `query`) and execute it, to create a table called `CUSTOMERS`.  Note: even though this query is not SELECTing something, it's still a query just like we've seen before.

Try and understand the format of the CREATE and how the column names are separated, the data types and constraints for the column are specified, and how the primary key (a singleton) is specified. 

Repeat the above SHOW TABLES sql.  Then try and execute the CREATE a second time.  Read the error and see what you get.

The second link below takes you to a set of pages that give more detail about the data types supported by our MySQL server.

- [Create Table](https://www.tutorialspoint.com/sql/sql-create-table.htm)  
- [MySQL Data Types](https://dev.mysql.com/doc/refman/5.7/en/data-types.html)

In [None]:
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
%sql $query

In [None]:
# Testing cell
assert 'CREATE' in query
assert 'CustAddress' in query
assert 'DECIMAL (18, 2)' in query

In [None]:
%sql SHOW TABLES

In [None]:
%sql SELECT * FROM CUSTOMERS

**Q2** The link below takes you to the TutorialsPoint tutorial section on removing (DROPing) an SQL table.  In the following cell, use SQL to delete the table you created.

[Drop Table](https://www.tutorialspoint.com/sql/sql-drop-table.htm)

In [None]:
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
%sql $query

In [None]:
# Testing cell

assert 'DROP' in query
assert 'TABLE' in query

If we strictly alternate between a DROP and a CREATE, we can iteratively refine our definition of a table in our database to get what we want.  However, sometimes we know we want to DROP a table, whether or not it already exists (and note that DROPing a table will **lose** all data contained in that table), and then CREATE our new version of the table.

The cell below gives the extended DROP syntax that allows us to DROP a table if it exists.  With an existing CUSTOMERS table, execute the following line, and then execute it again.

In [None]:
%sql DROP TABLE IF EXISTS CUSTOMERS

If we do a SHOW TABLES and know that the table exists, but are in the midst of development and want to remind ourselves of the columns attributes and data types for a table, we can use the DESCRIBE SQL statement.  Re-create the CUSTOMERS table and then execute the cell below:

In [None]:
%sql DESCRIBE CUSTOMERS

**Q3** In the cell below, write a Python function 
`
dropcreate_CUSTOMERS(conn)
`
that, given an sqlalchemy connection, will perform the two steps of dropping the CUSTOMERS table if it exists, and then CREATEs the customer table.  In this case, we are assuming in the function the schema for the customers table, and so there are no Python variables to bind, we simply need strings for each of the SQL commands, and the ability to execute.

In your answer, please create these strings, then execute both, then return both strings (drop statement then create statement).

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Run this cell
with engine.connect() as connection:
    d,c = dropcreate_CUSTOMERS(connection)

In [None]:
%sql SHOW TABLES

In [None]:
%sql DESCRIBE CUSTOMERS

In [None]:
# Testing cell
assert 'DROP' in d
assert 'EXISTS' in d
assert 'NOT NULL' in c
assert 'PRIMARY KEY' in c

**Q4** Say that we want to create the tables in a database for a dataset given to us as a csv file.  Consider the `movies.csv` assigned as a DataFrame to the `movies` variable below:

In [None]:
movies = pd.read_csv("movies.csv")
movies.head()

In [None]:
type(movies.iloc[0,0])

We see that the movies dataset has three columns, an integer giving a unique id for the movie, a title as a string, and genres, and verticle bar separated list of genres attributed to the given movie.  Setting aside the fact that this dataset does not conform to relational database standards of third normal form, let us build a SQL table that can simply hold the data as given to us.  

We need to come up with a table schema that, for the strings, uses a `VARCHAR(n)` for each attribute where `n` is large enough for the longest anticipated string in our dataset. Note that `n` should be a power of 2. The cells below apply the len function to the column vector of title and genres and updates the dataframe with these new columns.  We will not be using these columns in the SQL table, but it can inform our schema.

In [None]:
movies['titlelen'] = movies.title.apply(len)
movies['genrelen'] = movies.genres.apply(len)
movies.head()

In [None]:
print(max(movies.titlelen), max(movies.genrelen))

**Q5** As we did with the `CUSTOMERS` table, write a function 
`
dropcreate_Movies(conn)
`
that drops the Movie table if it exists, and then creates a Movie table with fields for `movieId`, `title`, and `genres`, with appropriate data types and `movieId` as the primary key.  Suppose that it is OK for the genres field to be NULL, but not the title.  Also note that, while we are using the same column names in our SQL table as those present in the csv, this need not be the case, and that if we wanted to rename columns, we would need to do it at this point. 

As before, please return the drop statement and the create statement, after executing them.

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Run this cell
with engine.connect() as connection:
    d,c = dropcreate_Movies(connection)

In [None]:
%sql SHOW TABLES

In [None]:
%sql DESCRIBE Movies

In [None]:
# Testing cell
assert 'DROP' in d
assert 'EXISTS' in d
assert 'Movies' in d
assert 'NOT NULL' in c
assert 'PRIMARY KEY' in c
assert 'Movies' in c
assert '(' in c

You should now have a `Movies` table in your personal database space. The next step is to populate this new table with data.

The link below takes you to the TutorialsPoint tutorial on the SQL INSERT statement.  Read the page and then come back to experiment with the INSERT's below.

[SQL INSERT](https://www.tutorialspoint.com/sql/sql-insert-query.htm)

Depending on how you last left the Schema for the CUSTOMERS table, you may be able to execute the following SQL insert statement.  If it does not work, use DESCRIBE and try to figure out why.  Once you are able to successfully insert the tuple, try and insert the same tuple a second time.  Explore the error message and see what the error is telling you.

In [None]:
%%sql 
INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
       VALUES (1, 'Bressoud, Thomas', 42, '1234 Main St', 140000)

**Q6** Describe what happened the first time you executed the command above (i.e. what was displayed after you ran the cell, and what it meant). Then describe what happened the second time. Then explain why.

YOUR ANSWER HERE

In [None]:
%sql SELECT * FROM CUSTOMERS

Consider the INSERT in the cell below.  How does that differ from the first INSERT?  Try rearranging the order of the fields and explore what works and what does not work. Make it so that the `SELECT` statement below yields two rows in the database (one for `TB` and one for `MS`).

In [None]:
%%sql
INSERT INTO CUSTOMERS (ID, AGE, NAME)
       VALUES (2, 27, 'Mary Sykes')

In [None]:
%sql SELECT * FROM CUSTOMERS

**Q7** What happened when you swapped the order of AGE and NAME?

YOUR ANSWER HERE

Consider the INSERT in the cell below.  How does that differ from the first and second INSERT?  Try rearranging the order of the fields and explore what works and what does not work. Make it so that the `SELECT` statement afterwards yields three total entries in `CUSTOMERS`.

In [None]:
%%sql
INSERT INTO CUSTOMERS VALUES (3, 'Weinberg, Adam', '131 W. Broadway', 50, 350000)

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
%sql SELECT * FROM CUSTOMERS

**Q8** Explain what went wrong with the first insert statement, and how you fixed this. Please reference "type" in your answer.

YOUR ANSWER HERE

Consider the INSERT in the cell below.  How does that differ from the first and second INSERT?  Try rearranging the order of the fields and explore what works and what does not work.

In [None]:
%%sql   
INSERT INTO CUSTOMERS VALUES (4, 'Kretchmar, R. Matthew', 30, NULL, NULL)

In [None]:
%sql SELECT * FROM CUSTOMERS

**Q9** What do you notice about what's wrong with the four rows presently in your database? How would this be fixed?

YOUR ANSWER HERE

**Q10** Given what you have learned about INSERT, write a function:
`
insert_Movies(conn, ident, title, genres)
`
That, given an active sqlalchemy database connection and the triple of parameters of a movie identifier, a title, and a genres string, will insert the tuple into the `Movies` table.  Note that here you have variables to bind, and so you will have to refer back to our earlier work on this subject in the prior notebook on SQL programming.

The statement you execute in this problem will be a bound statement. Please return this bound statement (as a string) at the end of your function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Try it out, experimenting with invocations using the following cell example as a starting point.

In [None]:
# Run this cell
with engine.connect() as connection:
    s = insert_Movies(connection, 1, "New Movie (2020)", "Comedy")

In [None]:
%sql SELECT * from Movies

In [None]:
# Testing cell
assert 'INSERT' in str(s)
assert 'movieId' in str(s)
assert 'VALUES' in str(s)
assert 'genres' in str(s)

You can always "start again" by executing the `dropcreate_Movies()` function.

In [None]:
with engine.connect() as connection:
    dropcreate_Movies(connection)

In [None]:
%sql SELECT * from Movies

Please insert your five favorite movies into the table, then check with the command below that you have five movies.

In [None]:
%sql SELECT COUNT(*) from Movies

**Q11** Write a function
`
createpopulate_Movies(conn, df)
`
that, given an sqlalchemy connection and a dataframe containing the dataset constructed from `movies.csv` will drop/create the Movies table and then iterate over the rows in the dataframe adding each to the SQL table until fully populated. Your function should call `dropcreate_Movies` and should not return anything. Instead, it creates a table in your personal database space.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Make this cell work before running the one below.
mov = pd.read_csv('movies.csv')
with engine.connect() as connection:
    createpopulate_Movies(connection,mov,100)

In [None]:
# Testing cell
resultset = %sql SELECT * FROM Movies
resultdf = resultset.DataFrame()
assert resultdf.shape == (100,3)
assert '1995' in resultdf.iloc[0,1]
assert 'Romance' in resultdf.iloc[3,2]
resultdf.head()

**Q12** Discuss, from the point of view of database design, how to fix the issue in the third column of your new `Movies` table.

YOUR ANSWER HERE