Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [None]:
NAME = "your name here"

---

# Exercise 02 - Due Friday, September 29 at 12pm


**Objectives**:  Gain experience loading a CSV dataset into a database and using SQL to explore its contents.  Write and execute a number of SQL queries using common syntax and functions.

**Grading criteria**: All code cells should be executed with outputs, and questions should all be answered with SQL queries in the space provided, unless a text answer is requested.  The notebook itself should be completely reproducible; from start to finish, another person should be able to use the same code to obtain the same results as yours.

For this assignment, you need **not** add narrative description to most of your queries (except where explicitly noted), although you may do so if something you see in the data prompts you.  If you do, add new text cells and use Markdown formatting.

**Deadline**: Friday, September 29, 12pm.

**Suggestion**: if you have worked through the [Software Carpentry SQL lessons](http://swcarpentry.github.io/sql-novice-survey/) and have run through the last two lecture notes notebooks, this should all be fairly easy.  If you have done neither, do them now, before you begin.

# Part A (50 points)
Get the ```survey.db``` SQLite3 database file from the [Software Carpentry lesson](http://swcarpentry.github.io/sql-novice-survey/discussion.html) and connect to it.

In [None]:
!wget -O survey.db http://files.software-carpentry.org/survey.db

To work with it, we'll need the ipython-sql extension loaded, and then we'll need to connect to the db.

In [None]:
%load_ext sql

In [None]:
%sql sqlite:///survey.db

First, take a look at the data in the tables:

In [None]:
%sql SELECT * FROM Site;

In [None]:
%sql SELECT * FROM Visited;

In [None]:
%sql SELECT * FROM Person;

In [None]:
%sql SELECT * FROM Survey;

### Question 1

Describe in your own words what the following query produces:
```
SELECT DISTINCT Site.name 
FROM Site 
JOIN Visited
    ON Site.lat < -49.0 
       AND Site.name = Visited.site 
       AND Visited.dated < '1932-01-01';
```

**EDIT THIS CELL** WITH YOUR ANSWER HERE

### Question 2
Write a query that lists all salinity readings that are out of range (the range should be between 0 and 1) and the persons who are responsible for those readings. The reasult should show name of the site, date of the site visit, the type of measurement taken and its reading, followed by personal name and family name of the person who took the measurement. Tip: you should get 2 records with 6 fields.

In [None]:
-- YOUR CODE HERE

### Question 3
After further investigation, we realize that Valentina Roerich was reporting salinity as percentages. Write a query that returns all of her original salinity readings, followed by the readings divided by 100. Use `ROUND` function to round the numbers to three decimal places. Rename these two attributes as "original_reading" and "corrected_reading". Tip: you should get 2 records.

In [None]:
-- YOUR CODE HERE

### Question 4
Once you are happy with the corrected salinity measurements from Valentina Roerich in the previous question, write one SQL UPDATE statement to correct all of her salinity measurements in the Survey table. For simplicity's sake, you can assume that all the out of range salinity readings were reported by her. Tip: you should see 2 rows updated.

In [None]:
-- YOUR CODE HERE

Write a query that shows her salinity measurements have been fixed.

In [None]:
-- YOUR CODE HERE

### Question 5
Write a query that shows each site's name with exact location (lat, long), visted date ordered in ascending order, followed by personal name and family name of the person who visited the site and took the survey and the type of measurement taken and its reading. Please avoid all null values. Tip: you should get 15 records with 8 fields.

In [None]:
-- YOUR CODE HERE

# Part B (50 points)

In this part, we'll download a clean CSV dataset from data.gov, load it into a SQLite database, and perform a series of queries to answer several questions.  For each problem, write and execute queries that provides the answer in the cells provided, with your SQL queries in the places marked.  

## Setup - obtain data and create database

The [Connecticut DMV Boating Registrations](http://catalog.data.gov/dataset/dmv-boating-registrations-2008-to-2014) dataset comprises several years of summary records.  It is available from data.gov.

First we download the dataset:

In [None]:
!wget --quiet -O boating.csv "https://data.ct.gov/api/views/mrb6-7ee5/rows.csv?accessType=DOWNLOAD"

Verify that it's what we think it is on the commandline:

In [None]:
!head boating.csv | csvlook

Looks right.  How many records are there?

In [None]:
!wc -l boating.csv

So that should be 145, counting the header.  And the basic stats:

In [None]:
!csvstat boating.csv

Looks about right!  

Note, though, that the column names have spaces, punctuation, and Upper Cased Names.  That's annoying!  First let's rename the file.

In [None]:
!mv boating.csv boating-orig.csv

Okay, using output redirection and `tail` we can write a new header line.

In [None]:
!echo "year,tx_type,num" > boating.csv

In [None]:
!tail -n +2 boating-orig.csv >> boating.csv

In [None]:
!head boating.csv | csvlook

Much easier to work with now.

Next we convert the updated csv file into a SQLite database using CSVkit. First we remove the database file if it exists, so that we can do it repeatedly.

In [None]:
!rm -f boating.db

In [None]:
!csvsql --db sqlite:///boating.db --insert boating.csv

Now connect to the newly created database. If you get an error here, make sure file `boating.db` exists, and ipython-sql extension is loaded. You've loaded ipython-sql extension in Part A. 

In [None]:
%sql sqlite:///boating.db

In [None]:
%%sql
SELECT COUNT(*)
FROM boating;

Looks like the same number of rows!  We're good to go.

## Basic queries

In the following queries, we'll do some basic exploration of the data.  Let's first see what a few records look like.

In [None]:
%%sql
SELECT *
FROM boating
LIMIT 10;

This should look familiar!

Let's look at just the "change" types.

In [None]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE";

How many records do we have here, and which year had the most?

In [None]:
%%sql
SELECT COUNT(*)
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE";

Which year had the most of these transactions?

In [None]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE"
ORDER BY num DESC;

...alright, your turn. Before we start, we need to understand the terminology used here: A _record_ refers to a row in our table, it is a summarized record; A _transaction_ refers to registration of an individual boat.  

### Question 6

Use `DISTINCT` to determine the unique set of transaction types in this dataset. Tip: you should get 21 records.

In [None]:
-- YOUR CODE HERE

### Question 7

Use `SUM` and `GROUP BY` to determine the overall number of transactions (across all years) per transaction type.

In [None]:
-- YOUR CODE HERE

### Question 8

Use `ORDER BY` and `LIMIT` to determine the top five types of transactions overall.

In [None]:
-- YOUR CODE HERE

### Question 9

Use a wildcard search to determine how many _transactions_ in 2012 involve canoes?

In [None]:
-- YOUR CODE HERE

### Question 10

How do the transaction trends over time involving pontoons compare to overall boating transaction activity?  Discuss as appropriate, adding Markdown cells for your discussion after your exploratory queries.

In [None]:
-- YOUR CODE HERE

### Bonus (10 points)

Make a plot (inline, here, using python) that demonstrates one or more overall trends in boat registrations in Connecticut, drawing data directly from the database.