#Introduction to Databases & SQL



Most DSSG teams will store their data in a SQL (specifically PostgreSQL) database. SQL provides many advantages for our type of work, such as the ability to process large amounts of data efficiently. In this module, we will load and query our data on a SQL server. 

##Goals

- Learn SQL basics
- Be comfortable writing basic SQL queries

## Tasks
- Create schema
- Create tables
- Copy data to your tables
- Query those tables

##Tools
- psql (command line)
- dBeaver

##Create schema
```
create schema jwalsh;
```

##Create table
```
csvsql -i postgresql building_permits.csv
csvsql -i postgresql building_permits.csv | sed -E 's/\"//g' | tr [:upper:] [:lower:] 
csvsql -i postgresql building_violations.csv | sed -E 's/\"//g' | tr [:upper:] [:lower:] 
```

##Copy data
Clean and copy the building-permits dataset

```
cat building_permits.csv |
sed 's/\$//g' |
psql -h dssgsummer2014postgres.c5faqozfo86k.us-west-2.rds.amazonaws.com -U jwalsh -d training_2015 \
     -c "\COPY jwalsh.building_permits FROM STDIN WITH CSV HEADER;"
```

Clean and copy the building-violations dataset (we'll use a sample to avoid wi-fi delays):

```
wget -O- https://raw.githubusercontent.com/dssg/data-challenges/master/BuildingInspections/data/Building_Violations_sample_50000.csv > building_violations_sample.csv

cat building_violations_sample.csv | tr [:upper:] [:lower:] | csvsql -i postgresql | sed 's/\"//g' 

cat building_violations_sample.csv |
sed 's/, ,/,,/g' |
psql -h dssgsummer2014postgres.c5faqozfo86k.us-west-2.rds.amazonaws.com -U jwalsh -d training_2015 \
     -c "\COPY jwalsh.building_violations_sample FROM STDIN WITH CSV HEADER;"
```

Copy full dataset from jwalsh schema:

```
CREATE TABLE [schema][table] AS (SELECT * FROM jwalsh.building_violations);
```

##Query data
Take a look at the first ten rows of each dataset:

```
SELECT * FROM jwalsh.building_violations LIMIT 10;
SELECT * FROM jwalsh.building_permits LIMIT 10;
```

How many rows are there?

```
SELECT COUNT(*) FROM jwalsh.building_violations AS a LIMIT 10;
```

Only look at the data that meet specified conditions:

```
SELECT * FROM jwalsh.building_violations AS a WHERE a.estimated_cost > 1000 LIMIT 10;
SELECT COUNT(*) FROM jwalsh.building_violations AS a WHERE a.estimated_cost > 1000;
```

We can use aliases:

```
SELECT a.* FROM jwalsh.building_violations AS a LIMIT 10;
```

We can use sub-queries:

```
SELECT a.* FROM (SELECT * FROM jwalsh.building_violations AS a LIMIT 10) AS a;
```

##Join data
```
SELECT * FROM jwalsh.building_permits AS a, jwalsh.building_violations AS b 
WHERE a.location = b.location 
LIMIT 10;
```

```
SELECT * 
FROM   (SELECT  *,
                street_number || ' ' || street_direction || ' ' || street_name || ' ' || suffix AS address
        FROM    jwalsh.building_permits) AS a
LEFT JOIN jwalsh.building_violations AS b 
     ON a.location = b.location OR
        a.address = b.address
LIMIT 10;
```

Fuzzy matching:

```
SELECT * FROM ( SELECT *, 
                       ROUND(latitude,5) AS lat_rounded, 
                       ROUND(longitude,5) AS long_rounded 
                FROM   jwalsh.building_permits) AS a, 
              ( SELECT *,
                       ROUND(latitude,5) AS lat_rounded, 
                       ROUND(longitude,5) AS long_rounded 
                FROM   jwalsh.building_violations) AS b 
WHERE a.lat_rounded = b.lat_rounded AND a.long_rounded = b.long_rounded
LIMIT 10;
```

Fuzzy matching using [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance):
```
SELECT *,
       levenshtein_less_equal(a.latitude, b.latitude,4) AS leven_lat,
       levenshtein_less_equal(a.longitude, b.longitude,4) AS leven_long       
FROM jwalsh.building_permits) AS a, 
     jwalsh.building_violations) AS b 
LIMIT 10;

```