# Data cleaning in SQL

General notes:

- We'll be using [ipython-sql](https://github.com/catherinedevlin/ipython-sql) to work with SQL directly within a notebook. To execute SQL commands, prefix with `%sql`. For multiple lines, use `%%sql` and then a newline.
- To use the [`psql` command line tool](https://www.postgresql.org/docs/current/app-psql.html), [launch a terminal](https://jupyterlab.readthedocs.io/en/stable/user/terminal.html) and run `psql postgresql://localhost/postgres`.

## Environment setup

***You can probably ignore this section.***

In [1]:
%load_ext sql

In [2]:
%sql postgresql://localhost/postgres

Start from a clean slate:

In [3]:
%sql DROP TABLE IF EXISTS requests;

 * postgresql://localhost/postgres
Done.


[]

### Load data

We're taking data from the CSV and putting it in an in-memory [SQLite](https://sqlite.org/index.html) database via [pandas](https://pandas.pydata.org/).

In [4]:
import pandas as pd

# use a subset of columns, for simplicity
columns = [
    "Unique Key",
    "Created Date",
    "Closed Date",
    "Agency",
    "Latitude",
    "Longitude"
]
requests = pd.read_csv("data/311_jan_2022.csv", index_col="Unique Key", usecols=columns)
# simplify column names
requests.index = requests.index.rename("unique_key")
requests.columns = requests.columns.str.replace(" ", "_", regex=False).str.replace(r"\W", "", regex=True).str.lower()

In [5]:
%sql --persist requests

 * postgresql://localhost/postgres


'Persisted requests'

#### Ensure records were loaded

In [6]:
%sql SELECT * FROM requests LIMIT 3;

 * postgresql://localhost/postgres
3 rows affected.


unique_key,created_date,closed_date,agency,latitude,longitude
52940375,01/01/2022 12:00:00 AM,01/03/2022 08:39:00 AM,DEP,40.75666417742652,-73.99019293432467
52934953,01/01/2022 12:00:10 AM,01/01/2022 01:00:11 AM,NYPD,40.72314288436064,-73.94366208445774
52933158,01/01/2022 12:00:57 AM,01/01/2022 12:58:22 AM,NYPD,40.59747269272421,-73.98885877127528


## [Display the schema](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMANDS)

In [7]:
%sql \d requests

 * postgresql://localhost/postgres
6 rows affected.


Column,Type,Modifiers
unique_key,bigint,
created_date,text,
closed_date,text,
agency,text,
latitude,double precision,
longitude,double precision,


## Exploration

### Counts per unique value

In [8]:
%sql SELECT agency, COUNT(agency) FROM requests GROUP BY agency;

 * postgresql://localhost/postgres
16 rows affected.


agency,count
MAYORâS OFFICE OF SPECIAL ENFORCEMENT,37
DOITT,1
DPR,794
HPD,12671
DOB,1233
NYPD,18254
DOHMH,1713
DCA,298
DSNY,3975
TLC,318


## Add constraints

### Primary key

In [9]:
%sql ALTER TABLE requests ADD PRIMARY KEY (unique_key);

 * postgresql://localhost/postgres
Done.


[]

### Convert dates from strings to datetime

In [10]:
%sql SELECT created_date FROM requests LIMIT 3;

 * postgresql://localhost/postgres
3 rows affected.


created_date
01/01/2022 12:00:00 AM
01/01/2022 12:00:10 AM
01/01/2022 12:00:57 AM


In [11]:
%%sql
ALTER TABLE requests
    ALTER COLUMN created_date SET NOT NULL,
    ALTER COLUMN created_date TYPE TIMESTAMP USING to_timestamp(created_date, 'MM-DD-YYYY HH:MI:SS AM');

 * postgresql://localhost/postgres
Done.


[]

In [12]:
%sql SELECT created_date FROM requests LIMIT 3;

 * postgresql://localhost/postgres
3 rows affected.


created_date
2022-01-01 00:00:00
2022-01-01 00:00:10
2022-01-01 00:00:57


#### Your turn: Convert `closed_date` to a timestamp

In [13]:
# YOUR CODE HERE

### Ranges

In [14]:
%sql ALTER TABLE requests ADD CHECK (latitude > 40 AND latitude < 41);

 * postgresql://localhost/postgres


IntegrityError: (psycopg2.errors.CheckViolation) check constraint "requests_latitude_check" of relation "requests" is violated by some row

[SQL: ALTER TABLE requests ADD CHECK (latitude > 40 AND latitude < 41);]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

In [15]:
%sql SELECT MIN(latitude), MAX(latitude) FROM requests;

 * postgresql://localhost/postgres
1 rows affected.


min,max
2.040074755e-06,40.91216806886205
