# Discussion 07 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

# Section I: Entity Resolution

## Database Setup

In [1]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS disc07'
!psql -h localhost -c 'CREATE DATABASE disc07'

%reload_ext sql
%sql postgresql://127.0.0.1:5432/disc07

DROP DATABASE
CREATE DATABASE


In [2]:
!psql -h localhost -d disc07 -f disc07.sql

SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE MATERIALIZED VIEW
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 6
COPY 176436
COPY 73006
COPY 21
ALTER TABLE
CREATE INDEX
REFRESH MATERIALIZED VIEW


In [3]:
!psql postgresql://127.0.0.1:5432/disc07 <disc07.sql

SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
CREATE EXTENSION
COMMENT
SET
SET
ERROR:  relation "albums" already exists
ALTER TABLE
ERROR:  relation "sales" already exists
ALTER TABLE
ERROR:  relation "joined_sales" already exists
ALTER TABLE
ERROR:  relation "names" already exists
ALTER TABLE
ERROR:  relation "nodes" already exists
ALTER TABLE
COPY 6
COPY 176436
ERROR:  duplicate key value violates unique constraint "nodes_pkey"
DETAIL:  Key (tax_id)=(1) already exists.
CONTEXT:  COPY nodes, line 1
COPY 21
ERROR:  multiple primary keys for table "nodes" are not allowed
ERROR:  relation "tax_id_ix" already exists
REFRESH MATERIALIZED VIEW


## Initial Exploration

Let us first take a look at the content of the database.

In [4]:
%sql SELECT * FROM albums;

name,artist,track_count
Red (Deluxe Edition),Taylor Swift,22
The Midsummer Station,Owl City,11
thank u next,Ariana Grande,12
Eyes wide open,TWICE,13
Bloom,Red Velvet,11
After LIKE,IVE,2
Red (Deluxe Edition),Taylor Swift,22
The Midsummer Station,Owl City,11
thank u next,Ariana Grande,12
Eyes wide open,TWICE,13


In [5]:
%sql SELECT * FROM sales;

name,day,count
Red [Deluxe Edition],1,3
Eyes wide open,1,1
After LIKE,1,2
Red - Deluxe Edition,2,-1
Eyes wide open (CD),2,2
After Like,2,-1
Red (Deluxe Edition),3,2
Eyes wide open,3,3
After LIKE,3,6
Red [Deluxe Edition],4,1


There is one significant barrier to performing data analysis: joins. Although we have cleaned data locally, within each table, we have not cleaned up the connections between these tables such as the strings we may want to join on.

In our dataset, we'll notice that simply joining on equality of names will result in missing data.

## Question 1. 
Let's try it out: write a query to join the tables using strict equality

In [6]:
%config SqlMagic.displaylimit = None

In [7]:
%%sql
-- your code here
SELECT *
FROM sales s
INNER JOIN albums a
ON s.name = a.name;

name,day,count,name_1,artist,track_count
After LIKE,7,8,After LIKE,IVE,2
After LIKE,1,2,After LIKE,IVE,2
After LIKE,1,2,After LIKE,IVE,2
After LIKE,3,6,After LIKE,IVE,2
After LIKE,5,-1,After LIKE,IVE,2
After LIKE,5,-1,After LIKE,IVE,2
After LIKE,7,8,After LIKE,IVE,2
After LIKE,3,6,After LIKE,IVE,2
After LIKE,7,8,After LIKE,IVE,2
After LIKE,1,2,After LIKE,IVE,2


By performing a left join, we can see how many rows that should have matched but did not when using a strict equality condition.

In [8]:
%%sql
-- your code here
SELECT *
FROM sales s
LEFT JOIN albums a
ON s.name = a.name;

name,day,count,name_1,artist,track_count
After Like,2,-1,,,
After Like,2,-1,,,
After LIKE,7,8,After LIKE,IVE,2.0
After LIKE,1,2,After LIKE,IVE,2.0
After LIKE,1,2,After LIKE,IVE,2.0
After LIKE,3,6,After LIKE,IVE,2.0
After LIKE,5,-1,After LIKE,IVE,2.0
After LIKE,5,-1,After LIKE,IVE,2.0
After LIKE,7,8,After LIKE,IVE,2.0
After LIKE,3,6,After LIKE,IVE,2.0


We see that we are missing a lot of data, which could cause trouble in computations later on.

### Distance Functions on Strings
The Levenshtein distance function can be helpful to find strings that are _similar_ but not identical. This algorithm computes the minimal number of insertions, deletions, and mutations. We can apply this to our dataset.

## Question 2. 
First, let's compute the Levenshtein distance between every pair of names in the two tables, sorted by this distance in ascending order.

Hint: levenshtein(text1, text2)

https://www.postgresql.org/docs/current/fuzzystrmatch.html#FUZZYSTRMATCH-LEVENSHTEIN

In [9]:
%%sql
-- your code here
SELECT a.name text1, s.name AS text2, levenshtein(a.name, s.name) AS levenshtein
FROM albums a, sales s
ORDER BY levenshtein ASC;

text1,text2,levenshtein
Red (Deluxe Edition),Red (Deluxe Edition),0
Eyes wide open,Eyes wide open,0
After LIKE,After LIKE,0
After LIKE,After LIKE,0
After LIKE,After LIKE,0
After LIKE,After LIKE,0
After LIKE,After LIKE,0
Eyes wide open,Eyes wide open,0
Eyes wide open,Eyes wide open,0
Red (Deluxe Edition),Red (Deluxe Edition),0


You'll notice that it's a very close boundary between the strings we want to match and those we don't. In reality, a clustering-based approach would be better suited for this scenario, but let's keep going with distance since that's a lot easier to implement.

We can use `< 10` as our threshold for matching strings.

## Question 3
Write a query that joins the tables on the condition that two entries match if their Levenshtein distance is less than 10

Make sure to have the following columns in the result:
- `name`: name of the album
- `artist`: name of the artist
- `day`: day of sales
- `count`: number of sale

In [21]:
%%sql

CREATE MATERIALIZED VIEW joined_sales4 AS (
SELECT a.name AS album_name, s.name AS sale_name, day, count
FROM albums AS a 
INNER JOIN sales AS s
ON levenshtein(a.name, s.name) < 10);

RuntimeError: (psycopg2.errors.DuplicateColumn) column "name" specified more than once

[SQL: CREATE MATERIALIZED VIEW joined_sales4 AS (
SELECT a.name, s.name, day, count
FROM albums AS a
INNER JOIN sales AS s
ON levenshtein(a.name, s.name) < 10);]
(Background on this error at: https://sqlalche.me/e/20/f405)


In [17]:
%%sql
DROP MATERIALIZED VIEW IF EXISTS joined_sales;

CREATE MATERIALIZED VIEW joined_sales AS 
    SELECT a.name AS album_name, s.name AS sale_name, day, count
    FROM albums a
        JOIN sales s
    levenshtein(a.name, s.name) < 10;

SELECT * FROM joined_sales;

aname,sname,day,count
Red (Deluxe Edition),Red [Deluxe Edition],1,3
Red (Deluxe Edition),Red [Deluxe Edition],1,3
Eyes wide open,Eyes wide open,1,1
Eyes wide open,Eyes wide open,1,1
After LIKE,After LIKE,1,2
After LIKE,After LIKE,1,2
Red (Deluxe Edition),Red - Deluxe Edition,2,-1
Red (Deluxe Edition),Red - Deluxe Edition,2,-1
Eyes wide open,Eyes wide open (CD),2,2
Eyes wide open,Eyes wide open (CD),2,2


In [15]:
%%sql
DROP MATERIALIZED VIEW IF EXISTS joined_sales2;

CREATE MATERIALIZED VIEW joined_sales2 AS 
    SELECT a.name AS aname, s.name AS sname, day, count
    FROM albums a, sales s
    WHERE levenshtein(a.name, s.name) < 10;

SELECT * FROM joined_sales2;

aname,sname,day,count
Red (Deluxe Edition),Red [Deluxe Edition],1,3
Red (Deluxe Edition),Red [Deluxe Edition],1,3
Eyes wide open,Eyes wide open,1,1
Eyes wide open,Eyes wide open,1,1
After LIKE,After LIKE,1,2
After LIKE,After LIKE,1,2
Red (Deluxe Edition),Red - Deluxe Edition,2,-1
Red (Deluxe Edition),Red - Deluxe Edition,2,-1
Eyes wide open,Eyes wide open (CD),2,2
Eyes wide open,Eyes wide open (CD),2,2


# Section IV [Optional]: Hampel X84

## Question 9: Deriving the Magic Number 1.4826

#### Goal: prove that selecting outliers 1.4826 MAD away from the median is equivalent to selecting outliers 1 standard deviation away from the mean, if the data follows a normal distribution.
First, let us find the **MAD** (Median Absolute Deviation) of a standard normal distribution.

Let **Z** be a random variable following a **standard normal distribution** with mean **μ = 0** and standard deviation **σ = 1**. We want to find **z\*** such that:


$$P(Z < z^*) = 0.75$$

Is z\* the MAD?

Hint:
Use `scipy.stats.norm` to compute the value of z\*.


In [None]:
from scipy.stats import norm
