# Discussion 07 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

# Section I: Entity Resolution

## Database Setup

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS disc07'
!psql -h localhost -c 'CREATE DATABASE disc07'

%reload_ext sql
%sql postgresql://127.0.0.1:5432/disc07

In [None]:
!psql -h localhost -d disc07 -f disc07.sql

In [None]:
!psql postgresql://127.0.0.1:5432/disc07 <disc07.sql

## Initial Exploration

Let us first take a look at the content of the database.

In [None]:
%sql SELECT * FROM albums;

In [None]:
%sql SELECT * FROM sales;

There is one significant barrier to performing data analysis: joins. Although we have cleaned data locally, within each table, we have not cleaned up the connections between these tables such as the strings we may want to join on.

In our dataset, we'll notice that simply joining on equality of names will result in missing data.

## Question 1. 
Let's try it out: write a query to join the tables using strict equality

In [None]:
%config SqlMagic.displaylimit = None

In [None]:
%%sql
-- your code here


By performing a left join, we can see how many rows that should have matched but did not when using a strict equality condition.

In [None]:
%%sql
-- your code here
SELECT *
FROM sales s
LEFT JOIN albums a
ON s.name = a.name;

We see that we are missing a lot of data, which could cause trouble in computations later on.

### Distance Functions on Strings
The Levenshtein distance function can be helpful to find strings that are _similar_ but not identical. This algorithm computes the minimal number of insertions, deletions, and mutations. We can apply this to our dataset.

## Question 2. 
First, let's compute the Levenshtein distance between every pair of names in the two tables, sorted by this distance in ascending order.

Hint: levenshtein(text1, text2)

https://www.postgresql.org/docs/current/fuzzystrmatch.html#FUZZYSTRMATCH-LEVENSHTEIN

In [None]:
%%sql
-- your code here


You'll notice that it's a very close boundary between the strings we want to match and those we don't. In reality, a clustering-based approach would be better suited for this scenario, but let's keep going with distance since that's a lot easier to implement.

We can use `< 10` as our threshold for matching strings.

## Question 3
Write a query that joins the tables on the condition that two entries match if their Levenshtein distance is less than 10

Make sure to have the following columns in the result:
- `name`: name of the album
- `artist`: name of the artist
- `day`: day of sales
- `count`: number of sale

In [None]:
%%sql
-- your code here


# Section IV [Optional]: Hampel X84

## Question 9: Deriving the Magic Number 1.4826

#### Goal: prove that selecting outliers 1.4826 MAD away from the median is equivalent to selecting outliers 1 standard deviation away from the mean, if the data follows a normal distribution.
First, let us find the **MAD** (Median Absolute Deviation) of a standard normal distribution.

Let **Z** be a random variable following a **standard normal distribution** with mean **μ = 0** and standard deviation **σ = 1**. We want to find **z\*** such that:


$$P(Z < z^*) = 0.75$$

Is z\* the MAD?

Hint:
Use `scipy.stats.norm` to compute the value of z\*.


In [None]:
from scipy.stats import norm
