# Discussion 07 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

# Section II: Entity Resolution

## Database Setup

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS disc07'
!psql -h localhost -c 'CREATE DATABASE disc07'

%reload_ext sql
%sql postgresql://127.0.0.1:5432/disc07

In [None]:
!psql postgresql://127.0.0.1:5432/disc07 <disc07.sql

## Initial Exploration

Let us first take a look at the content of the database.

In [None]:
%sql SELECT * FROM albums;

In [None]:
%sql SELECT * FROM sales;

There is one significant barrier to performing data analysis: joins. Although we have cleaned data locally, within each table, we have not cleaned up the connections between these tables such as the strings we may want to join on.

In our dataset, we'll notice that simply joining on equality of names will result in missing data.

### 1. Let's try it out: write a query to join the tables using strict equality

In [None]:
%%sql
-- your code here

We see that we are missing a lot of data, which could cause trouble in computations later on.

## Distance Functions on Strings
The Levenshtein distance function can be helpful to find strings that are _similar_ but not identical. This algorithm computes the minimal number of insertions, deletions, and mutations. We can apply this to our dataset.

### 2. First, let's compute the Levenshtein distance between every pair of names in the two tables, sorted by this distance.

In [None]:
%%sql
-- your code here

You'll notice that it's a very close boundary between the strings we want to match and those we don't. In reality, a clustering-based approach would be better suited for this scenario, but let's keep going with distance since that's a lot easier to implement.

We can use `< 10` as our threshold for matching strings.

### 3. Create a materialized view `joined_sales` that joins the tables using Levenshtein distance

Make sure to have the following columns in the view:
- `name`: name of the album
- `artist`: name of the artist
- `day`: day of sales
- `count`: number of sale

In [None]:
%%sql
-- your code here

# Section III: Data Granularity

## Initial Exploration

In [None]:
%sql SELECT * FROM nodes ORDER BY tax_id LIMIT 5;

In [None]:
%sql SELECT * FROM names ORDER BY tax_id LIMIT 5;

### 1. Write a SQL query to find the node representing the Animalia kingdom.

In [None]:
%%sql
-- your code here

### 2. Let us drill down into the Animalia kingdom. First, find all children nodes of the Animalia kingdom.

In [None]:
%%sql
-- your code here

### 3. Next, find the names of these nodes, along with the names of their parents.

In [None]:
%%sql
-- your code here

### 4. You will find there are many synonym names for the same phylum with the same tax_id. Aggregate them up by grouping by tax_id.

In [None]:
%%sql
-- your code here

### 5. Challenge: How can we drill down one more layer? What if we want to get the names of all the classes under the Animalia kingdom?

In [None]:
%%sql
-- your code here