## Data Manipulation at Scale: Systems and Algorithms
### Assignment 2: SQL for Data Science

https://www.coursera.org/learn/data-manipulation/programming/nkglo/sql-for-data-science-assignment

Load the `reuters.db` SQLite database. The %sql/%%sql magic in the cells is provided by the ipython-sql package.

In [None]:
%load_ext sql
%sql sqlite:///reuters.db

**Problem 1: Inspecting the Reuters Dataset and Basic Relational Algebra**

**Problem 1, Part A:** Using Select

In [None]:
%%sql
SELECT * FROM frequency WHERE docid="10398_txt_earn"

**Problem 1, Part B:** Using Select, Project


In [None]:
%%sql
SELECT term FROM frequency WHERE docid="10398_txt_earn" AND count=1

**Problem 1, Part C:** Using Union

In [None]:
%%sql
SELECT term FROM frequency WHERE docid="10398_txt_earn" AND count=1 UNION SELECT term FROM frequency WHERE docid="925_txt_trade" AND count=1 

**Problem 1, Part D:** Count unique documents

In [None]:
%%sql
SELECT count(*) FROM (SELECT docid FROM frequency WHERE term="legal" UNION SELECT docid FROM frequency WHERE term="law")

**Problem 1, Part E:** Find documents with >300 terms

In [None]:
%%sql
SELECT count(*) FROM (SELECT sum(count) as wordcount, docid FROM frequency GROUP BY docid HAVING wordcount>300)

**Problem 1, Part F:** Count documents that contain two words

In [None]:
%%sql
SELECT docid FROM frequency WHERE term='transactions' INTERSECT SELECT docid FROM frequency WHERE term='world'

**Problem 2: Matrix Multiplication in SQL**

In [None]:
%sql sqlite:///matrix.db

In [None]:
%%sql
SELECT * FROM B LIMIT 10

In [None]:
%%sql
SELECT A.row_num, B.col_num, sum(A.value*B.value) as value 
    FROM A,B 
    WHERE A.col_num=B.row_num 
    GROUP BY A.row_num, B.col_num

**Problem 3: Working with a Term-Document Matrix**

In [None]:
%sql sqlite:///reuters.db

Each row of the frequency table is a document vector, with one column for each word. Multiplying the matrix by its own transpose gives a square *similarity matrix*, where each cell represents the similarity of two documents. The similarity here is just the dot product of the two document vectors.

The condition `A.docid > B.docid` ensures that each dot product is only computed once.

The notebook crashed without the LIMIT 100 clause, but you would remove this to actually use the query for stuff.

In [None]:
%%sql

SELECT A.docid, B.docid, sum(A.count*B.count) as similarity
    FROM frequency as A, frequency as B
    WHERE A.term=B.term AND A.docid > B.docid
    GROUP BY A.docid, B.docid
    LIMIT 100

To search the dataset, add a document that represents the keyword query *'washington taxes treasury'* as document q.

In [None]:
%%sql

CREATE VIEW frequencyAndQuery AS
    SELECT * FROM FREQUENCY
    UNION
    SELECT 'q' as docid, 'washington' as term, 1 as count
    UNION
    SELECT 'q' as docid, 'taxes' as term, 1 as count
    UNION
    SELECT 'q' as docid, 'treasury' as term, 2 as count

Now compute the similarity matrix again. Get the 20 most similar documents to the query document q.

In [None]:
%%sql

SELECT * FROM
    (SELECT A.docid as docA, B.docid as docB, sum(A.count*B.count) as similarity
        FROM frequencyAndQuery as A, frequencyAndQuery as B
        WHERE A.term=B.term AND A.docid > B.docid
        GROUP BY A.docid, B.docid)
WHERE docA="q" OR docB="q"
ORDER BY similarity DESC
LIMIT 20

List all the terms in the most similar document.

In [None]:
%%sql
SELECT term, count from frequency where docid="19775_txt_interest" ORDER BY count DESC