# Denison CS181/DA210 SW Lab #11 - Step 2

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

#### Import Python modules and load "SQL Magic"

In [1]:
import pandas as pd
import os
import os.path
import json
import sys
import importlib

module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

%load_ext sql

#### Set credentials

In [2]:
def getsqlite_creds(dirname=".",filename="creds.json",source="sqlite"):
    """ Using directory and filename parameters, open a credentials file
        and obtain the two parts needed for a connection string to
        a local provider using the "sqlite" dictionary within
        an outer dictionary.  
        
        Return a scheme and a dbfile
    """
    assert os.path.isfile(os.path.join(dirname, filename))
    with open(os.path.join(dirname, filename)) as f:
        D = json.load(f)
    sqlite = D[source]
    return sqlite["scheme"], sqlite["dbdir"], sqlite["database"]

In [3]:
scheme, dbdir, database = getsqlite_creds(source="sqlite1")
template = '{}:///{}/{}.db'
cstring = template.format(scheme, dbdir, database)
print("Connection string:", cstring)

Connection string: sqlite:///../../dbfiles/book.db


#### Establish Connection from Client to Server

In [4]:
%sql $cstring

---

## Part C: Types of Joins

We'll observe the differences in the types of joins using the following two tables (technically they're "views" in the `book` database, but for our purposes we'll treat them as tables):

In [5]:
%sql SELECT * FROM pop_gdp

 * sqlite:///../../dbfiles/book.db
Done.


code,pop,gdp
CHN,1386.4,12143.5
FRA,66.87,2586.29
GBR,66.06,2637.87
USA,325.15,19485.4


In [6]:
%sql SELECT * FROM country_land

 * sqlite:///../../dbfiles/book.db
Done.


code,country,land
FRA,France,547557.0
GBR,United Kingdom,241930.0
IND,India,2973190.0
USA,United States,9147420.0
VNM,Vietnam,310070.0


We'll use the following "match condition" for the joins we'll explore:
```
    pop_gdp.code = country_land.code
```

#### Inner join

We have already seen inner joins, so we'll use this as our starting point.

First, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and the matching fields are present in **both** tables.

In [7]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM pop_gdp INNER JOIN country_land ON country_land.code = pop_gdp.code
"""

resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

 * sqlite:///../../dbfiles/book.db
Done.


Unnamed: 0,pg_code,pg_pop,pg_gdp,cl_code,cl_country,cl_land
0,FRA,66.87,2586.29,FRA,France,547557.0
1,GBR,66.06,2637.87,GBR,United Kingdom,241930.0
2,USA,325.15,19485.4,USA,United States,9147420.0


As only FRA, GBR, and USA are present in both tables, the resulting table has only three records.

#### Left join

To use a `LEFT JOIN`, we simply replace `INNER JOIN` with `LEFT JOIN` in our SQL statement.

Next, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and all rows in the `pop_gdp` table are present.

In [8]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM pop_gdp LEFT JOIN country_land ON country_land.code = pop_gdp.code
"""

resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

 * sqlite:///../../dbfiles/book.db
Done.


Unnamed: 0,pg_code,pg_pop,pg_gdp,cl_code,cl_country,cl_land
0,CHN,1386.4,12143.5,,,
1,FRA,66.87,2586.29,FRA,France,547557.0
2,GBR,66.06,2637.87,GBR,United Kingdom,241930.0
3,USA,325.15,19485.4,USA,United States,9147420.0


As this is a `LEFT JOIN`, it has all records present in the `pop_gdp` table, even if they are not present in the `country_land` table (e.g., CHN has NULL values in the columns coming from `country_land`).

#### Right join

Some systems do not implement a `RIGHT JOIN` and provide only a `LEFT JOIN`.  In this case, we can use a `LEFT JOIN` and reverse the order of the tables in the `FROM` clause.

Now, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and all rows in the `land_country` table are present.

In [9]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM country_land LEFT JOIN pop_gdp ON country_land.code = pop_gdp.code
"""
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

 * sqlite:///../../dbfiles/book.db
Done.


Unnamed: 0,pg_code,pg_pop,pg_gdp,cl_code,cl_country,cl_land
0,FRA,66.87,2586.29,FRA,France,547557.0
1,GBR,66.06,2637.87,GBR,United Kingdom,241930.0
2,,,,IND,India,2973190.0
3,USA,325.15,19485.4,USA,United States,9147420.0
4,,,,VNM,Vietnam,310070.0


Similar to the previous example, this result has all records present in the `country_land` table, even if they are not present in the `pop_gdp` table (e.g., IND and VNM have NULL values in the columns coming from `pop_gdp`).

#### Full outer join

An outer join, also called a `FULL OUTER JOIN`, is also not implemented in all systems.  Instead, we can take the `UNION` of both the `LEFT JOIN` and `RIGHT JOIN`.

Finally, we construct a combined table that includes all six columns from the two tables, and where the rows in the result satisfy the match condition, and all rows in either original table are present.

In [10]:
query = """
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM pop_gdp LEFT JOIN country_land ON country_land.code = pop_gdp.code
UNION
SELECT pop_gdp.code AS pg_code,
       pop_gdp.pop AS pg_pop,
       pop_gdp.gdp AS pg_gdp,
       country_land.code AS cl_code,
       country_land.country AS cl_country,
       country_land.land AS cl_land
FROM country_land LEFT JOIN pop_gdp ON country_land.code = pop_gdp.code
"""
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()

 * sqlite:///../../dbfiles/book.db
Done.


Unnamed: 0,pg_code,pg_pop,pg_gdp,cl_code,cl_country,cl_land
0,,,,IND,India,2973190.0
1,,,,VNM,Vietnam,310070.0
2,CHN,1386.4,12143.5,,,
3,FRA,66.87,2586.29,FRA,France,547557.0
4,GBR,66.06,2637.87,GBR,United Kingdom,241930.0


---

## Part D: `LEFT JOIN` and Set Differences

We can use a `LEFT JOIN` to compute differences between sets.  The next two exercises walk you through this process.  First, let's switch back to the `school` database.

In [11]:
scheme, dbdir, database = getsqlite_creds(source="sqlite2")
template = '{}:///{}/{}.db'
cstring = template.format(scheme, dbdir, database)
print("Connection string:", cstring)

%sql $cstring

Connection string: sqlite:///../../dbfiles/school.db


**Q6:** Write a SQL query to collect course and class information for all courses (subject, number, and title) and classes (also term, as a column `term`).  Your resulting table should include directed studies, and should have records for all rows in the `courses` table.

In [12]:
query6 = """
SELECT coursesubject, coursenum, courses.coursetitle, classterm AS term
FROM courses LEFT JOIN classes USING (coursesubject, coursenum)
"""
# YOUR CODE HERE
# raise NotImplementedError()

resultset6 = %sql $query6
resultdf6 = resultset6.DataFrame()
print(len(resultdf6))
resultdf6.head()

   sqlite:///../../dbfiles/book.db
 * sqlite:///../../dbfiles/school.db
Done.
1896


Unnamed: 0,coursesubject,coursenum,coursetitle,term
0,ARAB,111,Beginning Arabic I,FALL
1,ARAB,112,Beginning Arabic II,SPRING
2,ARAB,211,Intermediate Arabic I,FALL
3,ARAB,361,Directed Study,FALL
4,ARAB,361,Directed Study,FALL


In [13]:
# Testing cell
assert len(resultdf6) == 1896
assert set(resultdf6.columns) == set(["coursesubject", "coursenum", "coursetitle", "term"])
assert True in list(resultdf6["term"].isna())
assert True not in list(resultdf6["coursetitle"].isna())
assert "Beginning Arabic I" in set(resultdf6["coursetitle"])

**Q7:** Using the query from the previous question as a subquery, select the course subject, number, and title for any courses not offered in either the fall or spring terms.  (Hint: think about what you can filter from the result of the previous question.)

In [14]:
query7 = """
SELECT DISTINCT coursesubject, coursenum, coursetitle
FROM (SELECT *
FROM courses LEFT JOIN classes USING (coursesubject, coursenum))
WHERE classterm IS NULL
"""
# YOUR CODE HERE
# raise NotImplementedError()

resultset7 = %sql $query7
resultdf7 = resultset7.DataFrame()
print(len(resultdf7))
resultdf7

   sqlite:///../../dbfiles/book.db
 * sqlite:///../../dbfiles/school.db
Done.
94


Unnamed: 0,coursesubject,coursenum,coursetitle
0,ARTH,363,Independent Study
1,ARTH,452,Senior Research
2,BIOL,363,Independent Study
3,BIOL,364,Independent Study
4,BLST,265,Blk Women & Org Leadership
...,...,...,...
89,WMST,162,Self-Defense for Women
90,WMST,229,Feminism/Fairy Tales
91,WMST,265,Blk Women-Org Leadership
92,WMST,364,Independent Study


In [15]:
# Testing cell
assert len(resultdf7) == 94
assert set(resultdf7.columns) == set(["coursesubject", "coursenum", "coursetitle"])
assert resultdf7.iloc[0,1] == 363
assert "Beginning Arabic I" not in set(resultdf7["coursetitle"])

**Q8:** Further expand your previous SQL query to retrieve the English courses (subject, number, and title) that were not offered in either semester.

In [16]:
query8 = """
SELECT coursesubject, coursenum, coursetitle
FROM
(SELECT DISTINCT coursesubject, coursenum, coursetitle
FROM (SELECT *
FROM courses LEFT JOIN classes USING (coursesubject, coursenum))
WHERE classterm IS NULL)
WHERE coursesubject = 'ENGL'
"""
# YOUR CODE HERE
# raise NotImplementedError()

# A - B = A - (A cup B)
# B - A = B - (B cup A)
# (A-B) + (B-A) = A + B - (A cup B)

resultset8 = %sql $query8
resultdf8 = resultset8.DataFrame()
resultdf8.head()

   sqlite:///../../dbfiles/book.db
 * sqlite:///../../dbfiles/school.db
Done.


Unnamed: 0,coursesubject,coursenum,coursetitle
0,ENGL,340,Contemporary Drama
1,ENGL,349,Studies in European Lit


In [17]:
# Testing cell
assert len(resultdf8) == 2
assert set(resultdf8.columns) == set(["coursesubject", "coursenum", "coursetitle"])
assert resultdf8["coursetitle"][0] == "Contemporary Drama"

> You've reached the third (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor.
>
> Checkpoint 3: Why can we union only the left and right joins to get an outer join?  Put another way, why don't we need the union of the left, right, and inner joins to build an outer join?

---

---

## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

For over 1 hour