# Term Project Spring 2020 - Team Game Cancelled
___

## Course Data ETL

<b> Table of Contents: </b>
<br> [1. ERD for CSV Data](#0)
<br> [2. Data Dictionary](#1)
<br> [3. Set Up - SQL and Database](#2)
<br> [4. Create Tables From ERD](#3)
<br> [5. Importing CSV Data](#4)
<br> [6. Transform and Load Data into ERD](#5)
<br> [7. Quick Test for Entity Integrity](#6)
___

In this project, we created a normalized database for Fairfield University's Academic Calander and Course Catalogs. During our ETL process, we loaded data into a normalized database of our own design. Then we designed and populated a data warehouse to answer our analytics questions. We tested the domain, entity and relational integrity of our databases and demonstrated that all the course objectives were met. 
___

<a id = "0"> <h2> 1. ERD for CSV Data </h2> </a>   
_Design a normalized relational database that can contain all CSV data in your SourceData repository. Document the design with an ERD and a data dictionary._

![Course Data ERD](James/images/CourseDataERD.png)

In [None]:
# ADD updated image ^^

---
<a id = "1"> <h2> 2. Data Dictionary </h2> </a>

[Course Data Dictionary](Docs/CourseDataDictionary.md)

---
<a id = "2"> <h2> 3. Set Up - SQL and Database </h2> </a>

### Imports

In [1]:
%load_ext sql
import pandas as pd
import sqlite3

In [2]:
%sql sqlite:///CourseData.db
conn = sqlite3.connect('CourseData.db')

---
<a id = "3"> <h2> 4. Create Tables From ERD </h2> </a>

In this section we populate CourseData.db with data from the CSV files in our SourceData repository.

_Create a SQLite database called CourseData.db in this folder. The database should exactly match your ERD. Populate the database with data from the CSV files._

In [3]:
# is primaryinstructorID a foreign key?? 4/17
# should we add FK to our course dict? 4/22 

In [4]:
%%sql

-- Instructors table
DROP TABLE IF EXISTS INSTRUCTORS;
CREATE TABLE INSTRUCTORS (
    InstructorID INTEGER PRIMARY KEY,
    Name TEXT
);

-- Programs table
DROP TABLE IF EXISTS PROGRAMS;
CREATE TABLE PROGRAMS (
    ProgramID INTEGER PRIMARY KEY,
    ProgramCode TEXT NOT NULL,
    ProgramName TEXT NOT NULL
);

-- Course Catalogs table
DROP TABLE IF EXISTS COURSE_CATALOGS;
CREATE TABLE COURSE_CATALOGS(
    CourseID INTEGER PRIMARY KEY,
    CatalogYear TEXT NOT NULL,
    CatalogID TEXT NOT NULL,
    ProgramID TEXT NOT NULL,
    CourseTitle TEXT NOT NULL,
    Credits TEXT NOT NULL,
    Prequisites TEXT,
    Corequisites TEXT,
    Fees TEXT,
    Attributes TEXT,
    Description TEXT,
    FOREIGN KEY (ProgramID) REFERENCES Programs(ProgramID),
    FOREIGN KEY (CatalogID) REFERENCES Course_Offerings(CatalogID)
);


-- Location table
DROP TABLE IF EXISTS LOCATION;
CREATE TABLE LOCATION (
    LocationID INTEGER PRIMARY KEY,
    Location TEXT NOT NULL
);

-- Course Offerings table
DROP TABLE IF EXISTS COURSE_OFFERINGS;
CREATE TABLE COURSE_OFFERINGS (
    CourseOfferingID INTEGER PRIMARY KEY,
    CourseID INTEGER,
    Term TEXT,
    CRN INTEGER NOT NULL,
    CatalogID TEXT NOT NULL,
    Section TEXT NOT NULL,
    Credits TEXT NOT NULL,
    Title TEXT NOT NULL,
    Timecodes TEXT NOT NULL,
    PrimaryInstructorID INTEGER,
    Capacity INTEGER NOT NULL,
    Actual INTEGER NOT NULL,
    Remaining INTEGER NOT NULL,
    FOREIGN KEY (CourseID) REFERENCES Course_Catalogs(CourseID),
    FOREIGN KEY (CatalogID) REFERENCES Course_Catalogs(CatalogID)
);

-- Course Meetings table
DROP TABLE IF EXISTS COURSE_MEETINGS;
CREATE TABLE COURSE_MEETINGS (
    CourseMeetingID INTEGER PRIMARY KEY,
    CourseOfferingID INTEGER NOT NULL,
    CRN INTEGER NOT NULL,
    LocationID INTEGER NOT NULL,
    Day TEXT NOT NULL,
    StartDateTime TEXT NOT NULL,
    EndDateTime TEXT NOT NULL,
    FOREIGN KEY (LocationID) REFERENCES LOCATION(LocationID),
    FOREIGN KEY (CourseOfferingID) REFERENCES Course_Offerings(CourseOfferingID)
);

 * sqlite:///CourseData.db
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

 ---
<a id = "4"> <h2> 5. Importing CSV Data </h2> </a>

### List of CSVs:
- (Catalogs)
    CourseCatalog2017_2018.csv | https://ba-lab.fairfield.edu/user/jneri/lab/tree/term-project-teamgamecancelled/SourceData/Catalogs/CourseCatalog2017_2018.csv
    CourseCatalog2018_2019.csv
- (Each Semester's Folder)
    course_meetings.csv
    courses.csv

In [5]:
catalogyear = ['2017_2018', '2018_2019']

for catyear in catalogyear:
    filepath = 'SourceData/Catalogs/CourseCatalog'+catyear+'.csv'
    data = pd.read_csv(filepath)
    data['catyear']=catyear
    data.to_sql('IMPORT_COURSE_CATALOGS', conn, if_exists='append', index=False)

In [6]:
terms = ['Fall2014','Fall2015','Fall2016','Fall2017','Fall2018',
         'Spring2015','Spring2016','Spring2017','Spring2018','Spring2019',
         'SpringBreak2017',
         'Summer2015','Summer2016','Summer2017','Summer2018',
         'Winter2015','Winter2016','Winter2017','Winter2018']

for term in terms:
    filepath = 'SourceData/'+term+'/courses.csv'
    data = pd.read_csv(filepath)
    data.to_sql('IMPORT_COURSE_OFFERINGS',conn,if_exists='append',index=False) 
    
    filepath = 'SourceData/'+term+'/course_meetings.csv'
    data = pd.read_csv(filepath)
    data.to_sql('IMPORT_COURSE_MEETINGS',conn,if_exists='append',index=False)

In [7]:
%%sql
-- Record Counts for Course Offerings
SELECT 
    (SELECT Count(*) FROM IMPORT_COURSE_OFFERINGS) as 'RawCount',
    (SELECT Count(*) FROM (SELECT DISTINCT * FROM IMPORT_COURSE_OFFERINGS)) as 'DistinctCount';

 * sqlite:///CourseData.db
Done.


RawCount,DistinctCount
15937,15937


In [8]:
%%sql
-- Record Counts for Catalog Courses
SELECT 
    (SELECT Count(*) FROM IMPORT_COURSE_CATALOGS) as 'RawCount',
    (SELECT Count(*) FROM (SELECT DISTINCT * FROM IMPORT_COURSE_CATALOGS)) as 'DistinctCount';

 * sqlite:///CourseData.db
Done.


RawCount,DistinctCount
4440,4440


In [9]:
%%sql
-- Record Counts for Course Meetings
SELECT 
    (SELECT Count(*) FROM IMPORT_COURSE_MEETINGS) as 'RawCount',
    (SELECT Count(*) FROM (SELECT DISTINCT * FROM IMPORT_COURSE_MEETINGS)) as 'DistinctCount';

 * sqlite:///CourseData.db
Done.


RawCount,DistinctCount
284907,284847


__Take away:__
- There are 60 Duplicates, we will deal with them later 

 ---
<a id = "5"> <h2> 6. Transform and Load Data into ERD </h2> </a>

- Loading data into the Instructors Table which includes:
    - Instructor Id and Name
- Testing table with Professor Huntley's name

In [10]:
%%sql

DELETE FROM INSTRUCTORS;

INSERT INTO INSTRUCTORS (Name)
SELECT DISTINCT primary_instructor
FROM IMPORT_COURSE_OFFERINGS 
WHERE primary_instructor <> 'TBA' AND primary_instructor NOT LIKE '%/%';

SELECT * FROM INSTRUCTORS
LIMIT 10; 

SELECT * FROM INSTRUCTORS
WHERE Name like '%Huntley';

 * sqlite:///CourseData.db
0 rows affected.
1095 rows affected.
Done.
Done.


InstructorID,Name
272,Christopher L. Huntley


- Loading data into the Programs Table which includes:
    - Program ID, Program Code and Program Name
- Printing the first 10 rows of the table

In [11]:
%%sql

DELETE FROM PROGRAMS;

INSERT INTO PROGRAMS (ProgramCode,ProgramName)
SELECT DISTINCT program_code,program_name 
FROM IMPORT_COURSE_CATALOGS
ORDER BY program_code;

SELECT * FROM PROGRAMS
LIMIT 10

 * sqlite:///CourseData.db
0 rows affected.
83 rows affected.
Done.


ProgramID,ProgramCode,ProgramName
1,AC,Accounting
2,AE,Applied Ethics
3,AH,Art History
4,AN,Asian Studies
5,AR,Arabic
6,AS,American Studies
7,AY,Anthropology
8,BB,Business
9,BEN,Bioengineering
10,BI,Biology


- Loading data into the Course Catalog Table and joining it with the Programs Table which includes:
    - Course ID, Catalog year, Catalog ID, Program ID, Course title, Credits, Prequisites, Corequisites, Fees, Attirbutes and Course description
- Printing the first 5 rows of the table

In [12]:
%%sql 

DELETE FROM COURSE_CATALOGS;

INSERT INTO COURSE_CATALOGS (CatalogYear, CatalogID,ProgramID,CourseTitle,Credits,Prequisites,Corequisites,Fees,Attributes,Description)
SELECT DISTINCT catyear, catalog_id,ProgramID,course_title,credits,prereqs,coreqs,fees,attributes,description
FROM IMPORT_COURSE_CATALOGS 
    JOIN PROGRAMS ON (program_code = ProgramCode);
    
SELECT * FROM COURSE_CATALOGS
LIMIT 5


 * sqlite:///CourseData.db
0 rows affected.
4440 rows affected.
Done.


CourseID,CatalogYear,CatalogID,ProgramID,CourseTitle,Credits,Prequisites,Corequisites,Fees,Attributes,Description
1,2017_2018,AN 0301,4,Independent Study,1-3 Credits,,,,,Students undertake an individualized program of study in consultation with a director from the Asian studies faculty.
2,2017_2018,AN 0310,4,Asian Studies Seminar,3 Credits,,,,,"This seminar examines selected topics concerning Asia. This course is taught in conjunction with another 100-300 level course from a rotation of course offerings. Consult the Asian Studies director to identify the conjoined course for a given semester. The seminar concentrates on topics within the parameters of the conjoined course syllabus but adds research emphasis. Students registered for this course must complete a research project, to include 300-level research, in addition to the regular research requirements of the conjoined course, and a 25-50 page term paper in substitution of some portion of the conjoined course requirements, as determined by the instructor. Open to juniors and seniors only."
3,2017_2018,BU 0211,12,Legal Environment of Business,3 Credits,Junior standing.,,,,"This course examines the broad philosophical as well as practical nature and function of the legal system, and introduces students to the legal and social responsibilities of business. The course includes an introduction to the legal system, the federal courts, Constitutional law, the United States Supreme Court, the civil process, and regulatory areas such as employment discrimination, protection of the environment, and corporate governance and securities markets."
4,2017_2018,BU 0220,12,Environmental Law and Policy,3 Credits,,,,"EVME Environmental Studies Major Elective, EVPE Environmental Studies Elective, EVSS Environmental Studies: Social Science, MGEL Management: General Elective","This course surveys issues arising out of federal laws designed to protect the environment and manage resources. It considers in detail the role of the Environmental Protection Agency in the enforcement of environmental policies arising out of such laws as the National Environmental Policy Act, the Clean Water Act, and the Clear Air Act, among others. The course also considers the impact of Congress, political parties, bureaucracy, and interest groups in shaping environmental policy, giving special attention to the impact of environmental regulation on business and private property rights."
5,2017_2018,BU 0311,12,"The Law of Contracts, Sales, and Property",3 Credits,BU 0211.,,,,"This course examines the components of common law contracts including the concepts of offer and acceptance, consideration, capacity and legality, assignment of rights and delegation of duties, as well as discharge of contracts. The course covers Articles 2 and 2A of the Uniform Commercial Code relating to leases, sales of goods, and warranties. The course also considers personal and real property, and bailments."


- Loading data into the Location Table which includes:
    - Location ID
- Printing the first 10 rows of the table

In [13]:
%%sql

DELETE FROM LOCATION;

INSERT INTO LOCATION (Location)
SELECT DISTINCT location
FROM IMPORT_COURSE_MEETINGS
ORDER BY location;

SELECT * FROM LOCATION
LIMIT 10

 * sqlite:///CourseData.db
0 rows affected.
207 rows affected.
Done.


LocationID,Location
1,BCC 200
2,BD
3,BH
4,BH BY ARR
5,BLM 112
6,BLM LL105
7,BNW 124
8,BNW 127
9,BNW 128
10,BNW 129B


__Quick print out of IMPORT_COURSE_OFFERINGS Table for test__

In [14]:
%%sql
SELECT * 
FROM IMPORT_COURSE_OFFERINGS
LIMIT 5;

 * sqlite:///CourseData.db
Done.


term,crn,catalog_id,section,credits,title,meetings,timecodes,primary_instructor,cap,act,rem
Fall2014,70384,AC 0011,C01,3.0,Introduction to Financial Accounting,"[{'days': 'TF', 'times': '0800am-0915am', 'dates': '09/02-12/08', 'location': 'DSB 105'}]",['TF 0800am-0915am 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
Fall2014,70385,AC 0011,C02,3.0,Introduction to Financial Accounting,"[{'days': 'TF', 'times': '0930am-1045am', 'dates': '09/02-12/08', 'location': 'DSB 105'}]",['TF 0930am-1045am 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
Fall2014,70382,AC 0011,C03,3.0,Introduction to Financial Accounting,"[{'days': 'TF', 'times': '1230pm-0145pm', 'dates': '09/02-12/08', 'location': 'DSB 105'}]",['TF 1230pm-0145pm 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
Fall2014,70291,AC 0011,C04,3.0,Introduction to Financial Accounting,"[{'days': 'MR', 'times': '1100am-1215pm', 'dates': '09/02-12/08', 'location': 'DSB 111'}]",['MR 1100am-1215pm 09/02-12/08 DSB 111'],Rebecca I. Bloch,0,29,-29
Fall2014,70350,AC 0011,C05,3.0,Introduction to Financial Accounting,"[{'days': 'MR', 'times': '1230pm-0145pm', 'dates': '09/02-12/08', 'location': 'DSB 111'}]",['MR 1230pm-0145pm 09/02-12/08 DSB 111'],Rebecca I. Bloch,0,30,-30


- Loading data into the Course Offerings and joinig it with Course Catalogs and Instructors Table which includes:
    - Term, CRN, Credits, CatalogID, Section, Title, Timecodes, PrimaryInstructorID, Capacity, Actual, Remaining
- Printing the first 5 rows of the table

In [15]:
%%sql 

DELETE FROM COURSE_OFFERINGS;

INSERT INTO COURSE_OFFERINGS (CourseID,Term,CRN,Credits,CatalogID,Section,Title,Timecodes,PrimaryInstructorID,Capacity,Actual,Remaining)
SELECT DISTINCT CourseID,term,IMPORT_COURSE_OFFERINGS.crn,IMPORT_COURSE_OFFERINGS.credits,catalog_id,section,title,timecodes,primary_instructor,cap,act,rem
FROM IMPORT_COURSE_OFFERINGS 
    LEFT JOIN COURSE_CATALOGS ON (catalog_id = CatalogID)
    LEFT JOIN INSTRUCTORS ON (primary_instructor = Name);
    
SELECT * FROM COURSE_OFFERINGS
LIMIT 5;

 * sqlite:///CourseData.db
0 rows affected.
31180 rows affected.
Done.


CourseOfferingID,CourseID,Term,CRN,CatalogID,Section,Credits,Title,Timecodes,PrimaryInstructorID,Capacity,Actual,Remaining
1,113,Fall2014,70384,AC 0011,C01,3.0,Introduction to Financial Accounting,['TF 0800am-0915am 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
2,2333,Fall2014,70384,AC 0011,C01,3.0,Introduction to Financial Accounting,['TF 0800am-0915am 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
3,113,Fall2014,70385,AC 0011,C02,3.0,Introduction to Financial Accounting,['TF 0930am-1045am 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
4,2333,Fall2014,70385,AC 0011,C02,3.0,Introduction to Financial Accounting,['TF 0930am-1045am 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31
5,113,Fall2014,70382,AC 0011,C03,3.0,Introduction to Financial Accounting,['TF 1230pm-0145pm 09/02-12/08 DSB 105'],Michael P. Coyne,0,31,-31


- Loading data into the Course Meetings Table and joining it with Course Offerings and Location tablewhich includes:
    - CRN, LocationID, Day, StartDateTime, EndDateTime
- Printing the first 10 rows of the table

In [16]:
%%sql 

DELETE FROM COURSE_MEETINGS;

INSERT INTO COURSE_MEETINGS (CourseOfferingID,CRN,LocationID,Day,StartDateTime,EndDateTime)
SELECT DISTINCT CourseOfferingID,IMPORT_COURSE_MEETINGS.crn,LocationID,day,start,end
FROM IMPORT_COURSE_MEETINGS 
    LEFT JOIN COURSE_OFFERINGS USING (CRN, Term)
    LEFT JOIN LOCATION ON (IMPORT_COURSE_MEETINGS.Location = LOCATION.location);
    
SELECT * FROM COURSE_MEETINGS
LIMIT 5;

 * sqlite:///CourseData.db
0 rows affected.
558473 rows affected.
Done.


CourseMeetingID,CourseOfferingID,CRN,LocationID,Day,StartDateTime,EndDateTime
1,1,70384,99,T,2014-09-02T08:00:00,2014-09-02T09:15:00
2,2,70384,99,T,2014-09-02T08:00:00,2014-09-02T09:15:00
3,1,70384,99,F,2014-09-05T08:00:00,2014-09-05T09:15:00
4,2,70384,99,F,2014-09-05T08:00:00,2014-09-05T09:15:00
5,1,70384,99,T,2014-09-09T08:00:00,2014-09-09T09:15:00


 ---
<a id = "6"> <h2> 7. Quick Test for Entity Integrity </h2> </a>

In [17]:
%%sql
-- Record Counts for Course Meetings
SELECT 
    (SELECT Count(*) FROM IMPORT_COURSE_MEETINGS) as 'RawCount',
    (SELECT Count(*) FROM (SELECT DISTINCT * FROM IMPORT_COURSE_MEETINGS)) as 'DistinctCount';

 * sqlite:///CourseData.db
Done.


RawCount,DistinctCount
284907,284847


In [18]:
%%sql
vacuum;

 * sqlite:///CourseData.db
Done.


[]