# Technical Report
#### Collegiate Courses - New Textbook Library and Bookstore Data

### Overview

* We're taking raw data, extracting the information we want, transforming the information so it is in a universal format, and loading into a more user friendly database so analysts can more easily read and present the data. We set a goal of providing this data in **2nd normal form** for our analysts.

### Extract

* We downloaded two csv files comprising a dataset on kaggle.com, one with textbook data `BNTextbook_2022-02-05.csv | 278 MB`, and another with college course data `BNCollegeCourses_2022-02-05.csv | 97 MB`. Due to the enormity of these files, we used pandas to partition these csv files into smaller, GitHub-friendly csv files. 


* We loaded each large csv into a pandas dataframe and determined its total number of rows. We decided that each smaller csv should have no more than 200,000 rows. Our partitioning code thus rendered 5 textbook data csvs and 3 course data csvs in the `/Resources` folder.  


* `courses_1.csv   | 33 MB`
* `courses_2.csv   | 32 MB`
* `courses_3.csv   | 24 MB`

* `textbooks_1.csv | 49 MB`
* `textbooks_2.csv | 53 MB`
* `textbooks_3.csv | 51 MB`
* `textbooks_4.csv | 50 MB`
* `textbooks_5.csv | 54 MB`


* See `Partition_CSVs.ipynb` for the code which rendered these smaller csvs into the `/Resources` folder.  

### Transform


* After loading the smaller csv files into their respective dataframes in pandas, we applied custom transformations as follows:

**courses data**

1. starting dataframe: `537821 rows × 15 columns`
2. dropped duplicate rows and unneeded columns
3. rendered textual data in all caps for readability
4. applied boolean masks to render `term` column data into 4 categories: (FALL, WINTER, SPRING, SUMMER)
5. reindexed dataframe
6. transformed dataframe: `523968 rows × 8 columns`


**textbooks data**

1. starting dataframe: `990727 rows × 19 columns`
2. dropped duplicate rows and unneeded columns
3. rendered textual data in all caps for readability
4. dropped rows where `title, ISBN, price` columns were _all_ missing data
5. filled missing data in `edition, publisher` columns with "unknown" 
6. filled missing data in `ISBN` column with 0.0
7. filled missing data in `price` column with "0.01"
8. converted `ISBN` column from float to string
9. removed trailing ".0" from `ISBN` column data
10. renamed `ISBN` column to `isbn`
11. removed leading dollar signs, commas, and other string literals from `price` column data
12. converted `price` column from string to float so analysts can perform math in the database
13. reindexed dataframe
14. transformed dataframe: `645255 rows × 10 columns`

**merged courses and textbooks data**

1. inner-join merged the transformed dataframes above on 3 common columns: (`department_id, course_id, section_id`)
2. dropped duplicate rows and unneeded columns
3. reindexed dataframe
4. transformed dataframe: `610343 rows × 12 columns`

### Load

* We used PostgreSQL to upload our data into a database called `library_db`
* We then created file `populate.sql` to create a table `course_textbook`
* Lastly, we created an engine with SqlAlchemy that connects pandas dataframe to append the sql table