# Project Plan:
## Assumptions:
- I am going to assume that the recommender engine will not include user/customer information. You would need user demographics, user behavior, and probably device data and other data to make a recommender, but we don't have that. I assume that will be what is being simulated by the data scientists. Because I don't know what variables they will want to change, I don't know what fields they will need in the tables, so for now I am just going to assume they will be creating user demographic data.
- I am going to assume that the data scientists can use pandas.
- I am going to assume that analysts can use notebooks or pgadmin to interact with the db.
- I am going to assume that the data scientists do not want to use a notebook to interact with the db. I am going to assume that they want to use python scripts to interact with the db.

## Overall Objectives:
- Provide a database to support the simulation engine.
- Offer both direct SQL access and Python APIs for read and write operations.
- Create a demonstration recommender engine.

## Plan:

### 1. Evaluate and Select RDBMS:
#### Objective: Choose an RDBMS that fits the needs of the project.
##### PostgreSQL is an good choice for the recommender system for these reasons:
- Scalability: It offers high scalability, accommodating both vertical and horizontal growth, making it suitable for an expanding system.
- Performance: PostgreSQL is known for handling complex queries efficiently. This is particularly beneficial for a recommender system that requires complex data interactions.
- Community Support: With a strong open-source community, PostgreSQL provides extensive documentation, forums, and support, aiding in both development and troubleshooting.
- Ease of Integration with Python ORMs: PostgreSQL can be easily interfaced with Python Object-Relational Mapping (ORM) tools like FastAPI and SQLModel, allowing for a more streamlined and efficient development process.
- In summary, the combination of scalability, advanced query performance, vibrant community support, and easy integration with popular Python ORMs makes PostgreSQL a fitting choice for building a recommender system that can evolve and adapt to complex data needs.

### 2.1 Create raw schema for raw original data ✅

### 2.2 Load raw data into raw schema in db to meet analyst request to see raw data via ad hoc queries.
1. ✅ Load into db (see load_raw_data.ipynb)
2. ✅ Create read only user for analysts.

### 3. Design and Normalize the Database:
#### Objective: Transform the initial data into a suitable data model.
#### Tasks:
- ✅ Analyze initial data.
- ✅ Design tables and relationships.
- ✅ Normalize to at least 2NF to balance analytical needs and performance
- ✅ Document assumptions and rationale for design choices.

#### Assumptions and rationale for schema design choices: 
(see ERD here: artifacts/recommender_db_erd_2023-08-31.png)

1. I have designed this as a snowflaked fact/dimension table, with the fact_sessions table and the dim_users table being fabricate (with some fabricated data loaded for demo purposes). It is advisable to structure the data at the lowest grain level possible (a session of content consumption), so that analysis can be as fine-grained as users might want, and can then be aggregated for reports or feature engineering.
2. The fact_sessions table has only quantitative data, other than foreign keys which connect to the dimension tables. This allows us to add dim tables or add fields to dim tables without having to change the fact table.
3. This schema preserves all the information that was in the original raw tables, but because the data is more normalized, fewer tables are needed. The boolean fields, "is_year_best", "is_all_time_best", "is_main_genre", and "is_main_country" capture all new information from the "best" raw tables.
4. It is highly debatable whether it is optimal to snowflake out dim_genres, dim_prod_coutries, and dim_credits, instead of leaving them denormalized in dim_titles. This is a compromise between analytical vs. simulation creation ease. If, after discussion with the data team, scientists plan on using these tables for analysis, then I would denormalize these back into the dim_title table. If they are going to be creating different simpulations of data for each specific table, then I may leave them normalized, allowing for the scientists to have to make fewer updates.

### 3. Create relational schema for normalized data, create tables schemas, load data into tables

#### 3.1. Set Up Database, Load Initial Data, Fabricate Session and User Data:
##### Tasks:
- ✅ Set up the selected RDBMS. 
- ✅ Create tables as designed.
- ✅ Load initial data. 
- ✅ Test with sample queries to ensure everything is working. 
- ✅ Create user with read/write access to the database.
- ✅ Fabricate session and user data for demo purposes.

### 4. Enable interaction with the database through Python APIs.
#### Tasks:
- ✅ Implement read and write operations with SQL
- ✅ Create web api for read and write operations. (still needs update operations)
- ✅ Ensure security and validation of inputs. Done in web api
- ✅ Create Python APIs for read and write operations.
- ✅ Create testing for local python api (still needs error case testing)
- ✅ Create testing for web api (still needs error case testing)


### 5. Build a Demo Recommender Engine:
#### Objective: Showcase the functionality of the APIs and database.
#### Tasks:
- ✅ Develop a simple recommender engine.
- ✅ Implement reading and writing data using the developed APIs.
- ✅ Test with real or simulated user data.

- 5.1 Develop a Simple Recommender Engine
Choose a recommendation algorithm that fits the scope of your demonstration. Collaborative filtering is a common approach that can be implemented relatively quickly.
You can use libraries like Scikit-Surprise, which offers various recommendation algorithms.

- 5.2 Implement Reading and Writing Data Using APIs
Within your recommender engine, make HTTP requests to the APIs you developed in Part 4 to read and write data.
For reading, you might retrieve user preferences, historical ratings, or other relevant information.
For writing, you might store predictions or user feedback.

- 5.3 Test with Real or Simulated User Data
Create tests that mimic real user interactions, or use a dataset that resembles what real users might provide.
Ensure that the recommender engine can make reasonable predictions and that the read and write operations function correctly.

### 6. Provide Improvement Suggestions:
#### Objective: Analyze the solution and propose improvements.
#### Tasks:
- ✅ Review the entire solution.
- ✅ Identify areas for potential improvements.
- ✅ Document suggestions.

##### Ideas for improvements:
- Generally make the database and apis more robust and secure.
    - Dockerize the database and apis. Or, if not dockerize the database, then migrate the database to a cloud service.
    - Add error handling to local api
    - Complete testing, with error case testing, for all apis.
- If this system were going to break, how would it break? Consider a systematic way to discover where in the code each change would have ramifications. This involves defining likely error cases very specifically, improving code in relevant areas, and then testing for them.
    - A change to db could easily cause the connections to break or user permissions to go haywire.
    - A change to any of the models would have ramifications in any of the apis that use them. 
    - Change to file/folder structure
    - Haven't done enough error testing to see, but passing the wrong data type to the apis could cause problems.
    - Third party module changes or other dependencies could cause problems.


### Bonus Challenges (Optional):
✅ Write unit and integration tests for the solution. (partially complete)