#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

##### Tools and technology
I chose to do the preliminary exploration purely in Pandas, and then do the heavy lifting in PySpark.

My reasons for using Pandas are simple: I'm more familiar with it, and its visualisation tooling is miles better than PySpark's.

PySpark makes sense as a tool for ETL: It's built for parallel processing of large amounts of data, and it provides easy functionality to manipulate that data (also, it's similar enough to Pandas that I can wrap my head around it). I'm storing my data in parquet which seemed like the best idea based on what I learnt in the course, and PySpark supports this well.

I decided against using S3 for storage because my iteration cycles are a lot longer if I have to constantly write data to a remote destination instead of just doing it locally. I also very quickly went past the free allowance of S3 pushes/pulls and didn't want to spend any more money on it.

I'm not actually using Redshift for any of my evaluation. I've set up the table creation queries, and all that's missing is actually copying the data over into Redshift and then querying that data. I'll provide some example queries later on in the project (see notebook 6), but I'll run these against PySpark's temporary database views. The queries work on both PySpark and Redshift, but it's just much faster to iterate when I'm running the queries locally.  

If this isn't enough, I can upload my final data to S3, load it into Redshift and query it from there, but I don't really see the point.

##### Update frequency
This data should be updated either daily or weekly. There is more data about Covid-19 cases every day, and potentially new weather data as well; if the source data changes, we should update our data representation as well. On the other hand, this isn't crucial data (at least to me), so maybe once a week would be fine, still.

##### Scenario: Data is 100x
Honestly, I wouldn't change much. Spark is built for this exact scenario. If anything, this would give me even more reason not to use S3/Redshift and just do the work in an EMR cluster. Staging terabytes of data in Redshift isn't fast or cheap, since that data still needs to be in storage on the nodes; it would be better to partition this vast amount of data across multiple Spark nodes and just run the queries there.

##### Scenario: Dashboard updated daily at fixed time
If the data needs to be updated at regular intervals, I'd use Airflow (or Luigi, or another orchestration tool) to define my DAGs. I'd want to run my ETL pipeline daily, ideally over night, and have an SLA (Service-level agreement) that it'll be done by a set time (e.g. 7am). If that agreement is broken, I (or one of my team members) should receive an alert: This would mean that the pipeline hasn't completed on time, either because of an error or because the pipeline is too slow and needs to be optimised further.

##### Scenario: Database needs to be accessed by 100+ people
If there will be large amounts of simultaneous access, especially from different disciplines of end users, it might be beneficial to split the data into several databases instead, all depending on the same source-of-truth data that is kept up-to-date by the ETL pipeline. In jargon, we'd want to serve various data marts so that the end users can operate on the data they actually care about. Furthermore, this would reduce simultaneous access on any single database.