<a href="https://colab.research.google.com/github/christopherhuntley/DATA6510/blob/master/L6_Big_Data_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 6: Big Data Pipelines** 
_A Visit to the Data Sausage Factory_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The seven Big Data challenges
- The function of each stage of the Source $\rightarrow$ Lake $\rightarrow$ Warehouse $\rightarrow$ Mart $\rightarrow$ Apps data pipeline
- The basic elements of an ETL process
- The purpose and advantages of columnar database technology

### **Skills / Know how to ...**
- Create Pivot Table-like queries in SQL from any Dimensional Data Warehouse
- Use SQL to import online data into Tableau

--------

## **BIG PICTURE: Business Intelligence is the tip of a long spear**
Business intelligence (BI) is about generating actionable insights from data. It informs decision making at all levels of a firm: Given what we know now, how did it get that way and what can we do to get the results we desire? In other words, how can we use data to make us *smarter*? 

It is important to draw a distinction between BI and the broader world of Business Analytics. BI is **descriptive** (about things that are happening or have already happened) and to some degree **prescriptive** (what can be done now) but **very rarely predictive**. Where it does employ predictions, the models are more likely to be developed and tuned by data scientists with highly specialized training in machine learning and computer science.  

An *apropos* analogy is the difference between accounting and finance. Accounting models are about accurately capturing the *past* and present (so it can be distilled into financial reports), while financial models are about status and *future* (so we can make financial decisions that affect the future). While it is certainly possible for an accountant to know a lot about finance and a financial analyst to know a lot of accounting, they are nonetheless different professions. 

It's hard not to chuckle when people lump business intelligence in with database systems. Data visualization, probably *the* key feature for most BI apps, is in the presentation tier, not the data tier. Given access to lots of preprocessed data, with all the anomalies smoothed out, a BI dashboard allows us to see whatever stories the data can tell. That's very important $-$ literally why we collect, clean, and protect the data in the first place $-$ but about as much like a database app as whiffle ball is to baseball. They really are different things, with vastly different skill sets.

![BI Pipelines](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_Data_Pipelines.png)

The process of collecting, curating, and repackaging data for consumption by business analysts is called a **pipeline**. Starting with raw data from various sources $-$ more on that in a moment $-$ the process flows through several layers of processing to distill the data down to just what the business analyst needs. While much of this "sausage making" process is designed and managed by data engineers, it is (or should be) with the guidance and insights of data analysts. After all, who else really knows what the end product is supposed to be? 

In this lesson we will focus on data pipelines, starting with the general challenges and then highlighting the functional and technical differences between the layers.    

---
## **The Seven *V*s of Big Data**

It has long been tradition to characterize the unique challenges presented by big data with a laundry list of adjectives, all starting with the letter *V*. As we have gotten more familiar with the pitfalls of massive datasets, the  list has [expanded to seven](https://towardsdatascience.com/modern-unified-data-architecture-38182304afcc), each motivated by plenty of war stories:

- **Variety**. Data can come from various disconnected sources, each of which makes its own assumptions about the domain. Even within a given domain, assumptions may change over time. So, what was perfectly valid data a decade ago is not necessarily useful today. 
- **Velocity**. Modern online systems are really, really good at collecting data. It is like the data tier is trying to "drink from a firehose" as it sorts through it all in real-time. Anything that is not stored *immediately* is lost, and not everything can be stored. We have to make hard decisions sometimes. 
- **Volume**. The shear size of the datasets can be difficult to deal with. We saw that when trying to load the database in the IMDB homework, where the server crashed before the load was complete. In extreme cases, a dataset can be large enough that only part of the data could be loaded at a time. Otherwise, we risk completely filling up the server.  
- **Visibility**. There are always going to be new uses for a given dataset. Each time we try to support a new app or new model we may need a new interface, with data formatted a certain way, perhaps with data that has never been shared before. That opens up all sorts of potential bugs and other pitfalls. 
- **Veracity**. Data can lie. No matter how much we can try to validate every fact, some bugs are going to remain. These kinds of bugs are unknowable and won't surface until uncovered by an application that tries to use the data. 
- **Vulnerability**. As businesses become more and more dependent on data, the data itself becomes more and more enticing for cyber criminals. Even if they don't hold the data hostage, they may steal secrets that are best left hidden. 
- **Value**. It is tempting to just dump everything you have into an over-engineered uber data warehouse. However, needs will inevitably change, requiring continual redesign. Maintenance is then a continual process, with each added design feature a potential liability. If a feature does not provide value, then kill it before it becomes a problem instead of a solution. 

While any given dataset may be subject to any of these issues, the risks go up exponentially with more data, compounding with each row of data. In other words ...

**If it's big data, assume it's pig ugly and expect to spend 80% of your time applying lots of lipstick.**


 


## **Lakes, Warehouses, and Marts**

An enterprise-scale pipeline starts in the **data lake** layer, progresses through to a **data warehouse** layer, and then ends with a **data mart** layer. It is possible that a given company may combine or even skip one of these layers, of course. 

Many of the most common architectural choices are laid out in the table below. 

|  | Data Lake | Data Warehouse | Data Mart |
|---|---|---|---|
| **Objective** | Retain Everything | Curate & Integrate Content | Package for Use |
| **User Access** | Read-Write | Read-only | Read-Write (views and extracts) |
| **Structured Data**| SQL | SQL, OLAP Cubes | SQL, spreadsheets |
| **Semi-Structured Data** | JSON, Docs, NoSQL, APIs| SQL (with Extensions), Object-Relational DBs | SQL (with Extensions), NoSQL, Spreadsheets |
| **Unstructured Data** | Docs, Text Files, Streams | N/A | NoSQL, Reports |
| **Time Scale** | Now/Online | Historical, Online | Historical |
| **Storage Requirements** | Petabytes/Terabytes | Terabytes/Gigabytes | Gigabytes/Megabytes |
| **Storage Strategy** | Files + Row Stores | Column Stores | Documents / Various |
| **Example Tech** |  MySQL, PostgreSQL, AWS DynamoDB, etc. | GCP BigQuery, AWS Dynamo, etc. | Google Sheets, DropBox Files, etc.|

### **Functional Differences**

A **data lake** is a repository for just about any kind of data. A conscious effort is made to retain the data in its original state, including all the bugs and other errors. 

A **data warehouse** is intended as the one true source for all data. It draws from the data lake but then uses business logic to clean it of errors, enrich it with summary facts, and integrate it into a coherent whole. The cleaning and integration are performed as part of the ETL process that loads data into the warehouse. It is important to note that all data enters the data warehouse through the ETL processes. To everything else, access to the data warehouse is strictly read-only.

**Data marts** are generated (extracted) from the data warehouse. Since the warehouse ensures that the data is clean and consistent, there is no need to keep data in a normalized form. In fact, it is usually best to denormalize the data as much as possible (i.e., one table with lots of possibly redundant columns) so as to avoid complex query logic. We can safely do this because the data is never fed back into the data warehouse. The extraction processes that feed the data marts are read-only users just like everybody else.  

> **Data Lakehouses** are a new variation on the traditional Lake/Warehouse/Mart architecture where the Lake $→$ Warehouse ETL process is continuous, with all data in the Lake immediately available in the Warehouse. Effectively, this integrates the Lake and Warehouse into a single *Lakehouse*. It also allows analysts to *reverse ETL* analytical results from BI apps back into the Lakehouse, further integrating the entire pipeline. 

### **Data Structure and Format Differences**
How the data is stored and processed depends in part on the degree of structure.

**Structured data** is organized into datasets with a fixed data model (*schema*). Ideally, the data comes from a well-designed relational database so that we can at least assume that it is free of anomalies. If a change is made to the schema then the data is restructured to match. Thus, we always know what the schema is *before writing new data* to a dataset.  

In general, structured data is best kept in a relational database, though there are sometimes reasons to consider alternatives. For example, graph databases are excellent at storing geo-spatial data. It is still highly structured, except the schema is not relational. 

**Semi-structured data** has metadata and perhaps a schema but there is no attempt to retain consistency over time. In other words, each datum may have its own schema, which *might not be known until the data is read*. This of course can cause some data integration and retrieval/search issues. 

The classic example of a semistructured data format is JSON, which organizes data into hierarchies (trees) of indeterminate depth. It *is* certainly possible to have a consistent structure in JSON but it is not guaranteed.

**Unstructured data** does not have any schema at all. Raw text from a social media stream, for example, can be about anything. Similarly, photos that have not been tagged for content are just collections of pixels and lines. Data retrieval then becomes a matter of luck and intuition rather than a repeatable process. 

### **Performance Requirements and Technology Differences**
At Big Data scale raw performance matters, and how to best deliver that performance depends on where we are in the pipeline. 

In a data lake the emphasis is capturing and storing (writing) data in close to real time. That means any storage strategy needs to accept *serialized* data in the order it is collected. Files are transmitted in serialized form anyway, so they can be stored as is. Relational data (i.e., in a relational database) comes in one transaction at a time, adding rows with each transaction.  MySQL, for example, is tuned to operate this way. It writes (and reads) individual rows of data really quickly. 

In a data warehouse the emphasis is on making queries run as fast as possible. Since most queries only use a few columns at a time, the best way to [structure the data is in columns](https://en.wikipedia.org/wiki/Column-oriented_DBMS). As we will see later in the lesson, BigQuery uses this strategy to good effect, sometimes offering orders of magnitude speedup over MySQL for the same `SELECT` query. (Python programmers should also note that *pandas* also uses a column-oriented strategy: a DataFrame is equivalent to a dictionary of lists, one list per column.)

In a data mart the best strategy depends entirely on the applications. In many cases it might be best to avoid using a database at all, making data available as CSV files, spreadsheets, or other document formats. In any case, the datasets (or databases) themselves are rarely large enough to make performance an issue.


---
## **Data Warehouses are Designed for Analytics**
Any data warehouse is only as useful as the questions it answers. Whenever possible it should 
- make calculating aggregate statistics as easy as possible using simple sums, averages, etc.
- allow the data to be grouped (labeled) in various ways that make sense to analysts
- allow statistics to be disaggregated to identify the base-level source data
- use keys and other indexes that are **idempotent** (i.e., time invariant) so that a report from years ago can be rerun today without a major redesign

These requirements lead most naturally to a star schema design where:
- There is a central **fact table** with possibly many columns of precomputed aggregable measures (facts) and dimensional labels (foreign keys) that can be used to describe and categorize the facts. The fact table is fine grained, with the facts as atomic as possible. In the stereotypical data warehouse the fact table might have millions of rows. It is also likely to be highly normalized to eliminate potential double-counting errors due to data redundancy. 
- The fact table is surrounded by **dimension tables** that provide the labels and possibly more descriptive detail. The dimension tables are denormalized to eliminate unnecessary relationships. Usually, each dimension is much smaller than the facts table, with a modest number of rows that rarely change.  

We call such databases **Dimensional Data Warehouses**, about which we will go into more detail in Lesson 7. 

### **NBA PlayFacts Data Warehouse**
The ERD for the NBA PlayFacts warehouse is shown below:
- `PlayFact` corresponds to `Event` in the original `PlayLog` data. In a `SELECT` query the facts are what we would aggregate to calculate statistics like a box score. 
- The `Game`,`Team`, ... dimensions surrounding `PlayFact` represent different ways to aggregate the facts. In a `SELECT` query we would join in these tables as needed, using the primary keys in the `GROUP BY` clause. 
- The `players_list` attribute is literally a text string with a listing of the players in alphabetical order.  
- Any relationships between the dimensions (e.g, games and teams) have been eliminated through selective denormalization.

![NBA PlayFacts Dim DW](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_NBA_PlayFacts_Star_DW_v2.png
)

So, why do we want this dimensional design? Because it reduces the vast majority of queries to something like this (where `PlayFact` type $\rightarrow$ `play_facts` table):
```sql
SELECT grouping_columns, aggregate_columns 
FROM play_facts
      JOIN games USING (game_id)
      JOIN teams USING (team_id)
      JOIN lineups USING (lineup_id)
      JOIN players USING (player_id)
      JOIN play_segments USING (play_seg_id)
      JOIN event_types USING (event_type_id)
WHERE row_conditions
GROUP BY group_columns
HAVING group_conditions
```

The joins, which are where most `SELECT` queries go awry, can be the same every time. (Why? Because these joins never add rows to the resultset, they don't introduce double-counting anomalies.) All the analyst has to do is fill in a few details into the template:
- the columns to use for the grouping
- the columns and functions for the aggregates
- the conditions for the rows and groups

These details can be configured via a form interface, which usually ends up looking a lot like the ones used for Excel PivotTables (or [Count](https://count.co) if are more notebook-inclined). What could be more convenient (and bulletproof) than that?
![](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_excel_pivot_table.png)

*Source: the MS Excel documentation.*











## **Extract / Transform / Load (ETL) Processes**

Data warehouses are meant to be the "single source of truth," integrating all available data. Thus, they are by design based on data collected from somewhere else. Here we will consider two different possibilities for the NBA PlayFacts warehouse. Then we will show how data can follow a similar ETL process to create data marts for specific purposes. 

### **Transactional RDBMS $\rightarrow$ Dimensional DW**

![Trans 2 Dim ETL](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_ETL_from_RDBMS_v2.png)

Ideally, the source data is already in a DBMS with properly normalized tables (as shown on the left). Such a database is designed to allow new data to be added to it without much risk of corruption. From there it is fairly straightforward to build the data warehouse on the right, where some data may be safely duplicated for convenience if desired.

> **Pulse Check**  
> Study the "Transactional DB" on the left.
> - Can you spot the subtype/supertype relationship?
> - Why do we allow multiple events per play segment? Can you give an example? 
> - Can you guess an alternate key for the Teams table?   
> - As each play segment is recorded, how many tables are (usually) written to? 
> - Which tables could be written to before the start of the game? 
> - How many joins would be necessary to recreate the boxscore from HW2?


To build the data warehouse on the right, we will need to **transform** the transactional data to fit the new table schema. 

We can use SQL `INSERT` and `UPDATE` queries to populate the warehouse tables (right) directly from the existing database (left). 
- The `play_facts` table is mostly based on the original `events` table, with information from the `play_segments` and `shot_events` tables blended in for convenience. Since the transactional database was highly normalized there is little chance of denormalization causing a data anomaly. 
- The dimension tables are just slightly denormalized versions of their equivalents in the transactional database.

The following partial SQL snippet populates the `play_facts` table from tables extracted from the original database:
```sql
-- populating a table in the data warehouse (right)
INSERT INTO play_facts (game_id, team_id, ..., p_points, ...)
-- using data from the transaction database (left)
SELECT game_id, team_id, ... 
       CASE ... AS p_points,
       ...
FROM events
     JOIN play_segments USING (play_seg_id)
     LEFT JOIN `shot_events` USING (event_id);
```

There is actually quite a bit of *transformational* code hidden behind the ellipses (...) but this shows the general pattern: keys copied from existing tables with statistical facts calculated with CASE expressions and functions. 

To keep the queries from getting out of hand, it may be simpler to work in stages, with a first stage that creates the fact rows with the necessary keys, a second stage that updates some of the fact columns, a third stage that handles a few more, etc. This allows each query to be relatively simple, working on a few columns at a time. It also allows some of the earlier stages to inform the later ones, building up complex calculations from simpler ones. 

### **Data Files $\rightarrow$ Dimensional DW**

![ETL From Files](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_ETL_from_files.png)

In the absence of a normalized database (e.g., with CSV files), the ETL process is a often bit more laborious. 
- Since the data has not been normalized there is a high potential for data integrity errors. Just one misspelling, negative clock time, missing event type, etc. could trigger a series of errors that we really don't want in our data warehouse. 
- The data may be spread among many files, with slightly different file formats, column names, etc. In the case of the original NBA PlayLog data, there were 20,581 CSV files covering 16 years. The reporting standards had changed a bit over the years. For example, dates went from `mm/dd/yyyy` format to `yyyy-mm-dd` format in 2008.

Given the logical complexities of loading, integrating, validating, and cleaning a data set of this size, we cannot rely on a manual process. In fact, since there will likely be unexpected errors along the way that invalidate everything, we should expect to have to debug and repeat the process *from the beginning* many times before we are done. 

The NBA PlayLog ETL process was implemented in Python, making liberal use of pandas to do SQL-like tasks like indexing rows, checking columns for uniqueness, detecting invalid player names, etc.  Essentially, pandas DataFrames are used as an intermediary that acts like an in-memory data warehouse before exporting data into data marts. 

The snippet of code below loads each of the 20,581 CSV files, does some minor corrections, and dumps each year of data as a fully denormalized fact table:

![Load CSVs in Python](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_PlayLog_load_Python.png)
   
The dimensional data is still there, just blended in with the facts. Each CSV file is monstrously huge but can fit in 8 gigabytes of memory. A better solution is to use something like Google BigQuery to hold the data in a columnar database (with year-by-year partitions), but there will always be Pythonistas that prefer CSV files to SQL queries.   

### **Data Warehouse $\rightarrow$ Data Marts**

Creating a data mart from a data warehouse is generally pretty straightforward. In many cases it comes down to a few `SELECT` queries with strategically chosen `GROUP BY` clauses and aggregate calculations, followed by an export to a CSV, spreadsheet, etc. 

For the NBA PlayFacts warehouse, we could create a data mart based on just about any of the dimensions: 
- By game, season, or year
- By player, lineup, or team
- By event_type, period, or even existence of a phrase in the play description

All we need is the grouping logic and a process to calculate the group-wise aggregates. 

The `lineup_facts` dataset from Lesson 3 was created to support a research project on the effect of teamwork in basketball. It includes a fully denormalized table (facts w / dimension columns) `play_facts_all`, from which we can calculate various statistics about the usage and efficiencies of basketball lineups over the years. 

When working with the data in Excel or Tableau, however, it is musch easier to use a smaller version that is aggregated by lineup and year. It just barely fits into MS Excel, with just over 200K rows and 35 columns. Besides summarizing (rolling up) the original `play_facts_all` columns, the `season_facts_all` table includes a new column, `plus_minus_36m`, that is needed for the research questions that motivated the study. 

![Season Facts in Excel](https://github.com/christopherhuntley/DATA6510/raw/master/img/L10_Lineup_Season_Facts.png)

---
## **Big Data Tech: Columnar Databases**

While any of the logical data models (relational or NoSQL) can scale up for Big Data applications, they are only partial solutions. To complete the job, we also need technology that can scale up as well. For that we suggest using a columnar database like Google BigQuery or AWS RedShift.

#### **Row Stores vs Column Stores**

While all relational technology takes care not to use excessive storage space, there is a noticable difference between transactional databases and analytical databases.

**Transactional databases** work best using write-optimized **row store** technology. Data is continually being written to the database, one row at a time. In the example below each arrow represents one read or write operation.  

![Row Store](https://github.com/christopherhuntley/DATA6510/raw/master/img/L11_RowStore.png)

Because of how transaction control works, any delays in writing new rows also affect any other query that might be executing at the time. Each successive delay then takes up more and more computing capacity, causing further delays. Thus, in order to minimize the risk of system failure (dropped queries and rollbacks), it is best to prioritize writing data over reading it. 

**Analytical databases** tend to use read-optimized **column store** (or *columnar*) technology. Here the emphasis is on reading data organized into columns. Writing individual rows, however, can be very expensive. Thus in typical usage data rows will be added in bulk rather than one at a time. That requires just one write operation per column (shown in red below). 

![Column Store](https://github.com/christopherhuntley/DATA6510/raw/master/img/L11_ColumnStore.png)

> **Heads Up:** There is a common misperception, mostly among old-school data engineers, that column stores are non-relational. **The relational database model is about logic, not physical implementation.** Columnar databases implement all of the defining features and functions of the relational model and are thus *by definition* relational databases. The decision to store data in rows or columns is an implementation detail that has absolutely nothing to do with the relational database model.  

#### **Performance Optimizations**
While columnar databases do not excel at writing new data, they are exceptionally good at `SELECT` queries and raw data storage. 

**Most select queries only use a few columns of any given table.** For a dimensional data warehouse, where the fact tables tend to have many columns, that means that only a small fraction of the table needs to be processed (i.e., in memory) at a time. Let's say that we have a fact table with 50 measure (non-key) columns. If we are only using 2 columns, then the query processor has a lot less work to do in order to carry out a query. 

**A big advantage of column-wise storage is that it makes data compression really simple.** Data compression relies on exploiting repeated patterns in data. If the same pattern is repeated many times, then we can keep one copy of the pattern and then record each place where it applies. That alone can save space but we can go much further if the same pattern is repeated many times in a row. Consider, for example, the `Rating` column of our movies data. We see that 'PG-13' is repeated 27 times in a row. That means that the database only needs to store:
- the string 'PG-13'
- a run length of 26 more repeats

That is a compression ratio of about 26:1. Similar logic works for numeric data as well. Typically, numeric data tends to appear in somewhat narrow ranges, with each value in a column similar to the ones before and after it. This allows us to store the numbers as differences (above or below) some base value. Since the differences are small they can take up less storage space. For example, consider the `ShowTime` column in our example. In SQL, dates and times actually get stored as numbers (in order to simplify date/time arithmetic). However, since the show times are fairly regular, we can 
- calculate the difference between each showtime and the one above it to produce a column of mostly zeros, and then 
- compress the column even further using run lengths on the zeros

The results can be quite impressive, though not quite as good as those for columns of text data. 

**Columnar databases don't work as well with inherently row-oriented operations like table joins.** Thus, some data warehouse designs eschew dimension tables (and foreign keys) in favor of dimension columns, with a fully denormalized, one-table design. While that certainly works, it can potentially cause other problems if the source data has any anomalies. So, in recent years, columnar database solutions have implemented **optimized views that do the joins in advance**. This is done behind the scenes, in a way that is invisible to the user. For a dimensional data warehouse where the joins are the same every time anyway, such a strategy provides the speed advantage of column-wise storage and retrieval without giving up the integrity guarantees of table normalization.

#### **Columnar Database Examples**
We have already seen Google BigQuery (introduced in Lesson 3), but it is far from the only example. Others include:
- AWS Redshift
- Snowflake
- Azure SQL Data Warehouse
- Oracle Autonomous Datawarehouse 
- MariaDB ColumnStore
- PostgreSQL cstore_fdw
- Teradata
- Vertica
- Yellowbrick Data

It is also worth noting that pandas DataFrames and R Data Frames take the same column-centric approach (for the same reasons and with the same benefits/challenges) but without SQL. So, if you are comfortable using data frames for your data science projects you should be able to pick up any of these Columnar databases with only a small learning curve. 

---
## **Beyond SQL: Data Warehouse $\rightarrow$ Tableau**

We will conclude this lesson by building a visualization model using data drawn from our data warehouse. Rather than export the data into a data mart as a CSV file, we will use a query to pull data directly from our BigQuery data warehouse. 

![](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_NBA_PLayFacts_Pipeline_v2.png)

Our tool of choice is [Tableau](https://tableau.com). From the website:
> Tableau helps people see and understand data. Our visual analytics platform is transforming the way people use data to solve problems. See why organizations of all sizes trust Tableau to help them be more data-driven.

We will be using [Tableau Desktop](https://www.tableau.com/products/desktop), the fully-featured version intended for professional use. It is available with free [academic licensing](https://www.tableau.com/academic/students) for  up to one year of classroom use. 

> **Heads Up:** There is also a free Tableau Public version, which works much the same as Tableau Desktop *except* that it doesn't include the ability to extract data from online sources using SQL. 

### **Connecting to the Data Warehouse**
Tableau organizes its analytics models into projects that bundle sheets (visual models), dashboards, and stories. In this quick example we will only need a single sheet.

To get started we will connect the sheet to our NBA Lineup Facts database in BigQuery.

![Connect to BigQuery](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_Tableau_Connect_to_Data.png)

Tableau will open a browser window to log you into your Google account and authorize you to access the data. 

### **Running a Custom SQL Query**

Once connected, we can select *New Custom SQL* from the data source panel to create a new `SELECT` query.  Instead of drawing on the full data warehouse, the query below returns aggregated data "rolled up" by season. The aggregated data is in effect a **data mart** stored in the same database instance as the data warehouse. 

![SELECT Query](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_Tableau_Custom_Query.png)

> **Heads Up:** Tableau only allows one `SELECT` statement per query. It will even throw an error if you supply a semicolon at the end of the query.

After running the query (by clicking OK), we are presented with the results. 

![Query Results](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_Tableau_Query_Results.png)

> **Heads Up:** Tableau will complain if the extracted dataset has a large number of rows. Tableau is designed for visualization, not data management. If there is too much data to display on a chart then Tableau is nudging us to be more selective with the `SELECT` query. 

### **Building the Model**

Creating a Tableau sheet is a lot like using Excel's PivotTable and charting features:
1. Identify fields as either (categorical) dimensions or (continuous) measures. 
2. Drag fields to the Columns and Rows areas to define the vertical and horizontal axes of the plot. 
  - Dimensions are used to label or group the data
  - Measures are used for aggregation (sums, averages, etc.) and plotting
3. If we only want a subset of the data then we can set one or more filters.
4. Use the "Show Me" popdown panel to select the desired visualization model.
5. Configure labels, colors, etc. as needed. 

Here we have selected a scatter plot with the `play_length_mins` for the horizontal axis and `plus_minus_36m` for the vertical axis. We have also filtered to only the lineups with 200+ minutes played.

![Build Model](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_Tableau_Scatter_Plot_Filter.png)

### **The (mostly) Finished Model**

Except for a few tweaks like giving the sheet a name, adjusting the tick marks, etc. the model is done within about 5 minutes. 

![Finished Model](https://github.com/christopherhuntley/DATA6510/raw/master/img/L9_Tableau_Scatter_Final.png)

> **Heads Up:** Tableau will refresh the data each time you open the project. If you want to avoid rerunning the query  $-$ BigQuery is not free $-$ then you may want to save the data to your local computer (as a CSV, perhaps) before creating the model.  

---
## **Shameless Plug: DATA 6550 Big Data Management and DataOps**

Data 6550 is a new course planned for Spring 2024. It will fill in many of the technical details we leave out of this course, things like: 
- how to build a database
- how to architect data lakes / warehouses
- how to automate and orchestrate ETL processes, etc. 
The course will have two pre-reqs: DATA 6505 and DATA 6510. 

[DataOps](https://en.wikipedia.org/wiki/DataOps) is a kind of software engineering that focuses on data pipelines. ETL processes are the core engineering work in any data pipeline. Pipelines are often the most complex to code, where even subtle bugs can cause major anomalies in the data warehouse. 

The typical ETL process flows something like this:
1. Collect reference data (strong entities) from domain sources
2. Extract raw data from original sources into a temporary workspace
3. Apply data integrity checks to flag potential bugs
4. Transform (clean, aggregate/disaggregate, reformat) data for loading into the data warehouse
5. Transfer and/or stage the data for loading
6. Check for and correct process-induced data errors
7. Load into the data warehouse 

While we treat ETL as a discrete and linear process, it is actually three separate actions that can be completed in just about any order or even simultaneously. For example, the [Snowflake] (https://www.snowflake.com/) platform supports an [Extract $\rightarrow$ Load $\rightarrow$ Transform (ELT)](https://community.snowflake.com/s/article/ELT-Data-Pipelining-in-Snowflake-Data-Warehouse-using-Streams-and-Tasks) workflow in which data is:
1. Extracted from original sources
2. Loaded into a Snowflake data lake 
3. Transformed upon request to suit the needs of a given data model

Basically, this merges the data lake and data warehouse into one system (called a "Lake House"), with the cleaning and merging happening in near-real time. 

We can automate (or more formally, *orchestrate*) this sort of ETL process with a tool like [DBT](https://www.getdbt.com/), which will even handle data transfers between the Data Lake and the Data Warehouse. We can, for example, build a data warehouse at Google BigQuery with data drawn from a MySQL database hosted at AWS.

Regardless of the ETL logic or technology used, professionally-built [DataOps pipelines](https://en.wikipedia.org/wiki/DataOps) adhere to software engineering practices like:
- All work in code under version control (GitHub or equivalent)
- Automated testing to identify potential issues in ETL processing
- Planned development timelines, with named releases, etc. 
- Orchestration via configuration files, with each pipeline mapped out with directed acyclic graphs (DAGs) of data operations

Besides [DBT](https://www.getdbt.com/), there are a growing number of DataOps solutions like:
- Apache [Spark](https://spark.apache.org/), [Airflow](https://airflow.apache.org/), and other programmer-centric frameworks for building custom ETL systems
- [Alteryx](https://www.alteryx.com/), [Databricks](https://databricks.com/), and other low- or no-code data management tools
- Cloud-platform specific services for [IBM](https://www.ibm.com/products/infosphere-datastage), [Azure](https://azure.microsoft.com/en-us/services/data-factory/), [Amazon](https://aws.amazon.com/glue/), [Google Cloud Platform](https://cloud.google.com/solutions/performing-etl-from-relational-database-into-bigquery), [Snowflake](https://www.snowflake.com/), etc. 






---
## **Congratulations! You've made it to the end of Lesson 6.**

Next time we will focus more on data architecture, with a moderately deep dive into star schema models.



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.