# Week 1 - Data Models and Pipelines

## Get Started with Data Modeling, Schemas, and Databases

###  Overview


First, you'll learn about
design patterns and database schemas,
including common structures that BI professionals use.
You'll also be introduced to
data pipelines and ETL processes.
You've learned that ETL stands for
extract, transform, and load.
This refers to the process of
gathering data from source systems,
converting it into a useful format,
and bringing it into a data warehouse
or other unified destination system. 

You'll also develop strategies
for gathering information from stakeholders
in order to help you develop
more useful tools and processes for your team. 

After that, you'll focus
on database optimization to reduce
response time or the time it takes
for a database to complete a user request.
This will include exploring different types of
databases and the five factors of database performance,
workload, throughput,
resources, optimization, and contention. 

Finally, you'll learn about the importance
of quality testing your ETL processes,
validating your database schema,
and verifying business rules. 


### Modeling and Schemas


In order to make databases useful,
the data has to be organized.
This includes both source systems
from which data is ingested and
moved and the destination
database where it will be acted upon.
These source systems could include data lakes,
which are database systems that store large amounts of
raw data in its original format until it's needed. 


Another type of source system is
an Online Transaction Processing or OLTP database.
An OLTP database is one that has been
optimized for data processing instead of analysis.
One type of destination system is a data mart,
which is a subject oriented database that
can be a subset of a larger data warehouse.
Another possibility is using
an Online Analytical Processing or OLAP database.
This is a tool that has been
optimized for analysis in addition to
processing and can analyze data from multiple databases. 

Unstructured data is not organized
in any easily identifiable manner.
Structure data has been organized in a certain format,
such as rows and columns. 

a data model is a tool for organizing
data elements and how they relate to one another.
These are conceptual models that help
keep data consistent across the system. 


In order to create the data model,
BI professionals will often use
what is referred to as a design pattern.
Design pattern is a solution that uses relevant measures
and facts to create a model to support business needs.
Think of it like a re-usable problem-solving template,
which may be applied to many different scenarios. 

As a refresher, a schema is a way of
describing how something such as data is organized.
You may have encountered schemas
before while working with databases.
For example, some common schemas you
might be familiar with include relational models,
star schemas, snowflake schemas and noSQL schemas. 

### Dimensional Models

A primary key is an identifier in a database that references a column or
a group of columns in which each row uniquely identifies each record in the table.
In this database we have primary keys in each table. 

A foreign key is a field within a database table that's a primary key in
another table.
The primary keys from each table also appear as foreign keys in other tables. 


A dimensional model is a type of relational model that has been optimized
to quickly retrieve data from a data warehouse.
Dimensional models can be broken down into facts for measurement and
dimensions that add attributes for context. In a dimensional model,
a fact is a measurement or metric.
For example a monthly sales number could be a fact and a dimension is a piece of
information that provides more detail and context regarding that fact.
It's the who, what, where, when, why and how.
So if our monthly sales number is the fact then the dimensions could be
information about each sale, including the customer, the store location and
what products were sold.

An attribute is a characteristic or quality of data
used to label the table columns. In dimensional models,
attributes work kind of the same way. An attribute is a characteristic or quality
that can be used to describe a dimension.
So a dimension provides information about a fact and
an attribute provides information about a dimension. 

Think about a passport.
One dimension on your passport is your hair and eye color.
If you have brown hair and eyes,
brown is the attribute that describes that dimension. 

It's time for the dimensional model to use these
things to create two types of tables: fact tables and dimension tables.
A fact table contains measurements or metrics related to a particular event. 

A dimension table is where attributes of the dimensions of a fact are stored.
These tables are joined the appropriate fact table using the foreign key.
This gives meaning and context to the facts.
That's how tables are connected in the dimensional model. 

Think of the schema like a blueprint, it doesn't hold data itself, but
describes the shape of the data and how it might relate to other tables or models.
Any entry in the database is an instance of that schema and
will contain all of the properties described in the schema.

- A star schema is a schema consisting of one fact table that references any
number of dimension tables.
As its name suggests, this schema is shaped like a star.
Notice how each of the dimension tables is connected to the fact table at the center. 

- A snowflake schema is an extension of a star schema with additional dimensions
and, often, subdimensions.
These dimensions and subdimensions break down the schema into even more specific
tables, creating a snowflake pattern. 

### Types of Schema

- Star Schema

A star schema is a schema consisting of one or more fact tables referencing any number of dimension tables. As its name suggests, this schema is shaped like a star. This type of schema is ideal for high-scale information delivery and makes read output more efficient. It also classifies attributes into facts and descriptive dimension attributes (product ID, customer name, sale date).

All the dimension tables link back to the sales_fact table at the center, which confirms this is a star schema.

- Snowflake Schema

A snowflake schema is an extension of a star schema with additional dimensions and, often, subdimensions. 

This fact table branches out to multiple dimension tables and even subdimensions. The dimension tables break out multiple details, such as player international and player club stats, transfer history, and more.

- Flat Model

Flattened schemas are extremely simple database systems with a single table in which each record is represented by a single row of data. The rows are separated by a delimiter, like a column, to indicate the separations between records. Flat models are not relational; they can’t capture relationships between tables or data items. Because of this, flat models are more often used as a potential source within a data system to capture less complex data that doesn’t need to be updated.

Think of a names col, and time col, that's it that's all. Like a single DF

- Semi-Structured Schema

In addition to traditional, relational schemas, there are also semi-structured database schemas which have much more flexible rules, but still maintain some organization. Because these databases have less rigid organizational rules, they are extremely flexible and are designed to quickly access data.

1. Document schemas store data as documents, similar to JSON files. These documents store pairs of fields and values of different data types.

2. Key-value schemas pair a string with some relationship to the data, like a filename or a URL, which is then used as a key. This key is connected to the data, which is stored in a single collection. Users directly request data by using the key to retrieve it.

3. Wide-column schemas use flexible, scalable tables. Each row contains a key and related columns stored in a wide format.

4. Graph schemas  store data items in collections called nodes. These nodes are connected by edges, which store information about how the nodes are related. However, unlike relational databases, these relationships change as new data is introduced into the nodes.

### Database Comparison

In particular, databases vary based on how the data is processed,
organized and stored. 

A database migration involves moving data from one source platform to another target
database.
During a migration users transition the current database schemas,
to a new desired state.
This could involve adding tables or columns, splitting fields,
removing elements, changing data types or other improvements.
The database migration process often requires numerous phases and iterations,
as well as lots of testing. 

OLTP,
OLAP, Row-based, columnar, distributed, single-homed,
separated storage and compute and combined databases. 

- OLTP

OLTP databases managed database modification and
are operated with traditional database management system software.
These systems are designed to effectively store transactions and
help ensure consistency. 

If two people add the same book to their cart, but there's only one copy then
the person who completes the checkout process first will get the book.
And the OLTP system ensures that there aren't more copies sold than are in stock.
OLTP databases are optimized to read, write and
update single rows of data to ensure that business processes go smoothly.
But they aren't necessarily designed to read many rows together. 

- OLAP

OLAP stands for online analytical processing.
This is a tool that has been optimized for analysis in addition to processing and
can analyze data from multiple databases.
OLAP systems pull data from multiple sources at one time to analyze data and
provide key business insights. 

an OLAP system could
pull data about customer purchases from multiple data warehouses.
In order to create personalized home pages for customers based on their preferences.
OLAP database systems enable organizations to address their analytical needs from
a variety of data sources.
Depending on the data maturity of the organization, one of your first tasks as
a BI professional could be to set up an OLAP system. 

- Row-Based 

Row based databases are organized by rows.
Each row in a table is an instance or an entry in the database and
details about that instance are recorded and organized by column.

This means that if you wanted the average profit of all sales over the last five
years from the bookstore database. You would have to pull each row from those years even if you don't need all of
the information contained in those rows. 

- Columnar

databases organized by columns.
They're used in data warehouses because they are very useful for
analytical queries.
Columnar databases process data quickly,
only retrieving information from specific columns. 

n our average profit of all sales, example, with a columnar database,
you could choose to specifically pull the sales column instead of years worth
of rows. 

- Single-home

Single-home databases are databases where all the data is stored in the same
physical location.
This is less common for organizations dealing with large data sets.
And will continue to become rarer as more and
more organizations move their data storage to online and cloud providers. 

- Distributed Databases

distributed databases are collection of data systems distributed across
multiple physical locations.
Think about them like telephone books: it's not actually possible to keep all
the telephone numbers in the world in one book, it would be enormous.
So instead, the phone numbers are broken up by location and
across multiple books in order to make them more manageable. 

- Combined Systems

our database systems that store and
analyze data in the same place.
This is a more traditional setup because it enables users to access all of
the data that needs to stay in the system long-term.
But it can become unwieldy as more data is added. Like the name implies,
separated storage and
computing systems are databases where less relevant data is stored remotely.
And the relevant data is stored locally for analysis. 

For example, if you have a lot of data but only a few people are querying it,
you don't need as much computing power, which can save resources. 


### Table Comparisons

*OLAP vs OLTP*
Database Technology | Description | Use
---|---|---
OLAP | Online Analytical Processing (OLAP) systems are databases that have been primarily optimized for analysis. | Provide user access to data from a variety of source systems, Used by BI and other data professionals to support decision-making processes, Analyze data from multiple databases, Draw actionable insights from data delivered to reporting tables
OLTP | Online Transaction Processing (OLTP) systems are databases that have been optimized for data processing instead of analysis. | Store transaction data Used by customer-facing employees or customer self-service applications Read, write, and update single rows of data Act as source systems that data pipelines can be pulled from for analysis

*Distributed vs Single-Homed*
Database Technology | Description | Use
---|---|---
Distributed|Distributed databases are collections of data systems distributed across multiple physical locations.|Easily expanded to address increasing or larger scale business needs, Accessed from different networks, Easier to secure than a single-homed database system
Single-homed|Single-homed databases are databases where all of the data is stored in the same physical location.|Data stored in a single location is easier to access and coordinate cross-team, Cuts down on data redundancy, Cheaper to maintain than larger, more complex systems


*Seperated Storage and Compute vs Combined*
Database Technology | Description | Use
---|---|---
Separated storage and compute| Separated storage and computing systems are databases where less relevant data is stored remotely, and relevant data is stored locally for analysis.| Run analytical queries more efficiently because the system only needs to process the most relevant data, Scale computation resources and storage systems separately based on your organization’s custom needs
Combined storage and compute| Combined systems are database systems that store and analyze data in the same place.| Traditional setup that allows users to access all possible data at once, Storage and computation resources are linked, so resource management is straightforward


## Choose the Right Database

### Elements of Database Schema

a data warehouse is a specific type of database
that consolidates data from multiple source systems for
data consistency, accuracy and efficient access.
Data warehouses are used to support data driven decision making.
Often these systems are managed by data warehousing specialists but
BI professionals may help design them when it comes to designing a data warehouse.

Typically the shape of data refers to the rows and
columns of tables within the warehouse and how they are laid out.
The volume of data currently and in the future also changes how the warehouse is
designed and the model the warehouse will follow includes all of the tools and
constraints of the system, such as the database itself and
any analysis tools that will be incorporated into the system. 

they're interested in
measuring store profitability and
website traffic in order to evaluate the effectiveness of annual promotions. 

The primary business process is sales.
We could have a sales table that includes information such as quantity ordered,
total based amount, total tax amount, total discounts and total net amount.
These are the facts as a refresher. 

For instance, store, customer product promotion,
time, stock or currency could all be dimensions. 

here are several dimension tables all connected to a fact table at the center
and this means we just created a star schema.
With this model, you can answer the specific question,
the effectiveness of annual promotions and
also generate a dashboard with other KPIs and drill down reports. 


In this case, we started with the businesses specific needs,
looked at the data dimensions we had and
organize them into tables that formed relationships.
Those relationships helped us determine that a star schema will be the most useful
way to organize this data warehouse. 

### Making Useful Database Schema

Logical data modeling.
This involves representing different tables
in the physical data model.
Decisions have to be made about
how a system will implement that model. 


it's important to consider the schema
early on in any BI project.
There are four elements a database schema should include. 

1. Relevant Data

2. Names and data types for each column

3. Consistent Formatting

4. Unique Keys for every database entry and object

it's often
necessary for a BI professional to add
new information to an existing schema if
the current schema can't
answer a specific business question.
If the business wants to know
which customer service employee
responded the most to requests,
we would need to add that information to
the data warehouse and update the schema accordingly.
The schema also needs to include names and
data types for each column
in each table within the database.
Imagine if you didn't organize your kitchen drawers,
it would be really difficult to find anything if
all of your utensils were just thrown together.
Instead, you probably have
a specific place where you keep
your spoons, forks and knives.


For example, imagine we have
two transactional systems that
we're combining into one database.
One tracks the promotion sent to users,
and the other track sales to customers.
In the source systems,
the marketing system that tracks
promotions could have a user ID column,
while the sale system has customer ID instead.
To be consistent in our warehouse schema,
we'll want to use just one of these columns. 

1. The relevant data: The schema describes how the data is modeled and shaped within the database and must encompass all of the data being described.

2. Names and data types for each column: Include names and data types for each column in each table within the database.

3. Consistent formatting: Ensure consistent formatting across all data entries. Every entry is an instance of the schema, so it needs to be consistent.

4. Unique keys: The schema must use unique keys for each entry within the database. These keys build connections between the tables and enable users to combine relevant data from across the entire database.

### Review Schema

Francisco’s Electronics is launching an e-commerce store for its new home office product line. If it’s a success, company decision-makers plan to bring the rest of their products online as well. The company brought on Mia, a senior BI engineer, to help design its data warehouse. The database needed to store order data for analytics and reporting, and the sales manager needed to generate reports quickly to track the sales so that the success of the site can be determined.

The sales_warehouse database schema contains five tables: Sales, Products, Users, Locations, and Orders, which are connected via keys. The tables contain five to eight columns (or attributes) that range in data type. The data types include varchar or char (or character), integer, decimal, date, text (or string), timestamp, bit, and other types depending on the database system chosen.

- What kind of database schema is this? Why was this type of database selected? 

Mia designed the database with a star schema because Francisco’s Electronics is using this database for reporting and analytics. The benefits of star schema include simpler queries, simplified business reporting logic, query performance gains, and fast aggregations. 

- What naming conventions are used for the tables and fields? Are there any benefits of using these naming conventions? 

This schema uses a snake case naming convention. In snake case, underscores replace spaces and the first letter of each word is lowercase. Using a naming convention helps maintain consistency and improves database readability. Since snake case for tables and fields is an industry standard, Mia used it in the database.

- What is the purpose of using the decimal fields in data elements? 

For fields related to money, there are potential errors when calculating prices, taxes, and fees. You might have values that are technically impossible, such as a value of  $0.001, when the smallest value for the United States dollar is one cent, or $0.01. To keep values consistent and avoid accumulated errors, Mia used a decimal(10,2) data type, which only keeps the last two digits after the decimal point. 
Note: Other numeric values, such as exchange rate and quantities, may need extra decimal places to minimize rounding differences in calculations. Also, other data types may be better suited for other fields. To track when an order is created (created_at), you can use a timestamp data type. For other fields with various text sizes, you can use varchar. 

- What is the purpose of each foreign and primary key in the database?

Mia designed the Sales table with a primary key ID and included foreign keys in the other tables to reference the primary keys. The foreign keys must be the same data type as their corresponding primary keys. As you’ve learned, primary keys uniquely identify precisely one record on a table, and foreign keys establish integrity references from that primary key to records in other tables.

## How Data Moves

### ETL and Pipelines

As a refresher, a data pipeline
is a series of processes that transports
data from different sources to
their final destination for storage and analysis.
This automates the flow of data from sources to targets
while transforming the data to make it
useful as soon as it reaches its destination.
In other words, data pipelines are
used to get data from point A to point B,
automatically save time and
resources and make data more accessible and useful. 

They automate the processes
involved in extracting, transforming,
combining, validating,
and loading data for further analysis and visualization.
Effective data pipelines also help
eliminate errors and combat system latency. 

One of the most useful things about
a data pipeline is that it can
pull data from multiple sources,
consolidate it, and then
migrate it over to its proper destination.
These sources can include relational databases,
a website application with
transactional data or an external data source.
Usually, the pipeline has
a push mechanism that enables it to ingest
data from multiple sources in near
real time or regular intervals.
Once the data has been pulled into the pipeline,
it can be loaded to its destination. 

Often while data is being moved from point A to point B,
the pipeline is also transforming the data.
Transformations include sorting, validation,
and verification, making the data easier to analyze.
This process is called the ETL system.
ETL stands for extract, transform, and load. 

Let's say a business analyst has data in
one place and needs to move it to another,
that's where a data pipeline comes in.
But a lot of the time,
the structure of the source system isn't
ideal for analysis which is why
a BI professional wants to transform
that data before it gets to the destination system
and why having set database schemas
already designed and ready to
receive data is so important. 

Data pipelines in 3 stages

1. ingesting the raw data

2. processing and consolidating it into categories

3. dumping the data into reporting tables that users can access

Once we've determined what the stakeholders goal is,
we can start thinking about what data
we need the pipeline to ingest.
In this case, we're going to
want demographic data about the customers.
Our stakeholders are interested in monthly reports.
We can set up our pipeline to automatically
pull in the data we want at monthly intervals.
Once the data is ingested,
we also want our pipeline to
perform some transformations,
so that it's clean and consistent
once it gets delivered to our target tables. 

### ETL Process
Like other pipelines, ETL processes work in
stages and these stages are extract, transform, and load.
Let's start with extraction. 

**Exctraction**
- In this stage, the pipeline accesses
a source systems and then read and
collects the necessary data from within them.
Many organizations store
their data in transactional databases,
such as OLTP systems,
which are great for logging records
or maybe the business uses flat files,
for instance, HTML or log files.
Either way, ETL makes the data useful for analysis by
extracting it from its source and
moving it into a temporary staging table. 

**Transformation**
- The specific transformation activities
depend on the structure and
format of the destination
and the requirement of the business case,
but as you've learned,
these transformations generally include validating,
cleaning, and preparing the data for analysis.
This stage is also when
the ETL pipeline maps the datatypes from
the sources to the target systems so
the data fits the destination conventions. 

**Loading**
- This is when data is delivered to its target destination.
That could be a data warehouse, a data lake,
or an analytics platform that
works with direct data feeds.
Note that once the data has been delivered,
it can exist within multiple locations
in multiple formats.
For example, there could be
a snapshot table that covers a week of
data and a larger archive
that has some of the same records.
This helps ensure the historical data
is maintained within
the system while giving stakeholders
focused, timely data,
and if the business is interested in
understanding and comparing average monthly sales,
the data would be moved to
an OLAP system that have
been optimized for analysis queries. 

### Choosing The Right Tool


KPIs let us know whether or not we're succeeding, so
that we can adjust our processes to better reach objectives.
For example, some financial KPIs are gross profit margin,
net profit margin, and return on assets.
Or some HR KPIs are rate of promotion and employee satisfaction. 

Next, depending on how your stakeholders want to view the data,
there are different tools you can choose.
Stakeholders might ask for graphs, static reports, or dashboards. 

Looker Studio, Microsoft, PowerBI and
Tableau.
Some others are Azura Analysis Service, CloudSQL, Pentaho,
SSAS, and SSRS SQL Server, which all have reporting tools built in.

### BI Tools
Tool | Uses
---|---
Azure Analysis Service (AAS) | Connect to a variety of data sources, Build in data security protocols, Grant access and assign roles cross-team, Automate basic processes
CloudSQL | Connect to existing MySQL, PostgreSQL or SQL Server databases,  Automate basic processes, Integrate with existing apps and Google Cloud services, including BigQuery, Observe database processes and make changes
Looker Studio | Visualize data with customizable charts and tables, Connect to a variety of data sources, Share insights internally with stakeholders and online, Collaborate cross-team to generate reports, Use report templates to speed up your reporting
Microsoft PowerBI |Connect to multiple data sources and develop detailed models, Create personalized reports, Use AI to get fast answers using conversational languages, Collaborate cross-team to generate and share insights on Microsoft applications
Pentaho | Develop pipelines with a codeless interface, Connect to live data sources for updated reports, Establish connections to an expanded library, Access an integrated data science toolkit
SSAS SQL Server | Access and analyze data across multiple online databases, Integrate with existing Microsoft services including BI and data warehousing tools and SSRS SQL Server, Use built-in reporting tools
Tableau | Connect and visualize data quickly, Analyze data without technical programming languages, Connect to a variety of data sources including spreadsheets, databases, and cloud sources, Combine multiple views of the data in intuitive dashboards, Build in live connections with updating data sources

### ETL Tools
Tool | Uses
---|---
Apache Nifi |Connect a variety of data sources Access a web-based user interface Configure and change pipeline systems as needed Modify data movement through the system at any time
Google DataFlow | Synchronize or replicate data across a variety of data sources, Identify pipeline issues with smart diagnostic features,  Use SQL to develop pipelines from the BigQuery UI, Schedule resources to reduce batch processing costs, Use pipeline templates to kickstart the pipeline creation process and share systems across your organization
IBM InfoSphere Information Server |Integrate data across multiple systems, Govern and explore available data, Improve business alignment and processes, Analyze and monitor data from multiple data sources
Microsoft SQL SIS |Connect data from a variety of sources integration, Use built-in transformation tools, Access graphical tools to create solutions without coding, Generate custom packages to address specific business needs
Oracle Data Integrator | Connect data from a variety of sources, Track changes and monitor system performance with built-in features, Access system monitoring and drill-down capabilities, Reduce monitoring costs with access to built-in Oracle services
Pentaho Data Integrator | Connect data from a variety of sources, Create codeless pipelines with drag-and-drop interface, Access dataflow templates for easy use, Analyze data with integrated tools
Talend |Connect data from a variety of sources, Design, implement, and reuse pipeline from a cloud server, Access and search for data using integrated Talend services, Clean and prepare data with built-in tools

## Dataflow

### Working in Google Cloud

Google Data Flow is a serverless data-processing service that reads data
from the source, transforms it, and writes it in the destination location.
Dataflow creates pipelines with open source libraries which you can interact
with using different languages including Python and SQL.
Dataflow includes a selection of pre-built templates that you can customize or
you can use SQL statements to build your own pipelines.
The tool also includes security features to help keep your data safe.

- Jobs 

When you first open the console, you will find the Jobs page. The Jobs page is where your current jobs are in your project space. There are also options to CREATE JOB FROM TEMPLATE or CREATE MANAGED DATA PIPELINE from this page, so that you can get started on a new project in your Dataflow console. This is where you will go anytime you want to start something new. 

- Pipelines

Open the menu pane to navigate through the console and find the other pages in Dataflow. The Pipelines menu contains a list of all the pipelines you have created. If this is your first time using Dataflow, it will also display the processes you need to enable before you can start building pipelines. If you haven’t already enabled the APIs, click Fix All to enable the API features and set your location. 

- Workbench 

The Workbench section is where you can create and save shareable Jupyter notebooks with live code. This is helpful for first-time ETL tool users to check out examples and visualize the transformations. 

- Snapshots

Snapshots save the current state of a pipeline to create new versions without losing the current state. This is useful when you are testing or updating current pipelines so that you aren’t disrupting the system. This feature also allows you to back up and recover old project versions. You may need to enable APIs to view the Snapshots page; you will learn more about APIs in an upcoming activity. 

- SQL Workspace

Finally, the SQL Workspace is where you interact with your Dataflow jobs, connect to BigQuery functionality, and write necessary SQL queries for your pipelines.

## Python

### Coding With Python


A programming language is a system of words and
symbols used to write instructions that computers follow.
There are lots of different programming languages,
but Python was specifically developed to enable users to
write commands in fewer lines than most other languages. 

Python is a general purpose programming language
that can be applied to a variety of contexts.
In business intelligence, it's used to connect to
a database system to read and modify files.
It can also be combined with
other software tools to develop
pipelines and it can
even process big data and perform calculations. 

First, it is primarily object-oriented and interpreted.
Let's first understand what it
means to be object-oriented.
Object-oriented programming languages are
modeled around data objects.
These objects are chunks of
code that capture certain information.
Basically, everything in the system is an object,
and once data has been captured within the code,
it's labeled and defined by the system so that
it can be used again later
without having to re-enter the data. 

Now, let's consider the fact that
Python is an interpreted language.
Interpreted languages are programming languages
that use an interpreter;
typically another program to
read and execute coded instructions.
This is different from a compiled programming language,
which compiles coded instructions
that are executed directly by the target machine.
One of the biggest differences between
these two types of programming languages is that
the compiled code executed by
the machine is almost impossible for humans to read. 

A notebook is an interactive,
editable programming environment
for creating data reports.
This can be a great way to build
dynamic reports for stakeholders.
Python is a great tool to have in your BI toolbox.
There's even an option to use
Python commands in Google Dataflow. 

### Elements of Python

- Python is open source and freely available to the public.

- It is an interpreted programming language, which means it uses another program to read and execute coded instructions.

- Data is stored in data frames, similar to R.

- In BI, Python can be used to connect to a database system to work with files.

- It is primarily object-oriented.

- Formulas, functions, and multiple libraries are readily available.

- A community of developers exists for online code support.

- Python uses simple syntax for straightforward coding.

- It integrates with cloud platforms including Google Cloud, Amazon Web Services, and Azure.

## BigQuery

### Scenario

- The problem  

Consider a scenario in which a BI professional, Aviva, is working for a fictitious coffee shop chain. Each year, the cafes offer a variety of seasonal menu items. Company leaders are interested in identifying the most popular and profitable items on their seasonal menus so that they can make more confident decisions about pricing; strategic promotion; and retaining, expanding, or discontinuing menu items.
The solution

- Data extraction

In order to obtain the information the stakeholders are interested in, Aviva begins extracting the data. The data extraction process includes locating and identifying relevant data, then preparing it to be transformed and loaded. To identify the necessary data, Aviva implements the following strategies:

- Meet with key stakeholders

Aviva leads a workshop with stakeholders to identify their objectives. During this workshop, she asks stakeholders questions to learn about their needs:

1. What information needs to be obtained from the data (for instance, performance of different menu items at different restaurant locations)?

2. What specific metrics should be measured (sales metrics, marketing metrics, product performance metrics)?

3. What sources of data should be used (sales numbers, customer feedback, point of sales)?

4. Who needs access to this data (management, market analysts)?

5. How will key stakeholders use this data (for example, to determine which items to include on upcoming menus, make pricing decisions)?

- Observe teams in action

Aviva also spends time observing the stakeholders at work and asking them questions about what they’re doing and why. This helps her connect the goals of the project with the organization’s larger initiatives. During these observations, she asks questions about why certain information and activities are important for the organization.

- Organize data in BigQuery

Once Aviva has completed the data extraction process, she transforms the data she’s gathered from different stakeholders and loads it into BigQuery. Then she uses BigQuery to design a target table to organize the data. The target table helps Aviva unify the data. She then uses the target table to develop a final dashboard for stakeholders to review. 

- The results

When stakeholders review the dashboard, they are able to identify several key findings about the popularity and profitability of items on their seasonal menus. For example, the data indicates that many peppermint-based products on their menus have decreased in popularity over the past few years, while cinnamon-based products have increased in popularity. This finding leads stakeholders to decide to retire three of their peppermint-based drinks and bakery items. They also decide to add a selection of new cinnamon-based offerings and launch a campaign to promote these items. 



**Data extraction**

Data extraction is the process of taking data from a source system, such as a database or a SaaS, so that it can be delivered to a destination system for analysis. You might recognize this as the first step in an ETL (extract, transform, and load) pipeline. There are three primary ways that pipelines can extract data from a source in order to deliver it to a target table:

- Update notification: The source system issues a notification when a record has been updated, which triggers the extraction.

- Incremental extraction: The BI system checks for any data that has changed at the source and ingests these updates.

- Full extraction: The BI system extracts a whole table into the target database system.

Once data is extracted, it must be loaded into target tables for use. In order to drive intelligent business decisions, users need access to data that is current, clean, and usable. This is why it is important for BI professionals to design target tables that can hold all of the information required to answer business questions.

In [1]:
from google.cloud import bigquery
from dotenv import load_dotenv

load_dotenv()


client = bigquery.Client(project= GCLOUD,
                         credentials=GCLOUDAUTH)

query = """
SELECT
    address,
    COUNT(address) AS number_of_trees
FROM
    bigquery-public-data.san_francisco_trees.street_trees
WHERE
    address != "null"
GROUP BY address
ORDER BY number_of_trees DESC
LIMIT 10;
"""

query_job = client.query(query) 

print("The query data:")
for row in query_job:
    # Row values can be accessed by field name or index.
    print("Address = {}, Number of Trees={}".format(row[0], row[1]))

NameError: name 'GCLOUDPROJECT' is not defined