# Relational Data

## Relational, Tuple and Schema
The database term "relational" or "relationship" describes the way that data in tables are connected. Newcomers to the world of databases often have a hard time seeing the difference between a database and a spreadsheet. They see tables of data and recognize that databases allow you to organize and query data in new ways, but fail to grasp the significance of the relationships between data that give relational database technology its name. Relationships allow you to describe the connections between different database tables in powerful ways. These relationships can then be leveraged to perform powerful cross-table queries, known as joins.
In a relational database, the table is a relation because it stores the relation between data in its column-row format. The columns are the table's attributes, while the rows represent the data records. A single row is known as a tuple to database designers.<br>

### Relation properties in relational database.
-  First off, its name must be unique in the database, i.e. a database cannot contain multiple tables of the same name.Each relation must have a set of columns or attributes, and it must have a set of rows to contain the data. As with the table names, no attributes can have the same name.
-  No tuple (or row) can be a duplicate. In practice, a database might actually contain duplicate rows, but there should be practices in place to avoid this, such as the use of unique primary keys (next up).
-  Given that a tuple cannot be a duplicate, it follows that a relation must contain at least one attribute (or column) that identifies each tuple (or row) uniquely. This is usually the primary key. This primary key cannot be duplicated. This means that no tuple can have the same unique, primary key. The key cannot have a null value, which simply means that the value must be known.
-  Each cell, or field, must contain a single value. For example, you cannot enter something like "Tom Smith" and expect the database to understand that you have a first and last name; rather, the database will understand that the value of that cell is exactly what has been entered.

A database **schema** is a collection of metadata that describes the relations or tables in a database. A schema is also described as the layout or blueprint of a database that outlines the way data is organized into tables.
A schema is normally described using Structured Query Language (SQL) as a series of CREATE statements that may be used to replicate the schema in a new database.<br>
An easy way to envision a schema is to think of it as a box that holds tables, stored procedures, views and the rest of the database in its entirety. One can give people access to the box, and the box's ownership can be changed as well.<br>
Two types of database schemas are physical and logical.The former gives blueprint  for how each piece of date is stored in the database.The later gives structure to the tables and relationship inside of the database.

## Relational Database management system (RDMS).
A relational database (RDB) is a collective set of multiple data set organized by tables, records and columns. RDBs establish a well-defined relationship between database tables. Tables communicate and share information, which facilitates data searchability, organization and reporting.<br>
RDBs use Structured Query Language (SQL), which is a standard user application that provides an easy programming interface for database interaction.<br>
RDB is derived from the mathematical function concept of mapping data sets and was developed by Edgar F. Codd.
A relational database management system (RDBMS) is a collection of programs and capabilities that enable IT teams and others to create, update, administer and otherwise interact with relational database. Most commercial RDBMSes use Structured Query Language (SQL) to access the database, although SQL was invented after the initial development of the relational model and is not necessary for its use




## Indices
The purpose of creating an index on a particular table in your database is to make it faster to search through the table and find the row or rows that you want. The downside is that indexes make it slower to add rows or make updates to existing rows for that table. So, adding indexes can increase read performance and decrease write performance. Indexes are also used to enforce uniqueness constraints.
Indexes increase read performance in that they are used as a point of reference when reading data since they are row identifiers.<br>
Indexes reduces write performce because when writing a single record, double entry is done, first the record, then the entry for the index, this is compounded in that indexes needs to be added at the correct sport on the index column.
### Types of Indexes
- Clustered Index<br>
A clustered index defines the order in which data is physically stored in a table. Table data can be sorted in only way, therefore, there can be only one clustered index per table. In SQL Server, the primary key constraint automatically creates a clustered index on that particular column.


- Nonclustered Index<br>
A non-clustered index doesn’t sort the physical data inside the table. In fact, a non-clustered index is stored at one place and table data is stored in another place. This is similar to a textbook where the book content is located in one place and the index is located in another. This allows for more than one non-clustered index per table.
It is important to mention here that inside the table the data will be sorted by a clustered index. However, inside the non-clustered index data is stored in a specified order. The index contains column values on which the index is created and the address of the record that the column value belongs to.
When a query is issued against a column on which the index is created, the database will first go to the index and look for the address of the corresponding row in the table. It will then go to that row address and fetch other column values. It is due to this additional step that non-clustered indexes are slower than clustered indexes.

### Constraints in Relational Database.
Constraints enforce limits to the data or type of data that can be inserted/updated/deleted from a table. The whole purpose of constraints is to maintain the data integrity during an update/delete/insert into a table.The following are some important constraints in relational database management system.<br>
1. Primary key constraint<br>
A table typically has a column or combination of columns that contain values that uniquely identify each row in the table. This column,is called the primary key (PK) of the table and enforces the entity integrity of the table. Because primary key constraints guarantee unique data, they are frequently defined on an identity column.
When you specify a primary key constraint for a table, the Database engine enforces data uniqueness by automatically creating a unique index for the primary key columns. This index also permits fast access to data when the primary key is used in queries. If a primary key constraint is defined on more than one column, values may be duplicated within one column, but each combination of values from all the columns in the primary key constraint definition must be unique.
A table can only have one and only primary key.
2. Foreign key constraint<br>
A foreign key (FK) is a column or combination of columns that is used to establish and enforce a link between the data in two tables to control the data that can be stored in the foreign key table. In a foreign key reference, a link is created between two tables when the column or columns that hold the primary key value for one table are referenced by the column or columns in another table. This column becomes a foreign key in the second table.
For example, the Sales.SalesOrderHeader table has a foreign key link to the Sales.SalesPerson table because there is a logical relationship between sales orders and salespeople. The SalesPersonID column in the SalesOrderHeader table matches the primary key column of the SalesPerson table. The SalesPersonID column in the SalesOrderHeadertable is the foreign key to the SalesPerson table. By creating this foreign key relationship, a value for SalesPersonID cannot be inserted into the SalesOrderHeader table if it does not already exist in the SalesPerson table.<br>

When multiple-column constraint format is used, a **composite key** is created. A composite key specifies multiple columns for a primary-key or foreign-key constraint.

Other types of constraints are not null, unique constraint, default constraint and check constraint.


## SQL(Standard Query Language)
SQL (pronounced "ess-que-el") stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database. Some common relational database management systems are: Oracle, Sybase, Microsoft SQL Server, Access, Ingres, etc. Although most database systems use SQL, most of them also have their own additional proprietary extensions that are usually only used on their system. However, the standard SQL commands such as "Select", "Insert", "Update", "Delete", "Create", and "Drop" can be used to accomplish almost everything that one needs to do with a database. With this you are about to access and manipulate the database.<br>
SQL language is widely used today across web frameworks and database applications. Knowing SQL gives you the freedom to explore your data, and the power to make better decisions.
The SQL commands as seen above are generally categorized as:
1. DDL(Data Definition Language) :<br>
DDL or Data Definition Language actually consists of the SQL commands that can be used to define the database schema. It simply deals with descriptions of the database schema and is used to create and modify the structure of database objects in database.<br>
> - Create– is used to create the database or its objects (like table, index, function, views, store procedure and triggers).
> - Drop– is used to delete objects from the database.
> - Alter-is used to alter the structure of the database.
> - Truncate–is used to remove all records from a table, including all spaces allocated for the records are removed.
> - Comment –is used to add comments to the data dictionary.
> - Rename–is used to rename an object existing in the database.<br.

2.	DML(Data Manipulation Language) :<br> 
The SQL commands that deals with the manipulation of data present in database belong to DML or Data Manipulation Language and this includes most of the SQL statements.
  Examples of DML:<br>
> - Select– is used to retrieve data from a database.
> - Insert– is used to insert data into a table.
> - Update– is used to update existing data within a table.
> - Delete– is used to delete records from a database table.


## Dimensional Modeling
Before we get to understand what dimensional modeling of data is, we need to understand what a data warehouse is. Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence. Data warehouses are typically used to correlate broad business data to provide greater executive insight into corporate performance. They use a different design from standard operational databases that we have seen here already. The latter are optimized to maintain strict accuracy of data in the moment by rapidly updating real-time data. Data warehouses, by contrast, are designed to give a long-range view of data over time. They trade off transaction volume and instead specialize in data aggregation.<br>
Many types of business data are analyzed via data warehouses. The need for a data warehouse often becomes evident when analytic requirements run afoul of the ongoing performance of operational databases. Running a complex query on a database requires the database to enter a temporary fixed state. This is often untenable for transactional databases. A data warehouse is employed to do the analytic work, leaving the transactional database free to focus on transactions. 
One disadvantage with data warehouse is that they are expensive to scale and do not excel at handling raw, unstructured or complex data. However, data warehouse is an important tool in this big data era.
### What is dimensional model
Dimensional model is a data structure technique optimized for Data warehousing tools. The concept of Dimensional Modelling was developed by Ralph Kimball and is comprised of "fact" and "dimension" tables.
A Dimensional model is designed to read, summarize, analyze numeric information like values, balances, counts, weights, etc.
### Elements of Dimensional Data Model
1.	Measures aka Facts<br>
Fact record measurements or metrics for a specific event. Fact tables generally consist of numeric values, and foreign keys to dimensional data where descriptive information is kept.[2]Fact tables are designed to a low level of uniform detail (referred to as "granularity" or "grain"), meaning facts can record events at a very atomic level. This can result in the accumulation of a large number of records in a fact table over time. Fact tables are defined as one of three types:
 - Transaction fact tables record facts about a specific event (e.g., sales events)
 - Snapshot fact tables record facts at a given point in time (e.g., account details at month end)
 - Accumulating snapshot tables record aggregate facts at a given point in time (e.g., total month-to-date sales for a product).<br>
 Fact tables are generally assigned a surrogate key (unique identifier) to ensure each row can be uniquely identified.<br>
 
2. Dimension<br>
 Dimension tables usually have a relatively small number of records compared to fact tables, but each record may have a very large number of attributes to describe the fact data.So, dimension is something that qualifies the quantity or measures. Dimensions store the textual descriptions of the business. With help of dimension you can easily identify the measures. Dimensions can define a wide variety of characteristics, but some of the most common attributes defined by dimension tables include:
 - Time dimension tables describe time at the lowest level of time granularity for which events are recorded in the star schema
 - Geography dimension tables describe location data, such as country, state, or city
 - Product dimension tables describe products
 - Employee dimension tables describe employees, such as salespeople
 - Range dimension tables describe ranges of time, dollar values, or other measurable quantities to simplify reporting <br>

Dimension tables are generally assigned a surrogate primary key, usually a single-column integer data type, mapped to the combination of dimension attributes that form the natural key.
Based on the frequency of data change,dimensions can be broken down in to the following types:
  1. Unchanging or Static dimension:Dimension values are static and will not change over time.
  2. Slowly changing dimensions(SCD).
  Dimensions values changes over time. For instance, a Territory or Territory Group may change over time.  SCD are often categorized into 3 types
   1. Type 1 – Overwriting old Values
   > Benefits – Not much thinking involved, just update the records with the new value.Product Dimension could be an example of a Type 1 SCD
   > Disadvantages – No history will be kept.  When the Product Subcategory or Category changes, old sales will not be reflected in the new names only.
   2. Type 2 – Creating another additional record – The old values will not be replaced but a new row containing the new values will be added to the table.
   > Benefits – History can be kept.  That means if a Territory moves to a different group, old sales will be reflected in the old group and new sales will be reflected in the newly changed group.
   > Disadvantages – Requires more thinking when creating and reporting since Sales must be calculated on the Territory Name/Group used at the time of the sale.
   
   3. Type 3 – Creating new fields – The latest update to the changed values can be seen
   > Benefits – Keeps track of the latest change
   > Disadvantages – If the entity in the Dimension is changed more than once, there is no way to retain the history.
   
   
## OLTP and OLAP
OLTP (on-line transactional processing) and OLAP (on-line analytical processing) are used in business applications, especially — although not exclusively — in data warehousing and analytics. Together, they form the two different sides of the analytics/warehousing coin: storing and manipulating the data on one hand and analyzing it on the other.<br>
### What is OLTP
On-line transactional processing (OLTP) is a mouthful to say, but the concept is not hard to grasp. OLTP systems are “classical” systems that process data transactions. They are all around you. In the bank, the ATM or the computer system used by the bank teller to record a transaction is an OLTP system, usually a database. If you text someone from your smartphone, you are working with another OLTP system.<br>
In OLTP system, transaction is high because the whole system is based on it.These transactions need be organized and  kept properly.<br>
What does it take to keep transactions organized? It means that database transactions have to be stable (or durable, not easily changed), isolated, consistent, and atomic. According to Wikipedia, atomicity is an “indivisible and irreducible series of database operations […] that either all occur, or nothing occurs”. In computer science, these are known as ACID transactions (atomic, consistent, isolated, durable). In simple words, this type of transaction ensures that operations performed by different users do not interfere with each other. For example, if a husband and wife each make a withdrawal from their joint bank account, atomic transactions make sure that they do not withdraw more than their account holds.<br>
### What is OLAP?
 The key word in OLAP is  analytical, which also tells us what the OLAP system does. An OLAP system analyzes data effectively and efficiently.<br>
Unlike OLTP, OLAP systems work with very large amounts of data. Preserving the accuracy and integrity of transactions is not their purpose; this is up to OLTP. OLAP is here to allow us to find trends, crunch numbers, and get the big picture. These systems have a smaller group of users than OLTP systems. For example, you will not interact with your bank’s OLAP system, since it is not concerned with recording your account transactions.<br>
OLAP systems give the user the ability to extract and view the business data  from different point of views.<br>
Analysts frequently need to group, aggregate and join data. These operations in relational databases are resource intensive. With OLAP data can be pre-calculated and pre-aggregated, making analysis faster.<br>
OLAP databases are divided into one or more cubes. The cubes are designed in such a way that creating and viewing reports become easy.<br>
#### OLAP Cube
<img src="../../../images/olapcube.png" style="height:40vh">
 
At the core of the OLAP concept is an OLAP Cube. The OLAP cube is a data structure optimized for very quick data analysis.
The OLAP Cube consists of measures which are categorized by dimensions. OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet, where data values are arranged in row and column format. This is ideal for two-dimensional data. However, OLAP contains multidimensional data, with data usually obtained from a different and unrelated source. Using a spreadsheet is not an optimal option. The cube can store and analyze multidimensional data in a logical and orderly manner.
How does it work?
A Data warehouse would extract information from multiple data sources and formats like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or OLAP cube) where information is pre-calculated in advance for further analysis.
#### Basic analytical operations of OLAP

1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be performed in 2 ways
1.	Reducing dimensions
2.	Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based on their order or level.
Consider the following diagram
<img src="../../../images/olaprollup.png" style="height:70vh">

 
•	In this example, cities New jersey and Lost Angles are rolled up into country USA
•	The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively. They become 2000 after roll-up
•	In this aggregation process, data in location hierarchy moves up from city to the country.
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup process. It can be done via
•	Moving down the concept hierarchy
•	Increasing a dimension
 
Consider the diagram above
<img src="../../../images/olapdrilldown.png" style="height:70vh">
•	Quarter Q1 is drilled down to months January, February, and March. Corresponding sales are also registered.
•	In this example, dimension months are added.

3. Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:
<img src="../../../images/olapslice.png" style="height:70vh">
 
•	Dimension Time is Sliced with Q1 as the filter.
•	A new cube is created altogether.
4. Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more dimensions that result in the creation of a sub-cube.
<img src="../../../images/olapdice.png" style="height:70vh">
 
4. Pivot:
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.
 <img src="../../../images/olappivot.png" style="height:70vh">








   

 






















In [1]:
# No exercise

### Solution code

```python
# No exercise
```