In [1]:
import pandas as pd
import numpy as np

In [2]:
!pip install ipython-sql



# Lection 1. Basic concepts of DBMS(database management systems).

Questions:
* Database creating.
* Databases types.
* Normalization.
* Connections between tables.

<b>Database creating.</b>

Efficient data management typically requires the use of a computer database. A database is a
shared, integrated computer structure that stores a collection of the following:
• End-user data—that is, raw facts of interest to the end user
• Metadata, or data about data, through which the end-user data is integrated and
managed

A database management system (DBMS) is a collection of programs that manages the
database structure and controls access to the data stored in the database. In a sense, a database
resembles a very well-organized electronic filing cabinet in which powerful software (the
DBMS) helps manage the cabinet’s contents.

The DBMS serves as the intermediary between the user and the database. The database structure itself is stored as a collection of files, and the only way to access the data in those files is
through the DBMS

<b> Types of databases </b>

A DBMS can be used to build many types of databases. Each database stores a particular collection of data and is used for a specific purpose.

A single-user database supports only one user at a time. In other words, if user A is
using the database, users B and C must wait until user A is done. A single-user database that
runs on a personal computer is called a desktop database

In contrast, a multiuser database
supports multiple users at the same time. When the multiuser database supports a relatively
small number of users (usually fewer than 50) or a specific department within an organization,
it is called a workgroup database.

A database that supports
data located at a single site is called a centralized database. A database that supports data
distributed across several different sites is called a distributed database.

A cloud database is a database that is created and maintained using
cloud data services, such as Microsoft Azure or Amazon AWS. These services, provided by
third-party vendors, provide defined performance measures (data storage capacity, required
throughput, and availability) for the database, but do not necessarily specify the underlying
infrastructure to implement it. The data owners do not have to know, or be concerned about,
what hardware and software are being used to support their databases. The performance capabilities can be renegotiated with the cloud provider as the business demands on the database
change.

In some contexts, such as research environments, a popular way of classifying databases is
according to the type of data stored in them. Using this criterion, databases are grouped into two
categories: general-purpose and discipline-specific databases. General-purpose databases
contain a wide variety of data used in multiple disciplines—for example, a census database
that contains general demographic data and the LexisNexis and ProQuest databases that
contain newspaper, magazine, and journal articles for a variety of topics. Discipline-specific
databases contain data focused on specific subject areas. The data in this type of database is
used mainly for academic or research purposes within a small set of disciplines. 

![image.png](attachment:14b559ad-23de-4559-84ce-ee2ff26f5b33.png)

<b> Normalization </b>

The table is the basic building block of database design. Consequently, the table’s structure
is of great interest. Ideally, Entity
Relationship (ER) Modeling, yields good table structures. Yet, it is possible to create poor table
structures even in a good database design. How do you recognize a poor table structure, and
how do you produce a good table? The answer to both questions involves normalization.
Normalization is a process for evaluating and correcting table structures to minimize data
redundancies, thereby reducing the likelihood of data anomalies. The normalization process
involves assigning attributes to tables based on the concepts of determination and functional
dependency you learned in Chapter 3, The Relational Database Model.
Normalization works through a series of stages called normal forms. The first three stages
are described as first normal form (1NF), second normal form (2NF), and third normal form
(3NF). From a structural point of view, 2NF is better than 1NF, and 3NF is better than 2NF.
For most purposes in business database design, 3NF is as high as you need to go in the normalization process.
Although normalization is a very important ingredient in database design, you should not
assume that the highest level of normalization is always the most desirable. Generally, the
higher the normal form, the more relational join operations you need to produce a specified
output. Also, more resources are required by the database system to respond to end-user queries. A successful design must also consider end-user demand for fast performance. Therefore,
you will occasionally need to denormalize some portions of a database design to meet performance requirements. Denormalization produces a lower normal form; that is, a 3NF will be
converted to a 2NF through denormalization. However, the price you pay for increased performance through denormalization is greater data redundancy.

In this section, you learn how to use normalization to produce a set of normalized relations
(tables) that will be used to generate the required information. The objective of normalization
is to ensure that each table conforms to the concept of well-formed relations—in other words,
tables that have the following characteristics:
* Each relation (table) represents a single subject. For example, a COURSE table will contain only data that directly pertain to courses. Similarly, a STUDENT table will contain
only student data.
* Each row/column intersection contains only one (a single) value and not a group of values.
* No data item will be unnecessarily stored in more than one table (tables have minimum
controlled redundancy). The reason for this requirement is to ensure that the data is
updated in only one place.
* All nonprime attributes in a relation (table) are dependent on the primary key—the entire
primary key and nothing but the primary key. The reason for this requirement is to ensure
that the data is uniquely identifiable by a primary key value.
* Each relation (table) has no insertion, update, or deletion anomalies, which ensures the
integrity and consistency of the data.

Normalization is typically used in conjunction with the entity relationship modeling that you
learned in the previous chapters. Database designers commonly use normalization in two situations. When designing a new database structure based on the business requirements of the
end users, the database designer can construct a data model using a technique such as Crow’s
Foot notation ERDs. After the initial design is complete, the designer can use normalization to analyze the relationships among the attributes within each entity and determine if the
structure can be improved through normalization. Alternatively, and also more frequently,
database designers are often asked to modify existing data structures that can be in the form of
flat files, spreadsheets, or older database structures. Again, by analyzing relationships among the
attributes or fields in the data structure, the database designer can use the normalization process to improve the existing data structure and create an appropriate database design. Whether
you are designing a new database structure or modifying an existing one, the normalization
process is the same.
It is very rare to design a completely new database using just normalization. Commonly,
you start by defining the business rules and data constraints, and identifying the functional
dependencies, entities, and attributes using the techniques you learned in previous chapters.
Then, you apply normalization concepts to validate and further refine the model.

<b>Connections between tables.</b>

Giving some thought to how your tables should relate to each other also helps ensure data integrity, data accuracy, and keeps redundant data to a minimum.

One-to-one relationship
In a one-to-one relationship, a record in one table can correspond to only one record in another table (or in some cases, no records). One-to-one relationships aren’t the most common, since in many cases you can store corresponding information in the same table. Whether you split up that information into multiple tables depends on your overall data model and design methodology; if you’re keeping tables as narrowly-focused as possible (like in a normalized database), then you may find one-to-one relationships useful.

One-to-many relationships are the most common type of relationships between tables in a database. In a one-to-many (sometimes called many-to-one) relationship, a record in one table corresponds to zero, one, or many records in another table.

Many-to-many relationship
A many-to-many relationship indicates that multiple records in a table are linked to multiple records in another table. Those records may only be associated with a single record (or none at all) but the key is that they can and often are linked to more than one. Many-to-many relationships aren’t very common in practical database use cases, since adhering to normalization often involves breaking up many-to-many relationships into separate, more focused tables.

In fact, your database system may not even allow for the creation of a direct many-to-many relationship, but you can get around this by creating a third table, known as a join table, and create one-to-many relationships between it and your two starting tables.

In this sense, the Orders table in Metabase’s Sample Database acts as a join table, creating an intermediate link between People and Products. An ERD of the Sample Database would look something like the image below, where each relationship is specified by the type of line used to connect the tables.
    
Technically speaking the Products and Orders tables have a one-to-many relationship, in that one product can be associated with many orders. But according to our fake company’s database, people seem to only order a single product (they’ll buy like five Lightweight Wool Computers for whatever reason). A real-world (and perhaps more business-savvy) implementation of this database would probably include a join table between the two, making it so orders could contain many different products.

<b>Data Types</b>

Numeric Data Types
* serial: represents an auto-incrementing numeric value that occupies 4 bytes and can store numbers from 1 to 2147483647. The value of this type is formed by auto-incrementing the value of the previous row. Therefore, as a rule, this type is used to define row identifiers.

* smallserial: represents an auto-incrementing numeric value that occupies 2 bytes and can store numbers from 1 to 32767. Analogous to the serial type for small numbers.

* bigserial: represents an auto-incrementing numeric value that occupies 8 bytes and can store numbers from 1 to 9223372036854775807. Analogous to the serial type for large numbers.

* smallint: stores numbers from -32768 to +32767. Occupies 2 bytes. Has an alias of int2.

* integer: stores numbers from -2147483648 to +2147483647. Occupies 4 bytes. Has aliases int and int4.

* bigint: stores numbers from -9223372036854775808 to +9223372036854775807. Occupies 8 bytes. Has alias int8.

* numeric: stores numbers with fixed precision, which can have up to 131072 digits in the integer part and up to 16383 digits after the decimal point.

This type can take two parameters, precision and scale: numeric(precision, scale).

The precision parameter specifies the maximum number of digits that the number can store.

The scale parameter represents the maximum number of digits that the number can contain after the decimal point. This value must be between 0 and the precision parameter. By default, it is 0.

For example, for the number 23.5141, precision is 6 and scale is 4.

* decimal: stores numbers with fixed precision, which can have up to 131072 digits in the integer part and up to 16383 digits in the fractional part. Same as numeric.

* real: stores floating-point numbers in the range from 1E-37 to 1E+37. Occupies 4 bytes. Has an alias float4.

* double precision: stores floating-point numbers in the range from 1E-307 to 1E+308. Occupies 8 bytes. Has an alias float8.

Types for working with currency (monetary units)
For working with monetary units, the money type is defined, which can take values ​​in the range from -92233720368547758.08 to +92233720368547758.07 and occupies 8 bytes.

Character types
* character(n): represents a string of a fixed number of characters. The parameter specifies the number of characters in the string. It has the alias char(n).

* character varying(n): represents a string of variable length. The parameter specifies the maximum number of characters in the string. It has the alias varchar(n).

* text: represents text of arbitrary length.

Binary data
The bytea type is defined to store binary data. It stores data as binary strings, which represent a sequence of octets or bytes.

Types for working with dates and times
* timestamp: stores date and time. Takes 8 bytes. For dates, the lowest value is 4713 BC, the highest value is 294276 AD.

* timestamp with time zone: same as timestamp, but adds time zone information.

* date: represents a date from 4713 BC to 5874897 AD. Takes 4 bytes.

* time: stores time with 1 microsecond precision, without specifying a time zone. Accepts values ​​from 00:00:00 to 24:00:00. Takes 8 bytes.

* time with time zone: stores time with 1 microsecond precision, with time zone information. Accepts values ​​from 00:00:00+1459 to 24:00:00-1459. Takes 12 bytes.

* interval: represents a time interval. It occupies 16 bytes.

Logical type
The boolean type can store one of two values: true or false.

The following values ​​can be specified instead of true: TRUE, 't', 'true', 'y', 'yes', 'on', '1'.

The following values ​​can be specified instead of false: FALSE, 'f', 'false', 'n', 'no', 'off', '0'.

Types for representing Internet addresses
cidr: Internet address in IPv4 and IPv6 format. For example, 192.168.0.1. Takes from 7 to 19 bytes.

* inet: Internet address in cidr/y format, where cidr is the IPv4 or IPv6 address, and /y is the number of bits in the address (if this parameter is not specified, 34 for IPv4, 128 for IPv6 are used). For example, 192.168.0.1/24 or 2001:4f8:3:ba:2e0:81ff:fe22:d1f1/128. Takes up 7 to 19 bytes.

* macaddr: stores the MAC address. Takes up 6 bytes.

* macaddr8: stores the MAC address in EUI-64 format. Takes up 8 bytes.

Geometric types
* point: represents a point on the plane in the format (x,y). Takes up 16 bytes.

* line: represents a line of indefinite length in the format {A,B,C}. Takes up 32 bytes.

* lseg: represents a line segment in the format ((x1,y1),(x2,y2)). Takes up 32 bytes.

* box: represents a rectangle in the format ((x1,y1),(x2,y2)). Takes up 32 bytes.

* path: represents a set of connected points. In the format ((x1,y1),...) the path is closed (the first and last point are connected by a line) and is effectively a polygon. In the format [(x1,y1),...] the path is open. Takes up 16+16n bytes.

* polygon: represents a polygon in the format ((x1,y1),...). Takes up 40+16n bytes.

* circle: represents a circle in the format <(x,y),r>. Takes up 24 bytes.

Other data types
* json: stores json data in text form.

* jsonb: stores json data in binary format.

* uuid: stores a universally unique identifier (UUID), for example, a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11. Takes up 32 bytes.

* xml: stores data in XML format.