# Chapter 17: Big Data - Hadoop, Spark, NoSQL, Iot

### 17.1 Introduction

**Databases**
- Relational databases: store structured data in tables with a fixed-size number of columns per row (Manipulate with SQL)
- NoSQL databases: created to handle unstructured / sem-structured data
- NewSQL databases: blend the benefits of relational and NoSQL databases

Types of NoSQL databases:
- key-value
- document
- columnar
- graph

**Apache Hadoop**
- Designed for distributed data processing with massive parallelsim among clusters of computers
- Hadoop executes tasks by bteaking them into pieces that do lots of disk I/O across many computers

**Apache Spark**
- Spark was developed as a way to perform certain big-data tasks in memory for better performance

**IoT**
- Publish/subscribe is the model that IoT and other types of applications use to connect data users with data providers


### 17.2 Relational Databases and Structured Query Language (SQL)

- A database is an integrated collection of data
- A database management system (DBMS) provides mechanisms for storing and organizing data in a manner consistent with the database's format
- Relational database management systems (RDBMSs) store data in tables and define relationships among the tables
- SQL is used almost universally with relational database systems to manipulate data and perform queries
- Most popular database systems have Python support - each typically provides a module that adheres to Python's Database Application Programming Interface (DB-API)

Popular open-source RDBMSs:
- SQLite
- PostgreSQL
- MariaDB
- MySQL

Proprietary RDBMSs:
- Microsoft SQL Server
- Oracle
- Sybase
- IBM Db2

**Tables, Rows, and Columns**
- Tables are comprised of rows
- Rows are comprised of columns
- Primary key: a column with a value that's unique for each row
- Rows are unique (by primary key) within a table, but particular column values may be duplicated between rows

### 17.2.1 A books Database

In [1]:
# Connecting to the database in python

import sqlite3

connection = sqlite3.connect('books.db')

**authors Table**
- The authors table stores all the authors and has three columns: 'id', 'first', 'last'

**Viewing the authors Table's contents**
- The pandas function read_sql executes a SQL query and returns a DataFrame containing the query's results
- read_sql arguments: a string representing the SQL query to execute, the SQLite database's Connection object, an index_col keyword argument indicating which column should be used as the DataFrame's row indices
- when index_col keyword is not passed, index values starting from 0 appear to the left of the DataFrame's rows

In [3]:
import pandas as pd

pd.options.display.max_columns = 10

pd.read_sql('SELECT * FROM authors', connection, index_col=['id'])

# A SQL SELECT query gets rows and columns from one or more tables in a db
# the * is a wildcard indicating that the query should get all columns from the authors table

Unnamed: 0_level_0,first,last
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Paul,Deitel
2,Harvey,Deitel
3,Abbey,Deitel
4,Dan,Quirk
5,Alexander,Wald


**titles table**
- The titles table stores all the books and has four columns: 'isbn', 'title', 'edition', 'copyright'

In [4]:
pd.read_sql('SELECT * FROM titles', connection)

Unnamed: 0,isbn,title,edition,copyright
0,135404673,Intro to Python for CS and DS,1,2020
1,132151006,Internet & WWW How to Program,5,2012
2,134743350,Java How to Program,11,2018
3,133976890,C How to Program,8,2016
4,133406954,Visual Basic 2012 How to Program,6,2014
5,134601548,Visual C# How to Program,6,2017
6,136151574,Visual C++ How to Program,2,2008
7,134448235,C++ How to Program,10,2017
8,134444302,Android How to Program,3,2017
9,134289366,Android 6 for Programmers,3,2016
