# Structured Query Language (SQL)

### RECAP:

1. Pandas Apply
2. Pandas Groupby
3. Pandas Datatypes
4. Pandas Cleaning Strategies

#### SQL in python for data scientists

In this session we'll be using sqlite. In industries, most companies use mysql or postgresql (cloud solutions). The language and logic is similar.

SQL is a language designed to work with data

sqlite, mysql, etc are database systems that store data in a relational way

<img src="./images/sql.jpg" height=400, width=400>

With a lot more tables:

<img src="images/relational.png" height=400 width=400>

We use SQL because

1. It's an efficient data querying language
2. Industry standard
3. Scalable database system

for example: to find a row in a CSV file you need parse line by line, but in SQL it's index based so the look up time is exponentially quicker.

In [8]:
import pandas as pd
import sqlite3

### Types of SQL query

1. DQL (Data Querying Language): Anything that reads data without permanently changing the database itself
2. DML (Data Manipulating Language): Anything that changes the data inside of tables and rows, but does not change internal structure like column names or column data types
3. DDL (Data Definition language): Anything that changes the structure of databases, such as the columns. This is also used to create new tables and remove tables in schemas.
4. DCL (Data Control Language): Acts as the authentication system to ensure the correct people are given the correct permissions to access correct databases
5. TCL (Transaction Control Language): A system to ensure robustness of manipulations in real production

As a data scientist, we'll mostly use SQL to extract information from a database, and manipulate them through pandas. This is the standard workflow. So this session will only focus on DQL

The basic syntax of DQL:

```{sql}
SELECT Columns, Expressions
FROM Tables, Joins
WHERE Condition
GROUP BY Column
HAVING Condition
ORDER BY Column
LIMIT number OFFSET number;
```

All of this is considered one query, you can do query to query operations like set operations, subqueries, CTEs, etc. But we won't cover that in this session

SQL parses this differently:

1. From: gathers the tables to create a single virtual table to query from
2. Where: initial filtering of rows
3. Group By: changes structure of virtual table
4. Having: filtering of rows based on group by
5. Select: filtering of columns and expressions on columns
6. Order By: sorting
7. Limit / Offset: controls the output

Structure of today's session

Topic 1: From, Where, Group by, Having (50 mins)
Topic 2: Select, Order By, Limit / Offset (50 mins)

In [9]:
# setup a connection with the sqlite file
conn = sqlite3.connect("./data/chinook.sqlite")

# retrieves the cursor (allows us to write SQL)
cursor = conn.cursor()

# Executes the script to read table info in the file
cursor.execute("PRAGMA table_list;") 

# fetches result
result = cursor.fetchall() 

for table in result:
    print(f"table: {table[1]}")

table: Track
table: PlaylistTrack
table: Playlist
table: sqlite_schema
table: Artist
table: Customer
table: Employee
table: Genre
table: Invoice
table: Album
table: InvoiceLine
table: MediaType
table: sqlite_temp_schema


In [7]:
cursor.execute("SELECT * FROM Track LIMIT 2;")

result = cursor.fetchall()

result

[(1,
  'For Those About To Rock (We Salute You)',
  1,
  1,
  1,
  'Angus Young, Malcolm Young, Brian Johnson',
  343719,
  11170334,
  0.99),
 (2, 'Balls to the Wall', 2, 2, 1, None, 342562, 5510424, 0.99)]

Output looks messy, but we can actually use pandas to turn this into a dataframe!

In [None]:
df = pd.read_sql("SELECT * FROM Track", conn) 
df.head(2)

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
1,2,Balls to the Wall,2,2,1,,342562,5510424,0.99
