# SQL in Python using Pandas
***Author:*** Shilpa

We're going to be using SQLite to simulate working with an actual Database and write SQL queries + check-out corresponding outputs

SQLite implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.

[Reference](https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc) 

[Additional Resource: PostgreSQL Interview cheat sheet](https://www.postgresqltutorial.com/wp-content/uploads/2018/03/PostgreSQL-Cheat-Sheet.pdf)

In [1]:
import sqlite3
import pandas as pd

In [2]:
# creating a new DB file
con = sqlite3.connect('test_db.db')

In [3]:
# reusing one of our previous datasets as an example usecase - reading csv into df
titanic_df = pd.read_csv('titanic_clean.csv')
print(titanic_df.shape)
titanic_df.head()

(712, 10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


In [4]:
# storing the df in the test_db db
titanic_df.to_sql(name='titanic_cln',con=con, index=False, if_exists='replace')

# Executing SQL Queries from Pandas

---
## SQL I
---

## `SELECT` and `FROM`

SQL we read data from a database. The two primary clauses that must be present in every query are `SELECT` and `FROM`.

- What columns do you want (`SELECT`)?
- and `FROM` which table?

### - SELECT _(everything)_

In [5]:
sql_query = '''
SELECT * 
FROM titanic_cln
'''
pd.read_sql(sql_query,con)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...,...
707,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,29.1250,Q
708,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S
709,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S
710,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C


### - SELECT _(specifically)_

In [6]:
# subsetting specific columns
sql_query = '''
SELECT Sex, Embarked 
FROM titanic_cln
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
707,female,Q
708,male,S
709,female,S
710,male,C


### - Namespacing

In [7]:
# gives the same output as above. we're using the same table
# becomes more crucial when working with multiple tables like joins
sql_query = '''
SELECT titanic_cln.Sex, titanic_cln.Embarked 
FROM titanic_cln
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
707,female,Q
708,male,S
709,female,S
710,male,C


In [8]:
# using wildcard with namespacing to select all columns
sql_query = '''
SELECT titanic_cln.* 
FROM titanic_cln
'''
pd.read_sql(sql_query,con)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...,...
707,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,29.1250,Q
708,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S
709,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S
710,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C


### - Aliasing

In [9]:
# similar to namespacing, becomes more crucial when working with multiple tables like joins
# below is an example for aliasing table
sql_query = '''
SELECT data.Sex, data.Embarked 
FROM titanic_cln AS data
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
707,female,Q
708,male,S
709,female,S
710,male,C


In [10]:
# below is an example for aliasing columns-->see how Embarked changed to the new name we aliased 'location'
sql_query = '''
SELECT data.Sex, data.Embarked AS location
FROM titanic_cln AS data
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,location
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
707,female,Q
708,male,S
709,female,S
710,male,C


### - SELECT _DISTINCT_

In [11]:
# selecting non-duplicate values, expanding on previous query
sql_query = '''
SELECT DISTINCT data.Sex, data.Embarked AS location
FROM titanic_cln AS data
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,location
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
707,female,Q
708,male,S
709,female,S
710,male,C


### - ORDER BY

In [12]:
# this is how we will do sorting in SQL
# similar to Python sort operations like:sort_values, defaults to ASC sort (no need explicit specifying)
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass
FROM titanic_cln
ORDER BY Pclass
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass
0,female,C,1
1,female,S,1
2,male,S,1
3,male,C,1
4,male,Q,1
5,female,Q,1
6,female,C,2
7,female,S,2
8,male,S,2
9,male,C,2


In [13]:
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass
FROM titanic_cln
ORDER BY Pclass DESC
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass
0,male,S,3
1,female,S,3
2,male,Q,3
3,female,Q,3
4,female,C,3
5,male,C,3
6,female,C,2
7,female,S,2
8,male,S,2
9,male,C,2


In [14]:
# multiple cols ORDER BY --> LEFT TO RIGHT
# here we're prioritizing ordering by 'Sex' first, followed by 'Pclass'
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass
FROM titanic_cln
ORDER BY Sex, Pclass DESC
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass
0,female,S,3
1,female,Q,3
2,female,C,3
3,female,C,2
4,female,S,2
5,female,Q,2
6,female,C,1
7,female,S,1
8,female,Q,1
9,male,S,3


In [15]:
# if we reversed the sorting priority as below, we get different output
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass
FROM titanic_cln
ORDER BY Pclass DESC, Sex 
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass
0,female,S,3
1,female,Q,3
2,female,C,3
3,male,S,3
4,male,Q,3
5,male,C,3
6,female,C,2
7,female,S,2
8,female,Q,2
9,male,S,2


### - LIMIT

In [16]:
# this is like .head() in Python, used to limit our output rows
# crucial for tables with a LOT of data
# we can return 10 oldest passengers like so:
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass, Age
FROM titanic_cln
ORDER BY Age DESC
LIMIT 10
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass,Age
0,male,S,1,80.0
1,male,S,3,74.0
2,male,C,1,71.0
3,male,Q,3,70.5
4,male,S,2,70.0
5,male,S,1,70.0
6,male,S,2,66.0
7,male,C,1,65.0
8,male,Q,3,65.0
9,male,S,1,65.0


---
## SQL II
---

### - WHERE

#### LOGICAL OPERATIONS

In [20]:
# this is how 'condition based' filtering is done in SQL
# example, let's write a filter to return passengers with age > 70
# remember the Mnemonic shared earlier? going against that order and trying an 'ORDER BY' before 'WHERE' throws an error
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass, Age
FROM titanic_cln
WHERE Age > 70
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass,Age
0,male,C,1,71.0
1,male,Q,3,70.5
2,male,S,1,80.0
3,male,S,3,74.0


In [22]:
# filter operands are similar to Python, except 'equals' which is = for SQL, while == in Python
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass, Age
FROM titanic_cln
WHERE Age = 70
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass,Age
0,male,S,2,70.0
1,male,S,1,70.0


In [23]:
# advanced filtering with Logical 'AND' operator
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass, Age
FROM titanic_cln
WHERE Age = 70 AND Pclass = 2
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass,Age
0,male,S,2,70.0


In [33]:
# advanced filtering with Logical 'OR' operator, also combining both 'AND'/'OR'
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass, Age
FROM titanic_cln
WHERE AGE >70 AND (Embarked = 'C' OR Embarked = 'Q')
ORDER BY Embarked DESC
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass,Age
0,male,Q,3,70.5
1,male,C,1,71.0


In [35]:
# same result as above, but shorter syntax by changing 'OR' to 'IN' operator to define multiple values as checking criteria
sql_query = '''
SELECT DISTINCT Sex, Embarked, Pclass, Age
FROM titanic_cln
WHERE AGE >70 AND (Embarked IN ('C','Q'))
ORDER BY Embarked DESC
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Sex,Embarked,Pclass,Age
0,male,Q,3,70.5
1,male,C,1,71.0


#### WILDCARD OPERATIONS

In [39]:
# starts with
sql_query = '''
SELECT DISTINCT Name, Sex, Age
FROM titanic_cln
WHERE Name LIKE 'B%'
LIMIT 5
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Bonnell, Miss. Elizabeth",female,58.0
2,"Beesley, Mr. Lawrence",male,34.0
3,"Bing, Mr. Lee",male,32.0
4,"Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu...",female,33.0


In [43]:
# ends with (note the case insensitivity)
sql_query = '''
SELECT DISTINCT Name, Sex, Age
FROM titanic_cln
WHERE Name LIKE '%Y' 
LIMIT 5
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Name,Sex,Age
0,"Allen, Mr. William Henry",male,35.0
1,"Saundercock, Mr. William Henry",male,20.0
2,"Rugg, Miss. Emily",female,21.0
3,"Goodwin, Miss. Lillian Amy",female,16.0
4,"Newsom, Miss. Helen Monypeny",female,19.0


In [45]:
# contains
sql_query = '''
SELECT DISTINCT Name, Sex, Age
FROM titanic_cln
WHERE Name LIKE '%q%' 
LIMIT 5
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Name,Sex,Age
0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
1,"Baxter, Mr. Quigg Edmond",male,24.0
2,"Futrelle, Mr. Jacques Heath",male,37.0
3,"Tornquist, Mr. William Henry",male,25.0
4,"Levy, Mr. Rene Jacques",male,36.0


In [47]:
# does not contain, chained with other multiple filter criteria
sql_query = '''
SELECT DISTINCT Name, Sex, Age
FROM titanic_cln
WHERE Name LIKE '%q%' AND Sex NOT LIKE 'm%' AND Age IS 35
LIMIT 5
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Name,Sex,Age
0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0


In [48]:
# single character at specific position
sql_query = '''
SELECT DISTINCT Name, Sex, Age
FROM titanic_cln
WHERE Name LIKE '_o%' 
LIMIT 5
'''
pd.read_sql(sql_query,con)

Unnamed: 0,Name,Sex,Age
0,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0
1,"Bonnell, Miss. Elizabeth",female,58.0
2,"Fortune, Mr. Charles Alexander",male,19.0
3,"Holverson, Mr. Alexander Oskar",male,42.0
4,"Nosworthy, Mr. Richard Cater",male,21.0


---
Now that we've introduced how to use SQL on Python, we'll review the rest of the syntaxes from the markdown file and explore challenges on the `Google BigQuery` dataset.