# BDP - SQL Workshop

**Exercise notebook**

---

Create an account and download the train.csv file from kaggle's titanic competition

https://www.kaggle.com/c/titanic/data

Move the csv file to the same path as the notebook

---

**The titanic🛳️ dataset is one of the most famous Machine Learning datasets out there.**

**Dataset columns:** <br>
passengerid - *Unique identifier for the passenger* <br>
pclass - *Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)* <br>
name - *Name* <br>
sex - *Sex* <br>
age - *Age* <br>
sibsp - *Number of Siblings / Spouses Aboard* <br>
parch - *Number of Parents / Children Aboard* <br>
ticket - *Ticket Number* <br>
fare - *Passenger Fare* <br>
cabin - *Cabin* <br>
embarked - *Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)* <br>
survived - *Survival (0 = No; 1 = Yes)* <br>

---

**SETUP**

For the following exercises we will use the following libraries: <br>
- pandas (tabular data) https://pandas.pydata.org/docs/
- sqlite database engine https://www.sqlite.org/index.html <br>

In [2]:
import pandas as pd
import sqlite3

In [3]:
df = pd.read_csv('train.csv')

df.columns = df.columns.str.lower()

# Split the data into features and labels (we will store them in a seperate tables)
X = df.drop('survived', axis=1)
y = df[['passengerid', 'survived']]

---

Connect to database

In [4]:
con = sqlite3.connect('bdp.db')

Write to database (with pandas)

In [5]:
X.to_sql('titanic_features', con=con, index=False, if_exists='replace')
y.to_sql('titanic_labels', con=con, index=False, if_exists='replace')

Make sure you can query over the database

In [6]:
query = """

SELECT * FROM titanic_features

"""

pd.read_sql_query(query, con=con)

Unnamed: 0,passengerid,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


---

## SQL Exercise

Write a query which returns the output below each question. <br>
Be careful not to run the output cell (the output will disapper).

---

### Question 1:

Create categories based on the ticket price (fare) and count the total amount of passengers in each category. <br>
Return only categories with more than 15 passengers.

In [6]:
# YOUR QUERY
query = """

SELECT 
    CASE WHEN fare <= 0 THEN '0' 
    WHEN fare BETWEEN 1 AND 19 THEN '1-19' 
    WHEN fare BETWEEN 19 AND 49 THEN '20-49'
    WHEN fare BETWEEN 49 AND 99 THEN '50-99'
    WHEN fare BETWEEN 100 AND 199 THEN '100-199'
    WHEN fare > 200 THEN '200+' END as fare_category,
    COUNT(passengerid) as total_passengers
FROM titanic_features
GROUP BY 1
HAVING total_passengers > 15
ORDER BY 2 DESC

"""

pd.read_sql_query(query, con=con)

Unnamed: 0,fare_category,total_passengers
0,1-19,492
1,20-49,220
2,50-99,111
3,100-199,33
4,200+,20


In [7]:
## REQUIRED OUTPUT

---

### Question 2:

Find the elders (age is 65 or above) of each class onboard the titanic and rank them by their age (oldest person onboard ranked as 1). <br>

In [7]:
# YOUR QUERY
query = """

SELECT 
    name,
    pclass,
    age,
    RANK() OVER(PARTITION BY pclass ORDER BY pclass, age DESC) as rank
FROM titanic_features
WHERE age >= 65


"""

pd.read_sql_query(query, con=con)

Unnamed: 0,name,pclass,age,rank
0,"Barkworth, Mr. Algernon Henry Wilson",1,80.0,1
1,"Goldschmidt, Mr. George B",1,71.0,2
2,"Artagaveytia, Mr. Ramon",1,71.0,2
3,"Crosby, Capt. Edward Gifford",1,70.0,4
4,"Ostby, Mr. Engelhart Cornelius",1,65.0,5
5,"Millet, Mr. Francis Davis",1,65.0,5
6,"Mitchell, Mr. Henry Michael",2,70.0,1
7,"Wheadon, Mr. Edward H",2,66.0,2
8,"Svensson, Mr. Johan",3,74.0,1
9,"Connors, Mr. Patrick",3,70.5,2


In [9]:
## REQUIRED OUTPUT

Unnamed: 0,name,pclass,age,rank
0,"Barkworth, Mr. Algernon Henry Wilson",1,80.0,1
1,"Goldschmidt, Mr. George B",1,71.0,2
2,"Artagaveytia, Mr. Ramon",1,71.0,2
3,"Crosby, Capt. Edward Gifford",1,70.0,4
4,"Ostby, Mr. Engelhart Cornelius",1,65.0,5
5,"Millet, Mr. Francis Davis",1,65.0,5
6,"Mitchell, Mr. Henry Michael",2,70.0,1
7,"Wheadon, Mr. Edward H",2,66.0,2
8,"Svensson, Mr. Johan",3,74.0,1
9,"Connors, Mr. Patrick",3,70.5,2


---

### Question 3:

Create a classification model (dummy model) which receive higher accuracy score than what you saw in lecture. <br>
HINT: Women and children were evacuated first and there were barely enough lifeboats to accommodate them all. <br>
<br>
Compute the following evaluation metrics: accuracy, precision, recall and f1_score.

https://en.wikipedia.org/wiki/Confusion_matrix

In [14]:
query = """

WITH raw_data AS (
    SELECT CASE WHEN (sex = 'male' AND age >= 12) OR pclass NOT IN (1, 2) THEN 0 ELSE 1 END AS y_pred
        , survived AS y_true
    FROM titanic_features
        JOIN titanic_labels
            USING(passengerid)
    )


SELECT AVG(y_true = y_pred) AS accuracy
FROM raw_data
"""

pd.read_sql_query(query, con=con)

Unnamed: 0,accuracy
0,0.782267


In [10]:
query = """

WITH raw_data AS (
    SELECT CASE WHEN sex = 'male' THEN 0 ELSE 1 END AS y_pred
        , survived AS y_true
    FROM titanic_features
        JOIN titanic_labels
            USING(passengerid)
    ),

confusion_matrix AS (
    SELECT
        CAST(COUNT(CASE WHEN y_pred = 0 AND y_true = 1 then 1 end) AS FLOAT) as false_neg,
        CAST(COUNT(CASE WHEN y_pred = 0 AND y_true = 0 then 1 end) AS FLOAT) as true_neg,
        CAST(COUNT(CASE WHEN y_pred = 1 AND y_true = 0 then 1 end) AS FLOAT) as false_pos,
        CAST(COUNT(CASE WHEN y_pred = 1 AND y_true = 1 then 1 end) AS FLOAT) as true_pos,
        COUNT(*) as count
    FROM raw_data
),

evaluation_metrics AS (
SELECT (true_pos + true_neg) / count AS accuracy,
    true_pos / (true_pos + false_pos) as precision,
    true_pos / (true_pos + false_neg) as recall
FROM confusion_matrix
)

SELECT accuracy, precision, recall,
    2 * (precision * recall) / (precision + recall) as f1_score
FROM evaluation_metrics

"""
pd.read_sql_query(query, con=con)

Unnamed: 0,accuracy,precision,recall,f1_score
0,0.786756,0.742038,0.681287,0.710366


In [11]:
## REQUIRED OUTPUT

Unnamed: 0,accuracy,precision,recall,f1_score
0,0.792368,0.723647,0.74269,0.733045


---

### Question 4:

Write a function which aggregates the fare column, based on other columns. <br>

The parameter agg can receive 'avg', 'sum', 'count', 'min', 'max' as values. <br>
If the user provides a list of columns, the aggregations will be based on those columns (group by). <br>

In [77]:
def fare_aggregates(columns=[], agg="avg") -> pd.DataFrame:
    
    column_names = " ".join(column + "," for column in columns) if columns else ""
    group_by_command = "GROUP BY " + column_names[:-1] if columns else ""

    query = f"""

    -- YOUR QUERY
    SELECT {column_names} {agg}(fare) as {agg}_fare
    FROM titanic_features
    {group_by_command}
    
    """

    return pd.read_sql_query(query, con=con)

In [78]:
print(fare_aggregates())
print(fare_aggregates(agg='sum'))
print(fare_aggregates(columns=['sex', 'pclass']))

    avg_fare
0  32.204208
     sum_fare
0  28693.9493
      sex  pclass    avg_fare
0  female       1  106.125798
1  female       2   21.970121
2  female       3   16.118810
3    male       1   67.226127
4    male       2   19.741782
5    male       3   12.661633


In [13]:
fare_aggregates()

Unnamed: 0,avg_fare
0,32.204208


In [14]:
fare_aggregates(agg='sum')

Unnamed: 0,sum_fare
0,28693.9493


In [15]:
fare_aggregates(columns=['sex', 'pclass'])

Unnamed: 0,sex,pclass,avg_fare
0,female,1,106.125798
1,female,2,21.970121
2,female,3,16.11881
3,male,1,67.226127
4,male,2,19.741782
5,male,3,12.661633


---

Terminate the connection to the database

In [16]:
con.close()

---