# Exercise 7 - SQL Quiz Prep

- 🏆 20 points available

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [1]:
# DO NOT MODIFY THE CODE IN THIS CELL
import unittest
import base64
tc = unittest.TestCase()

---

### 🎯 Import packages

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.
    3. `sqlite3`: No alias

In [2]:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
import sqlite3
### END SOLUTION

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [3]:
# DO NOT CHANGE THE CODE IN THIS CELL
tc.assertTrue('pd' in globals(), 'Check whether you have correctly imported Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly imported NumPy with an alias.')
tc.assertTrue('sqlite3' in globals(), 'Check whether you have correctly imported the sqlite3 package.')

---
### 📌 Populate a database table from a CSV file

▶️ Run the code below to populate the `loans` table.

In [4]:
conn = sqlite3.connect('lending-club-loan-results.db')
c = conn.cursor()

tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)['tbl_name'])

if 'loans' in tables:
    c.execute(f'DELETE FROM loans')
    conn.commit()
    
pd.read_csv('https://github.com/bdi475/datasets/raw/main/lending-club-loan-results.csv.gz',
                         compression='gzip') \
    .to_sql(name='loans', index=False, con=conn, if_exists='append')

conn.close()

#### 🧭 Check your work

In [5]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn_checker = sqlite3.connect('lending-club-loan-results.db')
table_to_check = 'loans'

# Check if table exists
user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])
tc.assertTrue(table_to_check in user_tables, f'{table_to_check} does not exist in your NWT.db file!')

conn_checker.close()

The table below describes the columns in the `loans` table.

| Field | Description |
|---|---|
| loan_amnt | listed amount of the loan   applied for by the borrower |
| int_rate | interest rate on the loan   (between 0 and 1) |
| term_in_months | number of payments on the loan -   either 36 or 60 months |
| grade | assigned loan grade (A, B, C, D,   E, F, G) |
| job_title | job title supplied by the   Borrower when applying for the loan |
| home_ownership | home ownership status provided   by the borrower during registration (RENT, OWN, MORTGAGE) |
| annual_inc | self-reported annual income   provided by the borrower during registration |
| loan_status | result of the loan (Fully Paid   or Charged Off) |
| purpose | a category provided by the   borrower for the loan request |
| did_default | 0 == paid in full, 1 == default   (charged off) |

▶️ Run the code below to display the first 10 rows of the `loans` table.

In [6]:
conn = sqlite3.connect('lending-club-loan-results.db')
display(pd.read_sql_query('SELECT * FROM loans LIMIT 10', con=conn))
conn.close()

Unnamed: 0,loan_amnt,int_rate,term_in_months,grade,job_title,home_ownership,annual_inc,loan_status,purpose,did_default
0,30000,0.2235,36,D,Supervisor,MORTGAGE,100000.0,Fully Paid,debt_consolidation,0
1,20000,0.0756,36,A,Teacher,MORTGAGE,100000.0,Fully Paid,credit_card,0
2,2500,0.1356,36,C,Manager,RENT,42000.0,Fully Paid,other,0
3,12950,0.0756,36,A,Teacher,MORTGAGE,55000.0,Fully Paid,debt_consolidation,0
4,5000,0.1033,36,B,Police Officer,MORTGAGE,118964.0,Fully Paid,credit_card,0
5,3000,0.1992,36,D,Teacher,RENT,62000.0,Fully Paid,debt_consolidation,0
6,6000,0.1614,36,C,Truck Driver,RENT,50000.0,Fully Paid,credit_card,0
7,4500,0.1072,36,B,Teacher,RENT,55000.0,Fully Paid,debt_consolidation,0
8,20000,0.0819,36,A,Project Manager,RENT,105000.0,Fully Paid,credit_card,0
9,6000,0.1502,36,C,Teacher,MORTGAGE,101000.0,Fully Paid,vacation,0


---

### 🎯 Challenge 1: Find high profile defaults

#### 👇 Tasks

- ✔️ Write a query that:
    - selects all columns,
    - from the `loans` table,
    - where the `job_title` is `"Attorney"`, `annual_inc` is greater than `300000`, **and** `did_default` is `1`
- ✔️ Store your query to a new variable named `query_high_profile_defaults`.
- ✔️ The order of rows does not matter.

#### 🔑 Expected output

|    |   loan_amnt |   int_rate |   term_in_months | grade   | job_title   | home_ownership   |   annual_inc | loan_status   | purpose            |   did_default |
|---:|------------:|-----------:|-----------------:|:--------|:------------|:-----------------|-------------:|:--------------|:-------------------|--------------:|
|  0 |       15000 |     0.0916 |               36 | B       | Attorney    | MORTGAGE         |       309000 | Charged Off   | debt_consolidation |             1 |
|  1 |       35000 |     0.1344 |               36 | C       | Attorney    | RENT             |       340000 | Charged Off   | credit_card        |             1 |
|  2 |       35000 |     0.1541 |               60 | D       | Attorney    | MORTGAGE         |       350000 | Charged Off   | credit_card        |             1 |
|  3 |       35000 |     0.1709 |               36 | D       | Attorney    | RENT             |       450000 | Charged Off   | debt_consolidation |             1 |

In [7]:
### BEGIN SOLUTION
query_high_profile_defaults = '''
SELECT *
FROM loans
WHERE (job_title = "Attorney") AND (annual_inc > 300000) AND (did_default = 1);
'''
### END SOLUTION

conn = sqlite3.connect('lending-club-loan-results.db')
df_result = pd.read_sql_query(query_high_profile_defaults, con=conn)
display(df_result)
conn.close()

Unnamed: 0,loan_amnt,int_rate,term_in_months,grade,job_title,home_ownership,annual_inc,loan_status,purpose,did_default
0,15000,0.0916,36,B,Attorney,MORTGAGE,309000.0,Charged Off,debt_consolidation,1
1,35000,0.1344,36,C,Attorney,RENT,340000.0,Charged Off,credit_card,1
2,35000,0.1541,60,D,Attorney,MORTGAGE,350000.0,Charged Off,credit_card,1
3,35000,0.1709,36,D,Attorney,RENT,450000.0,Charged Off,debt_consolidation,1


#### 🧭 Check your work

In [8]:
conn = sqlite3.connect('lending-club-loan-results.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCAqCkZST00gbG9hbnMKV0hFUkUgKGpvYl90aXRsZSA9ICJB\
dHRvcm5leSIpIEFORCAoYW5udWFsX2luYyA+IDMwMDAwMCkgQU5EIChkaWRfZGVmYXVsdCA9IDEpOwo=').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
pd.testing.assert_frame_equal(df_result.sort_values(df_result.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))
conn.close()

---

### 🎯 Challenge 2: Stats by loan grade

#### 👇 Tasks

- ✔️ Write a query that summarizes the average interest rate and the default rate for each loan grade.
- ✔️ Use the following column names:
    - `grade`: Grade of each loan (e.g., "A", "B", ..., "G")
    - `avg_int_rate`: Average interest rate for each loan grade
    - `default_rate`: Default rate for each loan grade
        - This is the average of `did_default` column.
- ✔️ Sort the result by `grade` in ascending order.
- ✔️ Store your query to a new variable named `query_stats_by_grade`.

#### 🔑 Expected output

|    | grade   |   avg_int_rate |   default_rate |
|---:|:--------|---------------:|---------------:|
|  0 | A       |      0.0704255 |      0.0628832 |
|  1 | B       |      0.104537  |      0.139501  |
|  2 | C       |      0.138426  |      0.233738  |
|  3 | D       |      0.176442  |      0.31796   |
|  4 | E       |      0.211539  |      0.418802  |
|  5 | F       |      0.253798  |      0.472141  |
|  6 | G       |      0.282657  |      0.512246  |

In [9]:
### BEGIN SOLUTION
query_stats_by_grade = '''
SELECT
    grade,
    AVG(int_rate) AS avg_int_rate,
    AVG(did_default) AS default_rate
FROM loans
GROUP BY grade
ORDER BY grade;
'''
### END SOLUTION

conn = sqlite3.connect('lending-club-loan-results.db')
df_result = pd.read_sql_query(query_stats_by_grade, con=conn)
display(df_result)
conn.close()

Unnamed: 0,grade,avg_int_rate,default_rate
0,A,0.070426,0.062883
1,B,0.104537,0.139501
2,C,0.138426,0.233738
3,D,0.176442,0.31796
4,E,0.211539,0.418802
5,F,0.253798,0.472141
6,G,0.282657,0.512246


#### 🧭 Check your work

In [10]:
conn = sqlite3.connect('lending-club-loan-results.db')
decoded_query = base64.b64decode(b'ClNFTEVDVAogICAgZ3JhZGUsCiAgICBBVkcoaW\
50X3JhdGUpIEFTIGF2Z19pbnRfcmF0ZSwKICAgIEFWRyhkaWRfZGVmYXVsdCkgQVMgZGVmYXVsdF9yYXRlCkZST00gbG9hbnMKR1JPVVAgQlkgZ3JhZGUKT1JERVIgQlkgZ3JhZGU7Cg==').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
pd.testing.assert_frame_equal(df_result.reset_index(drop=True),
                              df_check.reset_index(drop=True))
conn.close()

---

### 🎯 Challenge 3: Stats by loan purpose

#### 👇 Tasks

- ✔️ Write a query that summarizes the number of loans, average interest rate, and the default rate for each loan purpose.
- ✔️ Use the following column names:
    - `purpose`: Purpose of each loan (e.g., "debt_consolidation", "credit_card")
    - `num_loans`: Number of loans for each loan purpose
    - `avg_int_rate`: Average interest rate for each loan purpose
    - `default_rate`: Default rate for each loan purpose
        - This is the average of `did_default` column.
- ✔️ Sort the result by `num_loans` in descending order.
- ✔️ Store your query to a new variable named `query_stats_by_purpose`.

#### 🔑 Expected output

|    | purpose            |   num_loans |   avg_int_rate |   default_rate |
|---:|:-------------------|------------:|---------------:|---------------:|
|  0 | debt_consolidation |      118382 |       0.133737 |       0.217685 |
|  1 | credit_card        |       45718 |       0.114085 |       0.173127 |
|  2 | home_improvement   |       14192 |       0.12765  |       0.184188 |
|  3 | other              |       10829 |       0.143714 |       0.215071 |
|  4 | major_purchase     |        4624 |       0.129519 |       0.222535 |
|  5 | small_business     |        3294 |       0.162964 |       0.319065 |
|  6 | medical            |        2419 |       0.139562 |       0.222819 |
|  7 | car                |        2059 |       0.120086 |       0.154444 |
|  8 | moving             |        1302 |       0.152251 |       0.24424  |
|  9 | vacation           |        1258 |       0.134918 |       0.197933 |
| 10 | house              |        1042 |       0.156619 |       0.255278 |

In [11]:
### BEGIN SOLUTION
query_stats_by_purpose = '''
SELECT
    purpose,
    COUNT(*) AS num_loans,
    AVG(int_rate) AS avg_int_rate,
    AVG(did_default) AS default_rate
FROM loans
GROUP BY purpose
ORDER BY num_loans DESC;
'''
### END SOLUTION

conn = sqlite3.connect('lending-club-loan-results.db')
df_result = pd.read_sql_query(query_stats_by_purpose, con=conn)
display(df_result)
conn.close()

Unnamed: 0,purpose,num_loans,avg_int_rate,default_rate
0,debt_consolidation,118382,0.133737,0.217685
1,credit_card,45718,0.114085,0.173127
2,home_improvement,14192,0.12765,0.184188
3,other,10829,0.143714,0.215071
4,major_purchase,4624,0.129519,0.222535
5,small_business,3294,0.162964,0.319065
6,medical,2419,0.139562,0.222819
7,car,2059,0.120086,0.154444
8,moving,1302,0.152251,0.24424
9,vacation,1258,0.134918,0.197933


#### 🧭 Check your work

In [12]:
conn = sqlite3.connect('lending-club-loan-results.db')
decoded_query = base64.b64decode(b'ClNFTEVDVAogICAgcHVycG9zZSwKICA\
    gIENPVU5UKCopIEFTIG51bV9sb2FucywKICAgIEFWRyhpbnRfcmF0ZSkgQVMgYXZnX\
    2ludF9yYXRlLAogICAgQVZHKGRpZF9kZWZhdWx0KSBBUyBkZWZhdWx0X3JhdGUKRlJ\
    PTSBsb2FucwpHUk9VUCBCWSBwdXJwb3NlCk9SREVSIEJZIG51bV9sb2FucyBERVNDOwo='
).decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
pd.testing.assert_frame_equal(df_result.reset_index(drop=True),
                              df_check.reset_index(drop=True))
conn.close()