# Data Visualization with Modern Data Science

> Assignment 2

Yao-Jen Kuo <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com)

## Instructions

- It is highly recommended that you test your solution in SQLiteStudio then paste into Google Colab.
- Write down your solution between comments `-- BEGIN SOLUTION` and `-- END SOLUTION`.
- Running tests to see if your solutions are right:
    - Runtime -> Restart and run all.
- When you are ready to submit, click File -> Download -> Download `.py`.

![](https://i.imgur.com/Y1BcDdx.png)

- Open a new Colab in a private window, upload the script and run tests again before submission to make sure the script is executable in a fresh new Colab.

![](https://i.imgur.com/ojlvbds.png)

- Upload to the Assignment session on NTU COOL.

## Run the cell below to download given files at your working directory.

In [1]:
import unittest
import requests
import numpy as np
import pandas as pd
import sqlite3

file_names = ["covid19.db"]
for file_name in file_names:
    file_url = f"https://raw.githubusercontent.com/datainpoint/asgmts-data-viz-with-modern-ds-2023/main/{file_name}"
    r = requests.get(file_url)
    with open(file_name , "wb") as f:
        f.write(r.content)

## 01. Write a SQL statement that is able to select all the columns and records from `daily_report` as specified.

```
              Combined_Key          Last_Update  Confirmed  Deaths
0              Afghanistan  2023-02-01 04:20:54     208545    7882
1                  Albania  2023-02-01 04:20:54     334167    3596
2                  Algeria  2023-02-01 04:20:54     271378    6881
3                  Andorra  2023-02-01 04:20:54      47839     165
4                   Angola  2023-02-01 04:20:54     105184    1931
...                    ...                  ...        ...     ...
4011    West Bank and Gaza  2023-02-01 04:20:54     703228    5708
4012  Winter Olympics 2022  2023-02-01 04:20:54        535       0
4013                 Yemen  2023-02-01 04:20:54      11945    2159
4014                Zambia  2023-02-01 04:20:54     340763    4047
4015              Zimbabwe  2023-02-01 04:20:54     261606    5652

[4016 rows x 4 columns]
```

In [2]:
select_all_from_daily_report =\
"""
-- BEGIN SOLUTION
SELECT *
  FROM daily_report;
-- END SOLUTION
"""

## 02. Write a SQL statement that is able to select four variables from `time_series` as specified.

```
              Date        Country_Region  Confirmed  Daily_Cases
0       2020-01-22           Afghanistan          0            0
1       2020-01-22               Albania          0            0
2       2020-01-22               Algeria          0            0
3       2020-01-22               Andorra          0            0
4       2020-01-22                Angola          0            0
...            ...                   ...        ...          ...
222301  2023-01-31    West Bank and Gaza     703228            0
222302  2023-01-31  Winter Olympics 2022        535            0
222303  2023-01-31                 Yemen      11945            0
222304  2023-01-31                Zambia     340763          181
222305  2023-01-31              Zimbabwe     261606            0

[222306 rows x 4 columns]
```

In [3]:
select_variables_from_time_series =\
"""
-- BEGIN SOLUTION
SELECT Date,
       Country_Region,
       Confirmed,
       Daily_Cases
  FROM time_series;
-- END SOLUTION
"""

## 03. Write a SQL statement that is able to find the records of Taiwan in `time_series`.

```
            Date Country_Region  Confirmed  Deaths  Daily_Cases  Daily_Deaths
0     2020-01-22         Taiwan          1       0            1             0
1     2020-01-23         Taiwan          1       0            0             0
2     2020-01-24         Taiwan          3       0            2             0
3     2020-01-25         Taiwan          3       0            0             0
4     2020-01-26         Taiwan          4       0            1             0
...          ...            ...        ...     ...          ...           ...
1101  2023-01-27         Taiwan    9428486   16204        24348            15
1102  2023-01-28         Taiwan    9455924   16224        27438            20
1103  2023-01-29         Taiwan    9483267   16246        27343            22
1104  2023-01-30         Taiwan    9505551   16276        22284            30
1105  2023-01-31         Taiwan    9537823   16308        32272            32

[1106 rows x 6 columns]
```

In [4]:
find_taiwan_from_time_series =\
"""
-- BEGIN SOLUTION
SELECT *
  FROM time_series
 WHERE Country_Region = 'Taiwan';
-- END SOLUTION
"""

## 04. Write a SQL statement that is able to find the records of Taiwan and dates after 2021-12-31 in `time_series` as specified.

```
    Country_Region        Date  Daily_Cases
0           Taiwan  2022-01-01           21
1           Taiwan  2022-01-02           20
2           Taiwan  2022-01-03           25
3           Taiwan  2022-01-04           34
4           Taiwan  2022-01-05           26
..             ...         ...          ...
391         Taiwan  2023-01-27        24348
392         Taiwan  2023-01-28        27438
393         Taiwan  2023-01-29        27343
394         Taiwan  2023-01-30        22284
395         Taiwan  2023-01-31        32272

[396 rows x 3 columns]
```

In [5]:
find_taiwan_and_specific_date_range_from_time_series =\
"""
-- BEGIN SOLUTION
SELECT Country_Region,
       Date,
       Daily_Cases
  FROM time_series
 WHERE Country_Region = 'Taiwan' AND
       Date > '2021-12-31';
-- END SOLUTION
"""

## 05. Write a SQL statement that is able to find the distinct `Last_Update` in `daily_report` as specified.

```
            Last_Update
0   2023-02-01 04:20:54
1   2020-12-21 13:27:30
2   2022-11-22 23:21:06
3   2023-01-30 23:20:55
4   2020-08-04 02:27:56
5   2022-10-21 23:21:56
6   2022-09-12 23:21:04
7   2020-08-07 22:34:20
8   2021-10-10 23:21:42
9   2021-07-31 23:21:38
10  2023-01-08 23:21:00
```

In [6]:
find_distinct_last_update_from_daily_report =\
"""
-- BEGIN SOLUTION
SELECT DISTINCT Last_Update
  FROM daily_report;
-- END SOLUTION
"""

## 06. Write a SQL statement that is able to find the distinct `Date` in `time_series` as specified.

```
            Date
0     2020-01-22
1     2020-01-23
2     2020-01-24
3     2020-01-25
4     2020-01-26
...          ...
1101  2023-01-27
1102  2023-01-28
1103  2023-01-29
1104  2023-01-30
1105  2023-01-31

[1106 rows x 1 columns]
```

In [7]:
find_distinct_date_from_time_series =\
"""
-- BEGIN SOLUTION
SELECT DISTINCT Date
  FROM time_series;
-- END SOLUTION
"""

## 07. Write a SQL statement that is able to find the records of US in `daily_report` as specified.

```
                 Combined_Key  Confirmed  Deaths
0        Autauga, Alabama, US      19471     230
1        Baldwin, Alabama, US      68983     723
2        Barbour, Alabama, US       7299     103
3           Bibb, Alabama, US       7919     109
4         Blount, Alabama, US      18255     261
...                       ...        ...     ...
3273       Teton, Wyoming, US      12058      16
3274       Uinta, Wyoming, US       6317      43
3275  Unassigned, Wyoming, US          0       0
3276    Washakie, Wyoming, US       2729      47
3277      Weston, Wyoming, US       1880      22

[3278 rows x 3 columns]
```

In [8]:
find_us_from_daily_report =\
"""
-- BEGIN SOLUTION
SELECT Combined_Key,
       Confirmed,
       Deaths
  FROM daily_report
 WHERE Combined_Key LIKE '%, US';
-- END SOLUTION
"""

## 08. Write a SQL statement that is able to find the records of US in `daily_report` and show the top 10 records with the most `Confirmed` as specified.

```
                  Combined_Key  Confirmed
0  Los Angeles, California, US    3676266
1      Miami-Dade, Florida, US    1524998
2           Cook, Illinois, US    1507271
3        Maricopa, Arizona, US    1495588
4            Harris, Texas, US    1258801
5    San Diego, California, US    1055035
6          Kings, New York, US     952898
7         Queens, New York, US     894868
8       Orange, California, US     777812
9    Riverside, California, US     770851
```

In [9]:
find_us_most_ten_confirmed_from_daily_report =\
"""
-- BEGIN SOLUTION
SELECT Combined_Key,
       Confirmed
  FROM daily_report
 WHERE Combined_Key LIKE '%, US'
 ORDER BY Confirmed DESC
 LIMIT 10;
-- END SOLUTION
"""

## 09. Write a SQL statement that is able to find the records of US in `lookup_table`.

```
           UID                   Combined_Key iso2 iso3 Country_Region  \
0           16             American Samoa, US   AS  ASM             US   
1          316                       Guam, US   GU  GUM             US   
2          580   Northern Mariana Islands, US   MP  MNP             US   
3          630                Puerto Rico, US   PR  PRI             US   
4          840                             US   US  USA             US   
...        ...                            ...  ...  ...            ...   
3401  84090053     Unassigned, Washington, US   US  USA             US   
3402  84090054  Unassigned, West Virginia, US   US  USA             US   
3403  84090055      Unassigned, Wisconsin, US   US  USA             US   
3404  84090056        Unassigned, Wyoming, US   US  USA             US   
3405  84099999             Grand Princess, US   US  USA             US   

                Province_State      Admin2      Lat     Long_   Population  
0               American Samoa        None -14.2710 -170.1320      55641.0  
1                         Guam        None  13.4443  144.7937     164229.0  
2     Northern Mariana Islands        None  15.0979  145.6739      55144.0  
3                  Puerto Rico        None  18.2208  -66.5901    3193694.0  
4                         None        None  40.0000 -100.0000  329466283.0  
...                        ...         ...      ...       ...          ...  
3401                Washington  Unassigned      NaN       NaN          NaN  
3402             West Virginia  Unassigned      NaN       NaN          NaN  
3403                 Wisconsin  Unassigned      NaN       NaN          NaN  
3404                   Wyoming  Unassigned      NaN       NaN          NaN  
3405            Grand Princess        None      NaN       NaN          NaN  

[3406 rows x 10 columns]
```

In [10]:
find_us_from_lookup_table =\
"""
-- BEGIN SOLUTION
SELECT *
  FROM lookup_table
 WHERE Country_Region = 'US';
-- END SOLUTION
"""

## 10. Write a SQL statement that is able to find the records of Russia and Ukraine in `lookup_table`.

```
       UID                 Combined_Key iso2 iso3 Country_Region  \
0      643                       Russia   RU  RUS         Russia   
1      804                      Ukraine   UA  UKR        Ukraine   
2    64301      Adygea Republic, Russia   RU  RUS         Russia   
3    64302           Altai Krai, Russia   RU  RUS         Russia   
4    64303       Altai Republic, Russia   RU  RUS         Russia   
..     ...                          ...  ...  ...            ...   
109  80424        Volyn Oblast, Ukraine   UA  UKR        Ukraine   
110  80425  Zakarpattia Oblast, Ukraine   UA  UKR        Ukraine   
111  80426   Zaporizhia Oblast, Ukraine   UA  UKR        Ukraine   
112  80427     Zhytomyr Oblast, Ukraine   UA  UKR        Ukraine   
113  80428             Unknown, Ukraine   UA  UKR        Ukraine   

         Province_State Admin2        Lat       Long_   Population  
0                  None   None  61.524010  105.318756  145934460.0  
1                  None   None  48.379400   31.165600   43733759.0  
2       Adygea Republic   None  44.693901   40.152042     453376.0  
3            Altai Krai   None  52.693224   82.693142    2350080.0  
4        Altai Republic   None  50.711410   86.857219     218063.0  
..                  ...    ...        ...         ...          ...  
109        Volyn Oblast   None  50.747200   25.325400    1035330.0  
110  Zakarpattia Oblast   None  48.620800   22.287900    1256802.0  
111   Zaporizhia Oblast   None  47.838800   35.139600    1705836.0  
112     Zhytomyr Oblast   None  50.254700   28.658700    1220193.0  
113             Unknown   None        NaN         NaN          NaN  

[114 rows x 10 columns]
```

In [11]:
find_russia_and_ukraine_from_lookup_table =\
"""
-- BEGIN SOLUTION
SELECT *
  FROM lookup_table
 WHERE Country_Region IN ('Russia', 'Ukraine');
-- END SOLUTION
"""

## End of assignment, run the following cells to get test result.

In [12]:
class TestAssignmentTwo(unittest.TestCase):
    def test_01_select_all_from_daily_report(self):
        all_from_daily_report = pd.read_sql(select_all_from_daily_report, connection)
        self.assertEqual(all_from_daily_report.shape, (4016, 4))
    def test_02_select_variables_from_time_series(self):
        variables_from_time_series = pd.read_sql(select_variables_from_time_series, connection)
        self.assertEqual(variables_from_time_series.shape, (222306, 4))
        columns = variables_from_time_series.columns
        self.assertIn("Date", columns)
        self.assertIn("Country_Region", columns)
        self.assertIn("Confirmed", columns)
        self.assertIn("Daily_Cases", columns)
    def test_03_find_taiwan_from_time_series(self):
        taiwan_from_time_series = pd.read_sql(find_taiwan_from_time_series, connection)
        self.assertEqual(taiwan_from_time_series.shape, (1106, 6))
        np.testing.assert_equal(taiwan_from_time_series['Country_Region'].unique(),
                               np.array(['Taiwan']))
    def test_04_find_taiwan_and_specific_date_range_from_time_series(self):
        taiwan_and_specific_date_range_from_time_series = pd.read_sql(find_taiwan_and_specific_date_range_from_time_series, connection)
        self.assertEqual(taiwan_and_specific_date_range_from_time_series.shape, (396, 3))
        np.testing.assert_equal(taiwan_and_specific_date_range_from_time_series['Country_Region'].unique(),
                               np.array(['Taiwan']))
    def test_05_find_distinct_last_update_from_daily_report(self):
        distinct_last_update_from_daily_report = pd.read_sql(find_distinct_last_update_from_daily_report, connection)
        self.assertEqual(distinct_last_update_from_daily_report.shape, (11, 1))
    def test_06_find_distinct_date_from_time_series(self):
        distinct_date_from_time_series = pd.read_sql(find_distinct_date_from_time_series, connection)
        self.assertEqual(distinct_date_from_time_series.shape, (1106, 1))
    def test_07_find_us_from_daily_report(self):
        us_from_daily_report = pd.read_sql(find_us_from_daily_report, connection)
        self.assertEqual(us_from_daily_report.shape, (3278, 3))
    def test_08_find_us_most_ten_confirmed_from_daily_report(self):
        us_most_ten_confirmed_from_daily_report = pd.read_sql(find_us_most_ten_confirmed_from_daily_report, connection)
        self.assertEqual(us_most_ten_confirmed_from_daily_report.shape, (10, 2))
        combined_keys = us_most_ten_confirmed_from_daily_report["Combined_Key"].values
        self.assertIn("Los Angeles, California, US", combined_keys)
        self.assertIn("Maricopa, Arizona, US", combined_keys)
        self.assertIn("Miami-Dade, Florida, US", combined_keys)
        self.assertIn("Cook, Illinois, US", combined_keys)
        self.assertIn("Harris, Texas, US", combined_keys)
    def test_09_find_us_from_lookup_table(self):
        us_from_lookup_table = pd.read_sql(find_us_from_lookup_table, connection)
        self.assertEqual(us_from_lookup_table.shape, (3406, 10))
        np.testing.assert_equal(us_from_lookup_table['Country_Region'].unique(),
                                np.array(['US']))
    def test_10_find_russia_and_ukraine_from_lookup_table(self):
        russia_and_ukraine_from_lookup_table = pd.read_sql(find_russia_and_ukraine_from_lookup_table, connection)
        self.assertEqual(russia_and_ukraine_from_lookup_table.shape, (114, 10))
        country_regions = russia_and_ukraine_from_lookup_table["Country_Region"].values
        self.assertIn("Russia", country_regions)
        self.assertIn("Ukraine", country_regions)
        
connection = sqlite3.connect("covid19.db")
suite = unittest.TestLoader().loadTestsFromTestCase(TestAssignmentTwo)
runner = unittest.TextTestRunner(verbosity=2)
test_results = runner.run(suite)
number_of_failures = len(test_results.failures)
number_of_errors = len(test_results.errors)
number_of_test_runs = test_results.testsRun
number_of_successes = number_of_test_runs - (number_of_failures + number_of_errors)
print(f"You've got {number_of_successes} successes among {number_of_test_runs} questions.")

test_01_select_all_from_daily_report (__main__.TestAssignmentTwo) ... ok
test_02_select_variables_from_time_series (__main__.TestAssignmentTwo) ... ok
test_03_find_taiwan_from_time_series (__main__.TestAssignmentTwo) ... ok
test_04_find_taiwan_and_specific_date_range_from_time_series (__main__.TestAssignmentTwo) ... ok
test_05_find_distinct_last_update_from_daily_report (__main__.TestAssignmentTwo) ... ok
test_06_find_distinct_date_from_time_series (__main__.TestAssignmentTwo) ... ok
test_07_find_us_from_daily_report (__main__.TestAssignmentTwo) ... ok
test_08_find_us_most_ten_confirmed_from_daily_report (__main__.TestAssignmentTwo) ... ok
test_09_find_us_from_lookup_table (__main__.TestAssignmentTwo) ... ok
test_10_find_russia_and_ukraine_from_lookup_table (__main__.TestAssignmentTwo) ... ok

----------------------------------------------------------------------
Ran 10 tests in 0.865s

OK


You've got 10 successes among 10 questions.
