# The Task At Hand

In this Jupyter notebook I will be completing the City-Hive exercise that has been handed to me. This project at the time of completion will be able to:

1)Upload a dataset containing corona statistics from a CSV format into a PostgreSQL database.

2)Run a query that answers the following question and export its result into a csv file: How many countries does the dataset include?

3)Upload that csv file into an AWS S3 bucket.

# Plan of Action

In [11]:
try:
    import psycopg2
except:
    !pip3 install psycopg2
    import psycopg2
print("psycopg2 version: {:>30}".format(psycopg2.__version__))

try:
    import sqlalchemy
except:
    !pip3 install sqlalchemy
    import sqlalchemy
print("sqlalchemy version: {:>30}".format(sqlalchemy.__version__))

try:
    import pandas
except:
    !pip3 install pandas
    import pandas
print("pandas version: {:>30}".format(pandas.__version__))

try:
    import selenium
except:
    !pip3 install selenium
    import selenium
print("selenium version: {:>30}".format(selenium.__version__))

try:
    import urllib.request
except:
    !pip3 install urllib.request
    import urllib.request
print("urllib version: {:>30}".format(urllib.request.__version__))


psycopg2 version:    2.9.3 (dt dec pq3 ext lo64)
sqlalchemy version:                          1.4.7
pandas version:                          1.2.4
selenium version:                          4.1.0
urllib version:                            3.8


In [21]:
#getting the URL of the hyperlink to download the CSV file
import re 

html = urllib.request.urlopen("https://ourworldindata.org/covid-deaths")
text = html.read()
plaintext = text.decode('utf8')
links = re.findall("href=[\"\'](.*?)[\"\']", plaintext)
csv_link = [link for link in links if "csv" in link]
print(csv_link[0])

https://covid.ourworldindata.org/data/owid-covid-data.csv


In [22]:
#Downloading and saving the dataset from the link
from urllib.request import urlretrieve as retrieve

retrieve(csv_link[0], 'CoronaStats.csv')

('CoronaStats.csv', <http.client.HTTPMessage at 0x205d9535f10>)

In [24]:
%ls

 Volume in drive C has no label.
 Volume Serial Number is A60F-1A98

 Directory of C:\Users\gavis

26/01/2022  20:45    <DIR>          .
26/01/2022  20:45    <DIR>          ..
19/11/2020  01:19    <DIR>          .android
26/01/2022  20:47    <DIR>          .conda
20/09/2021  12:11                25 .condarc
20/09/2021  12:11    <DIR>          .continuum
21/03/2021  18:26    <DIR>          .dotnet
20/01/2022  00:31    <DIR>          .idlerc
26/01/2022  14:50    <DIR>          .ipynb_checkpoints
20/09/2021  12:16    <DIR>          .ipython
11/10/2021  11:26    <DIR>          .jupyter
12/10/2021  10:58    <DIR>          .keras
13/10/2021  18:31    <DIR>          .matplotlib
13/12/2021  18:13    <DIR>          .nbi
22/03/2021  13:20    <DIR>          .nuget
25/12/2019  02:06    <DIR>          .Origin
26/01/2022  14:43                65 .pgAdmin4.1057243102.addr
26/01/2022  15:29             1,066 .pgAdmin4.1057243102.log
26/01/2022  14:43             1,580 .pgAdmin4.startup.log
09/02/2020 

## Create a SQL statement to create our database table so that we can import the csv we downloaded

In [35]:
#use pandas to get names of coloumns of the csv
import os
import pandas as pd
ipynb_path = os.path.dirname(os.path.realpath("__file__"))
csv_file_path = ipynb_path + '\\CoronaStats.csv' 
csv_data = pd.read_csv(csv_file_path)
column_names = list(csv_data.columns.values)
print(column_names)

['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'total_

### As you can see here, our csv is the correct one as shown in the 5 first columns

In [31]:
csv_data.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


## Now lets build our SQL statement:
To do so we will join the column name with the data type it needs when defining a table, and then add a comma between each line. Now because the data type names are different in pandas than SQL, we shall make a dictionary to replace each column data type in pandas to the correct type name in SQL.

In [55]:

replacements = {
        'timedelta64[ns]': 'varchar',
        'object': 'varchar',
        'float64': 'float',
        'int64': 'int',
        'datetime64': 'timestamp'
}

col_str = ", ".join("{} {}".format(n, d) for (n, d) in zip(column_names, csv_data.dtypes.replace(replacements)))
print(col_str)

iso_code varchar, continent varchar, location varchar, date varchar, total_cases float, new_cases float, new_cases_smoothed float, total_deaths float, new_deaths float, new_deaths_smoothed float, total_cases_per_million float, new_cases_per_million float, new_cases_smoothed_per_million float, total_deaths_per_million float, new_deaths_per_million float, new_deaths_smoothed_per_million float, reproduction_rate float, icu_patients float, icu_patients_per_million float, hosp_patients float, hosp_patients_per_million float, weekly_icu_admissions float, weekly_icu_admissions_per_million float, weekly_hosp_admissions float, weekly_hosp_admissions_per_million float, new_tests float, total_tests float, total_tests_per_thousand float, new_tests_per_thousand float, new_tests_smoothed float, new_tests_smoothed_per_thousand float, positive_rate float, tests_per_case float, tests_units varchar, total_vaccinations float, people_vaccinated float, people_fully_vaccinated float, total_boosters float, n

# Establish a connection to our database
I made a database using the AWS RDS infastructure with a PostgreSQL engine. 

In [60]:
conn_string = "host=coronastatistics.cz9th7gv5riq.us-east-1.rds.amazonaws.com \
                dbname=''\
                user='postgres' password='Password'"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
print("opened database succesfully")

opened database succesfully


In [61]:
#create table
full_sql_query = "create table coronadata" + '(' + col_str + ')' 
cursor.execute(full_sql_query)

In [62]:
#copying the data from the csv file to the database
my_file = open(csv_file_path)
SQL_STATEMENT = """
COPY coronadata FROM STDIN WITH
        CSV
        HEADER
        DELIMITER AS ','
 """
cursor.copy_expert(sql = SQL_STATEMENT, file = my_file)
print("file copied to db")


file copied to db


In [63]:
cursor.execute("grant select on table coronadata to public")
conn.commit()
cursor.close()
print("table coronadata imported to db completed")

table coronadata imported to db completed


# Run The Query and save the results in a CSV