## WEEK 6 - SQL WITH PYTHON - 13.02

Project Challenges:  Countries Data

Related files
country_codes.csv (10 KB)
The table country_codes.csv contains a list of countries and their numeric, two and three letters country codes. Also it includes latitude and longitude coordinates of the geographic center of each country. This table will be useful later in the week when visualizing country level data on a map.



In [16]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text 

In [17]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [28]:
from sqlalchemy import create_engine
from sqlalchemy import text

engine = create_engine(url, echo=False)

In [29]:
!echo " # db settings" > .env
!echo "HOST=127.0.0.1" >> .env
!echo "USERNAME=postgres" >> .env
!echo "DB_NAME=climate" >> .env
!echo "PASS=postgres" >> .env
!echo "PORT=5432" >> .env

In [30]:
!cat .env

 # db settings
HOST=127.0.0.1
USERNAME=postgres
DB_NAME=climate
PASS=postgres
PORT=5432


In [31]:
username = os.getenv('USERNAME')
password = os.getenv('PASS')
host = os.getenv('HOST')
port = os.getenv('PORT')

In [32]:
url = f'postgresql://{username}:{password}@{host}:{port}/climate'

In [33]:
engine = create_engine(url, echo=False)

In [34]:
engine

Engine(postgresql://postgres:***@127.0.0.1:5432/climate)

In [35]:
#1.Import the csv file as a pandas data frame.

country_data = pd.read_csv('/Users/anacoutinho/Desktop/spiced/cardamon-loops-working-folder/06_week_Climate_Data/02_sql_with_python/country_codes.csv')
country_data.head()

Unnamed: 0,name,alpha2,alpha3,code,lat,lon
0,Afghanistan,AF,AFG,4,33.0,65.0
1,Albania,AL,ALB,8,41.0,20.0
2,Algeria,DZ,DZA,12,28.0,3.0
3,American Samoa,AS,ASM,16,-14.3333,-170.0
4,Andorra,AD,AND,20,42.5,1.6


In [47]:
#2.With Python, define a countries table in the climate database:

with engine.begin() as conn:
    conn.execute(text("DROP TABLE IF EXISTS countries CASCADE;"))
    conn.execute(text("""
        CREATE TABLE countries (
            index INT,
            name VARCHAR PRIMARY KEY,
            alpha2 VARCHAR,
            alpha3 VARCHAR,
            code INT,
            lat NUMERIC,
            lon NUMERIC
        );
    """))

In [50]:
with engine.begin() as conn: # Done with echo=True
    conn.execute(text("INSERT INTO countries VALUES (1, 'index', 'name', 'alpha2', 42, 42, 43)"))
    
    

In [44]:
country_data.columns


Index(['name', 'alpha2', 'alpha3', 'code', 'lat', 'lon'], dtype='object')

In [45]:
country_data.index

RangeIndex(start=0, stop=243, step=1)

In [51]:
#3.Load the data frame into the countries table
#Use this script as a reference:

country_data.to_sql('countries', engine, if_exists='append', index=True)

243


##  Stations Data

Using the stations file found in the downloaded ECA_blend data folder proceed with it in the same way as for the countries data from the previous exercise. Use the downloaded stations file as it will have all stations found in the downloaded datasets.

## Hints:

- First read stations data into a notebook and clean up before uploading to database
- The pd.read_csv method has a skiprows parameter to skip some header lines of a .csv file
- You need to cleanup the column names of the file. Watch out for whitespace and convert the names to lowercase
- Add a foreign key constraint for the cn column and let it point to the alpha2 column of the countries table

In [66]:
stations = pd.read_csv('../data/ECA_blend_tg/stations.txt', skiprows=17)
stations


Unnamed: 0,STAID,STANAME,CN,LAT,LON,HGHT
0,1,VAEXJOE,SE,+56:52:00,+014:48:00,166
1,2,FALUN,SE,+60:37:00,+015:37:00,160
2,3,STENSELE,SE,+65:04:00,+017:09:59,325
3,4,LINKOEPING,SE,+58:24:00,+015:31:59,93
4,5,LINKOEPING-MALMSLAETT,SE,+58:24:00,+015:31:59,93
...,...,...,...,...,...,...
6450,25150,GDANSK-REBIECHOWO_OLD,PL,+54:22:59,+018:28:00,144
6451,25151,ELBLAG-MILEJEWO,PL,+54:13:23,+019:32:36,151
6452,25156,KROSNO,PL,+49:42:24,+021:46:09,326
6453,25157,YLJA KRAFTVERK,NO,+61:11:49,+008:22:50,517


In [53]:
stations.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6455 entries, 0 to 6454
Data columns (total 6 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   STAID                                     6455 non-null   int64 
 1   STANAME                                   6455 non-null   object
 2   CN                                        6455 non-null   object
 3         LAT                                 6455 non-null   object
 4          LON                                6455 non-null   object
 5   HGHT                                      6455 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 302.7+ KB


In [54]:
stations.columns


Index(['STAID', 'STANAME                                 ', 'CN', '      LAT',
       '       LON', 'HGHT'],
      dtype='object')

In [71]:
stations.columns = ['staid', 'staname', 'cn', 'lat', 'lon', 'hght']



In [72]:
stations.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6455 entries, 0 to 6454
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   staid    6455 non-null   int64 
 1   staname  6455 non-null   object
 2   cn       6455 non-null   object
 3   lat      6455 non-null   object
 4   lon      6455 non-null   object
 5   hght     6455 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 302.7+ KB


In [73]:
stations.columns = stations.columns.str.strip()
stations.columns


Index(['staid', 'staname', 'cn', 'lat', 'lon', 'hght'], dtype='object')

In [74]:
stations.sample(10)


Unnamed: 0,staid,staname,cn,lat,lon,hght
4598,18353,BODOE - SKIVIKA,NO,+67:18:29,+014:25:50,5
2026,4784,WOLFSBURG,DE,+52:26:33,+010:45:36,56
3840,11058,NURIA,ES,+42:22:54,+002:09:19,1971
2747,5624,BRAMON,SE,+62:13:12,+017:44:23,17
4307,18062,ANSTADBLAHEIA,NO,+68:43:05,+015:18:39,500
5374,19150,HOEGALOFTSKVELVEN,NO,+59:24:00,+006:52:00,1078
2700,5575,MORA_A,SE,+60:57:36,+014:30:36,196
3698,8452,ELABUGA,RU,+48:49:00,+135:52:59,62
1985,4742,WEILBURG (KLARANLAGE),DE,+50:28:27,+008:15:34,150
438,856,BOSCO CENTRALE,IT,+44:26:20,+010:02:00,902


In [83]:
stations.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6455 entries, 0 to 6454
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   staid    6455 non-null   int64 
 1   staname  6455 non-null   object
 2   cn       6455 non-null   object
 3   lat      6455 non-null   object
 4   lon      6455 non-null   object
 5   hght     6455 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 302.7+ KB


In [100]:
with engine.begin() as conn:
    conn.execute(text("""DELETE FROM stations;"""))

In [102]:
with engine.begin() as conn:
    conn.execute(text("""ALTER TABLE stations
    ADD FOREIGN KEY (cn)
    REFERENCES countries(alpha2);
    """))