## Building "dimsalesterritory" from Database Advature
There are six columns in "dimsalesterritory".
Based on our observation, 
1. "salesterritorykey": It is the primary key. Primary key doesn't require extraction from any other table.
2. "salesterritoryalternativekey": Extracted from the "territoryid" column of the "sales.salesterritory" table.
3. "salesterritoryregion": Extracted from the "name" column of the "sales.salesterritory" table.
4. "salesterritorycountry": Extracted from the "countryregioncode" column of the "sales.salesterritory" table, with additional processing required.
5. "salesterritorygroup": Obtained from the "group" column of the "sales.salesterritory" table.
6. "salesterritoryimage": Let's ignore this column because it seems from an external source, not within any table.

Ok, then looks like we only need extract some columns from sales.salesterritory

In [25]:
# psycopg2 is the python package to connect with postgresql server
import psycopg2
from psycopg2 import OperationalError
 
def create_connection(db_name, db_user, db_password, db_host, db_port):
    connection = None
    try:
        connection = psycopg2.connect(
            database=db_name,
            user=db_user,
            password=db_password,
            host=db_host,
            port=db_port,
        )
        print("Connection to PostgreSQL DB successful")
    except OperationalError as e:
        print(f"The error '{e}' occurred")
    return connection
 

# Connection details
db_name = "Adventureworks" # because the source data is from Adventureworks
db_user = "postgres"
db_password = "postgres"  # Update with your password
db_host = "pgdb"  # Update if your DB is hosted elsewhere
db_port = "5432"
 
# Create the connection
connection = create_connection(db_name, db_user, db_password, db_host, db_port)

Connection to PostgreSQL DB successful


In [26]:
import pandas as pd
from sqlalchemy import create_engine

connection_url = f"postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"
# Create the engine
engine = create_engine(connection_url)

In [27]:
# save the table into python dataframe
df = pd.read_sql_query("SELECT * FROM sales.salesterritory", engine)

In [28]:
df

Unnamed: 0,territoryid,name,countryregioncode,group,salesytd,saleslastyear,costytd,costlastyear,rowguid,modifieddate
0,1,Northwest,US,North America,7887187.0,3298694.0,0.0,0.0,43689a10-e30b-497f-b0de-11de20267ff7,2008-04-30
1,2,Northeast,US,North America,2402177.0,3607149.0,0.0,0.0,00fb7309-96cc-49e2-8363-0a1ba72486f2,2008-04-30
2,3,Central,US,North America,3072175.0,3205014.0,0.0,0.0,df6e7fd8-1a8d-468c-b103-ed8addb452c1,2008-04-30
3,4,Southwest,US,North America,10510850.0,5366576.0,0.0,0.0,dc3e9ea0-7950-4431-9428-99dbcbc33865,2008-04-30
4,5,Southeast,US,North America,2538667.0,3925071.0,0.0,0.0,6dc4165a-5e4c-42d2-809d-4344e0ac75e7,2008-04-30
5,6,Canada,CA,North America,6771829.0,5693989.0,0.0,0.0,06b4af8a-1639-476e-9266-110461d66b00,2008-04-30
6,7,France,FR,Europe,4772398.0,2396540.0,0.0,0.0,bf806804-9b4c-4b07-9d19-706f2e689552,2008-04-30
7,8,Germany,DE,Europe,3805202.0,1307950.0,0.0,0.0,6d2450db-8159-414f-a917-e73ee91c38a9,2008-04-30
8,9,Australia,AU,Pacific,5977815.0,2278549.0,0.0,0.0,602e612e-dfe9-41d9-b894-27e489747885,2008-04-30
9,10,United Kingdom,GB,Europe,5012905.0,1635823.0,0.0,0.0,05fc7e1f-2dea-414e-9ecd-09d150516fb5,2008-04-30


In [29]:
# extract the four columns that we need
extracted_df = df[['territoryid', 'name', 'countryregioncode', 'group']]
# Displaying the extracted DataFrame
extracted_df

Unnamed: 0,territoryid,name,countryregioncode,group
0,1,Northwest,US,North America
1,2,Northeast,US,North America
2,3,Central,US,North America
3,4,Southwest,US,North America
4,5,Southeast,US,North America
5,6,Canada,CA,North America
6,7,France,FR,Europe
7,8,Germany,DE,Europe
8,9,Australia,AU,Pacific
9,10,United Kingdom,GB,Europe


In [30]:
# countryregioncode stands for the abbreviated form of the country, but it is full name in table "dimsalesterritory"
# here, I simply use python dictionary
country2name = {"US": "United States", "CA": "Canada", "GB": "United Kingdom", "AU": "Australia", "DE": "Germany", "FR": "France"}
extracted_df['countryregioncode'] = extracted_df['countryregioncode'].replace(country2name)
# Displaying
extracted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  extracted_df['countryregioncode'] = extracted_df['countryregioncode'].replace(country2name)


Unnamed: 0,territoryid,name,countryregioncode,group
0,1,Northwest,United States,North America
1,2,Northeast,United States,North America
2,3,Central,United States,North America
3,4,Southwest,United States,North America
4,5,Southeast,United States,North America
5,6,Canada,Canada,North America
6,7,France,France,Europe
7,8,Germany,Germany,Europe
8,9,Australia,Australia,Pacific
9,10,United Kingdom,United Kingdom,Europe


In [31]:
# raname the column
extracted_df = extracted_df.rename(columns={'territoryid': 'salesterritoryalternativekey', 'name': 'salesterritoryregion', 'countryregioncode': 'salesterritorycountry', 'group': 'salesterritorygroup'})
# display
extracted_df

Unnamed: 0,salesterritoryalternativekey,salesterritoryregion,salesterritorycountry,salesterritorygroup
0,1,Northwest,United States,North America
1,2,Northeast,United States,North America
2,3,Central,United States,North America
3,4,Southwest,United States,North America
4,5,Southeast,United States,North America
5,6,Canada,Canada,North America
6,7,France,France,Europe
7,8,Germany,Germany,Europe
8,9,Australia,Australia,Pacific
9,10,United Kingdom,United Kingdom,Europe


In [35]:
#add NA to the last row in each column
#Append a new row
new_row_df = pd.DataFrame([{'salesterritoryalternativekey': len(extracted_df)+1,
           'salesterritoryregion': 'NA',
           'salesterritorycountry': 'NA',
           'salesterritorygroup': 'NA'}])

extracted_df = pd.concat([extracted_df, new_row_df], ignore_index=True)

In [36]:
extracted_df

Unnamed: 0,salesterritoryalternativekey,salesterritoryregion,salesterritorycountry,salesterritorygroup
0,1,Northwest,United States,North America
1,2,Northeast,United States,North America
2,3,Central,United States,North America
3,4,Southwest,United States,North America
4,5,Southeast,United States,North America
5,6,Canada,Canada,North America
6,7,France,France,Europe
7,8,Germany,Germany,Europe
8,9,Australia,Australia,Pacific
9,10,United Kingdom,United Kingdom,Europe


In [None]:
# add primary key to the first column
extracted_df.insert(0, 'salesterritorykey', range(1, len(extracted_df) + 1))
# now, extracted_df looks same as the dimension table
extracted_df

In [33]:
# a useful function that you might need in project 1: save dataframe to csv
extracted_df.to_csv('dimsalesterritory.csv', index=False)

## What if we need multiple tables? We could simply join them together in the SQL query passed to the `pd.read_sql_query()` function.

In [34]:
# for example,
df = pd.read_sql_query("SELECT * FROM sales.salesterritoryhistory FULL JOIN sales.salesterritory ON sales.salesterritoryhistory.territoryid = sales.salesterritory.territoryid;", engine)
df

Unnamed: 0,businessentityid,territoryid,startdate,enddate,rowguid,modifieddate,territoryid.1,name,countryregioncode,group,salesytd,saleslastyear,costytd,costlastyear,rowguid.1,modifieddate.1
0,275,2,2011-05-31,2012-11-29,8563ce6a-00ff-47d7-ba4d-3c3e1cdef531,2012-11-22,2,Northeast,US,North America,2402177.0,3607149.0,0.0,0.0,00fb7309-96cc-49e2-8363-0a1ba72486f2,2008-04-30
1,275,3,2012-11-30,NaT,2f44304c-ee87-4c72-813e-ca75c5f61f4c,2012-11-23,3,Central,US,North America,3072175.0,3205014.0,0.0,0.0,df6e7fd8-1a8d-468c-b103-ed8addb452c1,2008-04-30
2,276,4,2011-05-31,NaT,64bcb1b3-a793-40ba-9859-d90f78c3f167,2011-05-24,4,Southwest,US,North America,10510850.0,5366576.0,0.0,0.0,dc3e9ea0-7950-4431-9428-99dbcbc33865,2008-04-30
3,277,3,2011-05-31,2012-11-29,3e9f893d-5142-46c9-a76a-867d1e3d6f90,2012-11-22,3,Central,US,North America,3072175.0,3205014.0,0.0,0.0,df6e7fd8-1a8d-468c-b103-ed8addb452c1,2008-04-30
4,277,2,2012-11-30,NaT,132e4721-32dd-4a73-b556-1837f3a2b9ae,2012-11-23,2,Northeast,US,North America,2402177.0,3607149.0,0.0,0.0,00fb7309-96cc-49e2-8363-0a1ba72486f2,2008-04-30
5,278,6,2011-05-31,NaT,b7c8f9f5-5fb8-47b3-be73-1b9a14bdf8b9,2011-05-24,6,Canada,CA,North America,6771829.0,5693989.0,0.0,0.0,06b4af8a-1639-476e-9266-110461d66b00,2008-04-30
6,279,5,2011-05-31,NaT,57d1cdcf-62ce-499f-8be8-1bb71c4bb7ef,2011-05-24,5,Southeast,US,North America,2538667.0,3925071.0,0.0,0.0,6dc4165a-5e4c-42d2-809d-4344e0ac75e7,2008-04-30
7,280,1,2011-05-31,2012-09-29,fd3f5566-10e2-4960-be12-0365e5665881,2012-09-22,1,Northwest,US,North America,7887187.0,3298694.0,0.0,0.0,43689a10-e30b-497f-b0de-11de20267ff7,2008-04-30
8,281,4,2011-05-31,NaT,9d8754b2-c320-40db-a77f-ff5a1bc0f46b,2011-05-24,4,Southwest,US,North America,10510850.0,5366576.0,0.0,0.0,dc3e9ea0-7950-4431-9428-99dbcbc33865,2008-04-30
9,282,6,2011-05-31,2012-05-29,2c9f5240-d8bf-4f85-897d-6083146dbc4b,2012-05-22,6,Canada,CA,North America,6771829.0,5693989.0,0.0,0.0,06b4af8a-1639-476e-9266-110461d66b00,2008-04-30


then you can extract the columns that you need which is similar to the previous steps

In [13]:
# remeber to close
connection.close()