## Building a Data Model - Star Schema

Note, this would be better managed in a "gold" directory as a .py file, but I am housing it here so it is more easily readable and consolidated with the other notebooks.

In [1]:
# Reading in combined data for us to separate out
import os 

# Move the execution of the folder up one directory
os.chdir('..')

from pyspark.sql import SparkSession
from etl.read_normalize import ingest_parquet
from pyspark.sql.functions import monotonically_increasing_id

spark = SparkSession.builder.appName("OlympicCountryDataPipeline").getOrCreate()

df_denormalized = ingest_parquet(
    input_path = "datasets/countries_olympics_join.parquet"
    , spark = spark
)

25/01/15 21:56:01 WARN Utils: Your hostname, Coles-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.0.0.235 instead (on interface en0)
25/01/15 21:56:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/15 21:56:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [2]:
df_denormalized.show()

25/01/15 21:56:13 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------------+----+------+------+-----+----+------------+--------------------+-----------+----------+---------------------+--------------------------+-------------+--------------------------------+--------------+----------------+---------------+--------------+-------------+-------------+-------+---------+---------+-----------+--------+-------+
|Country_Code|Gold|Silver|Bronze|Total|Year|Country_Name|              Region| Population|Area_sq_mi|Pop_Density_per_sq_mi|Coastline_coast_area_ratio|Net_migration|Infant_mortality_per_1000_births|GDP_per_capita|Literacy_percent|Phones_per_1000|Arable_percent|Crops_percent|Other_percent|Climate|Birthrate|Deathrate|Agriculture|Industry|Service|
+------------+----+------+------+-----+----+------------+--------------------+-----------+----------+---------------------+--------------------------+-------------+--------------------------------+--------------+----------------+---------------+--------------+-------------+-------------+-------+---------+--

## Creating Star Schema Tables - Olympics & Countries

In [12]:
# sequence of dimension tables
dim_country_df = df_denormalized.select(
    "Country_Code", 
    "Country_Name", 
    "Population", 
    "Area_sq_mi", 
    "Pop_Density_per_sq_mi",
    "Coastline_coast_area_ratio",
    "Net_migration",
    "Infant_mortality_per_1000_births",
    "GDP_per_capita",
    "Literacy_percent",
    "Phones_per_1000",
    "Arable_percent",
    "Crops_percent",
    "Other_percent",
    "Climate",
    "Birthrate",
    "Deathrate",
    "Agriculture",
    "Industry",
    "Service").distinct()

dim_region_df = (df_denormalized.select("Region")
                 .distinct()
                 .withColumn("Region_ID", monotonically_increasing_id())
                 )

# Creating our fact_table to use our ID columns 
fact_olympics = (df_denormalized
                   .join(dim_region_df, on = "Region", how = "inner")
                   ).drop("Region", "Country_Name", "Population", "Area_sq_mi", \
                    "Pop_Density_per_sq_mi", "Coastline_coast_area_ratio", "Net_migration", \
                    "Infant_mortality_per_1000_births", "GDP_per_capita", "Literacy_percent", \
                    "Phones_per_1000", "Arable_percent", "Crops_percent", "Other_percent", \
                    "Climate", "Birthrate", "Deathrate", "Agriculture", "Industry","Service")



Printing out the schema of our fact and dimension tables so we can see how they would join!

In [22]:
print("Fact - Olympics")
fact_olympics.printSchema()

print("Dim - Region")
dim_region_df.printSchema()

print("Dim - Country")
dim_country_df.printSchema()

Fact - Olympics
root
 |-- Country_Code: string (nullable = true)
 |-- Gold: integer (nullable = true)
 |-- Silver: integer (nullable = true)
 |-- Bronze: integer (nullable = true)
 |-- Total: integer (nullable = true)
 |-- Year: string (nullable = true)
 |-- Region_ID: long (nullable = false)

Dim - Region
root
 |-- Region: string (nullable = true)
 |-- Region_ID: long (nullable = false)

Dim - Country
root
 |-- Country_Code: string (nullable = true)
 |-- Country_Name: string (nullable = true)
 |-- Population: double (nullable = true)
 |-- Area_sq_mi: double (nullable = true)
 |-- Pop_Density_per_sq_mi: double (nullable = true)
 |-- Coastline_coast_area_ratio: double (nullable = true)
 |-- Net_migration: double (nullable = true)
 |-- Infant_mortality_per_1000_births: double (nullable = true)
 |-- GDP_per_capita: double (nullable = true)
 |-- Literacy_percent: double (nullable = true)
 |-- Phones_per_1000: double (nullable = true)
 |-- Arable_percent: double (nullable = true)
 |-- Crops

## Adding More Fact & Dimension Tables