# PySpark: Medallion Architecture Example

### The present project has the goal to show in a concise way a medallion architecture example.

#### We are going to import some data from an API available in hte web https://api.openbrewerydb.org/breweries.
#### We need import pyspark libraries.

In [None]:
%pip install pandas
%pip install requests
%pip install pyspark


In [None]:
import pandas as pd
import requests
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

#### Creating a spark cluster

In [None]:
spark = SparkSession.builder.appName("MeuDataFrame").master("local[*]").getOrCreate()

#### Downloading the breweries' data from the web

In [None]:
url = "https://api.openbrewerydb.org/breweries"
response = requests.get(url)

data = response.json()

#### Importing the downloaded data into a pandas dataframe

In [None]:
pandas_df = pd.DataFrame(data)

#### Creating CSV file in my file catalog

In [None]:
pandas_df.to_csv('/kaggle/working/breweries.csv',index=False)

#### The bronze layer where the data stands raw from the source.

In [None]:
bronze_df = spark.read.csv('/kaggle/working/breweries.csv',header=True)

In [None]:
bronze_df.show(10)

#### Silver layer is where the data is cleaned and business rules are applied

In [None]:
#Creating a partition by region
silver_df = bronze_df.repartition('state_province','country')

In [None]:
# removing duplicates, dropping not interesting columns and rename some columns
silver_df = silver_df.dropDuplicates()\
          .drop('id',"address_2", "address_3","state",'latitude','longitude','state','postal_code','phone')\
          .withColumnRenamed('address_1','address')\
            .withColumnRenamed('brewery_type','type')

In [None]:
silver_df.show(20)

#### The gold layer is the presentation layer. Usually, some report runtime aggregations are run.

In [None]:
# Aggregate analysis with Group By and Count functions
gold_df = silver_df.select('type','state_province','country').groupBy('type','state_province','country').count().orderBy('count',ascending=False)

In [None]:
gold_df.show()