**CHAPTER 1 DATA COLLECTION**

**1.1 Extract the data from MySQL database**

1.1.1 Install PyMySQL


In [None]:
! pip install pymysql

1.1.2 Config DB credential: Use config connect to database

In [None]:
from google.colab import userdata
class Config:
  MYSQL_HOST = userdata.get("MYSQL_HOST")
  MYSQL_PORT = userdata.get("MYSQL_PORT")
  MYSQL_USER = userdata.get("MYSQL_USER")
  MYSQL_PASSWORD = userdata.get("MYSQL_PASSWORD")
  MYSQL_DB = 'r2de3'
  MYSQL_CHARSET = 'utf8mb4'

1.1.3 Connect to DB

In [None]:
import sqlalchemy
engine = sqlalchemy.create_engine(
    "mysql+pymysql://{user}:{password}@{host}:{port}/{db}".format(
        user=Config.MYSQL_USER,
        password=Config.MYSQL_PASSWORD,
        host=Config.MYSQL_HOST,
        port=Config.MYSQL_PORT,
        db=Config.MYSQL_DB,
    )
)

1.1.4 Show Tables

In [None]:
with engine.connect() as connection:
    result = connection.execute(sqlalchemy.text(f"show tables;")).fetchall()
result


1.1.5 Describe Tables

In [None]:
with engine.connect() as connection:
    desc_transaction = connection.execute(sqlalchemy.text(f"describe transaction")).fetchall()
    desc_customer = connection.execute(sqlalchemy.text(f"describe customer")).fetchall()
    desc_product = connection.execute(sqlalchemy.text(f"describe product")).fetchall()
print("== transaction ==")
print(desc_transaction)
print("== customer ==")
print(desc_customer)
print("== product ==")
print(desc_product)

1.1.6 Info: Table and Schema of data

Tables:
*   r2de3.transaction - data of transaction
*   r2de3.customer - data of customer
*   r2de3.product - data of product


  

  




1.1.7 Query Table (Method 1: sqlalchemy)

In [None]:
with engine.connect() as connection:
  product_result = connection.execute(sqlalchemy.text("SELECT * FROM r2de3.product;")).fetchall()
print("number of rows: ", len(product_result))


1.1.7.1 Convert data to Pandas

In [None]:
import pandas as pd
product = pd.DataFrame(product_result)
product = product.set_index("ProductNo")

1.1.8 Query Table (Method 2: Pandas)

In [None]:
customer = pd.read_sql("SELECT * FROM r2de3.customer", engine)
customer

1.1.8.1 Query for Select data from table r2de3.transaction

In [None]:
transaction = pd.read_sql("SELECT * FROM r2de3.transaction", engine)
transaction

1.1.9 Join tables: product & customer & transaction

Key for merge the table is:
*   transaction: ProductNo, CustomerNo
*   product: ProductNo
*   customer: CustomerNo

In [None]:
merged_transaction = transaction.merge(product, how="left", left_on="ProductNo", right_on="ProductNo").merge(customer, how="left", left_on="CustomerNo", right_on="CustomerNo")
merged_transaction


**1.2 Extract the conversion rate data from API withRequests**

1.2.1 Package requests use for REST API


In [None]:
import requests

1.2.2 Requests library Call API (HTTP GET) for conversion rate

In [None]:
url = "https://r2de3-currency-api-vmftiryt6q-as.a.run.app/gbp_thb"
r = requests.get(url)
result_conversion_rate = r.json()

result_conversion_rate

1.2.3 Convert to Pandas

In [None]:
conversion_rate = pd.DataFrame(result_conversion_rate)

conversion_rate

1.2.4 Drop column that no need to show (column id)

In [None]:
conversion_rate = conversion_rate.drop(columns=['id'])

1.2.5 Change type of column date from string to dt.date same as merged_transaction

In [None]:
conversion_rate['date'] = pd.to_datetime(conversion_rate['date'])

conversion_rate.head()

**1.3 Join the data**

1.3.1 Create finalDF from merge DataFrame merged_transaction with conversion_rate


In [None]:
final_df = merged_transaction.merge(conversion_rate, how="left", left_on="Date", right_on="date")

final_df

1.3.2 For now we have column Price and Quantity but we still don’t have total amount. So, it’s from Price * Quantity

In [None]:
final_df["total_amount"] = final_df["Price"] * final_df["Quantity"]

final_df.head()

1.3.3 After we have total_amount. Then we need the currency conversion. So, it’s from  (total_amount * gbp_thb)

In [None]:
final_df["thb_amount"] = final_df["total_amount"] * final_df["gbp_thb"]

final_df

1.3.4 Delete Column no need to show and change the Column name

We can drop the column no need to use such as duplicated date with Date and column gpb_thb

In [None]:
final_df = final_df.drop(["date", "gbp_thb"], axis=1)

final_df.columns

1.3.5 Change the column name to lower alphabet and change the column name ending with No to _id

In [None]:
final_df.columns = ['transaction_id', 'date', 'product_id', 'price', 'quantity', 'customer_id',
       'product_name', 'customer_country', 'customer_name', 'total_amount','thb_amount']

final_df

**1.4 Output file**

1.4.1 Last step is Output to Parquet file with coding to_parque

Normally, pandas will save index (0,1,2,3) if we no need we can coding by index=False

In [None]:
final_df.to_parquet("output.parquet", index=False)