## Data Processing
In this notebook I have processed the data (ETL), read it from the Azure Blob storage

In [273]:

import os
import sys
from urllib.parse import quote_plus
from sqlalchemy import create_engine, text
from dotenv import load_dotenv
import pandas as pd
import pyodbc
import numpy as np
from pathlib import Path

In [225]:
DOTENV_PATH = "/Users/haseebsagheer/Documents/Python Learning/Cloud-Retail-Insights/secrets/.env"

# Force .env to override anything already in the process
if load_dotenv(dotenv_path=DOTENV_PATH, override=True):
    print("The .env file is loaded successfully.")
else:
    print("Warning: .env not found or could not be loaded.")

The .env file is loaded successfully.


In [226]:
#This block of code will verify the SQL server login credentials
server   = os.getenv("SQL_SERVER")
database = os.getenv("SQL_DATABASE")
username = os.getenv("SQL_USERNAME")
password = os.getenv("SQL_PASSWORD")
print("Using SQL_SERVER   =", server)
print("Using SQL_DATABASE =", database)
print("Using SQL_USERNAME =", username)

if not all([server, database, username, password]):
    print("ERROR: Missing one or more of SQL_SERVER / SQL_DATABASE / SQL_USERNAME / SQL_PASSWORD")
    sys.exit(1)


Using SQL_SERVER   = sqlsrv-retail-dev.database.windows.net
Using SQL_DATABASE = sqldb-dretail-dev
Using SQL_USERNAME = sqladmin


In [227]:
DRIVER_PATH = "/opt/homebrew/lib/libmsodbcsql.18.dylib"

#Preparing the credentials for logging in the account (azure SQL Server)
odbc = (
    f"DRIVER={DRIVER_PATH};"
    f"SERVER={server};"
    f"DATABASE={database};"
    f"UID={username};"
    f"PWD={password};"
    "Encrypt=yes;"
    "TrustServerCertificate=no;"
    "Connection Timeout=30;"
)

conn_url = f"mssql+pyodbc:///?odbc_connect={quote_plus(odbc)}"

try:
    engine = create_engine(conn_url, fast_executemany=True)
    print("SQLAlchemy engine created successfully.")
except Exception as e:
    print("Error creating engine:", e)
    sys.exit(2)

SQLAlchemy engine created successfully.


In [228]:

try:
    print("Testing connection...")
    with engine.connect() as conn:
        df = pd.read_sql("SELECT * FROM dbo.stg_sales;", conn)
        if df.empty:
            print("Query returned 0 rows.")
        else:
            print(f"Query returned {len(df)} rows")
            

except:
    print("There was something wrong in getting data from Azure SQl Server")

Testing connection...
Query returned 9800 rows


### 🔗 Connecting Azure SQL → Pandas DataFrame

I successfully connected to my **Azure SQL Database** and retrieved data into a **pandas DataFrame**.  

During the process I faced multiple errors — for example, `.env` variables were not being picked up (`DOTENV_PATH` not found), and the ODBC connection kept failing.  

To fix it, I manually entered the `.env` values into my ODBC connection string:

```python
odbc = (
    f"DRIVER={DRIVER_PATH};"
    f"SERVER={server};"
    f"DATABASE={database};"
    f"UID={username};"
    f"PWD={password};"
    "Encrypt=yes;"
    "TrustServerCertificate=no;"
    "Connection Timeout=30;"
)

After several hours of troubleshooting and refining the setup (firewall rules, driver names, env handling), I was finally able to retrieve all rows from dbo.stg_sales into a DataFrame 🎉



In [229]:
df_new = df.copy()
null_values = ['N/A', 'na', 'n.a.', '--', '','NA']
df_new = df_new.replace(null_values,np.nan)

In [230]:
df_new.head()

Unnamed: 0,RowID,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [231]:
df_new= df_new.dropna(subset=["PostalCode"])

In [232]:
df_new.count()

RowID           9789
OrderID         9789
OrderDate       9789
ShipDate        9789
ShipMode        9789
CustomerID      9789
CustomerName    9789
Segment         9789
Country         9789
City            9789
State           9789
PostalCode      9789
Region          9789
ProductID       9789
Category        9789
SubCategory     9789
ProductName     9789
Sales           9789
dtype: int64

In [233]:
df_new.drop_duplicates(inplace=True)
df_new.count()

RowID           9789
OrderID         9789
OrderDate       9789
ShipDate        9789
ShipMode        9789
CustomerID      9789
CustomerName    9789
Segment         9789
Country         9789
City            9789
State           9789
PostalCode      9789
Region          9789
ProductID       9789
Category        9789
SubCategory     9789
ProductName     9789
Sales           9789
dtype: int64

In [252]:
update_dates = ["OrderDate", "ShipDate"]
for col in update_dates:
    df_new[col] = df_new[col].apply(pd.to_datetime, dayfirst=True, errors="coerce")

In [253]:
df_new[update_dates].head(10)

Unnamed: 0,OrderDate,ShipDate
0,2017-11-08,2017-11-11
1,2017-11-08,2017-11-11
2,2017-06-12,2017-06-16
3,2016-10-11,2016-10-18
4,2016-10-11,2016-10-18
5,2015-06-09,2015-06-14
6,2015-06-09,2015-06-14
7,2015-06-09,2015-06-14
8,2015-06-09,2015-06-14
9,2015-06-09,2015-06-14


In [254]:
df_new = df_new.dropna(subset=update_dates)

In [255]:
valid_ship_time_mask = (df_new['ShipDate'] - df_new['OrderDate']).dt.days >= 0
df_new = df_new[valid_ship_time_mask]
#Removing the negative days, means if the ship date is before order date, it's impossible

In [256]:
#Making another column that will define the days it took to shop the order
df_new["DaysToShip"] = (df_new["ShipDate"] - df_new["OrderDate"]).dt.days

In [257]:
df_new.head()

Unnamed: 0,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,DaysToShip
0,ca-2017-152156,2017-11-08,2017-11-11,second class,cg-12520,claire gute,consumer,united states,henderson,kentucky,42420,south,FUR-BO-10001798,furniture,bookcases,bush somerset collection bookcase,261.96,3
1,ca-2017-152156,2017-11-08,2017-11-11,second class,cg-12520,claire gute,consumer,united states,henderson,kentucky,42420,south,FUR-CH-10000454,furniture,chairs,"hon deluxe fabric upholstered stacking chairs,...",731.94,3
2,ca-2017-138688,2017-06-12,2017-06-16,second class,dv-13045,darrin van huff,corporate,united states,los angeles,california,90036,west,OFF-LA-10000240,office supplies,labels,self-adhesive address labels for typewriters b...,14.62,4
3,us-2016-108966,2016-10-11,2016-10-18,standard class,so-20335,sean o'donnell,consumer,united states,fort lauderdale,florida,33311,south,FUR-TA-10000577,furniture,tables,bretford cr4500 series slim rectangular table,957.5775,7
4,us-2016-108966,2016-10-11,2016-10-18,standard class,so-20335,sean o'donnell,consumer,united states,fort lauderdale,florida,33311,south,OFF-ST-10000760,office supplies,storage,eldon fold 'n roll cart system,22.368,7


In [258]:
df_new[df_new["DaysToShip"] >15]
#identifying outliers but did not found any

Unnamed: 0,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,DaysToShip


In [259]:
df_new.head(1)

Unnamed: 0,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,DaysToShip
0,ca-2017-152156,2017-11-08,2017-11-11,second class,cg-12520,claire gute,consumer,united states,henderson,kentucky,42420,south,FUR-BO-10001798,furniture,bookcases,bush somerset collection bookcase,261.96,3


In [260]:
df_new["PostalCode"]= df_new["PostalCode"].astype("Int64")

In [261]:
df_new[df_new["Sales"]<=0]

Unnamed: 0,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,DaysToShip


In [262]:
# The corrected code:
non_int = "OrderID  ShipMode    CustomerID  CustomerName    Segment Country City    State   Region  Category    SubCategory ProductName".split()
for col in non_int:
    # Pass the characters to be stripped as a single string
    df_new[col] = df_new[col].str.strip('," ')
    df_new[col] = df_new[col].str.lower()

In [263]:

df_new.drop_duplicates(inplace=True)
df_new.head(1)

Unnamed: 0,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,DaysToShip
0,ca-2017-152156,2017-11-08,2017-11-11,second class,cg-12520,claire gute,consumer,united states,henderson,kentucky,42420,south,FUR-BO-10001798,furniture,bookcases,bush somerset collection bookcase,261.96,3


In [None]:
if 'RowID' in df_new.columns:
    df_new.drop(columns=['RowID'], inplace=True)
    print("dropped")
df_new.drop_duplicates(inplace=True)

# --- From OrderDate ---
# --- From OrderDate ---
# Core from OrderDate
df_new["OrderYear"]       = df_new["OrderDate"].dt.year.astype("Int16")
df_new["OrderMonth"]      = df_new["OrderDate"].dt.month.astype("Int8")
df_new["OrderQuarter"]    = df_new["OrderDate"].dt.quarter.astype("Int8")
df_new["OrderYearMonth"]  = df_new["OrderDate"].dt.to_period("M").astype(str)
df_new["OrderWeekOfYear"] = df_new["OrderDate"].dt.isocalendar().week.astype("Int16")
df_new["OrderMonthName"]  = df_new["OrderDate"].dt.month_name()
df_new["OrderIsWeekendOrder"] = (df_new["OrderDate"].dt.dayofweek >= 5)

# Mirror from ShipDate
df_new["ShipYear"]       = df_new["ShipDate"].dt.year.astype("Int16")
df_new["ShipMonth"]      = df_new["ShipDate"].dt.month.astype("Int8")
df_new["ShipQuarter"]    = df_new["ShipDate"].dt.quarter.astype("Int8")
df_new["ShipYearMonth"]  = df_new["ShipDate"].dt.to_period("M").astype(str)
df_new["ShipWeekOfYear"] = df_new["ShipDate"].dt.isocalendar().week.astype("Int16")
df_new["ShipMonthName"]  = df_new["ShipDate"].dt.month_name()
df_new["ShipIsWeekendShip"] = (df_new["ShipDate"].dt.dayofweek >= 5)




In [270]:
pd.set_option("display.max_columns", None)
df_new.head(2)

Unnamed: 0,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,DaysToShip,OrderYear,OrderMonth,OrderQuarter,OrderYearMonth,OrderWeekOfYear,OrderMonthName,OrderIsWeekendOrder,ShipYear,ShipMonth,ShipQuarter,ShipYearMonth,ShipWeekOfYear,ShipMonthName,ShipIsWeekendShip
0,ca-2017-152156,2017-11-08,2017-11-11,second class,cg-12520,claire gute,consumer,united states,henderson,kentucky,42420,south,FUR-BO-10001798,furniture,bookcases,bush somerset collection bookcase,261.96,3,2017,11,4,2017-11,45,November,False,2017,11,4,2017-11,45,November,True
1,ca-2017-152156,2017-11-08,2017-11-11,second class,cg-12520,claire gute,consumer,united states,henderson,kentucky,42420,south,FUR-CH-10000454,furniture,chairs,"hon deluxe fabric upholstered stacking chairs,...",731.94,3,2017,11,4,2017-11,45,November,False,2017,11,4,2017-11,45,November,True


In [272]:
df_new["Segment"].unique()

array(['consumer', 'corporate', 'home office'], dtype=object)

In [279]:

# Construct the full path
myPath = Path.cwd().parent / "dataset" / "processed" / "preprocessed_data.csv"

# This is the fix: Create the parent directories if they don't exist
myPath.parent.mkdir(parents=True, exist_ok=True)

# Now save the DataFrame to the new path
print(myPath)
df_new.to_csv(myPath, index=False)

/Users/haseebsagheer/Documents/Python Learning/Cloud-Retail-Insights/dataset/processed/preprocessed_data.csv
