#### ***Extract***

In [None]:
import pandas as pd
import numpy as np

# Extraction process
try:
    df_financials = pd.read_csv(r"C:\Users\your_user\your_folder\constituents-financials_csv.csv")
    df_esg = pd.read_csv(r"C:\Users\your_user\your_folder\SP 500 ESG Risk Ratings.csv")
except FileNotFoundError as e:
    print(f"Error: {e}")
    exit()

In [2]:
# Analysis of Financials (df_financials)
print("Analysis: Financials (df_financials)")
print("\n Info Schema and Data Types:")
df_financials.info()
print("\n Head Sample Data:")
print(df_financials.head())

# Statistical summary of numeric columns
print("\n Numeric Statistics:")
print(df_financials.describe())

Analysis: Financials (df_financials)

 Info Schema and Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Symbol          505 non-null    object 
 1   Name            505 non-null    object 
 2   Sector          505 non-null    object 
 3   Price           505 non-null    float64
 4   Price/Earnings  503 non-null    float64
 5   Dividend Yield  505 non-null    float64
 6   Earnings/Share  505 non-null    float64
 7   52 Week Low     505 non-null    float64
 8   52 Week High    505 non-null    float64
 9   Market Cap      505 non-null    int64  
 10  EBITDA          505 non-null    float64
 11  Price/Sales     505 non-null    float64
 12  Price/Book      497 non-null    float64
 13  SEC Filings     505 non-null    object 
dtypes: float64(9), int64(1), object(4)
memory usage: 55.4+ KB

 Head Sample Data:
  Symbol                 

Main Finding: 

Two columns are inverted. The 52 Week Low and 52 Week High columns have logical differences. The describe statistics confirs that the mean for Low $122.62 > mean High $83.53. Therefore, in our transformation process will swap them.

Minor Issues:

Nulls: There are minor, acceptable data gaps. 2 companies are missing Price/Earnings (505 - 503 = 2), and 8 are missing Price/Book (505 - 497 = 8). This isn't an error, it's just incomplete data we'll have to live with.

Cleanup: The SEC Filings column is confirmed as object (text) and is not needed for our KPIs. It will be dropped.

File Quality: Our output shows Price/Earnings is already a float64 (numeric), which is good. The file is cleaner than we might have assumed.

In [3]:
# Analysis of ESG (df_esg)
print("Analysis: ESG (df_esg)")
print("\n Info Schema and Data Types:")
df_esg.info()
print("\n Head Sample Data:")
print(df_esg.head())

Analysis: ESG (df_esg)

 Info Schema and Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Symbol                  503 non-null    object 
 1   Name                    503 non-null    object 
 2   Address                 502 non-null    object 
 3   Sector                  502 non-null    object 
 4   Industry                502 non-null    object 
 5   Full Time Employees     498 non-null    object 
 6   Description             502 non-null    object 
 7   Total ESG Risk score    430 non-null    float64
 8   Environment Risk Score  430 non-null    float64
 9   Governance Risk Score   430 non-null    float64
 10  Social Risk Score       430 non-null    float64
 11  Controversy Level       430 non-null    object 
 12  Controversy Score       403 non-null    float64
 13  ESG Risk Percentile     430 non-null    ob

In [4]:
# checking text-based columns we need to clean
print("\n Unique 'Controversy Level' values:")
print(df_esg['Controversy Level'].unique())

print("\n Unique 'ESG Risk Percentile' values (first 10):")
print(df_esg['ESG Risk Percentile'].unique()[:10])


 Unique 'Controversy Level' values:
[nan 'Moderate Controversy Level' 'Low Controversy Level'
 'Severe Controversy Level' 'None Controversy Level'
 'Significant Controversy Level' 'High Controversy Level']

 Unique 'ESG Risk Percentile' values (first 10):
[nan '50th percentile' '66th percentile' '38th percentile'
 '59th percentile' '23rd percentile' '53rd percentile' '28th percentile'
 '21st percentile' '55th percentile']


Main Finding: 

Large Data Gaps. This file is missing a lot of data. Out of 503 companies, 73 are missing all primary ESG data (Total ESG Risk score is 430 non-null). This is a major limitation (503 - 430 = 73). These NaN values are not errors, but they confirm a large portion of the companies are "Not Rated."

Main Finding: Dirty Text Columns. The unique outputs confirm our cleaning plan is necessary.

Controversy Level: The values are verbose (e.g., 'Moderate Controversy Level'). We must strip the "Controversy Level" suffix.

ESG Risk Percentile: This is an object (text) column, not numeric (e.g., '50th percentile'). We will have to extract the number.

Cleanup and Redundancy:

Redundant Columns: The result shows Name and Sector columns, however our df_financials file already has these. Since df_financials is our base table (our "source of truth"), the Name and Sector in this file are redundant. We will drop them to prevent a merge conflict (which would create Name_x, Name_y columns).

Useless Columns: Address, Full Time Employees, and Description are useless for our KPIs. They're long text fields we can't score or filter on, so they will be dropped.

Useful Column: Industry (e.g., "Solar") is not in the financial file and is not redundant. It's valuable data for user filtering, so we will keep this column.

File Mismatch: This file has 503 entries; the financial file has 505. This confirms our LEFT JOIN strategy is correct. The two companies from the financial list that aren't in this file will correctly show NaN for all ESG fields.

#### ***Null Value Analysis (Pre-Transform)***

In [5]:
print("\nFinancials Nulls:")
print(df_financials.isna().sum())

print("\nESG Nulls:")
print(df_esg.isna().sum())


Financials Nulls:
Symbol            0
Name              0
Sector            0
Price             0
Price/Earnings    2
Dividend Yield    0
Earnings/Share    0
52 Week Low       0
52 Week High      0
Market Cap        0
EBITDA            0
Price/Sales       0
Price/Book        8
SEC Filings       0
dtype: int64

ESG Nulls:
Symbol                      0
Name                        0
Address                     1
Sector                      1
Industry                    1
Full Time Employees         5
Description                 1
Total ESG Risk score       73
Environment Risk Score     73
Governance Risk Score      73
Social Risk Score          73
Controversy Level          73
Controversy Score         100
ESG Risk Percentile        73
ESG Risk Level             73
dtype: int64


Financials (df_financials): The nulls are exactly as we thought. They are minor and acceptable: 2 nulls for Price/Earnings and 8 for Price/Book. This show incomplete data, but not a big issue.

ESG (df_esg): This confirms the major data gap. The key KPI columns (Total ESG Risk score, Environment Risk Score, Controversy Level, etc.) are all missing 73 records. This is the "Not Rated" group we identified.

It also shows that Controversy Score is in even worse shape, with 100 nulls. This just reinforces that the ESG data is spotty, but it doesn't change our plan.

#### ***Simulate Proprietary Data (df_proprietary)***

In [6]:

print("Simulating Proprietary Data")

# We are going to use symbols from the financials file to ensure a perfect match
symbols = df_financials['Symbol'].unique()

# Let's now generate a random "Proprietary Values Score" between 40 and 100
np.random.seed(42) # Use a seed for consistent, repeatable results
prop_scores = np.random.randint(40, 101, size=len(symbols))

df_proprietary = pd.DataFrame({'Symbol': symbols, 'Proprietary Values Score': prop_scores})

print(f"Simulated {len(df_proprietary)} proprietary scores.")
print("Verifying head of simulated data:")
print(df_proprietary.head())

Simulating Proprietary Data
Simulated 505 proprietary scores.
Verifying head of simulated data:
  Symbol  Proprietary Values Score
0    MMM                        78
1    AOS                        91
2    ABT                        68
3   ABBV                        54
4    ACN                        82


Result: 

The simulation was successful. It created a new df_proprietary dataframe with 505 records, one for each Symbol in our df_financials base table. The head() output confirms the Proprietary Values Score column is populated with random integers.

Why are we doing this? We're simulating this data because the Proprietary Values Score is a critical KPI from our Part 1 business plan. We must have this column to prove our DSS can work as designed. Since we don't have a real file of internal scores, simulating it is the only way to test our complete ETL process. Using the df_financials['Symbol'] as the key guarantees it has the exact same 505 companies as our base table, ensuring a technically perfect merge.

Why isn't this introducing bias? Since this data itself is 100% fake and has no real-world correlation, we would not worry about this aspect right now. It's just random numbers. This doesn't introduce bias for the purpose of this project because our goal right now is not to find actual investment insights. Our goal is to prove the technical capability of our ETL process and DSS.

#### ***Transformation***

In [7]:
print("Transforming Data (based on analysis)")

# T1: Fix inverted '52 Week Low' and '52 Week High' in df_financials
print("T1: Fixing inverted 52-week columns...")
df_financials.rename(columns={
    '52 Week Low': 'temp_high',
    '52 Week High': '52 Week Low'
}, inplace=True)
df_financials.rename(columns={
    'temp_high': '52 Week High'
}, inplace=True)

# T2: Handleling 'Price/Earnings' and convert to numeric
# Ensuring the 2 nulls from Block 5 are handled correctly
print("T2: Cleaning 'Price/Earnings'...")
df_financials['Price/Earnings'] = pd.to_numeric(df_financials['Price/Earnings'], errors='coerce')

# T3: Cleanning 'Controversy Level' strings in df_esg
print("T3: Cleaning 'Controversy Level'...")
df_esg['Controversy Level'] = df_esg['Controversy Level'].astype(str).str.replace(' Controversy Level', '').str.strip()

# T4: Cleanning 'ESG Risk Percentile' strings in df_esg
print("T4: Cleaning 'ESG Risk Percentile'...")
# Use regex to extract only the digits, then convert to float
df_esg['ESG Risk Percentile'] = df_esg['ESG Risk Percentile'].astype(str).str.extract(r'(\d+)').astype(float)

# T5: Field Selection (Dropping unnecessary columns)
print("T5: Dropping unneeded and redundant columns...")
# Dropping from financials:
df_financials = df_financials.drop(columns=['SEC Filings'])

# Dropping from ESG (keeping 'Industry' as we discussed):
df_esg = df_esg.drop(columns=[
    'Name',                 # Redundant with financials
    'Address',              # Useless for KPIs
    'Sector',               # Redundant with financials
    'Full Time Employees',  # Useless for KPIs
    'Description'           # Useless for KPIs
])

print("\nTransformation complete.")
print(f"VERIFY: df_financials columns: {list(df_financials.columns)}")
print(f"VERIFY: df_esg columns: {list(df_esg.columns)}")

Transforming Data (based on analysis)
T1: Fixing inverted 52-week columns...
T2: Cleaning 'Price/Earnings'...
T3: Cleaning 'Controversy Level'...
T4: Cleaning 'ESG Risk Percentile'...
T5: Dropping unneeded and redundant columns...

Transformation complete.
VERIFY: df_financials columns: ['Symbol', 'Name', 'Sector', 'Price', 'Price/Earnings', 'Dividend Yield', 'Earnings/Share', '52 Week High', '52 Week Low', 'Market Cap', 'EBITDA', 'Price/Sales', 'Price/Book']
VERIFY: df_esg columns: ['Symbol', 'Industry', 'Total ESG Risk score', 'Environment Risk Score', 'Governance Risk Score', 'Social Risk Score', 'Controversy Level', 'Controversy Score', 'ESG Risk Percentile', 'ESG Risk Level']


#### ***Verification (Post-Transform)***

In [8]:
print("--- Verifying Transformations ---")

print("\nFinancials Numeric Stats (Post-Transform):")
# This check proves the 52-week columns are fixed.
# The mean High should now be > mean Low.
print(df_financials[['52 Week Low', '52 Week High']].describe())

print("\nFinancials Info (Post-Transform):")
# This check proves 'Price/Earnings' is now float64
df_financials.info()

print("\nESG Unique Values (Post-Transform):")
# This check proves 'Controversy Level' is clean
print(df_esg['Controversy Level'].unique())

print("\nESG Head (Post-Transform):")
# This check proves 'ESG Risk Percentile' is now a number (float64)
# and that 'Industry' was kept.
print(df_esg.head())

--- Verifying Transformations ---

Financials Numeric Stats (Post-Transform):
       52 Week Low  52 Week High
count   505.000000    505.000000
mean     83.536616    122.623832
std     105.725473    155.362140
min       2.800000      6.590000
25%      38.430000     56.250000
50%      62.850000     86.680000
75%      96.660000    140.130000
max    1589.000000   2067.990000

Financials Info (Post-Transform):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Symbol          505 non-null    object 
 1   Name            505 non-null    object 
 2   Sector          505 non-null    object 
 3   Price           505 non-null    float64
 4   Price/Earnings  503 non-null    float64
 5   Dividend Yield  505 non-null    float64
 6   Earnings/Share  505 non-null    float64
 7   52 Week High    505 non-null    float64
 8   52 Week Low     505 non-null  

Main Fix Verified: 

The output confirms our primary fix. The 52 Week High (mean $122.62) is now correctly greater than the 52 Week Low (mean $83.53). The columns are no longer inverted.

Text Cleanup Verified: The [ESG Unique Values] output proves our string cleaning worked. The Controversy Level column is now a clean list of single-word categories (['Moderate', 'Low', 'Severe', 'None', 'Significant', 'High']) instead of verbose sentences.

Numeric Conversion Verified: The [ESG Head] output shows ESG Risk Percentile is now a clean number (e.g., 50.0, 66.0), not a text string. The [Financials Info] block also confirms Price/Earnings is correctly typed as float64.

Column Selection Verified: The [ESG Head] output confirms our plan to keep the Industry column was successful.

#### ***Load (Merge)***

In [None]:
print("Loading (Merging) Data")

# L1: We will merge Financials (base) with ESG data
# Using a LEFT JOIN to keep all 505 financial records
print("Merging Financials and ESG data (LEFT JOIN)...")
df_merged = pd.merge(df_financials, df_esg, on='Symbol', how='left')

# L2: Then we will merge the result with Proprietary data
# Using a LEFT JOIN again to keep all 505 records
print("Merging result with Proprietary data (LEFT JOIN)...")
df_final = pd.merge(df_merged, df_proprietary, on='Symbol', how='left')

print(f"Merge complete. Final dataframe has {len(df_final)} rows.")
print("--- Verifying head of final merged data ---")
print(df_final.head())

--- Step 4: Loading (Merging) Data ---
Merging Financials and ESG data (LEFT JOIN)...
Merging result with Proprietary data (LEFT JOIN)...
Merge complete. Final dataframe has 505 rows.
--- Verifying head of final merged data ---
  Symbol                 Name                  Sector   Price  Price/Earnings  \
0    MMM           3M Company             Industrials  222.89           24.31   
1    AOS      A.O. Smith Corp             Industrials   60.24           27.76   
2    ABT  Abbott Laboratories             Health Care   56.27           22.51   
3   ABBV          AbbVie Inc.             Health Care  108.48           19.41   
4    ACN        Accenture plc  Information Technology  150.51           25.47   

   Dividend Yield  Earnings/Share  52 Week High  52 Week Low    Market Cap  \
0        2.332862            7.92        259.77      175.490  138721055226   
1        1.147959            1.70         68.39       48.925   10783419933   
2        1.908982            0.26         64.60    

The merge was successful, and our LEFT JOIN strategy worked perfectly.

Row Count: The output Merge complete. Final dataframe has 505 rows. confirms our logic was sound. We started with 505 financial records, joined them against 503 ESG records and 505 proprietary records, and correctly ended with 505 rows. No companies were lost.

Schema: The head() output is the first look at our final, unified dataset. It visually confirms that all our key columns from all three sources are now in a single row for each company:

Financial data (e.g., Price, 52 Week High/Low)

ESG data (e.g., Industry, Total ESG Risk score)

Proprietary data (e.g., Proprietary Values Score)

No Conflicts: Critically, there are no Name_x, Name_y columns. This proves that our transformation step to drop the redundant Name and Sector columns from the ESG file was the correct decision and prevented a messy merge.

#### ***Final Analysis (Merged Data)***

In [10]:
print("--- Final Analysis of Merged Data ---")

print("\nFinal Schema and Data Types:")
df_final.info()

print("\nFinal Null Counts (Post-Merge):")
print(df_final.isna().sum())

--- Final Analysis of Merged Data ---

Final Schema and Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Symbol                    505 non-null    object 
 1   Name                      505 non-null    object 
 2   Sector                    505 non-null    object 
 3   Price                     505 non-null    float64
 4   Price/Earnings            503 non-null    float64
 5   Dividend Yield            505 non-null    float64
 6   Earnings/Share            505 non-null    float64
 7   52 Week High              505 non-null    float64
 8   52 Week Low               505 non-null    float64
 9   Market Cap                505 non-null    int64  
 10  EBITDA                    505 non-null    float64
 11  Price/Sales               505 non-null    float64
 12  Price/Book                497 non-null    float64
 1

Main Finding: 

Critical Data Mismatch. This is the most important finding of the entire ETL process. The [Final Null Counts] show 137 nulls for Total ESG Risk score and all other key ESG fields. This proves our two Kaggle files are not clean subsets of each other. This is the real-world data problem that we would face.

Here's the evidence from the numbers:

Our df_esg file (Block 4) had 430 companies with ESG scores.

Our final unified file only has 368 companies with ESG scores (505 total rows - 137 nulls).

This means 62 companies (430 - 368) that had an ESG score were discarded during the merge. Why? Because their Symbols (ticker) didn't exist in our df_financials base file.

This also means 137 companies from our df_financials base file have no ESG data. This is the true number of "Not Rated" companies in our final universe.

This isn't a failure. This is the finding. We were able to identified and quantified a major data integrity issue particularly in this case. Our final dataset for the dashboard will be based on the 368 companies where we have a complete match.

Minor Findings:

The original Price/Earnings (2 nulls) and Price/Book (8 nulls) counts are unchanged. This is correct.

Proprietary Values Score has 0 nulls, which is correct for our simulation.

Conclusion: The data is now fully unified. We have all the evidence for our report, including the critical mismatch we just found. All analysis is complete.

#### ***Save Output Files***

In [None]:
print("Saving Output Files")
df_final.to_csv(r"C:\your_user\your_folder\pdss_unified_dataset.csv", index=False)

# Saving the 100-record sample for the assignment submission
df_final.head(100).to_csv(r"C:\your_user\your_folder\pdss_data_sample.csv", index=False)
print("\nSuccessfully created 'pdss_unified_dataset.csv' and 'pdss_data_sample.csv'")
print("--- ETL Process Complete ---")

Saving Output Files

Successfully created 'pdss_unified_dataset.csv' and 'pdss_data_sample.csv'
--- ETL Process Complete ---


#### ***Loading to MySQL***

In [None]:
from sqlalchemy import create_engine

# Database Connection Configuration
db_user = 'root'
db_password = 'your_password'
db_host = 'localhost'
db_port = '3306'
db_name = 'wealth_management_dss'

# Creating SQLAlchemy Engine
connection_str = f"mysql+pymysql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"
engine = create_engine(connection_str)

print("Loading Data into MySQL Database")

# We need to rename columns to match the SQL Schema format (removing spaces/slashes)
df_sql = df_final.copy()
df_sql.columns = [
    'symbol', 'company_name', 'sector', 'current_price', 'pe_ratio', 
    'dividend_yield', 'earnings_per_share', 'week_52_high', 'week_52_low', 
    'market_cap', 'ebitda', 'price_to_sales', 'price_to_book', 'industry', 
    'total_esg_risk', 'environment_risk', 'governance_risk', 'social_risk', 
    'controversy_level', 'controversy_score', 'esg_risk_percentile', 
    'esg_risk_level', 'proprietary_score'
]

try:
    # Writting data to MySQL
    # if_exists='replace' will drop the table and recreate it every time we run the ETL
    # if_exists='append' would add duplicates if we aren't careful
    df_sql.to_sql('investment_universe', con=engine, if_exists='replace', index=False)
    print(f"Success: {len(df_sql)} rows loaded into table 'investment_universe'.")
except Exception as e:
    print(f"Error writing to database: {e}")

Loading Data into MySQL Database
Success: 505 rows loaded into table 'investment_universe'.
