# Task 2: Data Transformation & Feature Engineering

In [12]:
from sklearn.preprocessing import MinMaxScaler

# let copy original data
df_transformed = df.copy()

# Normalize columns used in scoring
scaler = MinMaxScaler()
df_transformed[['Normalized_Irradiance', 'Normalized_GridAccess', 
                'Normalized_Infrastructure', 'Normalized_Cost']] = scaler.fit_transform(
    df_transformed[['Solar_Irradiance_kWh_m2_day', 
                    'Grid_Access_Percent', 
                    'Infrastructure_Index', 
                    'Electricity_Cost_USD_per_kWh']]
)

# Invert normalized grid access: we want higher score if access is lower
df_transformed['Inverse_GridAccess'] = 1 - df_transformed['Normalized_GridAccess']

# Apply weights:
# - Irradiance: 35%
# - Inverse grid access: 25% (priority if underserved)
# - Infrastructure index: 20%
# - Electricity cost (normalized): 20%

df_transformed['Solar_Access_Score'] = (
    0.35 * df_transformed['Normalized_Irradiance'] +
    0.25 * df_transformed['Inverse_GridAccess'] +
    0.20 * df_transformed['Normalized_Infrastructure'] +
    0.20 * df_transformed['Normalized_Cost']
)

# Sort by score for inspection
df_transformed_sorted = df_transformed.sort_values(by='Solar_Access_Score', ascending=False)
df_transformed_sorted

Unnamed: 0,Region,Solar_Irradiance_kWh_m2_day,Rural_Pop_Density_per_km2,Grid_Access_Percent,Infrastructure_Index,Electricity_Cost_USD_per_kWh,Terrain_Ruggedness_Score,Normalized_Irradiance,Normalized_GridAccess,Normalized_Infrastructure,Normalized_Cost,Inverse_GridAccess,Solar_Access_Score
31,Region_32,7.35,111,46.4,0.48,0.39,0.19,1.0,0.352941,0.382353,0.965517,0.647059,0.781339
6,Region_7,7.08,376,55.7,0.68,0.38,0.19,0.929134,0.477273,0.676471,0.931034,0.522727,0.77738
2,Region_3,6.15,64,28.3,0.49,0.36,0.57,0.685039,0.110963,0.397059,0.862069,0.889037,0.713849
47,Region_48,6.56,304,73.4,0.82,0.37,0.3,0.792651,0.713904,0.882353,0.896552,0.286096,0.704733
30,Region_31,4.9,456,20.0,0.86,0.28,0.63,0.356955,0.0,0.941176,0.586207,1.0,0.680411
12,Region_13,5.74,188,35.2,0.46,0.39,0.93,0.577428,0.203209,0.352941,0.965517,0.796791,0.664989
9,Region_10,6.04,178,30.4,0.59,0.27,0.37,0.656168,0.139037,0.544118,0.551724,0.860963,0.664068
0,Region_1,6.0,90,23.0,0.39,0.31,0.33,0.645669,0.040107,0.25,0.689655,0.959893,0.653889
33,Region_34,4.44,342,32.3,0.79,0.37,0.4,0.23622,0.164439,0.838235,0.896552,0.835561,0.638525
44,Region_45,4.02,306,39.0,0.9,0.4,0.86,0.125984,0.254011,1.0,1.0,0.745989,0.630592


### Task 2: Data Transformation & Feature Engineering

---

### **Objective**

To assist in prioritizing regions for solar energy investment, I've created a composite **"Solar Access Score"** based on four weighted indicators:

* **Solar Irradiance (35%)** – Primary driver for solar yield
* **Inverse Grid Access (25%)** – Regions with poor grid access are higher priority
* **Infrastructure Index (20%)** – Indicates readiness for deployment logistics
* **Electricity Cost (20%)** – Higher cost regions offer greater economic return

---

### **Calculation Breakdown**

Each component was **normalized** using Min-Max scaling to ensure comparability:

* **Inverse Grid Access** = `1 - (normalized grid access)` to prioritize underserved areas
* **Solar Access Score** is computed as a weighted sum of the scaled inputs

---

### **Business Justification for Weighting**

The chosen weights reflect Prime Frontier's operational focus on:

* **Maximizing ROI and energy yield** (heavier weight to solar irradiance)
* **Targeting under-electrified regions** (emphasized via inverse grid access)
* **Feasibility of implementation** (logistics and access depend on infrastructure)
* **Financial leverage** (higher electricity cost = greater savings from solar)