# Predicting Suitable Solar Energy Potential in Buildings of Karachi
#### Solar installation stakeholders face significant challenges in assessing building potential, often requiring costly and time-consuming site surveys. This project addresses this challenge by analyzing the annual solar energy potential for Karachi's buildings using features from the data set.

## Business Value

#### Reduce assessment costs by quickly screening buildings for solar potential.
#### 	Support urban planning and renewable energy initiative
#### Help property owners evaluate solar investment opportunities.
#### Enable scalable solar adoption strategies across Karachi.i


In [2]:
# Business Questions

1.	How can the surface area of rooftops be optimized for maximizing the potential installable area for solar panels in Karachi?
3.	How does the energy potential per year vary across different assumed building types?
4.	What is the relationship between estimated building height and energy potential per year for optimizing solar panel placement?
5.	How can businesses leverage the estimated capacity factor to predict the efficiency and performance of solar installations?
7.	How does the estimated tilt of rooftops affect the energy potential per year and what adjustments can maximize efficiency?


Object `Karachi` not found.
Object `types` not found.
Object `placement` not found.
Object `installations` not found.
Object `efficiency` not found.


In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,mean_squared_error, r2_score,mean_absolute_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor,plot_tree
sns.set(color_codes=True)

df=pd.read_csv("karachi_rooftop_solar_potential.csv")
#To display the top 5 rows df.head(5)

df.head(5)



FileNotFoundError: [Errno 2] No such file or directory: 'karachi_rooftop_solar_potential.csv'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Data Source
#### This dataset contains solar rooftop potential data at individual building structure levels for a sample area of interest in Karachi. The data was gathered by extracting building rooftop footprint polygons from very high-resolution satellite stereo imagery of 0.5m resolution. The rooftop angle, obstruction, and shading were taken into account during suitable area calculation


# Source (https://energydata.info)

URL:https://energydata.info/dataset/karachi-rooftop-solar-potential-mapping

In [None]:
df.dtypes


- Surface_area: Total Surface area
- Potential_installable_area:Area in which panel can be placed
- Estimated_building_height:Height of Building
- Estimated_tilt:the angle at which panel is placed
- Assumed_building_type (encoded):Type of building
- Peak_installable_capacity:maximum capacity of kv to install  


In [None]:
print(df.count())

In [None]:
## city is duplicate for all
df2 = df.pivot_table(index = ['City'], aggfunc ='size')
print("Get count of duplicate values in multiple columns:\n", df2)

In [None]:
df=df.drop(['Comment', 'uuid','Unit_installation_price','City'], axis=1)
df.head(5)

In [None]:



duplicate_rows_df = df [df.duplicated()]

print("number of duplicate rows:", duplicate_rows_df.size)



In [None]:
print(df.isnull().sum())


In [None]:
df['Estimated_building_height'] = df['Estimated_building_height'].fillna(df.groupby('Assumed_building_type')['Estimated_building_height'].transform('mean'))
#Median
df['Estimated_capacity_factor'] = df['Estimated_capacity_factor'].fillna(df.groupby('Assumed_building_type')['Estimated_capacity_factor'].transform('mean'))

In [None]:
print(df.isnull().sum())

In [None]:
df.Assumed_building_type.value_counts().nlargest(100).plot(kind='bar', figsize=(10,5))
plt.title("Data vs ")
plt.ylabel('Max number of building type')
plt.xlabel('Assumed_building_type');

In [None]:
numeric_cols = ['Surface_area', 'Potential_installable_area', 'Peak_installable_capacity',
                'Energy_potential_per_year', 'Estimated_building_height']
correlation = df[numeric_cols].corr()
plt.figure(figsize=(12, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Key Metrics')
plt.tight_layout()

#high colarted


In [None]:
plt.figure(figsize=(20, 15))

# 1. Surface Area vs Energy Potential Scatter Plot
plt.subplot(2, 2, 1)
sns.scatterplot(data=df, x='Surface_area', y='Energy_potential_per_year',
                hue='Assumed_building_type', alpha=0.6)
plt.title('Surface Area vs Energy Potential by Building Type')
plt.xlabel('Surface Area')
plt.ylabel('Energy Potential per Year')

## they are co related

In [None]:
# 2. Building Height Distribution
plt.subplot(1, 1,1)
sns.boxplot(data=df, x='Assumed_building_type', y='Estimated_building_height')
plt.xticks(rotation=45)
plt.title('Building Height Distribution by Type')
plt.xlabel('Building Type')
plt.ylabel('Estimated Height')

In [None]:
# 3. Capacity Factor Distribution
plt.subplot(2, 2, 3)
sns.histplot(data=df[df['Estimated_capacity_factor'].notna()],
             x='Estimated_capacity_factor', bins=20)
plt.title('Distribution of Estimated Capacity Factor')
plt.xlabel('Capacity Factor')
plt.ylabel('Count')

In [None]:
# 4. Installation Area Efficiency
df['Installation_efficiency'] = (df['Potential_installable_area'] / df['Surface_area']) * 100
plt.subplot(2, 2, 4)
sns.violinplot(data=df, x='Assumed_building_type', y='Installation_efficiency')
plt.xticks(rotation=45)
plt.title('Installation Area Efficiency by Building Type')
plt.xlabel('Building Type')
plt.ylabel('Installation Efficiency (%)')
plt.tight_layout()

In [None]:
# Additional analysis: Energy potential per surface area
print("\nAverage Energy Potential per Surface Area by Building Type:")
efficiency_by_type = df.groupby('Assumed_building_type').agg({
    'Energy_potential_per_year': 'sum',
    'Surface_area': 'sum'
}).assign(
    energy_density=lambda x: x['Energy_potential_per_year'] / x['Surface_area']
).round(2)
print(efficiency_by_type['energy_density'])

In [None]:

groupeddf=df.groupby('Assumed_building_type')
 # Calculate efficiency score


In [None]:
for key, count in groupeddf:
    print(key+" : " + str(count['Assumed_building_type'].count()))


In [None]:
groupedAVG=groupeddf.agg({
            'Surface_area': 'mean',
            'Potential_installable_area': 'mean',
            'Peak_installable_capacity': 'mean',
            'Energy_potential_per_year': 'mean',
            'Estimated_building_height':'mean',
            'Estimated_capacity_factor':"mean",
            'Estimated_tilt':"mean",
        }).to_dict(orient='index'),

In [None]:
print("Avg Space Utilized")
for key, avg_space_utilization in groupeddf:
  space=(avg_space_utilization['Potential_installable_area'].sum() / df['Surface_area'].sum() * 100)
  print("for " + key + ":" + str(space) + "%")


In [None]:
#'buildings_with_zero_potential': ,
print("buildings with zero potential")
for key, buildings_with_zero_potential in groupeddf:
  zero=len(buildings_with_zero_potential[buildings_with_zero_potential['Energy_potential_per_year'] == 0])
  print("in " + key + ":" + str(zero))



In [None]:
#'buildings_with_high_potential': len(dfcopy[dfcopy['Energy_potential_per_year'] > dfcopy['Energy_potential_per_year'].mean()])

print("buildings with high potential")
for key, buildings_with_high_potential in groupeddf:
  high=len(buildings_with_high_potential[buildings_with_high_potential['Energy_potential_per_year'] > buildings_with_high_potential['Energy_potential_per_year'].mean()])
  print(key + ":" + str(high))




In [None]:
pivot_table = pd.pivot_table(
    df,
    values=[
        'Surface_area',
        'Potential_installable_area',
        'Peak_installable_capacity',
        'Energy_potential_per_year',
        'Estimated_building_height',
        'Estimated_capacity_factor',
        'Estimated_tilt',
    ],
    index=['Assumed_building_type'],  # Replace with your actual grouping column name
    aggfunc='mean'
)

print(pivot_table)

In [None]:


def analyze_surface_area_optimization():
    """Analyze surface area utilization and optimization potential"""
    # Calculate utilization ratio
    df['utilization_ratio'] = df['Potential_installable_area'] / df['Surface_area'] * 100

    # Group by building type and calculate mean metrics
    utilization_analysis = df.groupby('Assumed_building_type').agg({
        'Surface_area': 'mean',
        'Potential_installable_area': 'mean',
        'utilization_ratio': 'mean'
    }).round(2)

    # Calculate efficiency score
    utilization_analysis['efficiency_score'] = (
        utilization_analysis['utilization_ratio'] *
        utilization_analysis['Potential_installable_area']
    ).round(2)

    return utilization_analysis

def analyze_energy_potential_by_building():
    """Analyze energy potential across different building types"""
    energy_analysis = df.groupby('Assumed_building_type').agg({
        'Energy_potential_per_year': ['mean', 'sum'],
        'Surface_area': 'mean',
        'Peak_installable_capacity': 'mean'
    }).round(2)

    # Calculate energy efficiency ratio
    energy_analysis['energy_per_area'] = (
        energy_analysis[('Energy_potential_per_year', 'sum')] /
        energy_analysis[('Surface_area', 'mean')]
    ).round(2)

    return energy_analysis



def analyze_capacity_factor():
    """Analyze capacity factor patterns and implications"""
    capacity_analysis = df.groupby('Assumed_building_type').agg({
        'Estimated_capacity_factor': ['mean', 'std'],
        'Energy_potential_per_year': 'mean'
    }).round(2)

    # Calculate performance efficiency
    capacity_analysis['performance_ratio'] = (
        capacity_analysis[('Energy_potential_per_year', 'mean')] /
        capacity_analysis[('Estimated_capacity_factor', 'mean')]
    ).round(2)

    return capacity_analysis

def analyze_tilt_impact():
    """Analyze the impact of tilt on energy potential"""
    # Since all tilts are same, we'll calculate theoretical optimal tilt
    # Based on Karachi's latitude (24.8607° N)
    latitude = 24.8607
    optimal_tilt = latitude * 0.76  # General rule of thumb for optimal tilt

    tilt_analysis = {
        'current_tilt': df['Estimated_tilt'].mean(),
        'optimal_tilt': optimal_tilt,
        'tilt_difference': optimal_tilt - df['Estimated_tilt'].mean(),
        'potential_improvement': abs(optimal_tilt - df['Estimated_tilt'].mean()) * 0.5  # Estimated improvement percentage
    }

    return tilt_analysis


# Print results with insights


In [None]:
print("1. Surface Area Optimization Analysis:")
print( analyze_surface_area_optimization())


In [None]:
print("\n2. Energy Potential by Building Type:")
print( analyze_energy_potential_by_building())


In [None]:
print("\n5. Tilt Impact Analysis:")
print( analyze_tilt_impact())

In [None]:
features = ['Surface_area', 'Potential_installable_area', 'Peak_installable_capacity',
                        'Estimated_tilt', 'Estimated_building_height', 'Estimated_capacity_factor']

X = df[features]
y = df['Energy_potential_per_year']

        # Split data
X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )


In [None]:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:


def evaluate_model(model_name, model, X_train, X_test, y_train, y_test):
    print(f"\n{'='*50}")
    print(f"{model_name} Analysis")
    print(f"{'='*50}")

    # Train model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Create bins for classification metrics
    y_test_binned = pd.qcut(y_test, q=4, labels=['Low', 'Medium', 'High', 'Very High'])
    y_pred_binned = pd.qcut(y_pred, q=4, labels=['Low', 'Medium', 'High', 'Very High'])

    # Calculate metrics
    accuracy = accuracy_score(y_test_binned, y_pred_binned)

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test_binned, y_pred_binned))

    # Print accuracy
    print(f"\nAccuracy Score: {accuracy:.4f}")

    # Plot confusion matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test_binned, y_pred_binned)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Low', 'Medium', 'High', 'Very High'],
                yticklabels=['Low', 'Medium', 'High', 'Very High'])
    plt.title(f'{model_name} Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.tight_layout()
    plt.show()

    # Return model and predictions for further analysis
    return model, y_pred, accuracy

def analyze_linear_regression(X_train, X_test, y_train, y_test):
    model = LinearRegression()
    return evaluate_model('Linear Regression', model, X_train, X_test, y_train, y_test)

def analyze_ridge_regression(X_train, X_test, y_train, y_test):
    model = Ridge()
    return evaluate_model('Ridge Regression', model, X_train, X_test, y_train, y_test)

def analyze_lasso_regression(X_train, X_test, y_train, y_test):
    model = Lasso()
    return evaluate_model('Lasso Regression', model, X_train, X_test, y_train, y_test)

def analyze_decision_tree(X_train, X_test, y_train, y_test, feature_names):
    model = DecisionTreeRegressor(max_depth=5, random_state=42)
    trained_model, y_pred, accuracy = evaluate_model('Decision Tree', model, X_train, X_test, y_train, y_test)

    # Plot decision tree
    plt.figure(figsize=(20,10))
    plot_tree(trained_model, feature_names=feature_names, filled=True, rounded=True, fontsize=10)
    plt.title("Decision Tree Structure")
    plt.show()

    # Print feature importance
    importance = pd.DataFrame({
        'Feature': feature_names,
        'Importance': trained_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    print("\nFeature Importance:")
    print(importance)

    return trained_model, y_pred, accuracy

def analyze_random_forest(X_train, X_test, y_train, y_test, feature_names):
    model = RandomForestRegressor(n_estimators=5, random_state=42)
    trained_model, y_pred, accuracy = evaluate_model('Random Forest', model, X_train, X_test, y_train, y_test)

    # Print feature importance
    importance = pd.DataFrame({
        'Feature': feature_names,
        'Importance': trained_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    print("\nFeature Importance:")
    print(importance)

    return trained_model, y_pred, accuracy

def analyze_gradient_boosting(X_train, X_test, y_train, y_test, feature_names):
    model = GradientBoostingRegressor(random_state=42)
    trained_model, y_pred, accuracy = evaluate_model('Gradient Boosting', model, X_train, X_test, y_train, y_test)

    # Print feature importance
    importance = pd.DataFrame({
        'Feature': feature_names,
        'Importance': trained_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    print("\nFeature Importance:")
    print(importance)

    return trained_model, y_pred, accuracy





In [None]:
results = {}

 # Run all models
results['Linear Regression'] = analyze_linear_regression(X_train, X_test, y_train, y_test)


# Run the analysis


In [None]:
results['Ridge Regression'] = analyze_ridge_regression(X_train, X_test, y_train, y_test)


In [None]:
results['Lasso Regression'] = analyze_lasso_regression(X_train, X_test, y_train, y_test)


In [None]:
results['Random Forest'] = analyze_random_forest(X_train, X_test, y_train, y_test, features)


In [None]:
results['Gradient Boosting'] = analyze_gradient_boosting(X_train, X_test, y_train, y_test, features)




In [None]:

accuracies = pd.DataFrame({
     'Model': results.keys(),
     'Accuracy': [result[2] for result in results.values()]
 }).sort_values('Accuracy', ascending=False)

print("\nModel Accuracy Comparison:")
print("=========================")
print(accuracies)


In [None]:


# Function to train decision tree and get performance metrics
def train_and_evaluate_decision_tree():  # Reduced max_depth for better visualization

    dt_model = DecisionTreeRegressor(max_depth=3, random_state=42)
    dt_model.fit(X_train, y_train)

    # Make predictions
    y_pred = dt_model.predict(X_test)

    # Calculate performance metrics
    metrics = {
        'R2 Score': r2_score(y_test, y_pred),
        'Mean Squared Error': mean_squared_error(y_test, y_pred),
        'Root Mean Squared Error': np.sqrt(mean_squared_error(y_test, y_pred)),
        'Mean Absolute Error': mean_absolute_error(y_test, y_pred)
    }

    # Feature importance
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': dt_model.feature_importances_
    }).sort_values('Importance', ascending=False)

    return metrics, feature_importance,dt_model

# Function to visualize decision tree
def visualize_tree(model, feature_names):
    plt.figure(figsize=(20,10))
    plot_tree(model,
             feature_names=feature_names,
             filled=True,
             rounded=True,
             fontsize=10)
    plt.title("Decision Tree Visualization")
    plt.show();

# Load and prepare the data

# Train model and get metrics
metrics, feature_importance,dTmodel = train_and_evaluate_decision_tree()

# Print results
print("\nDecision Tree Performance Metrics:")
for metric, value in metrics.items():
    print(f"{metric}: {value:.4f}")

print("\nFeature Importance:")
print(feature_importance)

# Visualize the tree
visualize_tree(dTmodel, features)