In [None]:
'''Q.1. Global Power Plant Database
Project Description
The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 35,000 power plants from 167 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.
Key attributes of the database
The database includes the following indicators:
•	
•	`country` (text): 3 character country code corresponding to the ISO 3166-1 alpha-3 specification [5]
•	`country_long` (text): longer form of the country designation
•	`name` (text): name or title of the power plant, generally in Romanized form
•	`gppd_idnr` (text): 10 or 12 character identifier for the power plant
•	`capacity_mw` (number): electrical generating capacity in megawatts
•	`latitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)
•	`longitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)
•	`primary_fuel` (text): energy source used in primary electricity generation or export
•	`other_fuel1` (text): energy source used in electricity generation or export
•	`other_fuel2` (text): energy source used in electricity generation or export
•	`other_fuel3` (text): energy source used in electricity generation or export
•	 `commissioning_year` (number): year of plant operation, weighted by unit-capacity when data is available
•	`owner` (text): majority shareholder of the power plant, generally in Romanized form
•	`source` (text): entity reporting the data; could be an organization, report, or document, generally in Romanized form
•	`url` (text): web document corresponding to the `source` field
•	`geolocation_source` (text): attribution for geolocation information
•	`wepp_id` (text): a reference to a unique plant identifier in the widely-used PLATTS-WEPP database.
•	`year_of_capacity_data` (number): year the capacity information was reported
•	`generation_gwh_2013` (number): electricity generation in gigawatt-hours reported for the year 2013
•	`generation_gwh_2014` (number): electricity generation in gigawatt-hours reported for the year 2014
•	`generation_gwh_2015` (number): electricity generation in gigawatt-hours reported for the year 2015
•	`generation_gwh_2016` (number): electricity generation in gigawatt-hours reported for the year 2016
•	`generation_gwh_2017` (number): electricity generation in gigawatt-hours reported for the year 2017
•	`generation_gwh_2018` (number): electricity generation in gigawatt-hours reported for the year 2018
•	`generation_gwh_2019` (number): electricity generation in gigawatt-hours reported for the year 2019
•	`generation_data_source` (text): attribution for the reported generation information
•	`estimated_generation_gwh_2013` (number): estimated electricity generation in gigawatt-hours for the year 2013
•	`estimated_generation_gwh_2014` (number): estimated electricity generation in gigawatt-hours for the year 2014 
•	`estimated_generation_gwh_2015` (number): estimated electricity generation in gigawatt-hours for the year 2015 
•	`estimated_generation_gwh_2016` (number): estimated electricity generation in gigawatt-hours for the year 2016 
•	`estimated_generation_gwh_2017` (number): estimated electricity generation in gigawatt-hours for the year 2017 
•	'estimated_generation_note_2013` (text): label of the model/method used to estimate generation for the year 2013
•	`estimated_generation_note_2014` (text): label of the model/method used to estimate generation for the year 2014 
•	`estimated_generation_note_2015` (text): label of the model/method used to estimate generation for the year 2015
•	`estimated_generation_note_2016` (text): label of the model/method used to estimate generation for the year 2016
•	`estimated_generation_note_2017` (text): label of the model/method used to estimate generation for the year 2017 
Fuel Type Aggregation
We define the "Fuel Type" attribute of our database based on common fuel categories. 
Prediction :   Make two prediction  1) Primary Fuel    2) capacity_mw 

Dataset Link-
•	https://github.com/wri/global-power-plant-database/blob/master/source_databases_csv/database_IND.csv

'''


In [62]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/wri/global-power-plant-database/master/source_databases_csv/database_IND.csv")
df

Unnamed: 0,country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,generation_gwh_2015,generation_gwh_2016,generation_gwh_2017,generation_gwh_2018,generation_gwh_2019,generation_data_source,estimated_generation_gwh
0,IND,India,ACME Solar Tower,WRI1020239,2.5,28.1839,73.2407,Solar,,,,2011.0,Solar Paces,National Renewable Energy Laboratory,http://www.nrel.gov/csp/solarpaces/project_det...,National Renewable Energy Laboratory,,,,,,,,,,,
1,IND,India,ADITYA CEMENT WORKS,WRI1019881,98.0,24.7663,74.6090,Coal,,,,,Ultratech Cement ltd,Ultratech Cement ltd,http://www.ultratechcement.com/,WRI,,,,,,,,,,,
2,IND,India,AES Saurashtra Windfarms,WRI1026669,39.2,21.9038,69.3732,Wind,,,,,AES,CDM,https://cdm.unfccc.int/Projects/DB/DNV-CUK1328...,WRI,,,,,,,,,,,
3,IND,India,AGARTALA GT,IND0000001,135.0,23.8712,91.3602,Gas,,,,2004.0,,Central Electricity Authority,http://www.cea.nic.in/,WRI,,2019.0,,617.789264,843.747000,886.004428,663.774500,626.239128,,Central Electricity Authority,
4,IND,India,AKALTARA TPP,IND0000002,1800.0,21.9603,82.4091,Coal,Oil,,,2015.0,,Central Electricity Authority,http://www.cea.nic.in/,WRI,,2019.0,,3035.550000,5916.370000,6243.000000,5385.579736,7279.000000,,Central Electricity Authority,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902,IND,India,YERMARUS TPP,IND0000513,1600.0,16.2949,77.3568,Coal,Oil,,,2016.0,,Central Electricity Authority,http://www.cea.nic.in/,WRI,,2019.0,,,0.994875,233.596650,865.400000,686.500000,,Central Electricity Authority,
903,IND,India,Yelesandra Solar Power Plant,WRI1026222,3.0,12.8932,78.1654,Solar,,,,,Karnataka Power Corporation Limited,Karnataka Power Corporation Limited,http://karnatakapower.com,Industry About,,,,,,,,,,,
904,IND,India,Yelisirur wind power project,WRI1026776,25.5,15.2758,75.5811,Wind,,,,,,CDM,https://cdm.unfccc.int/Projects/DB/TUEV-RHEIN1...,WRI,,,,,,,,,,,
905,IND,India,ZAWAR MINES,WRI1019901,80.0,24.3500,73.7477,Coal,,,,,Hindustan Zinc ltd,Hindustan Zinc ltd,http://www.hzlindia.com/,WRI,,,,,,,,,,,


In [49]:
df.shape # to check the dimension of the data set (rows, columns)

(907, 27)

In [50]:
# Set display option to show all columns
pd.set_option('display.max_columns', None)

print(df.head(10))


  country country_long                      name   gppd_idnr  capacity_mw  \
0     IND        India          ACME Solar Tower  WRI1020239          2.5   
1     IND        India       ADITYA CEMENT WORKS  WRI1019881         98.0   
2     IND        India  AES Saurashtra Windfarms  WRI1026669         39.2   
3     IND        India               AGARTALA GT  IND0000001        135.0   
4     IND        India              AKALTARA TPP  IND0000002       1800.0   
5     IND        India              AKRIMOTA LIG  IND0000003        250.0   
6     IND        India                    ALIYAR  IND0000004         60.0   
7     IND        India           ALLAIN DUHANGAN  IND0000005        192.0   
8     IND        India               ALMATTI DAM  IND0000006        290.0   
9     IND        India               AMAR KANTAK  IND0000007        210.0   

   latitude  longitude primary_fuel other_fuel1 other_fuel2  other_fuel3  \
0   28.1839    73.2407        Solar         NaN         NaN          NaN   


In [51]:
print(df.tail(10))

    country country_long                                               name  \
897     IND        India  Wind power project by Riddhi Siddhi Gluco Biol...   
898     IND        India                    Wind power project in Rajasthan   
899     IND        India                                    YAMUNANAGAR TPP   
900     IND        India                                 YASHWANTRAO MOHITE   
901     IND        India                                      YELHANKA (DG)   
902     IND        India                                       YERMARUS TPP   
903     IND        India                       Yelesandra Solar Power Plant   
904     IND        India                       Yelisirur wind power project   
905     IND        India                                        ZAWAR MINES   
906     IND        India                            iEnergy Theni Wind Farm   

      gppd_idnr  capacity_mw  latitude  longitude primary_fuel other_fuel1  \
897  WRI1026753        34.65    8.8709    77.4466   

In [52]:
df.dtypes #Checking the types of columns

country                      object
country_long                 object
name                         object
gppd_idnr                    object
capacity_mw                 float64
latitude                    float64
longitude                   float64
primary_fuel                 object
other_fuel1                  object
other_fuel2                  object
other_fuel3                 float64
commissioning_year          float64
owner                        object
source                       object
url                          object
geolocation_source           object
wepp_id                     float64
year_of_capacity_data       float64
generation_gwh_2013         float64
generation_gwh_2014         float64
generation_gwh_2015         float64
generation_gwh_2016         float64
generation_gwh_2017         float64
generation_gwh_2018         float64
generation_gwh_2019         float64
generation_data_source       object
estimated_generation_gwh    float64
dtype: object

In [53]:
#checking the null Values
df.isnull().sum()

country                       0
country_long                  0
name                          0
gppd_idnr                     0
capacity_mw                   0
latitude                     46
longitude                    46
primary_fuel                  0
other_fuel1                 709
other_fuel2                 906
other_fuel3                 907
commissioning_year          380
owner                       565
source                        0
url                           0
geolocation_source           19
wepp_id                     907
year_of_capacity_data       388
generation_gwh_2013         907
generation_gwh_2014         509
generation_gwh_2015         485
generation_gwh_2016         473
generation_gwh_2017         467
generation_gwh_2018         459
generation_gwh_2019         907
generation_data_source      458
estimated_generation_gwh    907
dtype: int64

In [54]:
df.info() #detailed information about the data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 907 entries, 0 to 906
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   country                   907 non-null    object 
 1   country_long              907 non-null    object 
 2   name                      907 non-null    object 
 3   gppd_idnr                 907 non-null    object 
 4   capacity_mw               907 non-null    float64
 5   latitude                  861 non-null    float64
 6   longitude                 861 non-null    float64
 7   primary_fuel              907 non-null    object 
 8   other_fuel1               198 non-null    object 
 9   other_fuel2               1 non-null      object 
 10  other_fuel3               0 non-null      float64
 11  commissioning_year        527 non-null    float64
 12  owner                     342 non-null    object 
 13  source                    907 non-null    object 
 14  url       

In [None]:
'''For numerical columns (e.g., capacity_mw, latitude, longitude), we can impute missing values using the mean or median 
of the respective columns. For categorical columns (e.g., primary_fuel), we can use the mode.

If the percentage of missing values in a column is small, imputing might be a reasonable choice. If a column has a large
number of missing values, we may consider removing the entire column or, if applicable, the rows.

'''




In [56]:
from sklearn.impute import SimpleImputer

# Identify numerical and categorical columns
numerical_cols = ['capacity_mw', 'latitude', 'longitude']
categorical_cols = ['primary_fuel']

# Impute missing values for numerical columns using mean
numerical_imputer = SimpleImputer(strategy='mean')
df[numerical_cols] = numerical_imputer.fit_transform(df[numerical_cols])

# Impute missing values for categorical columns using mode
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])


In [57]:
#again checking the null Values
df.isnull().sum()

country                       0
country_long                  0
name                          0
gppd_idnr                     0
capacity_mw                   0
latitude                      0
longitude                     0
primary_fuel                  0
other_fuel1                 709
other_fuel2                 906
other_fuel3                 907
commissioning_year          380
owner                       565
source                        0
url                           0
geolocation_source           19
wepp_id                     907
year_of_capacity_data       388
generation_gwh_2013         907
generation_gwh_2014         509
generation_gwh_2015         485
generation_gwh_2016         473
generation_gwh_2017         467
generation_gwh_2018         459
generation_gwh_2019         907
generation_data_source      458
estimated_generation_gwh    907
dtype: int64

In [58]:
# Drop unnecessary columns or Feature Selection
selected_features = df[['capacity_mw', 'latitude', 'longitude', 'primary_fuel']]


In [59]:
#Encode categorical variable
selected_features = pd.get_dummies(selected_features, columns=['primary_fuel'], drop_first=True)


In [60]:
#Split the data into training and testing sets
X = selected_features.drop('capacity_mw', axis=1)
y = selected_features['capacity_mw']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [63]:
#Model Training for Primary Fuel Prediction
fuel_model = RandomForestRegressor(random_state=42)
fuel_model.fit(X_train, y_train)


RandomForestRegressor(random_state=42)

In [64]:
#Model Evaluation for Primary Fuel Prediction
df = fuel_model.predict(X_test)

In [66]:
# Note: RandomForestRegressor does not require classification_report for evaluation
# Evaluate mean squared error and R-squared
mse = mean_squared_error(y_test, df)
r2 = r2_score(y_test, df)

print(f"Primary Fuel Prediction Mean Squared Error: {mse:.2f}")
print(f"Primary Fuel Prediction R-squared: {r2:.2f}")


Primary Fuel Prediction Mean Squared Error: 201297.53
Primary Fuel Prediction R-squared: 0.34


In [67]:
#Model Training for Capacity Prediction
capacity_model = LinearRegression()
capacity_model.fit(X_train, y_train)

LinearRegression()

In [69]:
#Model Evaluation for Capacity Prediction
capacity_predictions = capacity_model.predict(X_test)
mse_capacity = mean_squared_error(y_test, capacity_predictions)
r2_capacity = r2_score(y_test, capacity_predictions)

print(f"Capacity Prediction Mean Squared Error: {mse_capacity:.2f}")
print(f"Capacity Prediction R-squared: {r2_capacity:.2f}")

Capacity Prediction Mean Squared Error: 224805.92
Capacity Prediction R-squared: 0.26


In [70]:
# Summarize the results and provide insights

print("Primary Fuel Prediction Results:")
print(f"Primary Fuel Prediction Mean Squared Error: {mse:.2f}")
print(f"Primary Fuel Prediction R-squared: {r2:.2f}")

print("\nCapacity Prediction Results:")
print(f"Capacity Prediction Mean Squared Error: {mse_capacity:.2f}")
print(f"Capacity Prediction R-squared: {r2_capacity:.2f}")


Primary Fuel Prediction Results:
Primary Fuel Prediction Mean Squared Error: 201297.53
Primary Fuel Prediction R-squared: 0.34

Capacity Prediction Results:
Capacity Prediction Mean Squared Error: 224805.92
Capacity Prediction R-squared: 0.26


In [71]:
#Save the models using joblib
from joblib import dump

dump(fuel_model, 'fuel_model.joblib')
dump(capacity_model, 'capacity_model.joblib')

['capacity_model.joblib']