# <div style="text-align: center; background-color: #00FFFF	; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">📊 EDA |CO2 emission Data | Visualization</div>

<h3 style="text-align: left;background-color: #00BFFF; font-family:Times New Roman; color: white; padding: 14px; line-height: 1; border-radius:10px"> About Dataset📁</h3>

<h4>The Global Data on Sustainable Energy Dataset contains <mark>21 columns</mark>, each with the following descriptions:</h4>

* <b> <mark>1. Entity</mark></b>: The name of the country or region for which the data is reported.
* <b> <mark>2. Year</mark></b>: The year for which the data is reported, ranging from 2000 to 2020.
* <b> <mark>3. Access to electricity (% of population)</mark></b>: The percentage of population with access to electricity.
* <b> <mark>4. Access to clean fuels for cooking (% of population)</mark></b>: The percentage of the population with primary reliance on clean fuels.
* <b> <mark>5. Renewable-electricity-generating-capacity-per-capita</mark></b>: Installed Renewable energy capacity per person.
* <b> <mark>6. Financial flows to developing countries (US $)</mark></b>: Aid and assistance from developed countries for clean energy projects.

* <b> <mark>7. Renewable energy share in total final energy consumption (%)</mark></b>: Percentage of renewable energy in final energy consumption.

* <b> <mark>8. Electricity from fossil fuels (TWh)</mark></b>: Electricity generated from fossil fuels (coal, oil, gas) in terawatt-hours.

* <b> <mark>9. Electricity from nuclear (TWh)</mark></b>: Electricity generated from nuclear power in terawatt-hours.
* <b> <mark>10. Electricity from renewables (TWh)</mark></b>: Electricity generated from renewable sources (hydro, solar, wind, etc.) in terawatt-hours.
* <b> <mark>11. Low-carbon electricity (% electricity)</mark></b>: Percentage of electricity from low-carbon sources (nuclear and renewables).
* <b> <mark>12. Primary energy consumption per capita (kWh/person)</mark></b>: Energy consumption per person in kilowatt-hours.
* <b> <mark>13. Energy intensity level of primary energy (MJ/$2011 PPP GDP)</mark></b>: Energy use per unit of GDP at purchasing power parity.
* <b> <mark>14. Value_co2_emissions (metric tons per capita)</mark></b>: Carbon dioxide emissions per person in metric tons.
* <b> <mark>15. Renewables (% equivalent primary energy)</mark></b>: Equivalent primary energy that is derived from renewable sources.
* <b> <mark>16. GDP growth (annual %)</mark></b>: Annual GDP growth rate based on constant local currency.
* <b> <mark>17. GDP per capita</mark></b>: Gross domestic product per person.
* <b> <mark>18. Density (P/Km2)</mark></b>: Population density in persons per square kilometer.
* <b> <mark>19. Land Area (Km2)</mark></b>: Total land area in square kilometers.
* <b> <mark>20. Latitude</mark></b>: Latitude of the country's centroid in decimal degrees.
* <b> <mark>21. Longitude</mark></b>: Longitude of the country's centroid in decimal degrees.


<a id="1"></a>
# <div style="text-align: center; background-color: #838B8B; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">1. Import Necessary Libraries</div>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import shap
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import missingno as mno
import plotly.offline as pyo 
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.subplots import make_subplots
import plotly.io as pio
from wordcloud import WordCloud
color_pal = sns.color_palette()
plt.style.use('seaborn-dark-palette')
plt.style.use('dark_background')

import nltk

import warnings
import statsmodels.api as sm
warnings.filterwarnings('ignore')
sns.set_theme(style='darkgrid', palette='colorblind')
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder()

#Model
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler

In [None]:
df=pd.read_csv('/kaggle/input/global-data-on-sustainable-energy/global-data-on-sustainable-energy (1).csv')

<a id="1"></a>
# <div style="text-align: center; background-color: #CDC8B1; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">2. Exploratory Data Analysis 📊 </div>

In [None]:
df.head(5)

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.describe().T

In [None]:
df.describe(include = 'object').T

In [None]:
unique_values =  df.nunique()
unique_values

In [None]:
df.info()

In [None]:
df.dtypes

# <div style="text-align: center; background-color: #CDC8B1; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">3. Null values</div>

In [None]:
df.isna().sum()

In [None]:
# Calculating the count of missing values in each column
missing_values = df.isna().sum()

# Creating a bar plot using Plotly Express
fig = px.bar(x=missing_values.index, y=missing_values.values, labels={'x': 'Columns', 'y': 'Missing Values Count'},
             title='Count of Missing Values in Each Column')
fig.show()

In [None]:
# Drop columns with a high number of missing values
df.drop(columns=['Financial flows to developing countries (US $)','Renewables (% equivalent primary energy)',
                 'Renewable-electricity-generating-capacity-per-capita'], inplace=True)

# Calculate mean for specific columns
Mean_Access = df['Access to clean fuels for cooking'].mean()
Mean_Renewable = df['Renewable energy share in the total final energy consumption (%)'].mean()
Mean_Electricity = df['Electricity from nuclear (TWh)'].mean()
Mean_Energy = df['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].mean()
Mean_Value_co2 = df['Value_co2_emissions_kt_by_country'].mean()
Mean_gdp_growth = df['gdp_growth'].mean()
Mean_gdp_per_capita = df['gdp_per_capita'].mean()

# Fill missing values in specific columns with calculated means
df['Access to clean fuels for cooking'].fillna(Mean_Access, inplace=True)
df['Renewable energy share in the total final energy consumption (%)'].fillna(Mean_Renewable, inplace=True)
df['Electricity from nuclear (TWh)'].fillna(Mean_Electricity, inplace=True)
df['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(Mean_Energy, inplace=True)
df['Value_co2_emissions_kt_by_country'].fillna(Mean_Value_co2, inplace=True)
df['gdp_growth'].fillna(Mean_gdp_growth, inplace=True)
df['gdp_per_capita'].fillna(Mean_gdp_per_capita, inplace=True)

# Drop rows with any remaining missing values
df = df.dropna()

# Display the shape of the DataFrame after cleaning
df.shape


# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">4. Duplicate rows</div>


In [None]:
# Finding duplicate rows
duplicate_rows = df[df.duplicated(keep='first')]

# Number of duplicate rows
num_duplicates = duplicate_rows.shape[0]

# Displaying the duplicate rows
print(f"Number of duplicate rows: {num_duplicates}")
duplicate_rows

# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">5. Feature engineering</div>

In [None]:
# Reanme columns 
df.rename(columns={"Value_co2_emissions_kt_by_country":"CO2" , 'Land Area(Km2)':'Land'} , inplace=True)


In [None]:
df.rename(columns={'Density\\n(P/Km2)': 'Density'}, inplace=True)
df['Density'] = df['Density'].str.replace(',', '').astype(int)

In [None]:
# Selecting specific columns 'Entity' and 'Land' into a new DataFrame
energy_land = df[['Entity', 'Land']]

# Dropping rows with missing values in the selected columns
energy_land = energy_land.dropna()

# Getting unique country names from the 'Entity' column
countries = energy_land['Entity'].unique()

# Getting unique land area values from the 'Land' column
land = energy_land['Land'].unique()

# Clean the land area values by converting to integers
land_int = []
for num in land:
    if isinstance(num, float):
        land_int.append(int(num))
    else:
        land_int.append(int(str(num).replace(',', '')))

# scale the data 

In [None]:
# Columns to be scaled
columns_to_scale = ['Electricity from fossil fuels (TWh)','CO2',
                    'Land','Electricity from nuclear (TWh)','Electricity from renewables (TWh)','Density']

# Select only the columns to be scaled
data_to_scale = df[columns_to_scale]

# Initialize the MaxAbsScaler
scaler = MaxAbsScaler(copy=True)

# Scale the selected columns
scaled_data = scaler.fit_transform(data_to_scale)

# Create a new DataFrame with the scaled values
df_scaled = df.copy()
df_scaled[columns_to_scale] = scaled_data

# Display the scaled DataFrame
df_scaled.head()


# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">6. Corr Matrix
</div>

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a correlation heatmap using Plotly Express
fig = px.imshow(
    correlation_matrix,  # Matrix containing the data
    labels=dict(x="Features", y="Features", color="Correlation"),  # Customize labels
    x=correlation_matrix.columns,  # x-values: Features
    y=correlation_matrix.columns,  # y-values: Features
    color_continuous_scale='blues',  # Set the color scale
    title='Correlation Heatmap',  # Set the title of the plot
    height=1200  # Set the height of the plot
)

# Display the plot
fig.show()

In [None]:
print('Top 5 Most Positively Correlated to the Target Variable')
correlation_matrix['CO2'].sort_values(ascending=False).head(5)

In [None]:
print('Top 5 Most Negatively Correlated to the Target Variable')
correlation_matrix['CO2'].sort_values(ascending=True).head(5)

# Reduce dimensionality


In [None]:
columns_to_drop = [col for col in correlation_matrix.columns if abs(correlation_matrix.loc['CO2', col]) < 0.5]
print('number of columns to drop ' ,len(columns_to_drop))
columns_to_drop

In [None]:
cols_to_drop = [
    'Access to electricity (% of population)',
    'Access to clean fuels for cooking',
    'Renewable energy share in the total final energy consumption (%)',
    'Low-carbon electricity (% electricity)',
    'Primary energy consumption per capita (kWh/person)',
    'Energy intensity level of primary energy (MJ/$2017 PPP GDP)',
    'gdp_growth',
    'gdp_per_capita',
    'Latitude',
    'Longitude'
]

# Drop columns except 'Year'
df = df.drop(cols_to_drop, axis=1)

# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">7. Data visualisation</div>

# the target column 'CO2'

In [None]:
# Selecting the top 10 CO2 from the 'CO2' column in the DataFrame 'df'
top_CO2 = df['CO2'].nlargest(10)
locations = df.loc[top_CO2.index]['Entity']

# Plotting the top 10 prices using Matplotlib
plt.figure(figsize=(10, 6))  
plt.bar(range(len(top_CO2)), top_CO2, color='#7B66FF')  
plt.xlabel('Index')  
plt.ylabel('CO2') 
plt.legend(['CO2'])
plt.title('Top 10 CO2') 
plt.xticks(range(len(top_CO2)), locations)  
plt.tight_layout()  
plt.show()


In [None]:
# Calculate the maximum 'CO2' emissions for each 'Country' category and sort in descending order
max_co2 = df.groupby('Entity')['CO2'].max().reset_index()
max_co2 = max_co2.sort_values(by='CO2', ascending=False)

# Select the top 10 'Country' categories with the highest maximum 'CO2' emissions
top_10_high_co2 = max_co2.head(10)

# Create a bar plot using Plotly Express
fig = px.bar(
    top_10_high_co2,  # DataFrame containing the data
    x='Entity',  # x-values: 'Country' categories
    y='CO2',  # y-values: maximum 'CO2' emissions
    color='CO2',  # Color the bars based on the indices
    title='Top 10 Countries by Maximum CO2 Emissions',  # Set the title of the plot
    labels={'Country': 'Country', 'CO2': 'CO2 Emissions'},  # Customize labels
    template='plotly_white'  # Use a white template for the plot
)

# Set the height of the plot
fig.update_layout(height=650)

# Display the plot
fig.show()


In [None]:
# Calculate the median 'CO2' emissions for each 'Year'
CO2_By_Year = df.groupby('Year')['CO2'].max().reset_index()

# Create a line plot using Plotly Express
fig_CO2_By_Year = px.line(
    CO2_By_Year,  # DataFrame containing the data
    x='Year',   # x-values: Year
    y='CO2',  # y-values: median CO2
    labels={'Year': 'Year'},  # Customize label for the x-axis
    title='Maxmum CO2 Emissions by Year',  # Set the title of the plot
    height=650  # Set the height of the plot
)

# Display the plot
fig_CO2_By_Year.show()


In [None]:
# Create the box plot using Plotly Express for 'CO2'
fig1 = px.box(df_scaled, y='CO2', template='plotly_white', title='CO2 emission (BoxPlot)')

# Customize the layout of the box plot
fig1.update_layout(font=dict(size=17, family="Franklin Gothic"))

# Display the box plot
fig1.show()


In [None]:
columns_to_plot = [
    ('Electricity from fossil fuels (TWh)', 'CO2', 'CO2 emission by Electricity from fossil fuels (TWh)'),
    ('Electricity from renewables (TWh)', 'CO2', 'CO2 emission by Electricity from renewables (TWh)'),
    ('Land','CO2','CO2 emission by Land'),
    ('Electricity from nuclear (TWh)','CO2','CO2 emission by Electricity from nuclear (TWh)')
]

fig = make_subplots(rows=2, cols=2, subplot_titles=[title for _, _, title in columns_to_plot])

for i, (column, y_label, title) in enumerate(columns_to_plot, start=1):
    data = df_scaled.groupby(column)[y_label].sum().reset_index()
    
    # Add a scatter plot to the subplot
    fig.add_trace(
        go.Scatter(x=data[column], y=data[y_label], mode='markers', name=title),
        row=(i - 1) // 2 + 1,  # Calculate the subplot row
        col=(i - 1) % 2 + 1  # Calculate the subplot column
    )

# Update layout and display the plot
fig.update_layout(height=1000, width=1000, showlegend=False, title='CO2 Emissions by Various Factors')
fig.show()


In [None]:
# Calculate the Max 'CO2' for each 'Year' and sort in descending order
Max_CO2 = df.groupby('Year')['CO2'].max().reset_index()
Max_CO2 = Max_CO2.sort_values(by='CO2', ascending=False)

# Select the top 10 years with the highest CO2 emissions
top_10_expensive_CO2 = Max_CO2.head(10)

# Create a bar plot using Plotly Express
fig = px.bar(
    top_10_expensive_CO2,  # DataFrame containing the data
    x='Year',  # x-values: years
    y='CO2',  # y-values: Max CO2 emissions
    color='CO2',  # Color the bars based on the CO2 values
    title='Top 10 Years by Max CO2 Emissions',  # Set the title of the plot
    labels={'Year': 'Year', 'CO2': 'Max CO2'},  # Set labels for axes
    template='plotly_white'  # Use a white template for the plot
)

# Set font color to black
fig.update_traces(textfont_color='black')

# Set the height of the plot
fig.update_layout(height=650)

# Display the plot
fig.show()


# Electricity from fossil fuels (TWh)

In [None]:
# Calculate the max 'Entity' for each 'Electricity from fossil fuels (TWh)'
Entity_By_Electricity_from_fossil = df.groupby('Entity')['Electricity from fossil fuels (TWh)'].max().reset_index()

# Create a line plot using Plotly Express
fig_Entity_By_Electricity_from_fossil = px.line(
    Entity_By_Electricity_from_fossil,  # DataFrame containing the data
    x='Entity',   # x-values: Year
    y='Electricity from fossil fuels (TWh)',  # y-values: median Land
    labels={'Entity': 'Entity'},  # Customize label for the x-axis
    title='Entity by Electricity from fossil fuels (TWh)',  # Set the title of the plot
    height=650  # Set the height of the plot
)

# Display the plot
fig_Entity_By_Electricity_from_fossil.show()


In [None]:
# Selecting the top 10 Electricity from fossil fuels (TWh) from the 'Electricity from fossil fuels (TWh)' column in the DataFrame 'df'
top_CO2 = df['Electricity from fossil fuels (TWh)'].nlargest(10)
locations = df.loc[top_CO2.index]['Entity']

# Plotting the top 10 prices using Matplotlib
plt.figure(figsize=(10, 6))  
plt.bar(range(len(top_CO2)), top_CO2, color='#7B66FF')  
plt.xlabel('Index')  
plt.ylabel('Electricity from fossil fuels (TWh)') 
plt.legend(['Electricity from fossil fuels (TWh)'])
plt.title('Top 10 Electricity from fossil fuels (TWh)') 
plt.xticks(range(len(top_CO2)), locations)  
plt.tight_layout()  
plt.show()


In [None]:
# Create the box plot using Plotly Express for 'CO2'
fig1 = px.box(df_scaled, y='Electricity from fossil fuels (TWh)', template='plotly_white', title='Electricity from fossil fuels (TWh) ')

# Customize the layout of the box plot
fig1.update_layout(font=dict(size=17, family="Franklin Gothic"))

# Display the box plot
fig1.show()


# Electricity from renewables (TWh)

In [None]:
# Calculate the Max 'Entity' for each 'Electricity from renewables (TWh)'
Entity_By_Electricity = df.groupby('Entity')['Electricity from renewables (TWh)'].max().reset_index()

# Create a line plot using Plotly Express
fig_Entity_By_Electricity = px.line(
    Entity_By_Electricity,  # DataFrame containing the data
    x='Entity',   # x-values: Year
    y='Electricity from renewables (TWh)',  # y-values: median Land
    labels={'Entity': 'Entity'},  # Customize label for the x-axis
    title='Electricity from renewables (TWh)',  # Set the title of the plot
    height=650  # Set the height of the plot
)

# Display the plot
fig_Entity_By_Electricity.show()

In [None]:
# Create the box plot using Plotly Express for 'Electricity from renewables (TWh)'
fig1 = px.box(df_scaled, y='Electricity from renewables (TWh)', template='plotly_white', title='Electricity from renewables (TWh)')

# Customize the layout of the box plot
fig1.update_layout(font=dict(size=17, family="Franklin Gothic"))

# Display the box plot
fig1.show()


# land 

In [None]:
# Creating a DataFrame using the country names and cleaned land area values
energy_land_data_use_df = pd.DataFrame({'Country': countries, 'Land': land_int})

# Creating a bar plot using Plotly Express
fig = px.bar(energy_land_data_use_df, x='Country', y='Land', labels={'Land': 'Land Area - km2', 'Entity': 'Country'})

# Updating the graph layout and title
fig.update_layout(title={'text': 'Countries Land Area - in km2', 'x': 0.5})

# Displaying the graph
fig.show()


In [None]:
# Calculate the maximum 'Land' for each 'Country' category and sort in descending order
max_co2 = df.groupby('Entity')['Land'].max().reset_index()
max_co2 = max_co2.sort_values(by='Land', ascending=False)

# Select the top 10 'Country' categories with the highest maximum 'Land' 
top_10_high_co2 = max_co2.head(10)

# Create a bar plot using Plotly Express
fig = px.bar(
    top_10_high_co2,  # DataFrame containing the data
    x='Entity',  # x-values: 'Country' categories
    y='Land',  # y-values: maximum 'Land' 
    color='Land',  # Color the bars based on the indices
    title='Top 10 Countries by Land ',  # Set the title of the plot
    labels={'Country': 'Country', 'Land': 'Land'},  # Customize labels
    template='plotly_white'  # Use a white template for the plot
)

# Set the height of the plot
fig.update_layout(height=650)

# Display the plot
fig.show()


# Entity & Year 

In [None]:
energy_co2_data = df[['Entity', 'Year', 'CO2']]#create new data 
energy_co2_data.head()

In [None]:
# Canada DataFrame with dropped missing values
energy_co2_data_canada = energy_co2_data[(energy_co2_data['Entity'] == 'Canada')]

# United States DataFrame with dropped missing values
energy_co2_data_united_states = energy_co2_data[(energy_co2_data['Entity'] == 'United States')]

# China DataFrame with dropped missing values
energy_co2_data_china = energy_co2_data[(energy_co2_data['Entity'] == 'China')]

# Brazil DataFrame with dropped missing values
energy_co2_data_Brazil = energy_co2_data[(energy_co2_data['Entity'] == 'Brazil')]

# Australia DataFrame with dropped missing values
energy_co2_data_Australia = energy_co2_data[(energy_co2_data['Entity'] == 'Australia')]

In [None]:
# Create subplots for each country's CO2 emissions
fig = make_subplots(rows=5, cols=1, subplot_titles=('Canada', 'United States', 'China'))

# Add traces for Canada, United States, and China CO2 emissions to separate subplots
fig.add_trace(go.Bar(x=energy_co2_data_canada['Year'], y=energy_co2_data_canada['CO2']), row=1, col=1)
fig.add_trace(go.Bar(x=energy_co2_data_united_states['Year'], y=energy_co2_data_united_states['CO2']), row=2, col=1)
fig.add_trace(go.Bar(x=energy_co2_data_china['Year'], y=energy_co2_data_china['CO2']), row=3, col=1)
fig.add_trace(go.Bar(x=energy_co2_data_Brazil['Year'], y=energy_co2_data_Brazil['CO2']), row=4, col=1)
fig.add_trace(go.Bar(x=energy_co2_data_Australia['Year'], y=energy_co2_data_Australia['CO2']), row=5, col=1)

# Update subplot layout
fig.update_layout(height=1200, width=1200, showlegend=False, 
                  title='CO2 emission - in kiloton - by the Five biggest countries in the world, per year')

# Show subplot
fig.show()


# The maxmum year with CO2 is 2019

In [None]:
#Gets all the years of 2019
energy_co2_data_2019 = energy_co2_data[(energy_co2_data['Year'] == 2019)]
# Drops the missing values
energy_co2_data_2019 = energy_co2_data_2019.dropna()
# Shows it columns
energy_co2_data_2019.columns

In [None]:
# Creates the graph of 2019 CO2 emissions
fig_co2_2019 = px.bar(energy_co2_data_2019, x='Entity', y='CO2')
# Updates graph layout
fig_co2_2019.update_layout(title={'text': 'CO2 emission - in kiloton - by all the countries in the world, per year', 'x': 0.5})
# Shows graph
fig_co2_2019.show()

# MAP SHAPE


In [None]:
# Function to plot features on world map
def plot_world_map(column_name):
    fig = go.Figure()
    for year in range(2000, 2021):
        # Filter the data for the current year
        filtered_df = df[df['Year'] == year]

        # Create a choropleth trace for the current year
        trace = go.Choropleth(
            locations=filtered_df['Entity'],
            z=filtered_df[column_name],
            locationmode='country names',
            colorscale='Electric',  # Use a different color scale for better contrast
            colorbar=dict(title=column_name),
            zmin=df[column_name].min(),
            zmax=df[column_name].max(),
            visible=False  # Set the trace to invisible initially
        )

        # Add the trace to the figure
        fig.add_trace(trace)

    # Set the first trace to visible
    fig.data[0].visible = True

    # Create animation steps
    steps = []
    for i in range(len(fig.data)):
        step = dict(
            method='update',
            args=[{'visible': [False] * len(fig.data)},  # Set all traces to invisible
                  {'title_text': f'{column_name} Map - {2000 + i}', 'frame': {'duration': 1000, 'redraw': True}}],
            label=str(2000 + i)  # Set the label for each step
        )
        step['args'][0]['visible'][i] = True  # Set the current trace to visible
        steps.append(step)

    # Create the slider
    sliders = [dict(
        active=0,
        steps=steps,
        currentvalue={"prefix": "Year: ", "font": {"size": 14}},  # Increase font size for slider label
    )]

    # Update the layout of the figure with increased size and change the template
    fig.update_layout(
        title_text=f'{column_name} Map with slider',  # Set the initial title
        title_font_size=24,  # Increase title font size
        title_x=0.5,  # Center the title
        geo=dict(
            showframe=True,
            showcoastlines=False,
            projection_type='natural earth'
        ),
        sliders=sliders,
        height=500,  # Set the height of the figure in pixels
        width=1000,  # Set the width of the figure in pixels
        font=dict(family='Arial', size=12),  # Customize font family and size for the whole figure
        margin=dict(t=80, l=50, r=50, b=50),  # Add margin for better layout spacing
        # Change the template to 'plotly_dark'
    )

    # Show the figure
    fig.show()

In [None]:
select_col=df.columns
select_col = ['CO2','Electricity from fossil fuels (TWh)',
 'Electricity from renewables (TWh)','Electricity from nuclear (TWh)']

In [None]:
for i in select_col:
    column_name = i
    print(column_name)
    plot_world_map(column_name)

# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">8. Categorical</div>

In [None]:
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder()
df.Entity = le.fit_transform(df.Entity)

# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">9. spliting the dataset

</div>


# The target column is 'CO2' emission values

In [None]:
X = df.drop(columns=['CO2'])
y = df['CO2']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Display the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# <div style="text-align: center; background-color: #6495ED; font-family:Times New Roman; color: white; padding: 14px; line-height: 1;border-radius:20px">10. Model Building and Analysis

</div>

In [None]:
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
}
best_model = None
best_r2 = 0

for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred= model.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    submit = pd.DataFrame()
    submit['Actual CO2'] = y_test
    submit['Predict_CO2'] = y_pred
    submit = submit.reset_index()
    r2 = r2_score(y_test, y_pred)
    if r2 > best_r2:
        best_r2 = r2
        best_model = model.__class__.__name__

    print(f'{model_name}:')
    print(f'R2 Score: {r2:.2f}')
    print(f'Mean Absolute Error (MAE): {mae:.2f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
    print(submit.head(5))

    print('----------------------------------------')
print(f"The best performing model is: {best_model} with accuracy: {best_r2:.2f}")

In [None]:
importances = model.feature_importances_
feature_names = X.columns
feature_importance_dict = dict(zip(feature_names, importances))
sorted_feature_importance = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

top_n = 5  # Set the number of top features to display
top_feature_names, top_importances = zip(*sorted_feature_importance[:top_n])

fig = px.bar(
    x=top_importances,
    y=top_feature_names,
    orientation='h',
    title='Top 5 Feature Importance',
    labels={'x': 'Importance', 'y': 'Feature'},
    color=top_importances,  # Color bars by importance values
    color_continuous_scale='reds',  # Choose a color scale
)

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')

fig.show()

In [None]:
y_pred= model.predict(X_test)

# Residuals
residuals = y_test - y_pred

# Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

# forward_selection

In [None]:
X = df.drop(columns=['CO2'])
y = df['CO2']

def forward_selection(df, target, significance_level=0.05):
    initial_features = df.columns.tolist()
    best_features = []
    while len(initial_features) > 0:
        remaining_features = list(set(initial_features) - set(best_features))
        new_pval = pd.Series(index=remaining_features)
        for new_column in remaining_features:
            model = sm.OLS(target, sm.add_constant(df[best_features + [new_column]])).fit()
            new_pval[new_column] = model.pvalues[new_column]
        min_p_value = new_pval.min()
        if min_p_value < significance_level:
            best_features.append(new_pval.idxmin())
        else:
            break
    return best_features

# Assuming you have already defined X and y as the features and target variable respectively
selected_features = forward_selection(X, y)
print("Selected features:", selected_features)

In [None]:
X = df[selected_features]
y = df['CO2']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Display the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Model after Feature selection

In [None]:
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
}
best_model = None
best_r2 = 0

for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred= model.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    submit = pd.DataFrame()
    submit['Actual Electricity'] = y_test
    submit['Predict_Electricity'] = y_pred
    submit = submit.reset_index()
    r2 = r2_score(y_test, y_pred)
    if r2 > best_r2:
        best_r2 = r2
        best_model = model.__class__.__name__

    print(f'{model_name}:')
    print(f'R2 Score: {r2:.2f}')
    print(f'Mean Absolute Error (MAE): {mae:.2f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
    print(submit.head(5))
    print('----------------------------------------')
print(f"The best performing model is: {best_model} with accuracy: {best_r2:.2f}")

In [None]:
y_pred= model.predict(X_test)

# Residuals
residuals = y_test - y_pred

# Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()