3.	Undertake research to find similarities between some country(s) against Ireland and apply parametric and non-parametric inferential statistical techniques to compare them (e.g., t-test, analysis of variance, Wilcoxon test, chi-squared test, among others). You must justify your choices and verify the applicability of the tests. Hypotheses and conclusions must be clearly stated. You are expected to use at least 5 different inferential statistics tests. 

## STATISTICS FOR DATA ANALYTICS

In [577]:
#import libraries
import pandas as pd
from scipy.stats import shapiro
from scipy.stats import mannwhitneyu
from scipy.stats import wilcoxon
from scipy.stats import chi2_contingency
from scipy.stats import chi2
from scipy.stats import kruskal
import numpy as np
from scipy.stats import ttest_ind, levene
from scipy.stats import norm
from scipy import stats
from scipy.stats import fisher_exact

In [579]:
energy_df=pd.read_csv("Energyshare2.csv")

In [440]:
#show first 5 rows
energy_df.head()

Unnamed: 0,Area,Country code,Year,Area type,Continent,Ember region,EU,OECD,G20,G7,ASEAN,Category,Subcategory,Variable,Unit,Value,YoY absolute change,YoY % change
0,Austria,AUT,1990,Country,Europe,Europe,1.0,1.0,0.0,0.0,0.0,Electricity demand,Demand,Demand,TWh,48.81,,
1,Austria,AUT,1990,Country,Europe,Europe,1.0,1.0,0.0,0.0,0.0,Electricity demand,Demand per capita,Demand per capita,MWh,6.36,,
2,Austria,AUT,1990,Country,Europe,Europe,1.0,1.0,0.0,0.0,0.0,Electricity generation,Aggregate fuel,Clean,%,66.25,,
3,Austria,AUT,1990,Country,Europe,Europe,1.0,1.0,0.0,0.0,0.0,Electricity generation,Aggregate fuel,Coal,%,12.56,,
4,Austria,AUT,1990,Country,Europe,Europe,1.0,1.0,0.0,0.0,0.0,Electricity generation,Aggregate fuel,Fossil,%,33.75,,


In [441]:
#checking null values
energy_df.isnull().sum()


Area                       0
Country code            2005
Year                       0
Area type                  0
Continent               2005
Ember region            2005
EU                      2005
OECD                    2005
G20                     2005
G7                      2005
ASEAN                   2005
Category                   0
Subcategory                0
Variable                   0
Unit                       0
Value                      0
YoY absolute change    25602
YoY % change           37548
dtype: int64

In [445]:
#cleaning data
drop_col = ['Area type', 'Country code', 'Continent', 'Ember region', 'EU', 'OECD', 'G20', 'G7', 'ASEAN', 
                   'YoY absolute change', 'YoY % change']
df_cleaned = energy_df.drop(columns=drop_col)
df_cleaned.head()

Unnamed: 0,Area,Year,Category,Subcategory,Variable,Unit,Value
0,Austria,1990,Electricity demand,Demand,Demand,TWh,48.81
1,Austria,1990,Electricity demand,Demand per capita,Demand per capita,MWh,6.36
2,Austria,1990,Electricity generation,Aggregate fuel,Clean,%,66.25
3,Austria,1990,Electricity generation,Aggregate fuel,Coal,%,12.56
4,Austria,1990,Electricity generation,Aggregate fuel,Fossil,%,33.75


In [447]:
#checking unique categories
print(df_cleaned['Category'].unique())


['Electricity demand' 'Electricity generation' 'Electricity imports'
 'Power sector emissions']


In [449]:
#checking unique subcategories
print(df_cleaned['Subcategory'].unique())

['Demand' 'Demand per capita' 'Aggregate fuel' 'Fuel' 'Total'
 'Electricity imports' 'CO2 intensity']


In [451]:
#checking unique variables
print(df_cleaned['Variable'].unique())

['Demand' 'Demand per capita' 'Clean' 'Coal' 'Fossil'
 'Hydro, bioenergy and other renewables' 'Renewables' 'Wind'
 'Wind and solar' 'Bioenergy' 'Gas' 'Hard coal' 'Hydro' 'Lignite'
 'Nuclear' 'Onshore wind' 'Other fossil' 'Other renewables' 'Solar'
 'Total generation' 'Net imports' 'CO2 intensity' 'Offshore wind']


In [453]:
#checking unique units
print(df_cleaned['Unit'].unique())

['TWh' 'MWh' '%' 'MtCO2e' 'gCO2e per kWh']


In [455]:
df_cleaned[['Category', 'Subcategory', 'Variable', 'Unit']].drop_duplicates()


Unnamed: 0,Category,Subcategory,Variable,Unit
0,Electricity demand,Demand,Demand,TWh
1,Electricity demand,Demand per capita,Demand per capita,MWh
2,Electricity generation,Aggregate fuel,Clean,%
3,Electricity generation,Aggregate fuel,Coal,%
4,Electricity generation,Aggregate fuel,Fossil,%
5,Electricity generation,Aggregate fuel,"Hydro, bioenergy and other renewables",%
6,Electricity generation,Aggregate fuel,Renewables,%
7,Electricity generation,Aggregate fuel,Wind,%
8,Electricity generation,Aggregate fuel,Wind and solar,%
9,Electricity generation,Aggregate fuel,Clean,TWh


In [457]:
#Selecting required categoires for applying inferential statistics

In [459]:

# STEP 1: Filter the selected indicators
selected_rows = df_cleaned[
    ((df_cleaned['Category'] == 'Electricity demand') & 
     (df_cleaned['Subcategory'] == 'Demand') & 
     (df_cleaned['Variable'] == 'Demand') & 
     (df_cleaned['Unit'] == 'TWh')) |

    ((df_cleaned['Category'] == 'Power sector emissions') & 
     (df_cleaned['Subcategory'] == 'CO2 intensity') & 
     (df_cleaned['Variable'] == 'CO2 intensity') & 
     (df_cleaned['Unit'] == 'gCO2e per kWh')) |

    ((df_cleaned['Category'] == 'Electricity generation') & 
     (df_cleaned['Subcategory'] == 'Total') & 
     (df_cleaned['Variable'] == 'Total generation') & 
     (df_cleaned['Unit'] == 'TWh')) |

    ((df_cleaned['Category'] == 'Electricity generation') & 
     (df_cleaned['Subcategory'] == 'Aggregate fuel') & 
     (df_cleaned['Variable'] == 'Renewables') & 
     (df_cleaned['Unit'] == '%')) |

    ((df_cleaned['Category'] == 'Electricity imports') & 
     (df_cleaned['Subcategory'] == 'Electricity imports') & 
     (df_cleaned['Variable'] == 'Net imports') & 
     (df_cleaned['Unit'] == 'TWh'))
]

# STEP 2: Create a unique 'Indicator' column so each indicator stays separate
selected_rows['Indicator'] = (
    selected_rows['Category'] + ' - ' +
    selected_rows['Subcategory'] + ' - ' +
    selected_rows['Variable'] + ' (' +
    selected_rows['Unit'] + ')'
)

# STEP 3: Pivot to wide format — One row per (Area, Country code, Year), each Indicator as a column
df_pivot = selected_rows.pivot_table(
    index=['Area', 'Year'],
    columns='Indicator',
    values='Value'
).reset_index()

# Optional: Show all resulting columns (for confirmation)
print("Pivoted columns:\n")
for col in df_pivot.columns:
    print(col)


Pivoted columns:

Area
Year
Electricity demand - Demand - Demand (TWh)
Electricity generation - Aggregate fuel - Renewables (%)
Electricity generation - Total - Total generation (TWh)
Electricity imports - Electricity imports - Net imports (TWh)
Power sector emissions - CO2 intensity - CO2 intensity (gCO2e per kWh)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_rows['Indicator'] = (


In [461]:
df_pivot.head()

Indicator,Area,Year,Electricity demand - Demand - Demand (TWh),Electricity generation - Aggregate fuel - Renewables (%),Electricity generation - Total - Total generation (TWh),Electricity imports - Electricity imports - Net imports (TWh),Power sector emissions - CO2 intensity - CO2 intensity (gCO2e per kWh)
0,Austria,1990,48.81,66.25,49.27,-0.46,249.85
1,Austria,1991,50.93,65.09,50.16,0.77,262.76
2,Austria,1992,50.44,72.48,49.89,0.55,201.64
3,Austria,1993,50.58,74.16,51.31,-0.73,183.59
4,Austria,1994,51.13,71.03,51.95,-0.82,196.15


In [463]:
# Example: assuming your DataFrame is named `df_pivoted`
ireland_data = df_pivot[df_pivot['Area'] == 'Ireland']
belgium_data = df_pivot[df_pivot['Area'] == 'Belgium']
austria_data = df_pivot[df_pivot['Area'] == 'Austria']
sweden_data = df_pivot[df_pivot['Area'] == 'Sweden']
denmark_data = df_pivot[df_pivot['Area'] == 'Denmark']

# APPLYING HYPOTHESIS TESTING

# CHECKING: 'Electricity generation - Aggregate fuel - Renewables (%)'

# TEST 1

In [466]:
renewables_col = 'Electricity generation - Aggregate fuel - Renewables (%)'

# Drop NaNs because Shapiro-Wilk cannot handle them
ireland_renewables = ireland_data[renewables_col].dropna()
belgium_renewables = belgium_data[renewables_col].dropna()
austria_renewables = austria_data[renewables_col].dropna()
sweden_renewables = sweden_data[renewables_col].dropna()
denmark_renewables = denmark_data[renewables_col].dropna()

In [468]:
ireland_renewables.describe()

count    35.000000
mean     16.786857
std      14.004380
min       4.030000
25%       5.115000
50%       9.920000
75%      26.450000
max      45.320000
Name: Electricity generation - Aggregate fuel - Renewables (%), dtype: float64

In [470]:
belgium_renewables.describe()

count    35.000000
mean      9.807143
std      10.724813
min       0.730000
25%       0.975000
50%       4.020000
75%      17.830000
max      35.860000
Name: Electricity generation - Aggregate fuel - Renewables (%), dtype: float64

In [None]:
#Checking p values

In [486]:
#from scipy.stats import shapiro

# Shapiro-Wilk Test for Ireland
shapiro_ireland = shapiro(ireland_renewables)
print(f"Ireland normality p-value: {shapiro_ireland.pvalue}")

    # Shapiro-Wilk Test for Belgium
shapiro_belgium = shapiro(belgium_renewables)
print(f"Belgium normality p-value: {shapiro_belgium.pvalue}")

    # Shapiro-Wilk Test for Austria
shapiro_austria = shapiro(austria_renewables)
print(f"Austria normality p-value: {shapiro_austria.pvalue}")

    # Shapiro-Wilk Test for Sweden
shapiro_sweden = shapiro(sweden_renewables)
print(f"Sweden normality p-value: {shapiro_sweden.pvalue}")

    # Shapiro-Wilk Test for Sweden
shapiro_denmark = shapiro(denmark_renewables)
print(f"Denmark normality p-value: {shapiro_denmark.pvalue}")

Ireland normality p-value: 6.021202170666885e-05
Belgium normality p-value: 3.720865192801833e-05
Austria normality p-value: 0.5452991997419276
Sweden normality p-value: 0.22279384418674253
Denmark normality p-value: 0.0010723735151860999


In [488]:
if shapiro_austria.pvalue > 0.05:
   print("Accept the null hypothesis. Austria renewable electricity data is normally distributed")
else: 
    print("Reject the null hypothesis. Austria renewable electricity data is not normally distributed")

Accept the null hypothesis. Austria renewable electricity data is normally distributed


In [490]:
#p-value > 0.05 then Data is normal
#p-value < 0.05 then Data is not normal

if shapiro_ireland.pvalue > 0.05:
    print("Accept the null hypothesis. Ireland renewable electricity data is normally distributed")
else: 
    print("Reject the null hypothesis. Ireland renewable electricity data is not normally distributed")

Reject the null hypothesis. Ireland renewable electricity data is not normally distributed


In [492]:
if shapiro_belgium.pvalue > 0.05:
   print("Accept the null hypothesis. Belgium renewable electricity data is normally distributed")
else: 
    print("Reject the null hypothesis. Belgium renewable electricity data is not normally distributed")

Reject the null hypothesis. Belgium renewable electricity data is not normally distributed


In [494]:
if shapiro_sweden.pvalue > 0.05:
   print("Accept the null hypothesis. Sweden renewable electricity data is normally distributed")
else: 
    print("Reject the null hypothesis. Sweden renewable electricity data is not normally distributed")

Accept the null hypothesis. Sweden renewable electricity data is normally distributed


In [496]:
if shapiro_denmark.pvalue > 0.05:
   print("Accept the null hypothesis. Denmark renewable electricity data is normally distributed")
else: 
    print("Reject the null hypothesis. Denmark renewable electricity data is not normally distributed")

Reject the null hypothesis. Denmark renewable electricity data is not normally distributed


Since the data is not normal for Ireland and Begium, I will apply non-parametric tests i.e. Mann Whitney, Wilcoxon and Chi square. 
My data is numerical so I can apply Mann Whitney and Wilcoxon directly but since Chi square requires categorical data, I will have have to convert my numerical column data into categorical data. 

Moreover, since Austria and Sweden data is normally distributed, therefore, I can apply parametric tests on it. 

# TEST 2

Test 2: Mann-Whitney U Test 
Use if data is not normal.
Hypotheses:
H0: The distribution of Renewable Electricity generation % is the same for Ireland and Belgium.
H1: The distribution is different.

In [379]:
#from scipy.stats import mannwhitneyu

u_stat, p_value = mannwhitneyu(ireland_renewables,belgium_renewables)

print("Mann-Whitney U Test (Ireland vs Belgium - Renewable electricity generation- % )")
print(f"U-statistic = {u_stat:.4f}, p-value = {p_value:.4f}")
if p_value < 0.05:
    print("→ Reject H₀: The distribution of Renewable Electricity generation % is different for Ireland and Belgium.\n")
else:
    print("→ Fail to reject H₀: No significant difference\n")

Mann-Whitney U Test (Ireland vs Belgium - Renewable electricity generation- % )
U-statistic = 861.0000, p-value = 0.0036
→ Reject H₀: The distribution of Renewable Electricity generation % is different for Ireland and Belgium.



# TEST 3

Test 3: Wilcoxon Signed-Rank Test 
Hypotheses:
H0: Over the years 1990 to 2024, median difference between ireland and belgium electricity generation = 0
H1: Median difference ≠ 0

In [377]:
#from scipy.stats import wilcoxon
w_stat, w_p = wilcoxon(ireland_renewables,belgium_renewables)
print("Wilcoxon Signed-Rank Test (ireland vs belgium)")
print(f"Statistic = {w_stat:.4f}, p-value = {w_p:.4f}")
if w_p < 0.05:
    print("→ Reject H₀: Significant difference in paired samples\n")
else:
    print("→ Fail to reject H₀: No significant difference\n")

Wilcoxon Signed-Rank Test (ireland vs belgium)
Statistic = 0.0000, p-value = 0.0000
→ Reject H₀: Significant difference in paired samples



# TEST 4

Test 4: Chi square test
Used to determine if there is a significant association between two categorical variables.
My data is numerical so I need to make it categorical by setting a threshold of mean electricity production i.e. Renewable electricity above mean (High) vs below mean (Low).

Hypotheses:
H0:There is no association between the country (Ireland or Belgium) and the classification of renewable generation ("High" or "Low") i.e., the proportions of "High"/"Low" are similar in both countries.

H1:There is an association between the country and the classification of renewable generation i.e., one country is more likely to be "High" or "Low" than the other.


In [423]:
# Calculate the mean for each country
#ireland_mean = ireland_renewables.mean()
#belgium_mean = belgium_renewables.mean()
combined_mean = pd.concat([ireland_data[renewables_col], belgium_data[renewables_col]]).mean()


# Classify as 'High' or 'Low' based on country-specific means
ireland_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in ireland_data[renewables_col]]
belgium_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in belgium_data[renewables_col]]

# Create a contingency table (2x2)
chi_square_data = pd.DataFrame({
    'High': [
        (belgium_data['High/Low'] == 'High').sum(),
        (ireland_data['High/Low'] == 'High').sum()
    ],
    'Low': [
        (belgium_data['High/Low'] == 'Low').sum(),
        (ireland_data['High/Low'] == 'Low').sum()
    ]
}, index=['Belgium', 'Ireland'])

print(chi_square_data)


         High  Low
Belgium    12   23
Ireland    15   20


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ireland_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in ireland_data[renewables_col]]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  belgium_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in belgium_data[renewables_col]]


In [425]:
# Perform Chi-Square test
#from scipy.stats import chi2_contingency
chi2_stat, p_value, df_scipy, expected_scipy = chi2_contingency(chi_square_data)

# Print the results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of Freedom: {df_scipy}")
print(f"Expected Values:\n{expected_scipy}")


Chi-Square Statistic: 0.2412
p-value: 0.6234
Degrees of Freedom: 1
Expected Values:
[[13.5 21.5]
 [13.5 21.5]]


In [427]:
#from scipy.stats import chi2
# Significance level (alpha)
alpha = 0.05

# Calculate the critical value from the Chi-Square distribution (inverse CDF)
critical_value = chi2.ppf(1 - alpha, df_scipy)

# Display the critical value
critical_value

3.841458820694124

In [429]:
# Output the results
if chi2_stat > critical_value:
    print("Reject the null hypothesis:There is an association between the country and the classification of renewable generation.")
else:
    print("Fail to reject the null hypothesis:There is no association between the country (Ireland or Belgium) and the classification of renewable generation (High or Low).")

Fail to reject the null hypothesis:There is no association between the country (Ireland or Belgium) and the classification of renewable generation (High or Low).


In [431]:
# Output the results using p-value
if p_value < alpha:
      print("Reject the null hypothesis: There is an association between the country and the classification of renewable generation.")
else:
    print("Fail to reject the null hypothesis:There is no association between the country (Ireland or Belgium) and the classification of renewable generation (High or Low).")

Fail to reject the null hypothesis:There is no association between the country (Ireland or Belgium) and the classification of renewable generation (High or Low).


# TEST 5

Now I want to apply ANOVA Test to Ireland, Denmark and Belgium data for renewable electricity generation. Since Ireland, Denmark and Belgium data is not normally distributed, therefore, we can apply a variant of ANOVA called Kruskal-Wallis test.

Test 5: Kruskal-Wallis test.

Hypotheses:
Null hypothesis (H₀): The median renewable electricity generation is the same for Ireland, Denmark, and Belgium.
Alternative hypothesis (H₁): At least one country differs in median electricity generation.

In [501]:
#from scipy.stats import kruskal

# Perform Kruskal-Wallis H-test
stat, p_value = kruskal(ireland_renewables, denmark_renewables, belgium_renewables)

# Print the results
print(f"Kruskal-Wallis H-statistic = {stat:.4f}")
print(f"p-value = {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("→ Reject H₀: There is a significant difference in renewable electricity generation medians between at least two countries.")
else:
    print("→ Fail to reject H₀: No significant difference in renewable electricity generation medians between countries.")


Kruskal-Wallis H-statistic = 23.2590
p-value = 0.0000
→ Reject H₀: There is a significant difference in renewable electricity generation medians between at least two countries.


In [None]:
#Checking Austria data

Since Austria data is normally distributed, therefore, I can apply parametric tests on it. I want to see renewables data befroe and after 2003 for Austria.I have chosen 2003 as threshold because in 2003,the Austrian Green Electricity Act was passed that was a major national push to promote renewable energy.

In [503]:
#import pandas as pd
#from scipy.stats import ttest_ind, levene

# Your DataFrame: df
# Assuming your pivoted column name for renewables is:
renewables_col = 'Electricity generation - Aggregate fuel - Renewables (%)'

# Filter data for Austria
austria_data = df_pivot[df_pivot['Area'] == 'Austria'].copy()

# Drop missing values in the renewables column
austria_data = austria_data.dropna(subset=[renewables_col])

# Split by year: before and after 2003
pre_2003 = austria_data[austria_data['Year'] < 2003][renewables_col]
post_2003 = austria_data[austria_data['Year'] >= 2003][renewables_col]

# 1. Levene’s Test: Check equality of variances
levene_stat, levene_p = levene(pre_2003, post_2003)

print(f"Levene's test: stat={levene_stat:.4f}, p-value={levene_p:.4f}")
equal_var = levene_p > 0.05  # If p > 0.05, assume equal variances




Levene's test: stat=8.2088, p-value=0.0072


Since p < 0.05, we reject the null hypothesis i.e. The variances of the two groups (e.g., Austria's renewables pre-2003 vs. post-2003) are statistically significantly different. Therefore, we can apply T-test (Welch test).


Test 6: T Test 
Use an independent t-test to compare the energy data before and after 2003 in Austria. We will use Welch’s t-test (which is a variation of the t-test that does not assume equal population variances).

Hypotheses:

H0: There is no significant difference in Austria’s mean renewable electricity share before and after 2003.
H1: There is a significant difference in Austria’s mean renewable electricity share before and after 2003.

In [505]:
#T-Test 
t_stat, p_val = ttest_ind(pre_2003, post_2003, equal_var=equal_var)

print(f"\nT-test:")
print(f"t-statistic = {t_stat:.4f}, p-value = {p_val:.4f}")

# 3. Hypothesis interpretation
alpha = 0.05
if p_val < alpha:
    print("Reject the null hypothesis: There is a significant difference in renewable electricity share before and after 2003.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in renewable electricity share before and after 2003.")


T-test:
t-statistic = -2.3750, p-value = 0.0244
Reject the null hypothesis: There is a significant difference in renewable electricity share before and after 2003.


In [None]:
I want to compare renewable data for Austria and Sweden using Z Test since both are normally distributed. 

Test 7: Z Test

Hypotheses:
H0: There is no difference in average renewable electricity share between Austria and Sweden.
H1: There is a difference in the average renewable electricity share between Austria and Sweden.

In [507]:
renewables_col = 'Electricity generation - Aggregate fuel - Renewables (%)'

# Drop NaNs
austria = df_pivot[df_pivot['Area'] == 'Austria'][renewables_col].dropna()
sweden = df_pivot[df_pivot['Area'] == 'Sweden'][renewables_col].dropna()


In [509]:
print(len(austria), len(sweden))  # Should be ≥ 30 ideally for each


35 35


In [511]:
#Since sample size is greater than 30 therefore i can apply Z Test
#from scipy.stats import norm
#from scipy import stats
#import numpy as np

# Sample statistics
mean1 = np.mean(austria)
mean2 = np.mean(sweden)
std1 = np.std(austria, ddof=1)
std2 = np.std(sweden, ddof=1)
n1 = len(austria)
n2 = len(sweden)

# Pooled standard error
pooled_std = np.sqrt((std1**2 / n1) + (std2**2 / n2))

# Z-score
z_stat = (mean1 - mean2) / pooled_std

# Two-tailed p-value
# For the two-tailed z-test, we use the normal distribution
p_value_ztest = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"z-statistic: {z_stat:.4f}, p-value (z-test): {p_value_ztest:.4f}")



z-statistic: 10.4924, p-value (z-test): 0.0000


In [513]:
# Decision for z-test (using α = 0.05)
if p_value_ztest < alpha:
    print("Reject the null hypothesis: There is a significant difference in the average renewable electricity share between Austria and Sweden. ")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the average renewable electricity share between Austria and Sweden.")


Reject the null hypothesis: There is a significant difference in the average renewable electricity share between Austria and Sweden. 


## EXTRA WORK

# CHECKING: 'Electricity demand - Demand - Demand (TWh)'

In [None]:
## Applying Inferential Statistics on Electricity demand

In [516]:
demand_col = 'Electricity demand - Demand - Demand (TWh)'

In [518]:

# Drop NaNs because Shapiro-Wilk cannot handle them
#ireland_renewables = ireland_data[renewables_col].dropna()
ireland_demand = ireland_data[demand_col].dropna()
belgium_demand = belgium_data[demand_col].dropna()
austria_demand = austria_data[demand_col].dropna()
sweden_demand = sweden_data[demand_col].dropna()

Test 1:Shapiro Test. Used to check normality of data.
Hypotheses: 
H0: Electricity Demand Data is normally distributed.
H1: Electricity Demand Data is not normally distributed.

In [520]:
#from scipy.stats import shapiro

# Shapiro-Wilk Test for Ireland
shapiro_ireland_d = shapiro(ireland_demand)
print(f"Ireland normality p-value: {shapiro_ireland_d.pvalue}")

    # Shapiro-Wilk Test for Belgium
shapiro_belgium_d = shapiro(belgium_demand)
print(f"Belgium normality p-value: {shapiro_belgium_d.pvalue}")

    # Shapiro-Wilk Test for Austria
shapiro_austria_d = shapiro(austria_demand)
print(f"Austria normality p-value: {shapiro_austria_d.pvalue}")

    # Shapiro-Wilk Test for Sweden
shapiro_sweden_d = shapiro(sweden_demand)
print(f"Sweden normality p-value: {shapiro_sweden_d.pvalue}")

Ireland normality p-value: 0.029562734276114853
Belgium normality p-value: 0.00016787639972897517
Austria normality p-value: 0.0004692344698031188
Sweden normality p-value: 0.651233465978291


In [522]:
#p-value > 0.05 then Data is normal
#p-value < 0.05 then Data is not normal

if shapiro_austria_d.pvalue > 0.05:
   print("Accept the null hypothesis. Austria electricity demand data is normally distributed")
else: 
    print("Reject the null hypothesis. Austria electricity demand data is not normally distributed")

if shapiro_ireland_d.pvalue > 0.05:
    print("Accept the null hypothesis. Ireland electricity demand data is normally distributed")
else: 
    print("Reject the null hypothesis. Ireland electricity demand data is not normally distributed")

if shapiro_belgium_d.pvalue > 0.05:
   print("Accept the null hypothesis. Belgium electricity demand data is normally distributed")
else: 
    print("Reject the null hypothesis. Belgium electricity demand data is not normally distributed")

if shapiro_sweden_d.pvalue > 0.05:
   print("Accept the null hypothesis. Sweden electricity demand data is normally distributed")
else: 
    print("Reject the null hypothesis. Sweden electricity demand data is not normally distributed")

Reject the null hypothesis. Austria electricity demand data is not normally distributed
Reject the null hypothesis. Ireland electricity demand data is not normally distributed
Reject the null hypothesis. Belgium electricity demand data is not normally distributed
Accept the null hypothesis. Sweden electricity demand data is normally distributed


Test 2: Mann-Whitney U Test. Use if data is not normal. 
Since Ireland and Belgium data is not normally distributed therefore we can apply Mann-Whitney Test.
Hypotheses:
H0: The distribution of Electricity demand in TWh is the same for Ireland and Belgium.
H1: The distribution is different.

In [524]:
u_stat, p_value = mannwhitneyu(ireland_demand,belgium_demand)

print("Mann-Whitney U Test (Ireland vs Belgium - Electricity demand- TWh)")
print(f"U-statistic = {u_stat:.4f}, p-value = {p_value:.4f}")
if p_value < 0.05:
    print("→ Reject H₀: The distribution of Electricity demand in TWh is different for Ireland and Belgium.\n")
else:
    print("→ Fail to reject H₀: No significant difference\n")


Mann-Whitney U Test (Ireland vs Belgium - Electricity demand- TWh)
U-statistic = 0.0000, p-value = 0.0000
→ Reject H₀: The distribution of Electricity demand in TWh is different for Ireland and Belgium.



Test 3: Wilcoxon Signed-Rank Test. Use if data is not normal. 
Since Ireland and Belgium data is not normally distributed therefore we can apply Wilcoxon Signed-Rank Test.
Hypotheses:
H0: Over the years 1990 to 2024, median difference between ireland and belgium electricity demand = 0
H1: Median difference ≠ 0

In [526]:
#from scipy.stats import wilcoxon
w_stat, w_p = wilcoxon(ireland_demand,belgium_demand)
print("Wilcoxon Signed-Rank Test (ireland vs belgium)")
print(f"Statistic = {w_stat:.4f}, p-value = {w_p:.4f}")
if w_p < 0.05:
    print("→ Reject H₀: Significant difference in paired samples\n")
else:
    print("→ Fail to reject H₀: No significant difference in electricity demand\n")

Wilcoxon Signed-Rank Test (ireland vs belgium)
Statistic = 0.0000, p-value = 0.0000
→ Reject H₀: Significant difference in paired samples



Test 4: Chi square test
Used to determine if there is a significant association between two categorical variables.
My data is numerical so I need to make it categorical by setting a threshold of mean electricity demand i.e. electricity demand above mean (High) vs below mean (Low).
Hypotheses:
H0: Proportion of High and Low electricity demand years is the same for both countries i.e. any difference is not statistically significant.
H1: The electricity demand levels differ significantly between the two countries.


Null Hypothesis (H₀):
There is no association between the country (Ireland or Belgium) and electricity demand level (High or Low).
→ This means that the proportion of High and Low electricity demand years is the same for both countries.

Alternative Hypothesis (H₁):
There is an association between the country and electricity demand level (High or Low).
→ This means that the proportion of High and Low electricity demand years is different between Ireland and Belgium.

In [546]:
# Calculate the mean for each country
#ireland_mean = ireland_demand.mean()
#belgium_mean = belgium_demand.mean()
combined_mean = pd.concat([ireland_data[demand_col], belgium_data[demand_col]]).mean()

# Classify as 'High' or 'Low' based on country-specific means
ireland_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in ireland_data[demand_col]]
belgium_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in belgium_data[demand_col]]

# Create a contingency table (2x2)
chi_square_data = pd.DataFrame({
    'High': [
        (belgium_data['High/Low'] == 'High').sum(),
        (ireland_data['High/Low'] == 'High').sum()
    ],
    'Low': [
        (belgium_data['High/Low'] == 'Low').sum(),
        (ireland_data['High/Low'] == 'Low').sum()
    ]
}, index=['Belgium', 'Ireland'])

print(chi_square_data)


         High  Low
Belgium    35    0
Ireland     0   35


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ireland_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in ireland_data[demand_col]]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  belgium_data['High/Low'] = ['High' if x >= combined_mean else 'Low' for x in belgium_data[demand_col]]


From the above table, we can see that it is clearly imbalanced. The two cells are zero so chi-square is invalid due to violated assumptions. Therefore, we can use Fisher’s Exact Test as it can handle this correctly, even with zeroes.

Fisher’s Exact Test is a statistical significance test used to determine if there is a non-random association between two categorical variables — especially when your data is in a 2x2 contingency table and some cells have low or zero counts.It’s called “exact” because it does not rely on approximations (like the chi-square test does), instead, it calculates the exact probability of the observed results happening by chance.

It is commonly used instead of the Chi-Square test when sample sizes are small or unbalanced.It calculates the exact probability (p-value) of getting a table this extreme (or more extreme) just by chance — under the null hypothesis that the row and column variables are independent.

In our context: 
Null Hypothesis:
Country and demand level are not associated.
(The distribution of High/Low is the same in both countries.)

Alternative Hypothesis:
Country and demand level are associated.
(One country has more Highs or more Lows than the other.)

Test 5: Fisher's Exact Test
Used to check whether two categorical variables are related.


Hypotheses:
H0: There is no association between country and demand level (i.e., proportions of High/Low are the same for both).
H1: There is an association (i.e., the proportions differ).


In [550]:
#from scipy.stats import fisher_exact

# Convert to array format
table = chi_square_data.values  # shape: (2,2)
# Make sure the columns are in [High, Low] order
oddsratio, p_value = fisher_exact(table)

print(f"Fisher's Exact Test p-value: {p_value:.4f}")
if p_value < 0.05:
    print("→ Reject H₀: Country and demand level are associated i.e. One country has more Highs or more Lows than the other\n")
else:
    print("→ Fail to reject H₀:Country and demand level are not associated i.e. The distribution of High/Low is the same in both countries.\n")

Fisher's Exact Test p-value: 0.0000
→ Reject H₀: Country and demand level are associated i.e. One country has more Highs or more Lows than the other



Now I want to apply ANOVA Test to Ireland, Austria and Belgium data for electricity demand. Since Ireland, Austria and Belgium data is not normally distributed, therefore, we can apply a variant of ANOVA called Kruskal-Wallis test.

Test 5: Kruskal-Wallis test.

Hypotheses:
HO: The median electricity demand is the same for Ireland, Austria, and Belgium.
HI: At least one country differs in median electricity demand.

In [278]:
#from scipy.stats import kruskal

# Perform Kruskal-Wallis H-test
stat, p_value = kruskal(ireland_demand, austria_demand, belgium_demand)

# Print the results
print(f"Kruskal-Wallis H-statistic = {stat:.4f}")
print(f"p-value = {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("→ Reject H₀: There is a significant difference in electricity demand medians between at least two countries.")
else:
    print("→ Fail to reject H₀: No significant difference in electricity demand medians between countries.")


Kruskal-Wallis H-statistic = 88.9036
p-value = 0.0000
→ Reject H₀: There is a significant difference in electricity demand medians between at least two countries.
