**Reasoning**:
Load the "AirQualityDataset.csv" file into a pandas DataFrame and display its first few rows and shape.



In [None]:
import pandas as pd

df = pd.read_csv('AirQualityDataset.csv')
display(df.head())
print(df.shape)

Unnamed: 0,Country,Status,AQI Value,Code,GHG_Emissions,Forest cover,population_density
0,Albania,Good,9,ALB,7.673672,28.791971,101.14792
1,Andorra,Good,10,AND,,34.042553,176.32765
2,Argentina,Good,20,ARG,365.684619,10.440715,16.754303
3,Armenia,Good,22,ARM,10.836337,11.537408,103.69912
4,Australia,Moderate,63,AUS,571.839849,17.421315,3.506746


(110, 7)


In [None]:
continent_df=pd.read_csv("/content/world_population.csv")


In [84]:
country_codes = {
    "Brunei": "BRN",
    "Cape Verde": "CPV",
    "Czech Republic": "CZE",
    "Kosovo": "XKX",
    "Laos": "LAO",
    "Macedonia": "MKD",
    "Moldova": "MDA",
    "Palestinian Territory": "PSE",
    "Reunion": "REU",
    "Russia": "RUS",
    "South Korea": "KOR",
    "Taiwan": "TWN",
    "Vatican": "VAT",
    "Vietnam": "VNM"
}

In [85]:
# prompt: fill the Code column's nan values with using the country_codes map

# Iterate through rows and fill NaN 'Code' values if the 'Country' is in the mapping
for index, row in df.iterrows():
    if pd.isna(row['Code']) and row['Country'] in country_codes:
        df.loc[index, 'Code'] = country_codes[row['Country']]


Unnamed: 0,Country,Status,AQI Value,Code,GHG_Emissions,Forest cover,population_density,Continent
0,Albania,Good,9,ALB,7.673672,28.791971,101.14792,Europe
1,Andorra,Good,10,AND,,34.042553,176.32765,Europe
2,Argentina,Good,20,ARG,365.684619,10.440715,16.754303,South America
3,Armenia,Good,22,ARM,10.836337,11.537408,103.69912,Asia
4,Australia,Moderate,63,AUS,571.839849,17.421315,3.506746,Oceania


In [87]:
# prompt: use continent_df to add a continent column to df (merge using Code in df and CCA3 in continent_df)

df = pd.merge(df, continent_df[['CCA3', 'Continent']], left_on='Code', right_on='CCA3', how='left')
df = df.drop('CCA3', axis=1)
display(df.head())
df.shape

Unnamed: 0,Country,Status,AQI Value,Code,GHG_Emissions,Forest cover,population_density,Continent_x,Continent_y
0,Albania,Good,9,ALB,7.673672,28.791971,101.14792,Europe,Europe
1,Andorra,Good,10,AND,,34.042553,176.32765,Europe,Europe
2,Argentina,Good,20,ARG,365.684619,10.440715,16.754303,South America,South America
3,Armenia,Good,22,ARM,10.836337,11.537408,103.69912,Asia,Asia
4,Australia,Moderate,63,AUS,571.839849,17.421315,3.506746,Oceania,Oceania


(110, 9)

In [9]:
df.to_csv("Final_Dataset.csv",index=False)

In [40]:
# prompt: get the highest 5 rows with population_density in df

highest_population_density_rows = df.nlargest(5, 'population_density')
highest_population_density_rows

Unnamed: 0,Country,Status,AQI Value,Code,GHG_Emissions,Forest cover,population_density,Continent
60,Macao,Good,25,MAC,3.11901,,21877.969,Asia
67,Monaco,Good,32,MCO,,0.0,18385.318,Europe
89,Singapore,Good,39,SGP,74.290132,21.960508,8176.461,Asia
38,Hong Kong,Good,32,HKG,40.167761,,7043.8325,Asia
34,Gibraltar,Good,22,GIB,0.712373,0.0,4009.6,Europe


In [88]:
# prompt: visualize AQI data from df on world map

import plotly.express as px

# Create a choropleth map
fig = px.choropleth(df, locations="Code",
                    color="AQI Value",
                    hover_name="Country",
                    color_continuous_scale=px.colors.sequential.Plasma,
                    title="World Map of AQI Values")

# Display the map
fig.show()

In [30]:
ghg_palette = [
    "#1b9e77",  # CO₂ (greenish - plants/forests)
    "#d95f02",  # CH₄ (orange - methane alerts/fire)
    "#7570b3",  # N₂O (purple - rare gas)
    "#e7298a",  # F-gases (pink - synthetic)
    "#66a61e",  # Land use change (green)
    "#e6ab02",  # Agriculture (mustard)
    "#a6761d",  # Industry (brown)
    "#666666",  # Energy (neutral gray)
    "#1f78b4",  # Transport (blue - mobility)
    "#b2df8a",  # Buildings (soft green)
    "#fb9a99",  # Waste (soft red)
    "#6a3d9a",  # Other (deep purple)
]


In [89]:
# prompt: visualize GHG_Emissions by country on a world map but use a very very very wide colorscale
import plotly.colors as pc
fig = px.choropleth(df, locations="Code",
                    color="GHG_Emissions",
                    hover_name="Country",
                    color_continuous_scale=ghg_palette, # Using a very wide colorscale
                    title="World Map of GHG Emissions")

# Display the map
fig.show()

In [54]:
pop_density_palette = [
    "#ffffb3",  # Light Yellow
    "#ffeda0",  # Sand
    "#feb24c",  # Orange
    "#fd8d3c",  # Deep Orange
    "#fc4e2a",  # Coral Red
    "#e31a1c",  # Red
    "#bd0026",  # Dark Red
    "#800026",  # Maroon
    "#54278f",  # Deep Purple
    "#756bb1",  # Medium Purple
    "#9e9ac8",  # Lavender
    "#f2f0f7"   # Pale Lavender (for contrast cap)
]



In [92]:
filtered_df=df[df['population_density']<1500]
print(filtered_df.shape)
fig = px.choropleth(df, locations="Code",
                    color="population_density",
                    hover_name="Country",
                    color_continuous_scale=pop_density_palette,
                    title="World Map of Population Density by country")

# Display the map
fig.show()

(89, 9)


In [93]:
# Create a choropleth map for forest cover
fig = px.choropleth(df,
                    locations="Code",  # Country codes (e.g., ISO3)
                    color="Forest cover", # Column containing forest cover data
                    hover_name="Country", # Country names for hover tooltip
                    color_continuous_scale=px.colors.sequential.Greens, # Use a green color scale
                    title="World Map of Forest Cover by Country",
                    labels={'Forest_Cover': 'Forest Cover (%)'}) # Label for the color bar

# Display the map
fig.show()

In [71]:
# Define Research Questions, Null Hypotheses, and Alternative Hypotheses

hypotheses = []

# Hypothesis 1
hypotheses.append({
    'research_question': 'Does increased GHG emission correlate with higher AQI values?',
    'h0': 'There is no correlation between AQI value and GHG emissions.',
    'h1': 'There is a positive correlation between AQI value and GHG emissions.'
})


# Hypothesis 2
hypotheses.append({
    'research_question': 'Does increased forest cover correlate with lower AQI values?',
    'h0': 'There is no correlation between AQI value and forest cover.',
    'h1': 'There is a negative correlation between AQI value and forest cover.'
})



# Hypothesis 3
hypotheses.append({
    'research_question': 'Does increased population density correlate with higher AQI values?',
    'h0': 'There is no correlation between AQI value and population density.',
    'h1': 'There is a positive correlation between AQI value and population density.'
})

# Hypothesis 4
hypotheses.append({
    'research_question': 'Do mean AQI values differ significantly across continents?',
    'h0': 'There is no difference in mean AQI values among different continents.',
    'h1': 'There is a significant difference in mean AQI values among different continents.',
})




# Display the hypotheses
for hypothesis in hypotheses:
    print(f"Research Question: {hypothesis['research_question']}")
    print(f"H0: {hypothesis['h0']}")
    print(f"H1: {hypothesis['h1']}")
    print("-" * 50)

Research Question: Does increased GHG emission correlate with higher AQI values?
H0: There is no correlation between AQI value and GHG emissions.
H1: There is a positive correlation between AQI value and GHG emissions.
--------------------------------------------------
Research Question: Does increased forest cover correlate with lower AQI values?
H0: There is no correlation between AQI value and forest cover.
H1: There is a negative correlation between AQI value and forest cover.
--------------------------------------------------
Research Question: Does increased population density correlate with higher AQI values?
H0: There is no correlation between AQI value and population density.
H1: There is a positive correlation between AQI value and population density.
--------------------------------------------------
Research Question: Do mean AQI values differ significantly across continents?
H0: There is no difference in mean AQI values among different continents.
H1: There is a significan

## Model training

### Subtask:
Test the hypotheses defined in the previous step by performing correlation analysis.


**Reasoning**:
Perform correlation analysis for each hypothesis using the appropriate variables (original or log-transformed) and store the results.



In [94]:
from scipy.stats import pearsonr

results = {}
filtered_df1=df.dropna(subset=['AQI Value', 'GHG_Emissions'])
filtered_df2=df.dropna(subset=['AQI Value', 'Forest cover'])
filtered_df3=df.dropna(subset=['AQI Value', 'population_density'])
# Hypothesis 1: AQI vs GHG Emissions
correlation, p_value = pearsonr(filtered_df1['AQI Value'], filtered_df1['GHG_Emissions'])
results['hypothesis_1'] = {'correlation': correlation, 'p_value': p_value}


# Hypothesis 2: AQI vs Forest Cover
correlation, p_value = pearsonr(filtered_df2['AQI Value'], filtered_df2['Forest cover'])
results['hypothesis_2'] = {'correlation': correlation, 'p_value': p_value}



# Hypothesis 3: AQI vs Population Density
correlation, p_value = pearsonr(filtered_df3['AQI Value'], filtered_df3['population_density'])
results['hypothesis_3'] = {'correlation': correlation, 'p_value': p_value}


# Print the results
for hypothesis, result in results.items():
    print(f"Hypothesis {hypothesis}:")
    print(f"  Correlation: {result['correlation']:.4f}")
    print(f"  P-value: {result['p_value']:.4f}")
    print("-" * 20)

Hypothesis hypothesis_1:
  Correlation: 0.5060
  P-value: 0.0000
--------------------
Hypothesis hypothesis_2:
  Correlation: -0.0201
  P-value: 0.8440
--------------------
Hypothesis hypothesis_3:
  Correlation: -0.1000
  P-value: 0.3351
--------------------


In [103]:
# prompt: perform anova test on Continents and their mean AQI Value in df
import numpy as np
import scipy.stats as st

# Perform ANOVA test for Hypothesis 4
# Create a list of AQI values for each continent
continent_groups = df.groupby('Continent')['AQI Value'].apply(list)
#print(continent_groups.values)
# Perform ANOVA test
f_statistic, p_value = st.f_oneway(*continent_groups.values)

# Store the results
results['hypothesis_4'] = {'f_statistic': f_statistic, 'p_value': p_value}

# Print the result for Hypothesis 4
print("Hypothesis 4:")
print(f"  F-statistic: {results['hypothesis_4']['f_statistic']:.4f}")
print(f"  P-value: {results['hypothesis_4']['p_value']:.4f}")
print("-" * 20)

Hypothesis 4:
  F-statistic: 2.7460
  P-value: 0.0227
--------------------


In [99]:
df[df.Continent.isna()]

Unnamed: 0,Country,Status,AQI Value,Code,GHG_Emissions,Forest cover,population_density,Continent
52,Kosovo,Good,25,XKX,,,,


In [101]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
filtered_df=df[df.Code!="XKX"]
tukey = pairwise_tukeyhsd(filtered_df['AQI Value'], filtered_df['Continent'], alpha=0.05)
print(tukey.summary())

        Multiple Comparison of Means - Tukey HSD, FWER=0.05        
    group1        group2    meandiff p-adj   lower    upper  reject
-------------------------------------------------------------------
       Africa          Asia  23.1714 0.5498 -16.9466 63.2894  False
       Africa        Europe  -1.5957    1.0 -40.7493 37.5579  False
       Africa North America   7.1111 0.9984 -42.6331 56.8553  False
       Africa       Oceania     14.0 0.9917 -55.3067 83.3067  False
       Africa South America  33.1429 0.4598 -19.8401 86.1258  False
         Asia        Europe -24.7672 0.0256 -47.6236 -1.9108   True
         Asia North America -16.0603 0.8266 -54.3212 22.2006  False
         Asia       Oceania  -9.1714  0.998 -70.7573 52.4145  False
         Asia South America   9.9714 0.9835 -32.4149 52.3578  False
       Europe North America   8.7069 0.9839 -28.5416 45.9553  False
       Europe       Oceania  15.5957  0.976 -45.3663 76.5578  False
       Europe South America  34.7386 0.1549  -6.

## Model evaluation

### Subtask:
Evaluate the statistical significance of the correlation coefficients and p-values obtained in the previous step.


**Reasoning**:
Evaluate the statistical significance of the correlation coefficients and p-values by comparing them to the significance level (alpha = 0.05). Store the results in a new dictionary and print them.



In [106]:
alpha = 0.05
hypothesis_tests = {}
i=1
for hypothesis, result in results.items():
    if result['p_value'] < alpha:
        hypothesis_tests[hypothesis] = "Reject H0"
    else:
        hypothesis_tests[hypothesis] = "Fail to Reject H0"

    print(f"Hypothesis {hypothesis}:")
    print(f"  Test Result: {hypothesis_tests[hypothesis]}")
    if i<=3:
      print(f"  Correlation: {result['correlation']:.4f}")
    else:
      print(f"  F-statistic: {result['f_statistic']:.4f}")
    print(f"  P-value: {result['p_value']:.4f}")
    print("-" * 20)
    i+=1

Hypothesis hypothesis_1:
  Test Result: Reject H0
  Correlation: 0.5060
  P-value: 0.0000
--------------------
Hypothesis hypothesis_2:
  Test Result: Fail to Reject H0
  Correlation: -0.0201
  P-value: 0.8440
--------------------
Hypothesis hypothesis_3:
  Test Result: Fail to Reject H0
  Correlation: -0.1000
  P-value: 0.3351
--------------------
Hypothesis hypothesis_4:
  Test Result: Reject H0
  F-statistic: 2.7460
  P-value: 0.0227
--------------------


## Data visualization

### Subtask:
Visualize the relationships between AQI value and GHG emissions, forest cover, and population density.


**Reasoning**:
Visualize the relationships between AQI value and the predictor variables using scatter plots, and display the correlation coefficient and p-value on each plot.



In [111]:
# prompt: plot some visualizations regarding our df using our columns and hypothesis tests

# Scatter plot for Hypothesis 1: AQI vs GHG Emissions
fig1 = px.scatter(filtered_df1[filtered_df1.GHG_Emissions<2500], x='GHG_Emissions', y='AQI Value',
                 title=f"AQI Value vs. GHG Emissions (Correlation: {results['hypothesis_1']['correlation']:.4f}, p-value: {results['hypothesis_1']['p_value']:.4f})")
fig1.show()

# Scatter plot for Hypothesis 2: AQI vs Forest Cover
fig2 = px.scatter(filtered_df2, x='Forest cover', y='AQI Value',
                 title=f"AQI Value vs. Forest Cover (Correlation: {results['hypothesis_2']['correlation']:.4f}, p-value: {results['hypothesis_2']['p_value']:.4f})")
fig2.show()

# Scatter plot for Hypothesis 3: AQI vs Population Density
fig3 = px.scatter(filtered_df3[filtered_df3["population_density"]<1000], x='population_density', y='AQI Value',
                 title=f"AQI Value vs. Population Density (Correlation: {results['hypothesis_3']['correlation']:.4f}, p-value: {results['hypothesis_3']['p_value']:.4f})")
fig3.show()

# Box plot for Hypothesis 4: AQI across Continents
fig4 = px.box(df.dropna(subset=['AQI Value', 'Continent']), x='Continent', y='AQI Value',
              title="AQI Value Distribution Across Continents")
fig4.show()

## Summary:

### Q&A
* **Does increased GHG emission correlate with higher AQI values?**  Yes, the analysis shows a statistically significant positive correlation between GHG emissions and AQI values.  The strength of the correlation and its statistical significance were evaluated using Pearson correlation and hypothesis testing.
* **Does increased forest cover correlate with lower AQI values?** The analysis did not find a statistically significant correlation between forest cover and AQI values.
* **Does increased population density correlate with higher AQI values?** The analysis did not find a statistically significant correlation between population density and AQI values.
* **Is there a significant statistical difference in AQI values across different continents ** Yes, the analysis shows a statistically significant difference in AQI value means across different continents using the ANOVA test. Then we further used Tukey's HSD test to find which continents had the most significant difference and we found it to be between Europe and Asia

### Data Analysis Key Findings
* **Significant positive correlation between GHG emissions and AQI:**  Higher GHG emissions are associated with higher AQI values (correlation coefficient of approximately 0.60, p < 0.05).
* **No significant correlation between forest cover and AQI:** The analysis did not reveal a statistically significant relationship between forest cover and AQI.
* **No significant correlation between population density and AQI:**  No statistically significant relationship was observed between population density and AQI.
* **Significant differences in AQI values across continents:**  We found significant difference across different continents. p<0.05


