# Assignement 1
In this assignment, you need to define at least one task based on each of the exercises 1, 7, and 8.

You will hand in the assignment through a Jupyter notebook, also with your environment and the dataset you picked, both zipped together and submitted as one file. Please name your file so that it contains your group number. It is important that you clearly state the tasks you are performing on the dataset as questions or something similar in the notebook before you do the operations on the data. Also make sure to document your solutions and your thinking so that it can easily be followed. If you fail to do these things, you may not pass this assignment.

The deadline of this assignment is on April 12, 2025 to get bonus points, or before the exam (in which case no bonus points are awarded).

Re-submission 1 is by the end of week 33, 2025.

Re-submission 2 is by the end of week 2, 2026.

## Exercice 1 tasks
- Data exploration
- Data preprocessing
- Combining datasets

We choosed to go with Data exploration. We are gonna see the different sectors and gas present in the dataset, the years and is there is any missing values.

## Exercice 7 tasks
- Filtering, arranging, selecting, mutating
- Pipes, grouping, and summarising
- Dates and date-times

## Exercice 8 tasks
- Tidy data and pivoting
- Relational data, _aka_ combining datasets
- Strings, factors, and advanced operations

## Open the csv
Read the csv and print it's content to see if it is correct

In [None]:
import pandas as pd

# Read the data from the csv and print it to be sure it worked
df = pd.read_csv('../Total air emissions by greenhouse gas.csv', na_values="..")
print(df.to_string())

# Data exploration
## Check for missing values
Check if there is missing values, then for total missing values, missing values by column and by row.
Add a visual representation with a heatmap.

In [None]:
# has missing value ?
has_missing_value = pd.isnull(df).any().any()
print(f"Does the dataframe has missing value ? Answer : {has_missing_value}")

In [None]:
# Total missing values
tot_na = pd.isna(df).sum().sum()
print(f"Total missing values: {tot_na}\n")

In [None]:
# Missing values by column
row_na = pd.isna(df).sum()
row_na

In [None]:
#missing values by row
missing_per_row = df.isnull().sum(axis=1)
missing_per_row

In [None]:
# Heatmap of missing values
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cmap="viridis", cbar=False)
plt.show()


## Different gases and sectors
Print all different gases and sectors

In [None]:
# Create function for better readability
def print_list(title, lst):
    print(title)
    for i in lst:
        print(f"\t{i}")

print_list("Unique greenhouse gases:", df["greenhouse gas"].unique())
print_list("Unique sectors:", df["sector"].unique())


## Group
How much of kg / t / kt / kt CO2-eqv. are produces by each sector ?  
There are 4 different units kt CO2-eqv, kt, t, kg so we are gonna divide them.

There are already rows for the total by sector so we are gonna use them.

In [None]:
def plot(df, unit):
    grouped_unit = df[df["greenhouse gas"].str.contains(unit, case=False, na=False)]
    grouped_unit = grouped_unit.set_index("greenhouse gas")

    # Transpose so years are on the X-axis
    grouped_unit.T.plot.area(figsize=(14, 6), colormap="viridis", alpha=0.8)

    plt.title("Emissions by Gas Over Time")
    plt.xlabel("Year")
    plt.ylabel(f"Emissions {unit.replace('\\', '')}")
    plt.legend(title="Sector", loc="upper right")
    plt.grid(alpha=0.3)
    plt.show()

# Remove total to not count it twice
rm_tot = df[df["greenhouse gas"] != "Total Greenhouse Gases (kt CO2-eqv.)"]

# Group by 'greenhouse gas' and sum numeric columns
grouped = rm_tot.groupby(["greenhouse gas"], as_index=False).sum(numeric_only=True)

plot(grouped, r"\(kg\)")
plot(grouped, r"\(t\)")
plot(grouped, r"\(kt\)")
plot(grouped, r"\(kt CO2-eqv.\)")

In [None]:
# Only keep total
df_filtered = df[df["greenhouse gas"] == "Total Greenhouse Gases (kt CO2-eqv.)"]

df_filtered = df_filtered.drop(columns=["greenhouse gas"])
df_filtered_no_nt = df_filtered[df_filtered["sector"] != "NATIONAL TOTAL (including LULUCF, excluding international transports)"] # contains positive and negative values which can't be plotted on a stacked area chart
df_filtered.set_index("sector", inplace=True)
df_filtered_no_nt.set_index("sector", inplace=True)
# print(df_filtered_no_nt.to_string())

# Transpose so years are on the X-axis
df_filtered_no_nt.T.plot.area(figsize=(14, 6), colormap="viridis", alpha=0.8)

# Formatting
plt.title("Emissions by Sector Over Time")
plt.xlabel("Year")
plt.ylabel("Emissions")
plt.legend(title="Sector", loc="upper right")
plt.grid(alpha=0.3)
plt.show()

### Isolate waste data

I want to isolate waste data, work on the data to know which gases are waste. Saw the evolution between 1990 and 2023

In [None]:
df_waste_total = df[df["sector"] == "WASTE, TOTAL"].sort_values(by="2023", ascending=False)
df_waste_total


#### Saw the evolution between 1990 and 2023

In [None]:
df_waste_total_only_1990_2023 = df_waste_total[['greenhouse gas', 'sector','1990', '2023']]
df_waste_total_only_1990_2023

In [None]:
df_waste_total_only_1990_2023_with_evolution = df_waste_total_only_1990_2023.copy()

#calculating percent change
df_waste_total_only_1990_2023_with_evolution["percent_change"] = (
    (df_waste_total_only_1990_2023_with_evolution["2023"]
     - df_waste_total_only_1990_2023_with_evolution["1990"])
    / df_waste_total_only_1990_2023_with_evolution["1990"]
) * 100

# filling NaN value with 0
df_waste_total_only_1990_2023_with_evolution = df_waste_total_only_1990_2023_with_evolution.fillna(0)

# delete rows with 0 in it
df_waste_total_only_1990_2023_with_evolution = df_waste_total_only_1990_2023_with_evolution[
    (df_waste_total_only_1990_2023_with_evolution["1990"] != 0) &
    (df_waste_total_only_1990_2023_with_evolution["2023"] != 0)
]

#sorting
df_waste_total_only_1990_2023_with_evolution.sort_values(by="percent_change", ascending=False)

#### plotting the evolution

In [None]:
# plotting percent change

df_waste_total_only_1990_2023_with_evolution.set_index("greenhouse gas", inplace=True)
df_waste_total_only_1990_2023_with_evolution["percent_change"].plot(kind="barh", figsize=(10, 8), color="mediumseagreen")
plt.xlabel("percent_change (%)")
plt.ylabel("Greenhouse Gas")
plt.title("percent change of emissions from Waste from 1990 to 2023")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


## Describe
Describe total for years and sectors

In [None]:
print(df_filtered.describe().to_string())
print()
print(df_filtered.T.describe().to_string())

# Preprocessing
## Create row total
Compute total gas emited since 1990

In [None]:
years = [str(i) for i in range(1990, 2024)]  # List of column names as strings
df["Total"] = df[years].sum(axis=1)
df

### Replace null value 

In [None]:
df = df.fillna(0)
df

## Get the top 25% of gas producers sectors by greenhouse effect

In [None]:
gas_quantile = df.groupby("greenhouse gas")["Total"].quantile(0.75) # Get the top 25% by gas
is_above_cutoff = lambda x: x["Total"] >= gas_quantile[x["greenhouse gas"]] # Create cutoff function, see ex1, 16
df["Top producer"] = df.apply(is_above_cutoff, axis=1) # Apply function
df