# Berlin City-Wide Emissions (2023) — Exploratory Data Review

This notebook provides a structured first look at the dataset in `CSV data/2023_City_Wide_Emissions_Berlin.csv`.
It covers loading, inspecting schema, missing values, unique values in key columns, summary statistics, duplicates,
categorical vs numerical identification, and a couple of simple visualizations.

## 1. Imports
We import common libraries for data analysis. `pandas` for dataframes, `numpy` for numerics, and `seaborn/matplotlib` for plotting.

In [None]:
# Essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from IPython.display import display

# Display options for readability
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)
sns.set_theme(style='whitegrid')

print('pandas:', pd.__version__)
print('numpy :', np.__version__)

## 2. Load the CSV and preview rows
We construct the path and read the CSV. Then we preview the first few rows to get a quick sense of the data.

In [None]:
# Path to the CSV (relative to this notebook)
csv_path = Path('CSV data') / '2023_City_Wide_Emissions_Berlin.csv'
print('Loading from:', csv_path)

# Read CSV; adjust encoding/errors if needed for special characters
df = pd.read_csv(csv_path)

# Show the first 10 rows (adjust to 5 if you prefer)
df.head(10)

## 3. Columns and data types
We list all columns and their inferred data types (`dtypes`).
This helps identify text (object), numeric, datetime, and categorical-like fields.

In [None]:
print('Columns ({}):'.format(len(df.columns)))
print(list(df.columns))

# Show data types in a tidy table
dtype_table = df.dtypes.to_frame(name='dtype')
dtype_table

## 4. Shape (rows and columns)
We check the overall size of the dataset as `(n_rows, n_columns)`.

In [None]:
print('Shape:', df.shape)
print('Number of rows   :', df.shape[0])
print('Number of columns:', df.shape[1])

## 5–6. Missing values per column and total
We compute missing (null) counts per column and overall, including percentage per column to gauge severity.

In [None]:
null_counts = df.isna().sum().sort_values(ascending=False)
null_pct = (df.isna().mean() * 100).round(2)
missing_summary = pd.concat([null_counts, null_pct], axis=1)
missing_summary.columns = ['null_count', 'null_pct']
missing_summary

print('Total null values in dataframe:', int(null_counts.sum()))

## 7. Unique values in key columns
We attempt to locate columns commonly used in emissions datasets (e.g., city, country, year, scope, emission type)
and display their unique values. If a column is not present, we skip it.

In [None]:
# Helper to find columns by candidate names (case-insensitive substring match)
def find_columns(candidates):
    cols_lower = {c.lower(): c for c in df.columns}  # map lower->actual
    found = []
    for cand in candidates:
        # direct exact lower-case match
        if cand in cols_lower:
            found.append(cols_lower[cand])
            continue
        # substring search across columns
        for c in df.columns:
            if cand in c.lower():
                found.append(c)
    # de-duplicate while preserving order
    seen = set()
    uniq = []
    for c in found:
        if c not in seen:
            uniq.append(c)
            seen.add(c)
    return uniq

key_groups = {
    'city': ['city', 'municipality', 'borough'],
    'country': ['country', 'nation'],
    'year': ['year', 'reporting year', 'report_year', 'fiscal year'],
    'scope': ['scope', 'emission scope', 'ghg scope'],
    'emission_type': ['emission type', 'emissions type', 'type', 'category', 'sector']
}

for key, cands in key_groups.items():
    found = find_columns([c.lower() for c in cands])
    if not found:
        print(f"No matching column found for '{key}'. Candidates tried: {cands}")
        continue
    for col in found:
        try:
            n_unique = df[col].nunique(dropna=True)
        except TypeError:
            # Fallback for columns with unhashable entries (lists/dicts)
            n_unique = int(df[col].dropna().shape[0])

        print(f"\n— {key.upper()} — column: '{col}' (unique: {n_unique})")

        try:
            raw_uniques = df[col].dropna().unique()
        except TypeError:
            # If unique() fails (e.g., due to unhashable types), coerce to string
            raw_uniques = df[col].dropna().astype(str).unique()

        values = list(raw_uniques)
        try:
            values = sorted(values)
        except TypeError:
            try:
                values = sorted(values, key=lambda x: str(x))
            except Exception:
                pass

        preview = pd.Series(values[:20], name=f"unique_{col}")
        display(preview.to_frame())

        if n_unique > 20:
            print('... (truncated)')


## 8. Summary statistics (numeric columns)
We compute standard descriptive statistics (count, mean, std, min, quartiles, max) for all numeric columns.

In [None]:
numeric_summary = df.describe(include=[np.number]).T
numeric_summary

## 9. Duplicate rows
We check how many rows are exact duplicates. If needed, we could display them, but we start by reporting the count.

In [None]:
dup_count = df.duplicated().sum()
print('Duplicate rows:', int(dup_count))

# If you want to see the duplicate rows, uncomment:
# df[df.duplicated(keep=False)].sort_values(list(df.columns)).head(20)

## 10. Categorical vs numerical columns
We infer likely categorical vs numerical columns using the pandas dtypes and simple heuristics (e.g., low-cardinality numerics might be categorical).

In [None]:
# Base detection via dtypes
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Heuristic: low-cardinality numerics might be categorical (e.g., 0/1 flags, enums)
low_card_numeric = []
for col in numeric_cols:
    nunique = df[col].nunique(dropna=True)
    if 1 < nunique <= max(10, int(0.02 * len(df))):
        low_card_numeric.append(col)

print('Categorical-like columns:')
print(sorted(list(set(categorical_cols + low_card_numeric))))

print('
Numerical columns:')
print(sorted(numeric_cols))

## 11. Simple visuals
We create a couple of basic plots, if the necessary columns exist:
- Countplot for an "emission type"-like column.
- Bar chart of total emissions by year (using a best-effort guess for the emissions column).

In [None]:
# Try to detect an 'emissions type' column
em_type_candidates = ['emission type', 'emissions type', 'type', 'category', 'sector']
em_type_cols = []
for cand in em_type_candidates:
    for c in df.columns:
        if cand in c.lower():
            em_type_cols.append(c)
em_type_cols = list(dict.fromkeys(em_type_cols))  # de-dupe preserve order

if em_type_cols:
    col = em_type_cols[0]
    plt.figure(figsize=(8, 4))
    sns.countplot(data=df, x=col, order=df[col].value_counts().index, color='#4C72B0')
    plt.title(f'Count of {col}')
    plt.xticks(rotation=30, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print('No emission-type-like column found for countplot.')

# Try to detect a 'year' column
year_cols = []
for cand in ['year', 'reporting year', 'report_year', 'fiscal year']:
    for c in df.columns:
        if cand in c.lower():
            year_cols.append(c)
year_cols = list(dict.fromkeys(year_cols))

# Try to detect a numeric emissions measure column
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
em_value_cols = [c for c in num_cols if any(k in c.lower() for k in ['emission', 'co2', 'co₂', 'ghg', 'tco2e', 'ton', 'kt', 'mt', 'value', 'quantity', 'total'])]

if year_cols and em_value_cols:
    ycol = year_cols[0]
    vcol = em_value_cols[0]
    print(f'Using year column: {ycol} | emissions column: {vcol}')
    # Aggregate total emissions by year
    by_year = (
        df.groupby(ycol, dropna=True)[vcol]
          .sum()
          .reset_index()
          .sort_values(ycol)
    )
    plt.figure(figsize=(8, 4))
    sns.barplot(data=by_year, x=ycol, y=vcol, color='#55A868')
    plt.title(f'Total {vcol} by {ycol}')
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.show()
else:
    if not year_cols:
        print('No year-like column found; skipping emissions-by-year chart.')
    if not em_value_cols:
        print('No numeric emissions column detected; skipping emissions-by-year chart.')

---
Notes:
- If column names differ from the guesses above, adjust the candidate lists accordingly.
- For more plots (e.g., breakdown by sector/scope/gas), you can extend the detection logic or set column names explicitly.