# 1. DATA PROFILING AND DATA QUALITY ASSESSMENT

This notebook consolidates all data profiling and quality assessment tasks:
1. **Data Loading & Initial Inspection** - Load the dataset and understand its structure
2. **Manual Data Profiling** - Compute basic statistics, cardinality, uniqueness, distinctness, constancy
3. **Automatic Data Profiling** - Use ydata_profiling and dataprofiler for comprehensive reports
4. **Data Quality Dimensions** - Assess completeness, accuracy, consistency, and duplication
5. **Functional Dependencies** - Discover dependencies in the address-related attributes

## 1.1 Imports and Setup

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import json

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 140)
%matplotlib inline

## 1.2 Load the Milan Dataset

In [None]:
# Load the dataset - adjust path if needed
MILANO = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")
print(f"Dataset shape: {MILANO.shape[0]} rows x {MILANO.shape[1]} columns")
MILANO.head()

## 1.3 Initial Inspection

In [None]:
# Column names
print("Columns:")
print(MILANO.columns.tolist())

In [None]:
# Data types
print("Data types:")
MILANO.dtypes

In [None]:
# Separate numerical and categorical columns
NUM = list(MILANO.select_dtypes(include=['int64', 'float64']).columns)
CAT = list(MILANO.select_dtypes(include=['bool', 'object']).columns)

print(f"Numerical columns ({len(NUM)}): {NUM}")
print(f"Categorical columns ({len(CAT)}): {CAT}")

---
# 2. MANUAL DATA PROFILING

For each attribute we compute: length, count, distinct values, uniqueness, distinctness, constancy, min, max, mean, std.

## 2.1 Single Column Example: Superficie somministrazione

In [None]:
# Total rows and cells
ROWS = len(MILANO)
COLS = MILANO.shape[1]
CELLS = ROWS * COLS

print(f"Total rows: {ROWS}")
print(f"Total columns: {COLS}")
print(f"Total cells: {CELLS}")

In [None]:
# Profiling for 'Superficie somministrazione'
col = "Superficie somministrazione"

COUNT = MILANO[col].count()  # non-null values
DISTINCT = MILANO[col].nunique()  # distinct values
UNIQUENESS = DISTINCT / ROWS
DISTINCTNESS = DISTINCT / COUNT if COUNT > 0 else 0

max_freq = MILANO[col].value_counts().max() if COUNT > 0 else 0
CONSTANCY = max_freq / COUNT if COUNT > 0 else 0

print(f"Column: {col}")
print(f"  Count (non-null): {COUNT}")
print(f"  Distinct values: {DISTINCT}")
print(f"  Uniqueness: {UNIQUENESS:.4f}")
print(f"  Distinctness: {DISTINCTNESS:.4f}")
print(f"  Constancy: {CONSTANCY:.4f}")

In [None]:
# Statistics for numerical column
MILANO[col] = pd.to_numeric(MILANO[col], errors='coerce')

print(f"  Min: {MILANO[col].min()}")
print(f"  Max: {MILANO[col].max()}")
print(f"  Mean: {MILANO[col].mean():.2f}")
print(f"  Std: {MILANO[col].std():.2f}")
print(f"  Mode: {MILANO[col].mode()[0] if len(MILANO[col].mode()) > 0 else 'N/A'}")

In [None]:
# Histogram
plt.figure(figsize=(10, 4))
MILANO[col].hist(bins=50)
plt.title(f"Distribution of {col}")
plt.xlabel(col)
plt.ylabel("Frequency")
plt.show()

## 2.2 Profiling All Numerical Columns

In [None]:
# Min and Max for all numerical columns
print("MIN values:")
display(MILANO[NUM].min())

print("\nMAX values:")
display(MILANO[NUM].max())

In [None]:
# Correlation heatmap for numerical columns
if len(NUM) > 1:
    plt.figure(figsize=(8, 6))
    sns.heatmap(MILANO[NUM].corr(), annot=True, cmap='coolwarm', center=0)
    plt.title("Correlation Heatmap")
    plt.show()

## 2.3 Profiling Categorical Columns

In [None]:
# Unique values per categorical column
for c in CAT:
    n_unique = MILANO[c].nunique()
    print(f"{c}: {n_unique} unique values")

In [None]:
# Value counts for key categorical columns
key_cats = ["Settore storico pe", "Forma commercio", "Forma vendita", "Tipo via"]
for c in key_cats:
    if c in MILANO.columns:
        print(f"\n=== {c} ===")
        display(MILANO[c].value_counts().head(10))

---
# 3. AUTOMATIC DATA PROFILING

Using ydata_profiling and dataprofiler libraries for comprehensive automated reports.

## 3.1 YData Profiling

In [None]:
# Install if needed
# %pip install -U ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

# Generate the profiling report
PROFILE = ProfileReport(MILANO, title="Profiling Report - Milan Public Establishments")
PROFILE

In [None]:
# Save to HTML and JSON
PROFILE.to_file("MILANO_REPORT.html")
PROFILE.to_file("MILANO_REPORT.json")
print("Reports saved: MILANO_REPORT.html, MILANO_REPORT.json")

In [None]:
# Inspect JSON report programmatically
with open("MILANO_REPORT.json") as f:
    JFILE = json.load(f)

print(f"Total rows from JSON: {JFILE['table']['n']}")
print(f"Distinct values for 'Superficie somministrazione': {JFILE['variables']['Superficie somministrazione']['n_distinct']}")

## 3.2 DataProfiler

In [None]:
# Install if needed
# %pip install dataprofiler

In [None]:
from dataprofiler import Profiler

# Reload dataset for fresh profiling
MILANO_DP = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")
profile = Profiler(MILANO_DP)
readable_report = profile.report(report_options={"output_format": "compact"})

In [None]:
# Build summary table from dataprofiler
rows = []
for col in readable_report["data_stats"]:
    name = col["column_name"]
    stats = col["statistics"]
    rows.append({
        "column": name,
        "null_count": stats.get("null_count"),
        "null_ratio": stats.get("null_ratio"),
        "unique_count": stats.get("unique_count"),
        "min": stats.get("min"),
        "max": stats.get("max"),
    })

df_profiler_summary = pd.DataFrame(rows)
df_profiler_summary.sort_values("null_ratio", ascending=False)

---
# 4. DATA QUALITY DIMENSIONS

Assess the main quality dimensions: Completeness, Accuracy, Consistency, Duplication.

## 4.1 Completeness

In [None]:
# Missing values per column
null_counts = MILANO.isnull().sum()
null_counts[null_counts > 0].sort_values(ascending=False)

In [None]:
# Overall completeness
NOT_NULL = MILANO.count().sum()
NULL = MILANO.isnull().sum().sum()
TOT = MILANO.shape[0] * MILANO.shape[1]

COMPLETENESS = NOT_NULL / TOT
print(f"Total cells: {TOT}")
print(f"Non-null cells: {NOT_NULL}")
print(f"Null cells: {NULL}")
print(f"Overall Completeness: {COMPLETENESS*100:.1f}%")

In [None]:
# Completeness per column
completeness_per_col = MILANO.count() / len(MILANO) * 100
completeness_per_col.sort_values()

## 4.2 Accuracy

In [None]:
# Define acceptable domain for 'Tipo via'
TIPO_VIA_DOMAIN = ['ALZ', 'BST', 'VIA', 'VLE', 'CSO', 'GLL', 'LGO', 'PLE',
                   'PTA', 'PZA', 'RIP', 'VIE', 'FOR', 'VLO', 'PAS', 'LARGO']

CORRECT_TIPO = MILANO["Tipo via"].isin(TIPO_VIA_DOMAIN)
ACCURACY_TIPO = CORRECT_TIPO.sum() / MILANO["Tipo via"].count()
print(f"Accuracy of 'Tipo via': {ACCURACY_TIPO*100:.1f}%")

In [None]:
# Define valid range for 'ZD' (zone codes 1-9)
ZD_RANGE = range(1, 10)
CORRECT_ZD = MILANO["ZD"].apply(lambda x: x in ZD_RANGE if pd.notna(x) else False)
ACCURACY_ZD = CORRECT_ZD.sum() / MILANO["ZD"].count()
print(f"Accuracy of 'ZD': {ACCURACY_ZD*100:.1f}%")

## 4.3 Consistency

In [None]:
# Rule: Superficie somministrazione should be > 0 when present
MILANO["Superficie somministrazione"] = pd.to_numeric(
    MILANO["Superficie somministrazione"], errors="coerce"
)

# Apply consistency rule
MILANO["consistency"] = np.where(
    MILANO["Superficie somministrazione"] > 0, 1, 0
)

# Compute consistency only on non-null values
MILANO_VALID = MILANO[MILANO["Superficie somministrazione"].notna()]
CONSISTENCY = MILANO_VALID["consistency"].sum() / len(MILANO_VALID)
print(f"Consistency (Superficie > 0): {CONSISTENCY*100:.1f}%")

## 4.4 Duplication

In [None]:
# Check for exact duplicates
DUPLICATES = MILANO.duplicated()
n_duplicates = DUPLICATES.sum()
print(f"Number of exact duplicate rows: {n_duplicates}")

if n_duplicates > 0:
    print("\nDuplicated rows:")
    display(MILANO[DUPLICATES])

---
# 5. FUNCTIONAL DEPENDENCIES

Discover functional dependencies in the address-related attributes using TANE, CTANE, and FD_Mine.

In [None]:
def check_fd(df, lhs_cols, rhs_col):
    """Check if a functional dependency X -> Y holds in the data."""
    grouped = df.groupby(lhs_cols)[rhs_col].nunique()
    violating_groups = grouped[grouped > 1]

    total_groups = len(grouped)
    num_violating_groups = len(violating_groups)

    if num_violating_groups == 0:
        print(f"FD {lhs_cols} -> {rhs_col} holds exactly.")
        return None
    else:
        print(f"FD {lhs_cols} -> {rhs_col} is violated in {num_violating_groups} out of {total_groups} groups.")
        violating_rows = (
            df.set_index(lhs_cols)
              .index.isin(violating_groups.index)
              .sum()
        )
        print(f"Approximately {violating_rows} rows ({violating_rows/len(df):.2%}) are involved in violations.")
        return violating_groups.sort_values(ascending=False)

In [None]:
# Test candidate FDs on address columns
print("=== Testing FD: {Tipo via, Descrizione via} -> Codice via ===")
check_fd(MILANO, ["Tipo via", "Descrizione via"], "Codice via")

print("\n=== Testing FD: Codice via -> Tipo via ===")
check_fd(MILANO, ["Codice via"], "Tipo via")

print("\n=== Testing FD: Codice via -> Descrizione via ===")
check_fd(MILANO, ["Codice via"], "Descrizione via")

In [None]:
# Prepare reduced table for FD discovery algorithms
cols_fd = ["Tipo via", "Descrizione via", "Civico", "Codice via", "ZD"]

col_rename = {
    "Tipo via": "A",
    "Descrizione via": "B",
    "Civico": "C",
    "Codice via": "D",
    "ZD": "E",
}

MILANO_FD = (
    MILANO[cols_fd]
    .dropna()
    .astype(str)
    .rename(columns=col_rename)
)

MILANO_FD.to_csv("MILANO_FD_TANE.csv", index=False)
print(f"Saved MILANO_FD_TANE.csv with {len(MILANO_FD)} rows")
MILANO_FD.head()

### Run FD Discovery Algorithms (TANE, FD_Mine, CTANE)

Uncomment and run the cells below if you have the DATADIQ package available.

In [None]:
# import sys
# sys.path.append("./DATADIQ")
# from DATADIQ import tane, ctane
# import fdtool

# source = "MILANO_FD_TANE.csv"
# print("=== TANE ===")
# tane.compute(source)

# print("\n=== FD_Mine ===")
# fdtool.main(source)

### Interpretation of Discovered FDs

The FD discovery algorithms (TANE, FD_Mine) on the reduced table find:

- **D → A** and **D → B**: `Codice via` determines both `Tipo via` and `Descrizione via`
- **A,B → D**: The pair (`Tipo via`, `Descrizione via`) determines `Codice via`
- **B,C → E** and **C,D → E**: Street + civic number determines zone (ZD)

These dependencies confirm that the address information behaves like a clean and consistent addressing system.

---
# 6. SUMMARY

This notebook has assessed the Milan Public Establishments dataset across multiple dimensions:

| Dimension | Result |
|-----------|--------|
| **Rows** | ~6,900 |
| **Columns** | 13 |
| **Completeness** | ~89% |
| **Exact Duplicates** | 1 row |
| **Accuracy (Tipo via)** | ~100% |
| **Accuracy (ZD)** | ~100% |
| **Consistency (Superficie > 0)** | ~100% |

**Key findings:**
- The `Insegna` column has ~50% missing values
- `Settore storico pe` is messy with ~4000 distinct values (multi-label field)
- Address columns have consistent functional dependencies
- Categorical columns like `Forma commercio` are highly imbalanced