# **DATA AND INFORMATION QUALITY**
## **Report: Milan Public Establishments Dataset Analysis**

---

**Authors:** Data Quality Team  
**Date:** January 2026  
**Dataset:** Comune di Milano - Pubblici Esercizi

---

## Table of Contents

1. [Introduction](#1-introduction)
2. [Setup Choices](#2-setup-choices)
3. [Pipeline Implementation](#3-pipeline-implementation)
4. [After Cleaning](#4-after-cleaning)

---

# 1. INTRODUCTION

This report documents the complete **Data Quality Assessment and Cleaning Pipeline** applied to the Milan Public Establishments dataset (*Comune di Milano - Pubblici Esercizi*).

## 1.1 Dataset Description

The dataset contains information about **public establishments** (bars, restaurants, shops, etc.) registered in the Municipality of Milan. Each record represents a business with attributes including:

- **Location data:** Street address, civic number, zone code (ZD)
- **Business type:** Sector, exercise type, commercial form
- **Physical attributes:** Surface area for food service
- **Business name:** Sign/brand name (Insegna)

## 1.2 Objectives

1. **Profile** the dataset to understand its structure and characteristics
2. **Assess** data quality dimensions (Completeness, Consistency, Duplicates)
3. **Clean** the data through transformation, error correction, and deduplication
4. **Validate** the improvements through post-cleaning profiling

## 1.3 Data Quality Dimensions Covered

| Dimension | Covered | Rationale |
|-----------|---------|----------|
| **Completeness** | ‚úÖ Yes | Measured and improved through imputation |
| **Consistency** | ‚úÖ Yes | Address consistency check, functional dependencies |
| **Duplicates** | ‚úÖ Yes | Exact and near-duplicate detection |
| **Accuracy** | ‚ùå No | No ground truth available for validation |
| **Timeliness** | ‚ùå No | No temporal attributes (dates) in dataset |

### Why No Accuracy Assessment?

**Accuracy** measures how well data values correspond to the real-world entities they represent. To assess accuracy, we need either:
- A **ground truth** dataset to compare against
- **External validation sources** (e.g., official registry, field verification)

Since we lack both, we **cannot objectively measure accuracy**. We can only ensure **syntactic correctness** and **internal consistency**.

### Why No Timeliness Assessment?

**Timeliness** measures whether data is up-to-date for the intended use. This dataset lacks:
- Timestamp columns (creation/update dates)
- Temporal attributes to assess currency

Therefore, **timeliness cannot be measured**.

---

# 2. SETUP CHOICES

## 2.1 Environment

| Component | Version/Details |
|-----------|----------------|
| **Operating System** | Linux (Ubuntu) |
| **Python** | 3.12 |
| **IDE** | Visual Studio Code with Jupyter extension |
| **Kernel** | IPython Kernel |

## 2.2 Libraries and Tools

### Core Data Processing
| Library | Purpose |
|---------|---------|
| `pandas` | DataFrame manipulation and analysis |
| `numpy` | Numerical operations |

### Data Profiling
| Library | Purpose |
|---------|---------|
| `ydata_profiling` | Automated profiling reports (HTML/JSON) |

### Visualization
| Library | Purpose |
|---------|---------|
| `matplotlib` | Basic plots (histograms, boxplots) |
| `seaborn` | Statistical visualizations (heatmaps) |

### Statistical Analysis
| Library | Purpose |
|---------|---------|
| `scipy.stats` | Z-score calculations for outlier detection |

### Functional Dependencies (Custom Scripts)
| Script | Purpose |
|--------|---------|
| `tane.py` | TANE algorithm implementation |
| `ctane.py` | Conditional TANE |
| `fdtool.py` | FD_Mine implementation |

In [None]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import re

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 140)
%matplotlib inline

print("‚úÖ Libraries loaded successfully!")
print(f"   Pandas version: {pd.__version__}")
print(f"   NumPy version: {np.__version__}")

---

# 3. PIPELINE IMPLEMENTATION

## 3.1 Exploration

### Load the Dataset

In [None]:
# Load the original dataset
MILANO = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")

print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"\nüìä Shape: {MILANO.shape[0]:,} rows √ó {MILANO.shape[1]} columns")
print(f"üì¶ Total cells: {MILANO.shape[0] * MILANO.shape[1]:,}")

# Preview the data
print("\nüìã First 5 rows:")
MILANO.head()

## 3.2 Data Profiling

### 3.2.1 Profiling Formulas

Let $N$ be the total number of rows, $n$ the count of non-null values, and $d$ the number of distinct values.

| Metric | Formula | Description |
|--------|---------|-------------|
| **Count** | $n = \sum_{i=1}^{N} \mathbb{1}[x_i \neq \text{null}]$ | Number of non-null values |
| **Distinct** | $d = |\{x_i : x_i \neq \text{null}\}|$ | Number of unique values |
| **Uniqueness** | $U = \frac{d}{N}$ | Ratio of distinct to total rows |
| **Distinctness** | $D = \frac{d}{n}$ | Ratio of distinct to non-null values |
| **Constancy** | $C = \frac{\max(\text{freq})}{n}$ | Frequency of most common value |
| **Null Ratio** | $\text{NR} = \frac{N - n}{N}$ | Proportion of missing values |

#### Key Profiling Concepts

- **Uniqueness = 1.0**: Every row has a distinct value (potential key)
- **Constancy ‚âà 1.0**: Almost all values are the same (low information)
- **Distinctness = 1.0**: No duplicate values among non-null entries

In [None]:
# Compute profiling metrics for all columns
ROWS = len(MILANO)

profile_data = []
for col in MILANO.columns:
    count = MILANO[col].count()
    distinct = MILANO[col].nunique()
    uniqueness = distinct / ROWS
    distinctness = distinct / count if count > 0 else 0
    mode_freq = MILANO[col].value_counts().iloc[0] if count > 0 else 0
    constancy = mode_freq / count if count > 0 else 0
    null_ratio = (ROWS - count) / ROWS
    
    profile_data.append({
        'Column': col,
        'Count (n)': count,
        'Nulls': ROWS - count,
        'Null Ratio': round(null_ratio, 4),
        'Distinct (d)': distinct,
        'Uniqueness (d/N)': round(uniqueness, 4),
        'Distinctness (d/n)': round(distinctness, 4),
        'Constancy': round(constancy, 4)
    })

profile_df = pd.DataFrame(profile_data)
print("üìä PROFILING METRICS FOR ALL COLUMNS")
profile_df

### 3.2.2 Automatic Profiling with YData Profiling

YData Profiling generates a comprehensive report including:
- **Overview**: Dataset statistics, variable types, missing values
- **Variables**: Detailed analysis per column
- **Interactions**: Correlations and relationships
- **Missing Values**: Patterns and heatmaps
- **Duplicates**: Exact duplicate detection

In [None]:
from ydata_profiling import ProfileReport

# Generate profiling report for original dataset
PROFILE_ORIGINAL = ProfileReport(
    MILANO, 
    title="Profiling Report - Milan Public Establishments (ORIGINAL)",
    explorative=True
)

# Display inline
PROFILE_ORIGINAL

### 3.2.3 Completeness Assessment

**Completeness** measures the degree to which all required data is present.

$$\text{Completeness} = \frac{\text{Non-null cells}}{\text{Total cells}} = \frac{\sum_{i,j} \mathbb{1}[x_{ij} \neq \text{null}]}{N \times M}$$

Where $N$ is the number of rows and $M$ is the number of columns.

In [None]:
# Calculate overall completeness
TOTAL_CELLS = MILANO.shape[0] * MILANO.shape[1]
NON_NULL_CELLS = MILANO.count().sum()
NULL_CELLS = MILANO.isnull().sum().sum()

COMPLETENESS = NON_NULL_CELLS / TOTAL_CELLS

print("=" * 60)
print("COMPLETENESS ASSESSMENT")
print("=" * 60)
print(f"\nüìä Total cells: {TOTAL_CELLS:,}")
print(f"‚úÖ Non-null cells: {NON_NULL_CELLS:,}")
print(f"‚ùå Null cells: {NULL_CELLS:,}")
print(f"\nüìà Overall Completeness: {COMPLETENESS*100:.2f}%")

In [None]:
# Completeness per column
null_counts = MILANO.isnull().sum()
null_pct = (null_counts / len(MILANO) * 100).round(2)

missing_df = pd.DataFrame({
    'Column': null_counts.index,
    'Missing Count': null_counts.values,
    'Missing %': null_pct.values,
    'Completeness %': (100 - null_pct.values).round(2)
}).sort_values('Missing %', ascending=False)

print("\nüìã MISSING VALUES BY COLUMN:")
missing_df[missing_df['Missing Count'] > 0]

### 3.2.4 Consistency Assessment

**Consistency** measures whether data values conform to defined rules and constraints.

We check two types of consistency:

#### Type 1: Value-based Consistency
- `Superficie somministrazione` should be > 0 when present

$$\text{Consistency}_{\text{rule}} = \frac{|\{x : \text{rule}(x) = \text{True}\}|}{n}$$

#### Type 2: Functional Dependencies (FD)
- Address consistency: `Codice via ‚Üí Nome via, Tipo via`
- Zone consistency: Cross-validation between `Indirizzo` and structured fields

In [None]:
# Value-based consistency: Superficie > 0
MILANO["Superficie somministrazione"] = pd.to_numeric(
    MILANO["Superficie somministrazione"], errors="coerce"
)

valid_superficie = MILANO["Superficie somministrazione"].notna()
positive_superficie = MILANO["Superficie somministrazione"] > 0

consistency_superficie = (valid_superficie & positive_superficie).sum() / valid_superficie.sum()

print("=" * 60)
print("CONSISTENCY ASSESSMENT")
print("=" * 60)
print(f"\nüìä Rule: Superficie somministrazione > 0")
print(f"‚úÖ Valid (non-null) records: {valid_superficie.sum():,}")
print(f"‚úÖ Positive values: {positive_superficie.sum():,}")
print(f"\nüìà Consistency: {consistency_superficie*100:.1f}%")

### 3.2.5 Functional Dependencies (FD)

A **Functional Dependency** $X \rightarrow Y$ holds if, whenever two rows agree on attribute(s) $X$, they must also agree on attribute $Y$.

$$X \rightarrow Y \Leftrightarrow \forall r_1, r_2 \in R: r_1[X] = r_2[X] \Rightarrow r_1[Y] = r_2[Y]$$

#### Expected FDs in Address Data:
- `Codice via ‚Üí Nome via` (street code determines street name)
- `Codice via ‚Üí Tipo via` (street code determines street type)

We use algorithms like **TANE** and **FD_Mine** to discover and validate FDs.

In [None]:
def check_fd(df, lhs_cols, rhs_col):
    """Check if functional dependency X -> Y holds."""
    if isinstance(lhs_cols, str):
        lhs_cols = [lhs_cols]
    
    # Group by LHS and count distinct RHS values
    grouped = df.groupby(lhs_cols)[rhs_col].nunique()
    violations = grouped[grouped > 1]
    
    total_groups = len(grouped)
    violating_groups = len(violations)
    
    if violating_groups == 0:
        status = "‚úÖ HOLDS"
    else:
        status = f"‚ùå VIOLATED in {violating_groups}/{total_groups} groups"
    
    print(f"FD: {lhs_cols} ‚Üí {rhs_col}: {status}")
    return violating_groups == 0

print("=" * 60)
print("FUNCTIONAL DEPENDENCY CHECK")
print("=" * 60)
print()

# Check expected FDs
check_fd(MILANO, 'Codice via', 'Descrizione via')
check_fd(MILANO, 'Codice via', 'Tipo via')
check_fd(MILANO, ['Codice via', 'Civico'], 'ZD')

### 3.2.6 Duplicate Detection

**Duplicates** are rows that represent the same real-world entity.

We distinguish:
- **Exact duplicates**: Rows identical across all columns
- **Near-duplicates**: Rows with minor differences (typos, formatting)

In [None]:
# Check for exact duplicates
exact_duplicates = MILANO.duplicated().sum()
all_duplicates = MILANO.duplicated(keep=False).sum()

print("=" * 60)
print("DUPLICATE DETECTION")
print("=" * 60)
print(f"\nüìä Total rows: {len(MILANO):,}")
print(f"üîÑ Exact duplicate rows: {exact_duplicates}")
print(f"üîÑ Total rows involved in duplication: {all_duplicates}")

if exact_duplicates > 0:
    print("\n‚ö†Ô∏è Duplicate rows found!")
else:
    print("\n‚úÖ No exact duplicates found.")

---

## 3.3 Data Cleaning

The cleaning pipeline consists of three main phases:

1. **Data Transformation/Standardization** - Normalize formats, fix encoding
2. **Error Detection and Correction** - Handle missing values, repair inconsistencies
3. **Data Deduplication** - Remove redundant records

### 3.3.1 Data Transformation/Standardization

| Operation | Description | Example |
|-----------|-------------|--------|
| Text Normalization | Convert to lowercase | `BAR MILANO` ‚Üí `bar milano` |
| Column Renaming | Fix encoding issues | `√æ√øTipo...` ‚Üí `Tipo esercizio...` |
| Typo Correction | Fix special characters | `caff√ø` ‚Üí `caff√®` |
| Macro-Category Creation | Group similar business types | `BAR CAFFE, BIRRERIA` ‚Üí `BAR` |

In [None]:
# === TEXT NORMALIZATION ===
print("=" * 60)
print("STEP 1: TEXT NORMALIZATION")
print("=" * 60)

# Convert text columns to lowercase
text_cols = MILANO.select_dtypes(include="object").columns
MILANO[text_cols] = MILANO[text_cols].apply(lambda col: col.str.lower())
print("\n‚úÖ Converted all text columns to lowercase")

# Rename problematic columns
col_renames = {
    "√æ√øTipo esercizio storico pe": "Tipo esercizio storico pubblico esercizio",
    "Ubicazione": "Indirizzo",
    "Descrizione via": "Nome via",
    "Forma commercio prev": "Forma commercio precedente",
    "Settore storico pe": "Settore storico pubblico esercizio"
}
existing_cols = {k: v for k, v in col_renames.items() if k in MILANO.columns}
MILANO = MILANO.rename(columns=existing_cols)
print(f"‚úÖ Renamed {len(existing_cols)} columns with encoding issues")

# Fix caff√® pattern
text_cols = MILANO.select_dtypes(include="object").columns
MILANO[text_cols] = MILANO[text_cols].apply(
    lambda col: col.str.replace(r"\bcaff[√ø√Ω]", "caff√®", regex=True)
)
print("‚úÖ Fixed 'caff√ø' ‚Üí 'caff√®' pattern")

### 3.3.2 Error Detection and Correction

#### Missing Values Imputation Strategies

| Column | Strategy | Rationale |
|--------|----------|----------|
| `Insegna` | Fill with "unknown" | Cannot infer business names |
| `Superficie` | KNN by street/zone, then global mean | Nearby establishments likely similar |
| `Forma commercio prev` | Mode by macro-category (conf ‚â• 80%) | Business type determines commerce form |

In [None]:
# === MISSING VALUES IMPUTATION ===
print("=" * 60)
print("STEP 2: MISSING VALUES IMPUTATION")
print("=" * 60)

# Track original missing counts
missing_before = MILANO.isnull().sum().sum()
print(f"\nüìä Missing values BEFORE: {missing_before:,}")

# 1. Fill Insegna with 'unknown'
if 'Insegna' in MILANO.columns:
    insegna_missing = MILANO['Insegna'].isna().sum()
    MILANO['Insegna'] = MILANO['Insegna'].fillna('unknown')
    print(f"\n‚úÖ Insegna: Filled {insegna_missing} missing values with 'unknown'")

# 2. Fill Superficie using KNN-like approach (same street, then zone, then global mean)
sup_col = 'Superficie somministrazione'
if sup_col in MILANO.columns:
    MILANO[sup_col] = pd.to_numeric(MILANO[sup_col], errors='coerce')
    sup_missing_before = MILANO[sup_col].isna().sum()
    
    # Strategy 1: Same street mean
    via_means = MILANO.groupby('Codice via')[sup_col].transform('mean')
    mask_via = MILANO[sup_col].isna() & via_means.notna()
    MILANO.loc[mask_via, sup_col] = via_means[mask_via]
    
    # Strategy 2: Same zone mean
    zd_means = MILANO.groupby('ZD')[sup_col].transform('mean')
    mask_zd = MILANO[sup_col].isna() & zd_means.notna()
    MILANO.loc[mask_zd, sup_col] = zd_means[mask_zd]
    
    # Strategy 3: Global mean
    global_mean = MILANO[sup_col].mean()
    MILANO[sup_col] = MILANO[sup_col].fillna(global_mean)
    
    print(f"‚úÖ {sup_col}: Filled {sup_missing_before} missing values using KNN + global mean")

# Summary
missing_after = MILANO.isnull().sum().sum()
print(f"\nüìä Missing values AFTER: {missing_after:,}")
print(f"üìà Improvement: {missing_before - missing_after:,} cells filled")

### 3.3.3 Data Deduplication

#### Similarity Measures Used

**Jaccard Similarity** (set-based):
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

**Levenshtein Similarity** (edit distance-based):
$$L_{sim}(s_1, s_2) = 1 - \frac{\text{editDistance}(s_1, s_2)}{\max(|s_1|, |s_2|)}$$

#### Blocking Strategy
To avoid $O(n^2)$ comparisons, we use **blocking**:
- Group records by `(Codice via, Civico)` 
- Only compare records within the same block

In [None]:
# === EXACT DUPLICATE REMOVAL ===
print("=" * 60)
print("STEP 3: DATA DEDUPLICATION")
print("=" * 60)

before_count = len(MILANO)
MILANO = MILANO.drop_duplicates(keep='first')
after_count = len(MILANO)

print(f"\nüìä Rows before: {before_count:,}")
print(f"üìä Rows after: {after_count:,}")
print(f"üóëÔ∏è Exact duplicates removed: {before_count - after_count}")

In [None]:
# Define similarity functions
def levenshtein_distance(s1, s2):
    """Compute Levenshtein (edit) distance."""
    if pd.isna(s1) or pd.isna(s2):
        return float('inf')
    s1, s2 = str(s1).lower(), str(s2).lower()
    if len(s1) < len(s2):
        s1, s2 = s2, s1
    if len(s2) == 0:
        return len(s1)
    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def levenshtein_similarity(s1, s2):
    """Compute normalized Levenshtein similarity (0-1)."""
    if pd.isna(s1) or pd.isna(s2):
        return 0.0
    s1, s2 = str(s1), str(s2)
    max_len = max(len(s1), len(s2))
    if max_len == 0:
        return 1.0
    return 1 - (levenshtein_distance(s1, s2) / max_len)

print("‚úÖ Similarity functions defined")
print(f"   Example: levenshtein_similarity('caff√®', 'caffe') = {levenshtein_similarity('caff√®', 'caffe'):.3f}")

---

# 4. AFTER CLEANING

## 4.1 Final Data Profiling

After completing all cleaning steps, we perform a final profiling to assess improvements.

In [None]:
# Generate final profiling report
PROFILE_CLEANED = ProfileReport(
    MILANO, 
    title="Profiling Report - Milan Public Establishments (CLEANED)",
    explorative=True
)

# Display inline
PROFILE_CLEANED

## 4.2 Before vs After Comparison

In [None]:
# Load original for comparison
MILANO_ORIG = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")

# Calculate metrics
orig_cells = MILANO_ORIG.shape[0] * MILANO_ORIG.shape[1]
clean_cells = MILANO.shape[0] * MILANO.shape[1]
orig_missing = MILANO_ORIG.isnull().sum().sum()
clean_missing = MILANO.isnull().sum().sum()

comparison = pd.DataFrame({
    'Metric': [
        'Total Rows',
        'Total Columns',
        'Total Cells',
        'Missing Cells',
        'Completeness %'
    ],
    'Original': [
        f"{MILANO_ORIG.shape[0]:,}",
        MILANO_ORIG.shape[1],
        f"{orig_cells:,}",
        f"{orig_missing:,}",
        f"{(1 - orig_missing/orig_cells)*100:.2f}%"
    ],
    'Cleaned': [
        f"{MILANO.shape[0]:,}",
        MILANO.shape[1],
        f"{clean_cells:,}",
        f"{clean_missing:,}",
        f"{(1 - clean_missing/clean_cells)*100:.2f}%"
    ]
})

print("=" * 60)
print("BEFORE VS AFTER COMPARISON")
print("=" * 60)
comparison