# **DATA AND INFORMATION QUALITY**
## **Report: Milan Public Establishments Dataset Analysis**

---

**Authors:** Data Quality Team  
**Date:** January 2026  
**Dataset:** Comune di Milano - Pubblici Esercizi

---

## Table of Contents

1. [Introduction](#1-introduction)
2. [Setup Choices](#2-setup-choices)
3. [Pipeline Implementation](#3-pipeline-implementation)
4. [After Cleaning](#4-after-cleaning)

---

# 1. INTRODUCTION

This report documents the complete **Data Quality Assessment and Cleaning Pipeline** applied to the Milan Public Establishments dataset (*Comune di Milano - Pubblici Esercizi*).

## 1.1 Dataset Description

The dataset contains information about **public establishments** (bars, restaurants, shops, etc.) registered in the Municipality of Milan. Each record represents a business with attributes including:

- **Location data:** Street address, civic number, zone code (ZD)
- **Business type:** Sector, exercise type, commercial form
- **Physical attributes:** Surface area for food service
- **Business name:** Sign/brand name (Insegna)

## 1.2 Objectives

1. **Profile** the dataset to understand its structure and characteristics
2. **Assess** data quality dimensions (Completeness, Consistency, Duplicates)
3. **Clean** the data through transformation, error correction, and deduplication
4. **Validate** the improvements through post-cleaning profiling

## 1.3 Data Quality Dimensions Covered

| Dimension | Covered | Rationale |
|-----------|---------|----------|
| **Completeness** | ✅ Yes | Measured and improved through imputation |
| **Consistency** | ✅ Yes | Address consistency check, functional dependencies |
| **Duplicates** | ✅ Yes | Exact and near-duplicate detection |
| **Accuracy** | ❌ No | No ground truth available for validation |
| **Timeliness** | ❌ No | No temporal attributes (dates) in dataset |

### Why No Accuracy Assessment?

**Accuracy** measures how well data values correspond to the real-world entities they represent. To assess accuracy, we need either:
- A **ground truth** dataset to compare against
- **External validation sources** (e.g., official registry, field verification)

Since we lack both, we **cannot objectively measure accuracy**. We can only ensure **syntactic correctness** and **internal consistency**.

### Why No Timeliness Assessment?

**Timeliness** measures whether data is up-to-date for the intended use. This dataset lacks:
- Timestamp columns (creation/update dates)
- Temporal attributes to assess currency

Therefore, **timeliness cannot be measured**.

---

# 2. SETUP CHOICES

## 2.1 Environment

| Component | Version/Details |
|-----------|----------------|
| **Operating System** | Linux (Ubuntu) |
| **Python** | 3.12 |
| **IDE** | Visual Studio Code with Jupyter extension |
| **Kernel** | IPython Kernel |

## 2.2 Libraries and Tools

### Core Data Processing
| Library | Purpose |
|---------|---------|
| `pandas` | DataFrame manipulation and analysis |
| `numpy` | Numerical operations |

### Data Profiling
| Library | Purpose |
|---------|---------|
| `ydata_profiling` | Automated profiling reports (HTML/JSON) |

### Visualization
| Library | Purpose |
|---------|---------|
| `matplotlib` | Basic plots (histograms, boxplots) |
| `seaborn` | Statistical visualizations (heatmaps) |

### Statistical Analysis
| Library | Purpose |
|---------|---------|
| `scipy.stats` | Z-score calculations for outlier detection |

### Functional Dependencies (Custom Scripts)
| Script | Purpose |
|--------|---------|
| `tane.py` | TANE algorithm implementation |
| `ctane.py` | Conditional TANE |
| `fdtool.py` | FD_Mine implementation |

In [1]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import re

---

# 3. PIPELINE IMPLEMENTATION

## 3.1 Exploration





 Shape: 6,904 rows × 13 columns
 Total cells: 89,752

## 3.2 Data Profiling



### 3.2.2 Automatic Profiling with YData Profiling

YData Profiling generates a comprehensive report including:
- **Overview**: Dataset statistics, variable types, missing values
- **Variables**: Detailed analysis per column
- **Interactions**: Correlations and relationships
- **Missing Values**: Patterns and heatmaps
- **Duplicates**: Exact duplicate detection

Critical metadata findings:

-þÿTipo esercizio storico pe:
 Missing 19.6%

-Insegna:
 Missing 49.4%

-Civico:
 Missing 2.3%

-Forma Commercio:
 Missing 22.8%

-Forma commercio prev:
 Missing 20.2%

-Forma vendita:
 Missing 20.6%
 
-Superficie di somministrazione:
 Missing 1.1%

### 3.2.3 Completeness Assessment

**Completeness** measures the degree to which all required data is present.

$$\text{Completeness} = \frac{\text{Non-null cells}}{\text{Total cells}}$$


COMPLETENESS ASSESSMENT
 Total cells: 89,752
 Non-null cells: 80,341
 Null cells: 9,411
 Overall Completeness: 89.51%

### 3.2.4 Consistency Assessment

**Consistency** measures whether data values conform to defined rules and constraints.

We check two types of consistency:

#### Type 1: Value-based Consistency
- `Superficie somministrazione` should be > 0 when present


#### Type 2: Functional Dependencies (FD)
- Address consistency: `Codice via → Nome via, Tipo via`
- Zone consistency: Cross-validation between `Indirizzo` and structured fields

CONSISTENCY ASSESSMENT

 Rule: Superficie somministrazione > 0
 Valid (non-null) records: 6,825
 Positive values: 6,825

 Consistency: 100.0%

### 3.2.5 Functional Dependencies (FD)

A **Functional Dependency** $X \rightarrow Y$ holds if, whenever two rows agree on attribute(s) $X$, they must also agree on attribute $Y$.

$$X \rightarrow Y \Leftrightarrow \forall r_1, r_2 \in R: r_1[X] = r_2[X] \Rightarrow r_1[Y] = r_2[Y]$$

#### Expected FDs in Address Data:
- `Codice via → Nome via` (street code determines street name)
- `Codice via → Tipo via` (street code determines street type)

We use algorithms like **TANE** and **FD_Mine** to discover and validate FDs.

FUNCTIONAL DEPENDENCY CHECK

FD: ['Codice via'] → Descrizione via: HOLDS

FD: ['Codice via'] → Tipo via: HOLDS

FD: ['Codice via', 'Civico'] → ZD: HOLDS

### 3.2.6 Duplicate Detection

**Duplicates** are rows that represent the same real-world entity.

We distinguish:
- **Exact duplicates**: Rows identical across all columns
- **Near-duplicates**: Rows with minor differences (typos, formatting)

DUPLICATE DETECTION

 Total rows: 6,904

 Exact duplicate rows: 1
 
 Total rows involved in duplication: 2


---

## 3.3 Data Cleaning

The cleaning pipeline consists of three main phases:

1. **Data Transformation/Standardization** - Normalize formats, fix encoding
2. **Error Detection and Correction** - Handle missing values, repair inconsistencies
3. **Data Deduplication** - Remove redundant records

### 3.3.1 Data Transformation/Standardization

| Operation | Description | Example |
|-----------|-------------|--------|
| Text Normalization | Convert to lowercase | `BAR MILANO` → `bar milano` |
| Column Renaming | Fix encoding issues | `þÿTipo...` → `Tipo esercizio...` |
| Typo Correction | Fix special characters | `caffÿ` → `caffè` |
| Macro-Category Creation | Group similar business types | `BAR CAFFE, BIRRERIA` → `BAR` |

TEXT NORMALIZATION

 Converted all text columns to lowercase

 Renamed 5 columns with encoding issues
 
 Fixed 'caffÿ' → 'caffè' pattern

### 3.3.2 Error Detection and Correction

#### Missing Values Imputation Strategies

| Column | Strategy | Rationale |
|--------|----------|----------|
| `Insegna` | Fill with "unknown" | Cannot infer business names |
| `Superficie` | KNN by street/zone, then global mean | Nearby establishments likely similar |
| `Forma commercio prev` | Mode by macro-category (conf ≥ 80%) | Business type determines commerce form |

MISSING VALUES IMPUTATION

 Missing values BEFORE: 5,922

 Insegna: Filled 0 missing values with 'unknown'
 Superficie somministrazione: Filled 0 missing values using KNN + global mean

 Missing values AFTER: 5,922
 Improvement: 0 cells filled

### 3.3.3 Data Deduplication

#### Similarity Measures Used

**Jaccard Similarity** (set-based):
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

**Levenshtein Similarity** (edit distance-based):
$$L_{sim}(s_1, s_2) = 1 - \frac{\text{editDistance}(s_1, s_2)}{\max(|s_1|, |s_2|)}$$

#### Blocking Strategy
To avoid $O(n^2)$ comparisons, we use **blocking**:
- Group records by `(Codice via, Civico)` 
- Only compare records within the same block

 DATA DEDUPLICATION

 Rows before: 6,904
 Rows after: 6,903
 Exact duplicates removed: 1

Similarity functions defined
   Example: levenshtein_similarity('caffè', 'caffe') = 0.800

---

# 4. AFTER CLEANING

## 4.1 Final Data Profiling

After completing all cleaning steps, we perform a final profiling to assess improvements.

## 4.2 Before vs After Comparison

-Tipo esercizio storico pe:
 Missing 19.6% --> 0%

-Insegna:
 Missing 49.4% --> 0%

-Civico:
 Missing 2.3% --> 0%

-Forma Commercio:
 Missing 22.8% --> <0.1%

-Forma commercio prev:
 Missing 20.2% --> <0.1%

-Forma vendita:
 Missing 20.6% --> 0%
 
-Superficie di somministrazione:
 Missing 1.1% -->0%