# 1. DATA QUALITY ASSESSMENT
This first step sets up the working environment by importing the libraries used throughout the notebook. The pandas library is used to load and manipulate tabular data, numpy provides numerical utilities, and datetime is used to work with dates in the examples seen during the lectures.

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime

The dataset of public establishments in the city of Milan is loaded from the CSV file. The separator is specified as semicolon because the file uses ; rather than commas to separate fields. Displaying the DataFrame immediately after loading allows a quick visual check that the data has been read correctly.

In [4]:
MILANO = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")
MILANO


Unnamed: 0,þÿTipo esercizio storico pe,Insegna,Ubicazione,Tipo via,Descrizione via,Civico,Codice via,ZD,Forma commercio,Forma commercio prev,Forma vendita,Settore storico pe,Superficie somministrazione
0,,,ALZ NAVIGLIO GRANDE N. 12 ; isolato:057; (z.d. 6),ALZ,NAVIGLIO GRANDE,12,5144,6,,,,"Ristorante, trattoria, osteria;Genere Merceol....",83.0
1,,,ALZ NAVIGLIO GRANDE N. 44 (z.d. 6),ALZ,NAVIGLIO GRANDE,44,5144,6,,,,Bar gastronomici e simili,26.0
2,,,ALZ NAVIGLIO GRANDE N. 48 (z.d. 6),ALZ,NAVIGLIO GRANDE,48,5144,6,,,,Bar gastronomici e simili,58.0
3,,,ALZ NAVIGLIO GRANDE N. 8 (z.d. 6),ALZ,NAVIGLIO GRANDE,8,5144,6,,,,"BAR CAFFÿý E SIMILI;Ristorante, trattoria, ost...",101.0
4,,,ALZ NAVIGLIO PAVESE N. 24 (z.d. 6),ALZ,NAVIGLIO PAVESE,24,5161,6,,,,Bar gastronomici e simili,51.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6899,"wine,birr.,pub enot.,caff.,the",bar cherry,VLE DORIA ANDREA N. 12 ; isolato:031; accesso:...,VLE,DORIA ANDREA,12,2230,2,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",59.0
6900,"wine,birr.,pub enot.,caff.,the",la balusa,VIA GARIGLIANO N. 5 ; isolato:277; accesso: ac...,VIA,GARIGLIANO,5,1134,9,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",40.0
6901,"wine,birr.,pub enot.,caff.,the",la champagnerie sas,VIA SOTTOCORNO PASQUALE N. 4 ; isolato:014; ac...,VIA,SOTTOCORNO PASQUALE,4,3152,4,solo somministrazione,somministrazione,misto,BAR CAFFÿý E SIMILI;Bar gastronomici e simili,53.0
6902,"wine,birr.,pub enot.,caff.,the",old rooster,VIA CASTROVILLARI N. 23 ; isolato:150; accesso...,VIA,CASTROVILLARI,23,6299,7,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",43.0


The overall size of the dataset is inspected by looking at the number of rows and columns. This provides a first idea of how many records and attributes will be considered in the data quality assessment.

In [5]:
MILANO.shape



(6904, 13)

The names of the attributes are displayed together with a sample of the first tuples. Looking at the schema helps to recognise which columns are categorical, numeric or identifiers, while the first rows give an intuitive feeling of the content of the dataset.

In [6]:
MILANO.columns
MILANO.head()


Unnamed: 0,þÿTipo esercizio storico pe,Insegna,Ubicazione,Tipo via,Descrizione via,Civico,Codice via,ZD,Forma commercio,Forma commercio prev,Forma vendita,Settore storico pe,Superficie somministrazione
0,,,ALZ NAVIGLIO GRANDE N. 12 ; isolato:057; (z.d. 6),ALZ,NAVIGLIO GRANDE,12,5144,6,,,,"Ristorante, trattoria, osteria;Genere Merceol....",83.0
1,,,ALZ NAVIGLIO GRANDE N. 44 (z.d. 6),ALZ,NAVIGLIO GRANDE,44,5144,6,,,,Bar gastronomici e simili,26.0
2,,,ALZ NAVIGLIO GRANDE N. 48 (z.d. 6),ALZ,NAVIGLIO GRANDE,48,5144,6,,,,Bar gastronomici e simili,58.0
3,,,ALZ NAVIGLIO GRANDE N. 8 (z.d. 6),ALZ,NAVIGLIO GRANDE,8,5144,6,,,,"BAR CAFFÿý E SIMILI;Ristorante, trattoria, ost...",101.0
4,,,ALZ NAVIGLIO PAVESE N. 24 (z.d. 6),ALZ,NAVIGLIO PAVESE,24,5161,6,,,,Bar gastronomici e simili,51.0


The data types inferred by pandas for each column are examined. This information is useful to understand how the system interprets each attribute and to decide later which columns are suitable for numerical computations and which ones should be treated as categories or textual descriptions.

In [7]:
MILANO.dtypes



þÿTipo esercizio storico pe     object
Insegna                         object
Ubicazione                      object
Tipo via                        object
Descrizione via                 object
Civico                          object
Codice via                       int64
ZD                               int64
Forma commercio                 object
Forma commercio prev            object
Forma vendita                   object
Settore storico pe              object
Superficie somministrazione    float64
dtype: object

As an example of basic inspection on a categorical column, the distinct values of the attribute describing the historical sector of the establishment are retrieved. This reveals which types of activities are present in the dataset and how many different labels are used.


In [8]:
MILANO["Settore storico pe"].unique()



array(['Ristorante, trattoria, osteria;Genere Merceol.Autorizz.Sanit.;Ristorante',
       'Bar gastronomici e simili',
       'BAR CAFFÿý E SIMILI;Ristorante, trattoria, osteria', ...,
       "LETT. GIOCHI LECITI + SOCIETA';Genere Merceol.Autorizz.Sanit.;Trattoria;Ristorante, trattoria, osteria",
       'Disco-piano-americ.bar serali;Trattoria;BAR CAFFÿý E SIMILI;Pizzerie e simili;Ristorante, trattoria, osteria;Pizzeria;Genere Merceol.Autorizz.Sanit.',
       'Tavola calda;Pizzeria;Ristorante, trattoria, osteria;Trattoria;Genere Merceol.Autorizz.Sanit.'],
      dtype=object)

The number of distinct sectors is computed, and then the frequency of each sector is analysed. The count of unique values shows how many different categories exist, while the value counts show how they are distributed across the dataset.

In [9]:
MILANO["Settore storico pe"].nunique()

SETTORE_COUNT = MILANO["Settore storico pe"].value_counts()
SETTORE_COUNT



Settore storico pe
Genere Merceol.Autorizz.Sanit.                                                                                                                                                               214
Ristorante, trattoria, osteria                                                                                                                                                               174
Bar gastronomici e simili                                                                                                                                                                    165
BAR CAFFÿý E SIMILI                                                                                                                                                                           85
BAR CAFFÿý E SIMILI;Bar gastronomici e simili                                                                                                                                                 55
                

The duplication dimension is investigated by checking whether some tuples are exact duplicates of others. A boolean Series is created with duplicated, and then it is used to verify if any duplicate exists and to inspect which records are repeated.

In [10]:
DUPLICATES = MILANO.duplicated()
DUPLICATES

print(DUPLICATES.any())

MILANO[DUPLICATES]



True


Unnamed: 0,þÿTipo esercizio storico pe,Insegna,Ubicazione,Tipo via,Descrizione via,Civico,Codice via,ZD,Forma commercio,Forma commercio prev,Forma vendita,Settore storico pe,Superficie somministrazione
940,,,VIA SALASCO N. 29 (z.d. 5),VIA,SALASCO,29,4051,5,,,,Spaccio bevande analcoliche,


The completeness dimension begins with an inspection of missing values. The isnull function is used to see, for each cell, whether the value is present or missing. As an example, the null pattern is also inspected for a single attribute related to the serving surface.

In [11]:
MILANO.isnull()

MILANO["Superficie somministrazione"].isnull()



0       False
1       False
2       False
3       False
4       False
        ...  
6899    False
6900    False
6901    False
6902    False
6903    False
Name: Superficie somministrazione, Length: 6904, dtype: bool

The number of non-null values is counted for each column, and then summed to obtain the total number of available values in the dataset. This quantity is the numerator in the completeness measure introduced in the exercise.

In [12]:
MILANO.count()

NOT_NULL = MILANO.count().sum()
NOT_NULL



80341

The number of missing values is computed by summing the result of isnull().sum() over all columns. The total number of cells is obtained as the product between the number of rows and columns. As in the lecture, the equality between the total number of cells and the sum of null and non-null values is also verified.

In [13]:
MILANO.isnull().sum()

NULL = MILANO.isnull().sum().sum()
NULL

TOT = MILANO.shape[0] * MILANO.shape[1]
TOT

TOT = NOT_NULL + NULL
TOT


89752

The completeness of the dataset is evaluated according to the definition given in class. The ratio between the total number of non-null values and the total number of cells is computed and then formatted as a percentage. This gives a single quantitative indicator of how complete the dataset is.

In [14]:
COMPLETENESS = NOT_NULL / TOT
COMPLETENESS = "{0:.1f}%".format(COMPLETENESS * 100)
print(COMPLETENESS)



89.5%


To study accuracy, a set of acceptable values is defined for the attribute describing the type of street. This list plays the role of a definition domain, similar to the external STYLES data source used in the example with beers. The attribute will be considered syntactically accurate when its values belong to this domain.

In [15]:
TIPO_VIA_DOMAIN = ['ALZ','BST','VIA','VLE','CSO','GLL','LGO','PLE',
                   'PTA','PZA','RIP','VIE','FOR','VLO','PAS','LARGO']


The syntactic accuracy of the street-type attribute is measured. A boolean Series identifies, for each tuple, whether the value of Tipo via belongs to the previously defined domain. The number of correct values is then summed, the number of non-null values is counted, and their ratio is formatted as a percentage to obtain the accuracy of this attribute.

In [16]:
CORRECT_TIPO = MILANO["Tipo via"].isin(TIPO_VIA_DOMAIN)
CORRECT_TIPO

CORRECT_TIPO_COUNT = np.sum(CORRECT_TIPO)
CORRECT_TIPO_COUNT

NOT_NULL_TIPO = MILANO["Tipo via"].count()
NOT_NULL_TIPO

ACCURACY_TIPO = CORRECT_TIPO_COUNT / NOT_NULL_TIPO
ACCURACY_TIPO = "{0:.1f}%".format(ACCURACY_TIPO * 100)
print(ACCURACY_TIPO)



100.0%


A second accuracy example follows the pattern used in the exercise for the ibu attribute. Here it is assumed that valid zone codes must belong to the range from 1 to 9. A Python range object is created, and for each value of ZD it is checked whether it falls into this range. The number of correct codes and the number of non-null codes are used to compute and print the accuracy of the zone attribute.

In [21]:
ZD_RANGE_CORRECT = range(1, 10)

CORRECT_ZD = sum(1 for item in MILANO["ZD"] if item in ZD_RANGE_CORRECT)
CORRECT_ZD

NOT_NULL_ZD = MILANO["ZD"].count()
NOT_NULL_ZD

ACCURACY_ZD = CORRECT_ZD / NOT_NULL_ZD
ACCURACY_ZD = "{0:.1f}%".format(ACCURACY_ZD * 100)
print(ACCURACY_ZD)


100.0%


Timeliness is defined in the exercise as the extent to which the age of the data is appropriate for the task at hand. In the professor’s example it is computed on the PROPERTY dataset by converting an update timestamp into a date, deriving the currency in days and then applying a formula that depends on an assumed volatility. In the Milan establishments dataset there is no attribute recording the update time of each record, so the same computation cannot be performed. For this reason the timeliness dimension is acknowledged conceptually but is not evaluated numerically in this notebook.

The consistency dimension is evaluated by defining a simple business rule on a numerical attribute. The serving surface is converted explicitly to a numeric type using pd.to_numeric, as shown in the example on the number of bathrooms. This guarantees that any non-numeric values are treated as missing before checking the rule.

In [22]:
MILANO["Superficie somministrazione"] = pd.to_numeric(
    MILANO["Superficie somministrazione"],
    errors="coerce"
)
MILANO["Superficie somministrazione"]



0        83.0
1        26.0
2        58.0
3       101.0
4        51.0
        ...  
6899     59.0
6900     40.0
6901     53.0
6902     43.0
6903     95.0
Name: Superficie somministrazione, Length: 6904, dtype: float64

A basic rule is introduced: whenever the serving surface is present, its value should be strictly greater than zero. A new column called consistency is added to the dataset using numpy.where. The column takes value 1 when the rule is satisfied and 0 when it is violated. This follows the same pattern as the rule on the number of bathrooms and bedrooms in the professor’s example.

In [23]:
MILANO["consistency"] = np.where(
    MILANO["Superficie somministrazione"] > 0,
    1,
    0
)
MILANO[["Superficie somministrazione", "consistency"]]



Unnamed: 0,Superficie somministrazione,consistency
0,83.0,1
1,26.0,1
2,58.0,1
3,101.0,1
4,51.0,1
...,...,...
6899,59.0,1
6900,40.0,1
6901,53.0,1
6902,43.0,1


To compute the consistency measure, only tuples where the serving surface is not null are considered, because the rule cannot be evaluated when the attribute is missing. A filtered DataFrame is created with these tuples, a boolean mask identifies which ones satisfy the rule, and the number of consistent tuples is obtained by summing the mask.

In [24]:
MILANO_COUNT = MILANO[MILANO["Superficie somministrazione"].notna()]
MILANO_COUNT

CONSISTENT_MASK = MILANO_COUNT["consistency"] == 1
CONSISTENT = CONSISTENT_MASK.sum()
CONSISTENT



np.int64(6825)

The total number of tuples where the rule has been evaluated is counted, and the fraction of tuples satisfying the rule is computed. This ratio is then formatted as a percentage, providing a single quantitative indicator of consistency for the chosen business rule, exactly in the style of the exercise on the PROPERTY dataset.

In [25]:
COUNT = MILANO_COUNT["consistency"].count()
COUNT

CONSISTENCY = CONSISTENT / COUNT
CONSISTENCY = "{0:.1f}%".format(CONSISTENCY * 100)
print(CONSISTENCY)

100.0%


### Data Quality – First checks on the Milan dataset

From the value counts of **`Settore storico pe`** it is clear that this column is very messy from a data-quality point of view. There are about 3975 distinct values, almost as many as the number of rows, and only a few categories appear often, such as *Genere Merceol.Autorizz.Sanit.* (214 rows), *Ristorante, trattoria, osteria* and *Bar gastronomici e simili*. Many other entries are long strings where several sectors are concatenated with `;` (for example combinations of bar, trattoria, pizzeria, games, etc.). This means the column is actually a multi-label field encoded as free text, with a long tail of almost unique combinations. For profiling it is useful to see the most common sectors, but for later analysis this field will probably need cleaning or simplification (for example splitting on `;`, normalising names and maybe keeping only the main categories), otherwise the huge cardinality will not be very informative.

The duplicate check shows a good situation: `MILANO.duplicated()` finds only **one duplicated row**, where almost all descriptive fields are missing and only the address information and the sector *Spaccio bevande analcoliche* are present. This tells us that, at full-row level, the dataset is basically free of exact duplicates, apart from this single pair which can easily be dropped in the cleaning phase.

The calls to `MILANO.isnull()` and to `MILANO["Superficie somministrazione"].isnull()` are used to inspect missing values at cell level. Even if the printed snippet for `Superficie somministrazione` shows only `False`, we know from the counts that not all rows are present in this column, so some `NaN` values exist further down. Finally, `MILANO.count()` together with `NOT_NULL = MILANO.count().sum()` lets us compare the number of non-null cells with the total number of cells (we already know there are 89,752 cells in the table). This is the basis for computing overall completeness of the dataset and, combined with column-wise counts, will help us decide where we need imputation or other strategies to handle missing data.
