# 2. DATA PROFILING (Y Data Profiling)

In this notebook the focus is on automatic data profiling. The goal is to generate a complete profiling report for the Milan dataset and to inspect some of the summary information programmatically.



The classes and modules needed for this notebook are imported. `ProfileReport` is the main class from `ydata_profiling`, `pandas` is used to load the Milan dataset, and `json` is used later to open and inspect the profiling report saved in JSON format.


In [1]:
from ydata_profiling import ProfileReport
import pandas as pd
import json


The Milan dataset of public establishments is loaded from the local CSV file using `read_csv`. The separator is set to semicolon because fields in the file are separated by `;`. The DataFrame is stored in the variable `MILANO`, which will be used as input for the profiling report.


In [2]:
MILANO = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")
MILANO


Unnamed: 0,þÿTipo esercizio storico pe,Insegna,Ubicazione,Tipo via,Descrizione via,Civico,Codice via,ZD,Forma commercio,Forma commercio prev,Forma vendita,Settore storico pe,Superficie somministrazione
0,,,ALZ NAVIGLIO GRANDE N. 12 ; isolato:057; (z.d. 6),ALZ,NAVIGLIO GRANDE,12,5144,6,,,,"Ristorante, trattoria, osteria;Genere Merceol....",83.0
1,,,ALZ NAVIGLIO GRANDE N. 44 (z.d. 6),ALZ,NAVIGLIO GRANDE,44,5144,6,,,,Bar gastronomici e simili,26.0
2,,,ALZ NAVIGLIO GRANDE N. 48 (z.d. 6),ALZ,NAVIGLIO GRANDE,48,5144,6,,,,Bar gastronomici e simili,58.0
3,,,ALZ NAVIGLIO GRANDE N. 8 (z.d. 6),ALZ,NAVIGLIO GRANDE,8,5144,6,,,,"BAR CAFFÿý E SIMILI;Ristorante, trattoria, ost...",101.0
4,,,ALZ NAVIGLIO PAVESE N. 24 (z.d. 6),ALZ,NAVIGLIO PAVESE,24,5161,6,,,,Bar gastronomici e simili,51.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6899,"wine,birr.,pub enot.,caff.,the",bar cherry,VLE DORIA ANDREA N. 12 ; isolato:031; accesso:...,VLE,DORIA ANDREA,12,2230,2,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",59.0
6900,"wine,birr.,pub enot.,caff.,the",la balusa,VIA GARIGLIANO N. 5 ; isolato:277; accesso: ac...,VIA,GARIGLIANO,5,1134,9,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",40.0
6901,"wine,birr.,pub enot.,caff.,the",la champagnerie sas,VIA SOTTOCORNO PASQUALE N. 4 ; isolato:014; ac...,VIA,SOTTOCORNO PASQUALE,4,3152,4,solo somministrazione,somministrazione,misto,BAR CAFFÿý E SIMILI;Bar gastronomici e simili,53.0
6902,"wine,birr.,pub enot.,caff.,the",old rooster,VIA CASTROVILLARI N. 23 ; isolato:150; accesso...,VIA,CASTROVILLARI,23,6299,7,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",43.0


A profiling report is created from the Milan DataFrame using `ProfileReport`. This object contains all the computed statistics and sections described before. Displaying the object in the notebook shows the interactive HTML report.


In [3]:
PROFILE = ProfileReport(MILANO, title="Profiling Report - Milan Public Establishments")
PROFILE


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 13/13 [00:00<00:00, 54.83it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



The profiling report is saved to an HTML file. This file can be opened in a web browser to explore the full interactive report outside the notebook.


In [37]:
PROFILE.to_file("MILANO_REPORT.html")


Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

The same profiling report is also exported to a JSON file. The JSON version contains all the summary information in a structured format. It can be read and inspected with Python to extract specific statistics, like the total number of rows or the number of distinct values for a given variable.


In [4]:
PROFILE.to_file("MILANO_REPORT.json")


Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

The JSON report is opened from disk and loaded into a Python object using the `json` module. The variable `JFILE` now contains the full profiling summary in dictionary form.


In [6]:
file = open("MILANO_REPORT.json")
JFILE = json.load(file)


The JSON object is displayed to see its structure. This shows that the report is divided into sections such as `"table"` and `"variables"`, which can be accessed by key. 


In [7]:
JFILE


{'analysis': {'title': 'Profiling Report - Milan Public Establishments',
  'date_start': '2025-12-12 09:31:28.867211',
  'date_end': '2025-12-12 09:31:30.750075'},
 'time_index_analysis': 'None',
 'table': {'n': 6904,
  'n_var': 13,
  'memory_size': 718148,
  'record_size': 104.01911935110081,
  'n_cells_missing': 9411,
  'n_vars_with_missing': 8,
  'n_vars_all_missing': 0,
  'p_cells_missing': 0.1048556021035743,
  'types': {'Categorical': 5, 'Text': 5, 'Numeric': 3},
  'n_duplicates': 1,
  'p_duplicates': 0.00014484356894553882},
 'variables': {'þÿTipo esercizio storico pe': {'n_distinct': 23,
   'p_distinct': 0.004143397586020537,
   'is_unique': False,
   'n_unique': 1,
   'p_unique': 0.0001801477211313277,
   'type': 'Categorical',
   'hashable': True,
   'value_counts_without_nan': {'bar caffÿý': 3274,
    'ristorante': 807,
    'trattoria': 554,
    'pizzeria': 350,
    'tavola calda': 104,
    'spaccio bevande analcoliche': 82,
    'ristorante, trattoria, osteria': 62,
    'bar

The total number of rows in the dataset is retrieved from the JSON profile. This is done by accessing the `"table"` section and then the `"n"` key. This confirms that the profiling report correctly captured the size of the Milan dataset.


In [8]:
JFILE["table"]["n"]


6904

As an example of how to inspect the statistics of a single attribute, the number of distinct values is read for the variable `Superficie somministrazione`. In the JSON report this information is stored under the `"variables"` section, in the `"n_distinct"` field for that variable. 


In [9]:
JFILE["variables"]["Superficie somministrazione"]["n_distinct"]


400

**dataprofiler LIBRARY**

`dataprofiler` is another library that can be used for automatic data profiling. It produces a profile object for a pandas DataFrame and can generate a readable report with many statistics, similar to `ydata_profiling`. In this section the same Milan dataset is used again, but the profiling is done with `dataprofiler`.


The `dataprofiler` library is installed using `pip`. This makes the `Data` and `Profiler` classes available in the environment.


In [43]:
%pip install dataprofiler


Note: you may need to restart the kernel to use updated packages.


The Milan dataset is loaded again into a DataFrame


In [44]:
import pandas as pd

MILANO = pd.read_csv("Comune-di-Milano-Pubblici-esercizi(in)-2.csv", sep=";")
MILANO


Unnamed: 0,þÿTipo esercizio storico pe,Insegna,Ubicazione,Tipo via,Descrizione via,Civico,Codice via,ZD,Forma commercio,Forma commercio prev,Forma vendita,Settore storico pe,Superficie somministrazione
0,,,ALZ NAVIGLIO GRANDE N. 12 ; isolato:057; (z.d. 6),ALZ,NAVIGLIO GRANDE,12,5144,6,,,,"Ristorante, trattoria, osteria;Genere Merceol....",83.0
1,,,ALZ NAVIGLIO GRANDE N. 44 (z.d. 6),ALZ,NAVIGLIO GRANDE,44,5144,6,,,,Bar gastronomici e simili,26.0
2,,,ALZ NAVIGLIO GRANDE N. 48 (z.d. 6),ALZ,NAVIGLIO GRANDE,48,5144,6,,,,Bar gastronomici e simili,58.0
3,,,ALZ NAVIGLIO GRANDE N. 8 (z.d. 6),ALZ,NAVIGLIO GRANDE,8,5144,6,,,,"BAR CAFFÿý E SIMILI;Ristorante, trattoria, ost...",101.0
4,,,ALZ NAVIGLIO PAVESE N. 24 (z.d. 6),ALZ,NAVIGLIO PAVESE,24,5161,6,,,,Bar gastronomici e simili,51.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6899,"wine,birr.,pub enot.,caff.,the",bar cherry,VLE DORIA ANDREA N. 12 ; isolato:031; accesso:...,VLE,DORIA ANDREA,12,2230,2,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",59.0
6900,"wine,birr.,pub enot.,caff.,the",la balusa,VIA GARIGLIANO N. 5 ; isolato:277; accesso: ac...,VIA,GARIGLIANO,5,1134,9,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",40.0
6901,"wine,birr.,pub enot.,caff.,the",la champagnerie sas,VIA SOTTOCORNO PASQUALE N. 4 ; isolato:014; ac...,VIA,SOTTOCORNO PASQUALE,4,3152,4,solo somministrazione,somministrazione,misto,BAR CAFFÿý E SIMILI;Bar gastronomici e simili,53.0
6902,"wine,birr.,pub enot.,caff.,the",old rooster,VIA CASTROVILLARI N. 23 ; isolato:150; accesso...,VIA,CASTROVILLARI,23,6299,7,solo somministrazione,somministrazione,misto,"Wine,birr.,pub enot.,caff.,the",43.0


The `Profiler` class from `dataprofiler` is used to create a profile for the Milan dataset. This object computes statistics and type information for each attribute, similarly to the profiling report created with `ydata_profiling`. Displaying the `profile` object shows a summary of what has been detected.


In [10]:
from dataprofiler import Data, Profiler

profile = Profiler(MILANO)
profile


INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 




Profiling Type: data_labeler
Exception: ModuleNotFoundError
Message: No module named 'tensorflow'

For labeler errors, try installing the extra ml requirements via:

$ pip install dataprofiler[ml] --user


  profiler_utils.warn_on_profile("data_labeler", e)
100%|██████████| 13/13 [00:00<00:00, 184.61it/s]

INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 



100%|██████████| 13/13 [00:00<00:00, 25.55it/s]


<dataprofiler.profilers.profile_builder.StructuredProfiler at 0x75c4f051e840>

A compact readable report is generated from the dataprofiler profile using the `report` method. The `output_format` is set to `"compact"`, as in the teacher's example. The resulting object contains a structured summary of the profiling results for the Milan dataset.


In [11]:
readable_report = profile.report(report_options={"output_format": "compact"})
readable_report


{'global_stats': {'samples_used': 5000,
  'column_count': 13,
  'row_count': 6904,
  'row_has_null_ratio': 0.5258,
  'row_is_null_ratio': 0.0,
  'unique_row_ratio': 0.9999,
  'duplicate_row_count': 1,
  'file_type': "<class 'pandas.core.frame.DataFrame'>",
  'encoding': None,
  'correlation_matrix': None,
  'chi2_matrix': '[[ 1., nan, nan,  0., nan,  0., nan,  0.,  0.,  0.,  0., nan,  0.], ... , [ 0., nan, nan,  0., nan,  0., nan,  0.,  0.,  0.,  0., nan,  1.]]',
  'profile_schema': {'þÿTipo esercizio storico pe': [0],
   'Insegna': [1],
   'Ubicazione': [2],
   'Tipo via': [3],
   'Descrizione via': [4],
   'Civico': [5],
   'Codice via': [6],
   'ZD': [7],
   'Forma commercio': [8],
   'Forma commercio prev': [9],
   'Forma vendita': [10],
   'Settore storico pe': [11],
   'Superficie somministrazione': [12]},
  'times': {'row_stats': 0.0126}},
 'data_stats': [{'column_name': 'þÿTipo esercizio storico pe',
   'data_type': 'string',
   'categorical': True,
   'order': 'random',
   'sa

In this part the compact report produced by `dataprofiler` is used to build a small summary table.  
For each column the table shows: number of null values, null ratio, number of unique values, minimum and maximum.  

This helps to connect the automatic profiling done by `dataprofiler` with the manual data profiling performed before (for example on `Superficie somministrazione`).

In [12]:
import pandas as pd

rows = []
for col in readable_report["data_stats"]:
    name = col["column_name"]
    stats = col["statistics"]
    rows.append({
        "column": name,
        "null_count": stats.get("null_count"),
        "null_ratio": stats.get("null_ratio"),
        "unique_count": stats.get("unique_count"),
        "min": stats.get("min"),
        "max": stats.get("max"),
    })

df_profiler_summary = pd.DataFrame(rows)
df_profiler_summary


Unnamed: 0,column,null_count,null_ratio,unique_count,min,max
0,þÿTipo esercizio storico pe,962,,23,5.0,30.0
1,Insegna,2450,,2123,2.0,27.0
2,Ubicazione,0,,4674,22.0,135.0
3,Tipo via,0,,14,3.0,3.0
4,Descrizione via,0,,1629,3.0,32.0
5,Civico,108,,243,1.0,3.0
6,Codice via,0,,1642,1.0,7602.0
7,ZD,0,,9,1.0,9.0
8,Forma commercio,1122,,2,21.0,23.0
9,Forma commercio prev,999,,2,6.0,16.0


Using the summary table, the columns are sorted by `null_ratio`.  
This makes it easy to see which attributes have the largest proportion of missing values and are therefore more critical from the completeness point of view.  

These results can be compared with the information shown in the HTML report of `ydata_profiling` and with the completeness analysis done in the previous notebooks.

In [13]:
df_profiler_summary.sort_values("null_ratio", ascending=False).head(10)


Unnamed: 0,column,null_count,null_ratio,unique_count,min,max
0,þÿTipo esercizio storico pe,962,,23,5.0,30.0
1,Insegna,2450,,2123,2.0,27.0
2,Ubicazione,0,,4674,22.0,135.0
3,Tipo via,0,,14,3.0,3.0
4,Descrizione via,0,,1629,3.0,32.0
5,Civico,108,,243,1.0,3.0
6,Codice via,0,,1642,1.0,7602.0
7,ZD,0,,9,1.0,9.0
8,Forma commercio,1122,,2,21.0,23.0
9,Forma commercio prev,999,,2,6.0,16.0


Finally, the statistics for the column `Superficie somministrazione` are extracted directly from `readable_report`.  
This allows to check that the minimum, maximum, number of null values and number of unique values detected by `dataprofiler` are consistent with:

- the manual data profiling done in the previous notebook, and  
- the information shown in the `ydata_profiling` HTML report.

This step shows that the automatic tools are computing the same kind of statistics that were already analysed by hand.

In [14]:
sup_stats = next(
    c for c in readable_report["data_stats"]
    if c["column_name"] == "Superficie somministrazione"
)
sup_stats


{'column_name': 'Superficie somministrazione',
 'data_type': 'int',
 'categorical': True,
 'order': 'random',
 'samples': "['251.0', '119.0', '90.0', '113.0', '27.0']",
 'statistics': {'min': 4.0,
  'max': 2041.0,
  'mode': '[60.0175]',
  'median': 63.553,
  'sum': 426909.0,
  'mean': 86.2965,
  'variance': 7634.4104,
  'stddev': 87.3751,
  'skewness': 6.9714,
  'kurtosis': 96.4466,
  'quantiles': {0: 42.2069, 1: 63.553, 2: 101.2013},
  'median_abs_deviation': 26.6513,
  'num_zeros': 0,
  'num_negatives': 0,
  'unique_count': 368,
  'unique_ratio': 0.0744,
  'categories': "['60.0', '59.0', '58.0', ... , '5.0', '465.0', '203.0']",
  'gini_impurity': 0.9918,
  'unalikeability': 0.992,
  'categorical_count': {'60.0': 107,
   '59.0': 91,
   '58.0': 75,
   '50.0': 72,
   '57.0': 71,
   '32.0': 71,
   '38.0': 71,
   '40.0': 69,
   '41.0': 69,
   '30.0': 68,
   '34.0': 67,
   '80.0': 65,
   '55.0': 64,
   '35.0': 64,
   '70.0': 64,
   '53.0': 62,
   '52.0': 61,
   '36.0': 61,
   '37.0': 58,
 

### YData Profiling – Interpretation of the report

The YData Profiling report shows that the Milan dataset has 13 columns and about 6900 rows, with almost no duplicate records but many rows that contain at least one missing value. From a data type point of view, the columns can be grouped into three blocks: numeric (`Codice via`, `ZD`, `Superficie somministrazione`), purely categorical (`Tipo via`, `Forma commercio`, `Forma commercio prev`, `Forma vendita`, `þÿTipo esercizio storico pe`), and high-cardinality text (`Ubicazione`, `Descrizione via`, `Insegna`, `Civico`, `Settore storico pe`).

YData clearly highlights two main issues: the presence of many missing values concentrated on some key variables (especially `Insegna`, the commerce/sales forms and partly the type of exercise) and the strong class imbalance in several categorical columns. For example, `Tipo via` is almost always “VIA”, `Forma commercio` is almost always “solo somministrazione”, and `Forma commercio prev` is dominated by “somministrazione”, while `Forma vendita` is a bit more balanced between “misto” and “al banco”. The column `þÿTipo esercizio storico pe` is also skewed towards bar/café categories.

The text-like columns such as `Ubicazione`, `Descrizione via`, `Insegna` and especially `Settore storico pe` have a very large number of distinct values: they are very informative for understanding each single establishment, but they are difficult to use directly in a model without some feature engineering (for example parsing, grouping or embedding). `Superficie somministrazione` appears as a useful numeric variable but with a strongly right-skewed distribution: most venues are small or medium size, with few very large cases.

Finally, YData reports some interesting correlations: street codes (`Codice via`) are related to the zone (`ZD`), and the historical type of exercise is linked to the sales form, the commerce forms and the surface area. In practice, the combination of type of venue, way of serving and size follows clear patterns that are also intuitively reasonable (small bars serving at the counter, larger restaurants more often serving at the table or mixed).


### DataProfiler – Interpretation of the report

The DataProfiler report is computed on a sample of 5000 rows from the dataset and globally confirms the structure seen with YData: 13 columns, a very varied dataset with almost no duplicate records, and a significant fraction of rows that contain at least one null value. For each column, DataProfiler provides descriptive statistics, frequency distributions, number of unique values and measures such as gini impurity and unalikeability to evaluate how concentrated or spread out the categories are.

From the sample we see that `þÿTipo esercizio storico pe` has 23 categories with a strong prevalence of bar and restaurant types; `Tipo via` has few modalities with “VIA” clearly dominant, while `Civico` and `Codice via` have many distinct values and behave more like identifiers than proper numeric measurements. `ZD` is an integer from 1 to 9 with a non-uniform distribution but all zones present. The commerce forms (`Forma commercio` and `Forma commercio prev`) have only two categories each and a very strong concentration on “solo somministrazione” / “somministrazione”, which is reflected in very low gini impurity values. `Forma vendita` shows four categories with a good presence of “misto” and “al banco” and fewer cases for “al tavolo” and “self service”.

For text-like columns such as `Insegna`, `Ubicazione`, `Descrizione via` and `Settore storico pe`, DataProfiler highlights high cardinality and many unique values, confirming that these are descriptive fields rather than clean low-cardinality categorical variables. `Superficie somministrazione` is treated as an integer with a very right-skewed distribution: many small/medium surfaces and a few very large values. The report also gives null counts per column, which on the sample are in line with what YData observes on the full dataset, in particular for `Insegna`, the commerce/sales forms and the type of exercise.


### Comparison between YData Profiling and DataProfiler

The two tools basically tell the same story about the Milan dataset, even if they present it in different ways. YData works on the full dataset and focuses on visual summaries and data quality indicators (missing by column, imbalance of categorical variables, automatic correlations between features). DataProfiler works on a 5000-row sample and returns a more “numerical” view for each column, with classical statistics, distributions, category counts, gini impurity and data type representation.

The biggest differences are in style, not in content: YData groups columns into Numeric, Categorical and Text, while DataProfiler uses labels like int, string, text with an additional categorical flag. In both cases, however, `Codice via`, `ZD` and `Superficie somministrazione` are interpreted as numeric, the commerce and sales forms as very imbalanced categorical variables, and `Ubicazione`, `Descrizione via`, `Insegna` and `Settore storico pe` as high-cardinality text fields. The levels of missing data and the imbalance between categories that emerge from DataProfiler are consistent with what YData shows, taking into account that one uses the full dataset and the other a sample.

In summary, the results of YData Profiling and DataProfiler are coherent with each other: both confirm that the dataset is structurally well-formed but has many missing values in some key variables, categorical features dominated by a few main classes (especially for exercise and commerce types), and some very rich but complex text fields. The small numerical differences between the two reports are easily explained by sampling in DataProfiler and do not change the main conclusions about the data.
