Progetto: **Classificazione di testi descrittivi per destinazione d'uso, inerenti ai beni sottoposti ad aste giudiziarie italiane.**

Studente: **Alessandro Monolo** | *10439147*

Relatore: Marco Brambilla

Referente aziendale: Simone Redaelli

Master: Data Science & Artificial Intelligence

Università: Politecnico di Milano

<hr style="border:1px solid black">

## Pre-Processing per data frame da usare nel modello predittivo "Logistic Regression"

- **1.** Creo una **nuova colonna** denominata **"Regione"** basata sul campo "Provincia";


- **2.** **Creo** una **nuova colonna**, basata su quella **booleana** **"Catasto_Fabbricati"**, ora in formato **Integer**;


- **3.** Applico **Pre-Processing** alla **target variable "Destinazione d'uso"**, utilizzando la funzione **"LabelEncoder"**;


- **4.** Fase di **Pre-Processing** sulla variabile categorica **"Comune"**:
    - Estraggo le **10 top counts** values;
    - Creo **10 variabili dummies** in base alla presenza o meno del bene ubicato nei top 10 comuni per conteggio;


- **5.** **Droppo** le colonne **non utili al modello**;


- **6.** Fase di **Pre-Processing** alle colonne categoriche tramite la funzione **"One-Hot Encoder"**:
    - In particulare sulle ultime 3 colonne categoriche rimaste, ovvero: **Regione**, **Provincia**, **Tribunale**;


- **7.** **Conclusioni**;


- **8.** **Export** data frame in formato **csv**.


- **9.** **Referenze**.

<hr style="border:1px solid black">

**Importo le librerie che mi servono:**

In [1]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import math
import warnings
from matplotlib import cm
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder

#### Set pandas options:

In [2]:
pd.set_option('display.max_colwidth', None)
pd.options.display.max_rows = 5000
pd.options.display.max_columns = 1000
pd.options.display.float_format = '{:.6f}'.format
pd.options.mode.chained_assignment = None

**Importo file CSV**

In [3]:
df = pd.read_csv("D:\\Master_Cefriel_DS_AI_Monolo\\0_Project_Work\\Dataset\\8_Dataset_Benchmark\\Dataset_Benchmark.csv",
                 dtype={'Numero_Lotto' : 'Int64'})

### 1 - Creo una nuova colonna "Regione", mappandola in base alle provincie di riferimento

In [4]:
# Creo un dizionario con tutte le provincie d'Italia presenti nel campo "Provincia" del mio data frame:
province_to_region = {
    'TO': 'Piemonte',
    'TE': 'Abruzzo',
    'MT': 'Basilicata',
    'AT': 'Piemonte',
    'RC': 'Calabria',
    'FI': 'Toscana',
    'VT': 'Lazio',
    'RM': 'Lazio',
    'AL': 'Piemonte',
    'BI': 'Piemonte',
    'ME': 'Sicilia',
    'IM': 'Liguria',
    'CS': 'Calabria',
    'VC': 'Piemonte',
    'SV': 'Liguria',
    'CN': 'Piemonte',
    'FG': 'Puglia',
    'SP': 'Liguria',
    'NO': 'Piemonte',
    'LI': 'Toscana',
    'MB': 'Lombardia',
    'BA': 'Puglia',
    'MI': 'Lombardia',
    'VA': 'Lombardia',
    'AO': 'Valle d\'Aosta',
    'OR': 'Sardegna',
    'VB': 'Piemonte',
    'PV': 'Lombardia',
    'CZ': 'Calabria',
    'SU': 'Sardegna',
    'GE': 'Liguria',
    'PC': 'Emilia-Romagna',
    'SA': 'Campania',
    'CO': 'Lombardia',
    'SR': 'Sicilia',
    'OT': 'Sardegna',
    'SS': 'Sardegna',
    'GR': 'Toscana',
    'SI': 'Toscana',
    'PR': 'Emilia-Romagna',
    'CL': 'Sicilia',
    'PI': 'Toscana',
    'LT': 'Lazio',
    'RG': 'Sicilia',
    'MS': 'Toscana',
    'SO': 'Lombardia',
    'AV': 'Campania',
    'AR': 'Toscana',
    'BG': 'Lombardia',
    'TV': 'Veneto',
    'LO': 'Lombardia',
    'BL': 'Veneto',
    'LC': 'Lombardia',
    'CR': 'Lombardia',
    'PN': 'Friuli-Venezia Giulia',
    'CT': 'Sicilia',
    'UD': 'Friuli-Venezia Giulia',
    'EN': 'Sicilia',
    'VR': 'Veneto',
    'PA': 'Sicilia',
    'VE': 'Veneto',
    'BS': 'Lombardia',
    'RE': 'Emilia-Romagna',
    'RN': 'Emilia-Romagna',
    'FE': 'Emilia-Romagna',
    'RA': 'Emilia-Romagna',
    'PD': 'Veneto',
    'NU': 'Sardegna',
    'MN': 'Lombardia',
    'VI': 'Veneto',
    'RO': 'Veneto',
    'BN': 'Campania',
    'VV': 'Calabria',
    'MO': 'Emilia-Romagna',
    'BZ': 'Trentino-Alto Adige',
    'KR': 'Calabria',
    'TR': 'Umbria',
    'BO': 'Emilia-Romagna',
    'GO': 'Friuli-Venezia Giulia',
    'TS': 'Friuli-Venezia Giulia',
    'LE': 'Puglia',
    'OG': 'Sardegna',
    'FC': 'Emilia-Romagna',
    'IS': 'Molise',
    'PU': 'Marche',
    'AN': 'Marche',
    'FM': 'Marche',
    'MC': 'Marche',
    'AP': 'Marche',
    'PE': 'Abruzzo',
    'LU': 'Toscana',
    'AQ': 'Abruzzo',
    'CE': 'Campania',
    'PT': 'Toscana',
    'PO': 'Toscana',
    'TP': 'Sicilia',
    'NAP': 'Campania',
    'PG': 'Umbria',
    'CH': 'Abruzzo',
    'RI': 'Lazio',
    'FR': 'Lazio',
    'TA': 'Puglia',
    'CB': 'Molise',
    'BR': 'Puglia',
    'BT': 'Puglia',
    'PZ': 'Basilicata',
    'AG': 'Sicilia',
    'CA': 'Sardegna',
    'VS': 'Sardegna'
}

# Mappo una nuova colonna "Regione", con il dizionario creato sopra:
df['Regione'] = df['Provincia'].map(province_to_region)

### 2 - Trasformo la colonna Booleana "Fabbricato_Catasto", in una colonna Integer

In [5]:
# Creo una nuova colonna basata sulla colonna Catasto_Fabbricati, assegnando per ogni True = 1 e per ogni False = 0:
df['Catasto_Fabbricati_Int'] = df['Catasto_Fabbricati'].astype(int)

### 3 - Fase di Pre-Processign per la Target Variable "Destinazione d'uso", applicando LabelEncoder

- Per la fase di **Pre-Processing** inerente alla target variable, utilizzo la libreria **sci-kit learn** e uso la funzione **"LabelEncoder"**, utile a trasformare ogni categoria presente nella target variable in un **integer progressivo**, da 0 a N.

In [6]:
# Instanzio in una nuova variabile il modello per l'encoding della target variable:
label_encoder = LabelEncoder()

# Applico il modello di encoding alla mia target variable e salvo il risultato in una nuova colonna:
df['Destinazione_Uso_Encoded'] = label_encoder.fit_transform(df['Destinazione_Uso'])

In [7]:
df['Destinazione_Uso_Encoded'].value_counts()

4    132207
6     16907
5      7743
3      5804
2      2269
1       560
0       550
Name: Destinazione_Uso_Encoded, dtype: int64

In [8]:
df['Destinazione_Uso'].value_counts()

RESIDENTIAL              132207
STORAGE                   16907
RETAIL                     7743
LAND                       5804
INDUSTRIAL                 2269
HOTEL                       560
AGRICULTURAL BUILDING       550
Name: Destinazione_Uso, dtype: int64

### 4 - Fase di Pre-Processing per la variabile categorica "Comune"

- Per la fase di pre-processing applicata a questa colonna, seguo la tecnica menzionata in **Niculescu-Mizil et al. (2009)**.
- Selezionando le **prime dieci categorie per value counts** presenti nella variabile categorica, **creo dieci variabili dummies** per ogni bene, appartenente o meno alle prime dieci categorie per numero.
- Quindi alla fine avrò dieci colonne in più, una per ogni comune più presente (Top 10) **senza evere centinaia o migliaia di colonne dummies** in più che comprometterebbero la dimensionalità e l'utilizzabilità del data frame.

In [7]:
# Seleziono e salvo in una lista, le dieci categorie più presenti per la colonna "Comune":
top_10 = [x for x in df["Comune"].value_counts().sort_values(ascending=False).head(10).index]

# Definisco una User Defined Function per creare dieci nuove variabili dummies definendo variabile e data frame di riferimento:
def one_hot_top_x(df, variable, top_x_labels):
    for label in top_x_labels:
        df[variable + "_" + label] = np.where(df[variable] == label, 1, 0)

# Applico la USer Defined Function creata sopra in base al tipo di variabile categorica scelta, in questo caso "Comune":
one_hot_top_x(df, "Comune", top_10)

### 5 - Elimino le colonne che non mi servono per il modello predittivo

In [8]:
# Listo le colonne che voglio droppare dal data frame:
columns_to_drop = ["Destinazione_Uso",
                   "Comune",
                   "Categoria_Catastale",
                   "Descrizione_Bene",
                   "Clean_Descrizione_Bene",
                   "Words",
                   "Catasto_Fabbricati"]

# Droppo le colonne listate:
df = df.drop(columns=columns_to_drop)

### 6 - Fase Pre-Processing per le restanti variabili Categoriche con la funzione One-Hot Encoder:

In [9]:
# Definisco in una lista tutte le colonne categoriche che voglio encoddare tramite "One-Hot Encoder":
categorical_columns = ['Regione', 'Provincia', 'Tribunale']

# Creo "ColumnTransformer" per applicare "One-Hot Encoder" alle colonne categoriche listate prima:
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(sparse_output=False, drop=None), categorical_columns)],
                                 remainder='passthrough')

# Creo un nuovo data frame al quale applico il ColumnTransformer creato prima con la funzione "One-Hot Encoder":
df_encoded = preprocessor.fit_transform(df)

# Prendo i nomi delle colonne categoriche aggiornate dopo avergli applicato "One-Hot encoder":
encoded_column_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_columns)

# Combino i nomi delle colonne numeriche con le colonne categoriche dopo aver applicato "One-Hot Encoder"
all_column_names = list(encoded_column_names) + df.columns[df.columns.isin(categorical_columns) == False].tolist()

# Infine creo un nuovo data frame unendo tutte le colonne e riassegnando il relativo nome:
df_encoded = pd.DataFrame(df_encoded, columns=all_column_names)

# Creo un dizionario per specificare il datatype di ogni colonna:
dtype_dict = {col: 'int' for col in df_encoded.columns if col not in ["Numero_Vani", "Superficie", "Rapporto_Vani_Superficie"]}

# Converto tutte le colonne tranne le tre listate sopra in formato integers:
df_encoded = df_encoded.astype(dtype_dict)

# Ora converto le float columns nel loro formsto originael:
float_columns = ["Numero_Vani", "Superficie", "Rapporto_Vani_Superficie"]
df_encoded[float_columns] = df_encoded[float_columns].astype('float')

### 7 - Conclusioni

In [10]:
for index, (col, dtype) in enumerate(df_encoded.dtypes.items(), 1):
    print(f'N° Colonna: {index} - Nome Colonna: {col} - Tipo di dato: {dtype}\n________________________________________________________________________________\n')

N° Colonna: 1 - Nome Colonna: Regione_Abruzzo - Tipo di dato: int32
________________________________________________________________________________

N° Colonna: 2 - Nome Colonna: Regione_Basilicata - Tipo di dato: int32
________________________________________________________________________________

N° Colonna: 3 - Nome Colonna: Regione_Calabria - Tipo di dato: int32
________________________________________________________________________________

N° Colonna: 4 - Nome Colonna: Regione_Campania - Tipo di dato: int32
________________________________________________________________________________

N° Colonna: 5 - Nome Colonna: Regione_Emilia-Romagna - Tipo di dato: int32
________________________________________________________________________________

N° Colonna: 6 - Nome Colonna: Regione_Friuli-Venezia Giulia - Tipo di dato: int32
________________________________________________________________________________

N° Colonna: 7 - Nome Colonna: Regione_Lazio - Tipo di dato: int32
________

In [11]:
# Numero di null values per colonna del data frame:
df_encoded.isnull().sum()

Regione_Abruzzo                                     0
Regione_Basilicata                                  0
Regione_Calabria                                    0
Regione_Campania                                    0
Regione_Emilia-Romagna                              0
Regione_Friuli-Venezia Giulia                       0
Regione_Lazio                                       0
Regione_Liguria                                     0
Regione_Lombardia                                   0
Regione_Marche                                      0
Regione_Molise                                      0
Regione_Piemonte                                    0
Regione_Puglia                                      0
Regione_Sardegna                                    0
Regione_Sicilia                                     0
Regione_Toscana                                     0
Regione_Trentino-Alto Adige                         0
Regione_Umbria                                      0
Regione_Valle d'Aosta       

In [12]:
# Composizione del data frame ottenuto:
df_encoded.shape

(166040, 333)

In [13]:
# Composizione delle prime 10 righe di ogni colonna nel data frame:
df_encoded.head(10)

Unnamed: 0,Regione_Abruzzo,Regione_Basilicata,Regione_Calabria,Regione_Campania,Regione_Emilia-Romagna,Regione_Friuli-Venezia Giulia,Regione_Lazio,Regione_Liguria,Regione_Lombardia,Regione_Marche,Regione_Molise,Regione_Piemonte,Regione_Puglia,Regione_Sardegna,Regione_Sicilia,Regione_Toscana,Regione_Trentino-Alto Adige,Regione_Umbria,Regione_Valle d'Aosta,Regione_Veneto,Provincia_AG,Provincia_AL,Provincia_AN,Provincia_AO,Provincia_AP,Provincia_AQ,Provincia_AR,Provincia_AT,Provincia_AV,Provincia_BA,Provincia_BG,Provincia_BI,Provincia_BL,Provincia_BN,Provincia_BO,Provincia_BR,Provincia_BS,Provincia_BT,Provincia_BZ,Provincia_CA,Provincia_CB,Provincia_CE,Provincia_CH,Provincia_CL,Provincia_CN,Provincia_CO,Provincia_CR,Provincia_CS,Provincia_CT,Provincia_CZ,Provincia_EN,Provincia_FC,Provincia_FE,Provincia_FG,Provincia_FI,Provincia_FM,Provincia_FR,Provincia_GE,Provincia_GO,Provincia_GR,Provincia_IM,Provincia_IS,Provincia_KR,Provincia_LC,Provincia_LE,Provincia_LI,Provincia_LO,Provincia_LT,Provincia_LU,Provincia_MB,Provincia_MC,Provincia_ME,Provincia_MI,Provincia_MN,Provincia_MO,Provincia_MS,Provincia_MT,Provincia_NAP,Provincia_NO,Provincia_NU,Provincia_OG,Provincia_OR,Provincia_OT,Provincia_PA,Provincia_PC,Provincia_PD,Provincia_PE,Provincia_PG,Provincia_PI,Provincia_PN,Provincia_PO,Provincia_PR,Provincia_PT,Provincia_PU,Provincia_PV,Provincia_PZ,Provincia_RA,Provincia_RC,Provincia_RE,Provincia_RG,Provincia_RI,Provincia_RM,Provincia_RN,Provincia_RO,Provincia_SA,Provincia_SI,Provincia_SO,Provincia_SP,Provincia_SR,Provincia_SS,Provincia_SU,Provincia_SV,Provincia_TA,Provincia_TE,Provincia_TO,Provincia_TP,Provincia_TR,Provincia_TS,Provincia_TV,Provincia_UD,Provincia_VA,Provincia_VB,Provincia_VC,Provincia_VE,Provincia_VI,Provincia_VR,Provincia_VS,Provincia_VT,Provincia_VV,Tribunale_AGRIGENTO,Tribunale_ALESSANDRIA,Tribunale_ALESSANDRIA (EX ACQUI TERME),Tribunale_ALESSANDRIA (EX TORTONA),Tribunale_ANCONA,Tribunale_AOSTA,Tribunale_AREZZO,Tribunale_ASCOLI PICENO,Tribunale_ASTI,Tribunale_ASTI (EX ALBA),Tribunale_AVELLINO,Tribunale_AVELLINO (EX SANT'ANGELO DEI LOMBARDI),Tribunale_AVEZZANO,Tribunale_BARCELLONA POZZO DI GOTTO,Tribunale_BARI,Tribunale_BARI (EX BITONTO),Tribunale_BELLUNO,Tribunale_BENEVENTO,Tribunale_BENEVENTO (EX ARIANO IRPINO),Tribunale_BERGAMO,Tribunale_BIELLA,Tribunale_BOLOGNA,Tribunale_BOLZANO - BOZEN,Tribunale_BRESCIA,Tribunale_BRINDISI,Tribunale_BUSTO ARSIZIO,Tribunale_CAGLIARI,Tribunale_CALTAGIRONE,Tribunale_CALTANISSETTA,Tribunale_CAMPOBASSO,Tribunale_CASSINO,Tribunale_CASTROVILLARI,Tribunale_CASTROVILLARI (EX ROSSANO),Tribunale_CATANIA,Tribunale_CATANIA (EX ACIREALE),Tribunale_CATANIA (EX ADRANO),Tribunale_CATANIA (EX BELPASSO),Tribunale_CATANIA (EX BRONTE),Tribunale_CATANIA (EX GIARRE),Tribunale_CATANIA (EX MASCALUCIA),Tribunale_CATANIA (EX PATERNO),Tribunale_CATANZARO,Tribunale_CHIETI,Tribunale_CIVITAVECCHIA,Tribunale_CIVITAVECCHIA (EX BRACCIANO),Tribunale_COMO,Tribunale_COSENZA,Tribunale_CREMONA,Tribunale_CREMONA (EX CREMA),Tribunale_CROTONE,Tribunale_CUNEO,Tribunale_CUNEO (EX MONDOVI),Tribunale_CUNEO (EX SALUZZO),Tribunale_ENNA,Tribunale_ENNA (EX NICOSIA),Tribunale_FERMO,Tribunale_FERRARA,Tribunale_FIRENZE,Tribunale_FOGGIA,Tribunale_FOGGIA (EX LUCERA (EX RODI GARGANICO),Tribunale_FOGGIA (EX LUCERA),Tribunale_FORLI,Tribunale_FROSINONE,Tribunale_GELA,Tribunale_GENOVA,Tribunale_GENOVA (EX CHIAVARI),Tribunale_GORIZIA,Tribunale_GROSSETO,Tribunale_IMPERIA,Tribunale_IMPERIA (EX SAN REMO),Tribunale_ISERNIA,Tribunale_IVREA,Tribunale_L'AQUILA,Tribunale_LA SPEZIA,Tribunale_LAGONEGRO,Tribunale_LAGONEGRO (EX SALA CONSILINA),Tribunale_LAMEZIA TERME,Tribunale_LANCIANO,Tribunale_LANUSEI,Tribunale_LARINO,Tribunale_LARINO (EX TERMOLI),Tribunale_LATINA,Tribunale_LATINA (EX GAETA),Tribunale_LATINA (EX TERRACINA),Tribunale_LECCE,Tribunale_LECCO,Tribunale_LIVORNO,Tribunale_LOCRI,Tribunale_LODI,Tribunale_LUCCA,Tribunale_MACERATA,Tribunale_MACERATA (EX CAMERINO),Tribunale_MANTOVA,Tribunale_MARSALA,Tribunale_MASSA,Tribunale_MATERA,Tribunale_MESSINA,Tribunale_MILANO,Tribunale_MODENA,Tribunale_MONZA,Tribunale_NAPOLI,Tribunale_NAPOLI (EX CASORIA),Tribunale_NAPOLI NORD,Tribunale_NOCERA INFERIORE,Tribunale_NOLA,Tribunale_NOVARA,Tribunale_NUORO,Tribunale_ORISTANO,Tribunale_PADOVA,Tribunale_PALERMO,Tribunale_PALMI,Tribunale_PAOLA,Tribunale_PARMA,Tribunale_PATTI,Tribunale_PATTI (EX MISTRETTA),Tribunale_PAVIA,Tribunale_PAVIA (EX VIGEVANO),Tribunale_PAVIA (EX VOGHERA),Tribunale_PERUGIA,Tribunale_PESARO,Tribunale_PESARO (EX FANO),Tribunale_PESCARA,Tribunale_PIACENZA,Tribunale_PISA,Tribunale_PISTOIA,Tribunale_PORDENONE,Tribunale_POTENZA,Tribunale_POTENZA (EX MELFI),Tribunale_PRATO,Tribunale_RAGUSA,Tribunale_RAGUSA (EX MODICA),Tribunale_RAVENNA,Tribunale_REGGIO DI CALABRIA,Tribunale_REGGIO NELL'EMILIA,Tribunale_RIETI,Tribunale_RIMINI,Tribunale_ROMA,Tribunale_ROMA (EX OSTIA),Tribunale_ROVERETO,Tribunale_ROVIGO,Tribunale_SALERNO,Tribunale_SALERNO (EX EBOLI),Tribunale_SALERNO (EX MERCATO SAN SEVERINO),Tribunale_SANTA MARIA CAPUA VETERE,Tribunale_SANTA MARIA CAPUA VETERE (EX AVERSA),Tribunale_SASSARI,Tribunale_SAVONA,Tribunale_SCIACCA,Tribunale_SIENA,Tribunale_SIENA (EX MONTEPULCIANO),Tribunale_SIRACUSA,Tribunale_SONDRIO,Tribunale_SPOLETO,Tribunale_SULMONA,Tribunale_TARANTO,Tribunale_TEMPIO PAUSANIA,Tribunale_TEMPIO PAUSANIA (EX OLBIA),Tribunale_TERAMO,Tribunale_TERMINI IMERESE,Tribunale_TERNI,Tribunale_TERNI (EX ORVIETO),Tribunale_TIVOLI,Tribunale_TIVOLI (EX CASTELNUOVO DI PORTO),Tribunale_TORINO,Tribunale_TORINO (EX PINEROLO),Tribunale_TORRE ANNUNZIATA,Tribunale_TRANI,Tribunale_TRAPANI,Tribunale_TREVISO,Tribunale_TRIESTE,Tribunale_UDINE,Tribunale_UDINE (EX TOLMEZZO),Tribunale_URBINO,Tribunale_VALLO DELLA LUCANIA,Tribunale_VARESE,Tribunale_VASTO,Tribunale_VELLETRI,Tribunale_VENEZIA,Tribunale_VERBANIA,Tribunale_VERCELLI,Tribunale_VERCELLI (EX CASALE MONTEFERRATO),Tribunale_VERCELLI (EX VARALLO),Tribunale_VERONA,Tribunale_VIBO VALENTIA,Tribunale_VICENZA,Tribunale_VICENZA (EX BASSANO DEL GRAPPA),Tribunale_VITERBO,Numero_Lotto,Numero_Vani,Superficie,N_Caratteri,Rapporto_Vani_Superficie,Catasto_Fabbricati_Int,Destinazione_Uso_Encoded,Comune_ROMA,Comune_PALERMO,Comune_NAPOLI,Comune_BARI,Comune_CATANIA,Comune_TARANTO,Comune_ALESSANDRIA,Comune_TORINO,Comune_SASSARI,Comune_MESSINA
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.5,69.0,182,0.050725,1,4,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0,15.0,101,0.066667,1,4,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,4.0,80.0,223,0.05,1,4,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0,21.0,223,0.047619,1,4,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,4.0,80.0,103,0.05,1,4,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5.0,100.0,103,0.05,1,4,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.5,73.0,103,0.047945,1,6,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0.5,12.0,71,0.041667,1,4,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3.5,70.0,180,0.05,1,4,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6.5,130.0,98,0.05,1,4,0,0,0,0,0,0,0,0,0,0


In [14]:
# Conto il numero di integers e float nel data frame:
integer_columns = df_encoded.select_dtypes(include=['int']).columns
num_integer_columns = len(integer_columns)
print("Numero totale di colonne in formato Integer:", num_integer_columns)

float_columns = df_encoded.select_dtypes(include=['float64']).columns
num_float_columns = len(float_columns)
print("Numero totale di colonne in formato float:", num_float_columns)

Numero totale di colonne in formato Integer: 330
Numero totale di colonne in formato float: 3


- Per ogni categoria che era presente nelle colonne **"Regione", "Tribunale" & "Provincia"**, sono state create tante **dummies variables** tante quante erano le categorie in esse contenute.
    - In particolare sono state create per la colonna **"Tribunale" N° 187**, per la colonna **"Provincia" N° 109**, e per la colonna **"Regione" N° 20**;
    - Che si sommano alle colonne **già esistenti numeriche** (in totale **7**);
    - Ed infine a quelle dummies relative alle **top 10 comuni**.
    

- Non ci sono **Null Values** all'interno delle colonne del data frame;


- Non ci sono **Object columns** nela data frame dopo l'aplicazione di **"One-Hot Encoder"**;


- Il data frame è composto quindi da **166.040 righe** e **333 colonne**;


- In totale ci sono **330 Integer columns** e **3 Float columns** nel data frame;


- Dopo aver creato le derivanti dummies variable, le **Object columns sono state eliminate** dal data frame in quanto il modello **Logistic Regression** necessità solo di variabili numeriche;


- Anche l'unica variabile inizialmente **Booleana** è stata eliminata dopo essere stata trasformata in una variabile 0 / 1.

### 8 - Export data frame in formato CSV:

In [15]:
df_encoded_csv = df_encoded.to_csv("D:\\Master_Cefriel_DS_AI_Monolo\\0_Project_Work\\Dataset\\9_Dataset_Logistic_Regression\\Dataset_Logistic_Regression.csv",
                                   index=False)

### 9 - Referenze

- *Niculescu-Mizil, A., Perlich, C., Swirszcz, G., Sindhwani, V., Liu, Y., Melville, P., Wang, D., Xiao, J., Hu, J., Singh, M., Shang, W., & Zhu, Y. (2009). Winning the KDD Cup Orange Challenge with ensemble selection. Knowledge Discovery and Data Mining, 23–34*. http://people.cs.uchicago.edu/~vikass/KDDCup-jmlr09.pdf