# Standardizzazione variabili categoriche · Car Sales

**Fonte dati:** `data/processed/database_cleaned_2.csv` (dataset pulito e ripulito dagli outlier).  
**Obiettivo del notebook:**  
- Standardizzare le variabili categoriche (Company, Dealer, Model, Transmission, Gender, Customer Name)  
- Creare e mantenere file di mapping (`mappings/*_mapping.csv`) che unificano valori incoerenti o duplicati  
- Generare un file di dettaglio `database_for_tableau_city_state.csv` per supportare la geocodifica in Tableau  

**Nota di business case:**  
La metrica principale che useremo in Tableau sarà il rapporto **reddito/prezzo** (income-to-price ratio), che misura l’accessibilità del bene: valori più alti = auto più accessibili rispetto al reddito medio.

In [21]:
# === SETUP AMBIENTE E PATH ===
# Scopo (per il team):
# - Centralizziamo i path per lavorare in modo coerente (notebook root vs repo root)
# - Lavoriamo SEMPRE con path relativi → compatibile GitHub / altri ambienti
# - La sorgente è database_cleaned_2.csv (pulito + outlier gestiti)

import pandas as pd
import numpy as np
from pathlib import Path

# individua la root del repo (funziona sia se il notebook è in / che in /notebook)
def get_repo_root():
    here = Path.cwd().resolve()
    for base in (here, here.parent, here.parent.parent):
        if (base / "data").exists():
            return base
    return here

REPO_ROOT = get_repo_root()
DATA_DIR  = REPO_ROOT / "data"
PROC_DIR  = DATA_DIR / "processed"
MAPPINGS_DIR = Path("mappings")

# file canonici
CLEANED_PATH = PROC_DIR / "database_cleaned_2.csv"
OUT_DETAIL   = PROC_DIR / "database_for_tableau_city_state.csv"

print("Repo root:", REPO_ROOT)
print("Processed dir:", PROC_DIR)
print("CLEANED_PATH exists?", CLEANED_PATH.exists())

# carico il cleaned
if not CLEANED_PATH.exists():
    raise FileNotFoundError(f"{CLEANED_PATH} non trovato. Generare prima database_cleaned_2.csv.")
    
cols0 = pd.read_csv(CLEANED_PATH, nrows=0).columns.tolist()
parse_arg = ['Date'] if 'Date' in cols0 else None
df = pd.read_csv(CLEANED_PATH, parse_dates=parse_arg, low_memory=False)

print("Caricato df shape:", df.shape)
display(df.head(5))

Repo root: /Users/serenatempesta/Documents/Progetti/Data_Analysis/progetto_finale
Processed dir: /Users/serenatempesta/Documents/Progetti/Data_Analysis/progetto_finale/data/processed
CLEANED_PATH exists? True
Caricato df shape: (23906, 25)


Unnamed: 0,Date,Customer Name,Gender,Annual Income,Dealer_Name,Company,Model,Engine,Transmission,Color,...,_income_clean,Company_mapped,Dealer_Name_mapped,Model_mapped,Transmission_mapped,Gender_mapped,Customer Name_mapped,Dealer_Region_mapped,_price_clean_w,_income_clean_w
0,2022-01-02,Geraldine,Male,13500,Buddy Storbeck's Diesel Service Inc,Ford,Expedition,DoubleÂ Overhead Camshaft,Auto,Black,...,13500,ford,buddy storbeck's diesel service inc,expedition,auto,male,geraldine,middletown,26000.0,13500
1,2022-01-02,Gia,Male,1480000,C & M Motors Inc,Dodge,Durango,DoubleÂ Overhead Camshaft,Auto,Black,...,1480000,dodge,c & m motors inc,durango,auto,male,gia,aurora,19000.0,1480000
2,2022-01-02,Gianna,Male,1035000,Capitol KIA,Cadillac,Eldorado,Overhead Camshaft,Manual,Red,...,1035000,cadillac,capitol kia,eldorado,manual,male,gianna,greenville,31500.0,1035000
3,2022-01-02,Giselle,Male,13500,Chrysler of Tri-Cities,Toyota,Celica,Overhead Camshaft,Manual,Pale White,...,13500,toyota,chrysler of tri-cities,celica,manual,male,giselle,pasco,14000.0,13500
4,2022-01-02,Grace,Male,1465000,Chrysler Plymouth,Acura,TL,DoubleÂ Overhead Camshaft,Auto,Red,...,1465000,acura,chrysler plymouth,tl,auto,male,grace,janesville,24500.0,1465000


## Generazione template mapping
Mostriamo i top values per alcune categoriche e generiamo i file `*_mapping_template.csv` in `mappings/`.
Da questi i colleghi potranno definire le versioni canonical (lowercase, senza spazi extra).

In [22]:
MAPPINGS_DIR.mkdir(exist_ok=True)

cat_cols = ['Company', 'Dealer_Name', 'Model', 'Transmission', 'Gender', 'Customer Name']

# Top values
for c in cat_cols:
    if c in df.columns:
        print(f"\n--- {c} (top 20) ---")
        display(df[c].astype(str).str.strip().str.lower().value_counts().head(20))
    else:
        print(f"\nColonna non presente: {c}")

# Template file (raw → canonical vuoto)
for col in cat_cols:
    if col in df.columns:
        vals = pd.Series(df[col].astype(str).str.strip().str.lower().unique())
        mapping_df = pd.DataFrame({'raw': vals, 'canonical': [''] * len(vals)})
        mapping_path = MAPPINGS_DIR / f"{col}_mapping_template.csv"
        mapping_df.sort_values('raw').to_csv(mapping_path, index=False)
        print("Template generato:", mapping_path)


--- Company (top 20) ---


Company
chevrolet     1819
dodge         1671
ford          1614
volkswagen    1333
mercedes-b    1285
mitsubishi    1277
chrysler      1120
oldsmobile    1111
toyota        1110
nissan         886
mercury        874
lexus          802
pontiac        796
bmw            790
volvo          789
honda          708
acura          689
cadillac       652
plymouth       617
saturn         586
Name: count, dtype: int64


--- Dealer_Name (top 20) ---


Dealer_Name
progressive shippers cooperative association no    1318
rabun used car sales                               1313
race car help                                      1253
saab-belle dodge                                   1251
star enterprises inc                               1249
tri-state mack inc                                 1249
ryder truck rental and leasing                     1248
u-haul co                                          1247
scrivener performance engineering                  1246
suburban ford                                      1243
nebo chevrolet                                      633
pars auto sales                                     630
new castle ford lincoln mercury                     629
mckinney dodge chrysler jeep                        629
hatfield volkswagen                                 629
gartner buick hyundai saab                          628
pitre buick-pontiac-gmc of scottsdale               628
capitol kia                         


--- Model (top 20) ---


Model
diamante         418
silhouette       411
prizm            411
passat           391
ram pickup       383
jetta            382
rl               372
ls400            354
lhs              330
a6               329
528i             324
3000gt           303
montero sport    302
s40              282
tl               269
pathfinder       267
durango          262
grand marquis    261
323i             260
metro            258
Name: count, dtype: int64


--- Transmission (top 20) ---


Transmission
auto      12571
manual    11335
Name: count, dtype: int64


--- Gender (top 20) ---


Gender
male      18798
female     5108
Name: count, dtype: int64


--- Customer Name (top 20) ---


Customer Name
thomas      92
emma        90
lucas       88
nathan      80
louis       76
lea         75
chloe       74
paul        71
theo        65
sarah       65
hugo        64
leo         63
alexis      62
dylan       61
victor      60
camille     59
benjamin    57
antoine     56
samuel      56
julie       54
Name: count, dtype: int64

Template generato: mappings/Company_mapping_template.csv
Template generato: mappings/Dealer_Name_mapping_template.csv
Template generato: mappings/Model_mapping_template.csv
Template generato: mappings/Transmission_mapping_template.csv
Template generato: mappings/Gender_mapping_template.csv
Template generato: mappings/Customer Name_mapping_template.csv


In [23]:
# === INIZIALIZZARE I MAPPING (crea *_mapping.csv se mancano) ===
# Scopo (team):
# - Garantire la presenza dei file finali `mappings/<col>_mapping.csv` (raw, canonical).
# - Se il file finale ESISTE → non lo tocchiamo (safe).
# - Se NON esiste → usiamo il template se presente; altrimenti generiamo dai valori del df.

from pathlib import Path
import pandas as pd

if 'df' not in globals():
    raise RuntimeError("DataFrame `df` non trovato. Esegui prima la cella di caricamento.")

MAPPINGS_DIR = Path("mappings")
MAPPINGS_DIR.mkdir(exist_ok=True)
cat_cols = ['Company', 'Dealer_Name', 'Model', 'Transmission', 'Gender', 'Customer Name']

created, skipped = [], []

def _norm(s):
    return s.astype(str).str.strip().str.lower()

for col in cat_cols:
    template_path = MAPPINGS_DIR / f"{col}_mapping_template.csv"
    final_path    = MAPPINGS_DIR / f"{col}_mapping.csv"

    if final_path.exists():
        skipped.append(final_path.name)
        continue

    if template_path.exists():
        df_map = pd.read_csv(template_path, dtype=str).fillna('')
        if not {'raw','canonical'}.issubset(df_map.columns):
            vals = pd.Series(sorted(_norm(df[col]).unique())) if col in df.columns else pd.Series([], dtype=str)
            df_map = pd.DataFrame({'raw': vals, 'canonical': vals})
        else:
            df_map['raw']       = _norm(df_map['raw'])
            df_map['canonical'] = _norm(df_map['canonical'])
            df_map.loc[df_map['canonical'] == '', 'canonical'] = df_map.loc[df_map['canonical'] == '', 'raw']
    elif col in df.columns:
        vals  = pd.Series(sorted(_norm(df[col]).unique()))
        df_map = pd.DataFrame({'raw': vals, 'canonical': vals})
    else:
        continue

    df_map = df_map[['raw','canonical']].drop_duplicates()
    df_map.to_csv(final_path, index=False)
    created.append(final_path.name)

print("📄 Mapping creati   :", created if created else "none")
print("⏭️  Mapping saltati  :", skipped if skipped else "none")
print("📁 Cartella mapping :", MAPPINGS_DIR.resolve())

📄 Mapping creati   : none
⏭️  Mapping saltati  : ['Company_mapping.csv', 'Dealer_Name_mapping.csv', 'Model_mapping.csv', 'Transmission_mapping.csv', 'Gender_mapping.csv', 'Customer Name_mapping.csv']
📁 Cartella mapping : /Users/serenatempesta/Documents/Progetti/Data_Analysis/progetto_finale/notebook/mappings


In [24]:
# === ISPEZIONE MAPPING ===
# Obiettivo: visualizzare rapidamente i `*_mapping.csv` finali per una review prioritaria.
from pathlib import Path
import pandas as pd

MAPPINGS_DIR = Path("mappings")
cat_cols = ['Company', 'Dealer_Name', 'Model', 'Transmission', 'Gender', 'Customer Name']

for col in cat_cols:
    path = MAPPINGS_DIR / f"{col}_mapping.csv"
    if path.exists():
        m = pd.read_csv(path, dtype=str).fillna('')
        m['raw'] = m['raw'].astype(str).str.strip().str.lower()
        m['canonical'] = m['canonical'].astype(str).str.strip().str.lower()
        print(f"\n-- {col} mapping -- {path.name}")
        display(m.head(10))
        print(f"Total unique raw values: {len(m)}")
    else:
        print(f"\n{col}: mapping file non trovato: {path.name}")


-- Company mapping -- Company_mapping.csv


Unnamed: 0,raw,canonical
0,acura,acura
1,audi,audi
2,bmw,bmw
3,buick,buick
4,cadillac,cadillac
5,chevrolet,chevrolet
6,chrysler,chrysler
7,dodge,dodge
8,ford,ford
9,honda,honda


Total unique raw values: 30

-- Dealer_Name mapping -- Dealer_Name_mapping.csv


Unnamed: 0,raw,canonical
0,buddy storbeck's diesel service inc,buddy storbeck's diesel service inc
1,c & m motors inc,c & m motors inc
2,capitol kia,capitol kia
3,chrysler of tri-cities,chrysler of tri-cities
4,chrysler plymouth,chrysler plymouth
5,classic chevy,classic chevy
6,clay johnson auto sales,clay johnson auto sales
7,diehl motor co inc,diehl motor co inc
8,enterprise rent a car,enterprise rent a car
9,gartner buick hyundai saab,gartner buick hyundai saab


Total unique raw values: 28

-- Model mapping -- Model_mapping.csv


Unnamed: 0,raw,canonical
0,3-sep,3-sep
1,3000gt,3000gt
2,300m,300m
3,323i,323i
4,328i,328i
5,4runner,4runner
6,5-sep,5-sep
7,528i,528i
8,a4,a4
9,a6,a6


Total unique raw values: 154

-- Transmission mapping -- Transmission_mapping.csv


Unnamed: 0,raw,canonical
0,auto,auto
1,manual,manual


Total unique raw values: 2

-- Gender mapping -- Gender_mapping.csv


Unnamed: 0,raw,canonical
0,female,female
1,male,male


Total unique raw values: 2

-- Customer Name mapping -- Customer Name_mapping.csv


Unnamed: 0,raw,canonical
0,aahil,aahil
1,aaliyah,aaliyah
2,aarav,aarav
3,aaron,aaron
4,aarya,aarya
5,aayan,aayan
6,abby,abby
7,abdal,abdal
8,abdelatif,abdelatif
9,abderramine,abderramine


Total unique raw values: 3022


## Ispezione avanzata: individuare candidate per canonicalizzazione

Obiettivo: identificare le voci che probabilmente richiedono un'unificazione (abbreviazioni, punteggiatura, varianti corte, typo evidenti).  
Eseguire la cella di codice per ottenere:
- conteggio valori Company ordinato;
- eventuali valori Company contenenti caratteri speciali, abbreviazioni o molto corti;
- top-models sospetti (es. doppioni simili).

In [25]:
# Ispezione avanzata per Company e Model
import pandas as pd

# company counts
company_counts = df['Company'].astype(str).str.strip().str.lower().value_counts()
print("Company - top (count):")
display(company_counts.head(50))

# candidate con simboli o short codes
candidates_symbols = [v for v in company_counts.index if any(ch in v for ch in ".-/&()'") or len(v) <= 3]
print("\nCompany - candidate contenenti simboli o short (esempi):")
display(candidates_symbols[:50])

# valori molto rari (f < 0.1%) - per valutare eventuale "other"
freq = company_counts / len(df)
rare_companies = freq[freq < 0.001]  # soglia 0.1% (adatta nella tua analisi)
print("\nCompany molto rare (f < 0.1%):")
display(rare_companies)

# per Model: mostra similarità semplicistica (possibili duplicati basati su trim)
model_counts = df['Model'].astype(str).str.strip().str.lower().value_counts()
print("\nModel - top (count):")
display(model_counts.head(50))

# mostra esempi di modelli simili per manual check (esempio: same prefix)
print("\nEsempi modelli che contengono 'passat' o 'jetta' per verificarne varianti:")
display(model_counts[model_counts.index.str.contains('passat')].head(50))
display(model_counts[model_counts.index.str.contains('jetta')].head(50))

Company - top (count):


Company
chevrolet     1819
dodge         1671
ford          1614
volkswagen    1333
mercedes-b    1285
mitsubishi    1277
chrysler      1120
oldsmobile    1111
toyota        1110
nissan         886
mercury        874
lexus          802
pontiac        796
bmw            790
volvo          789
honda          708
acura          689
cadillac       652
plymouth       617
saturn         586
lincoln        492
audi           468
buick          439
subaru         405
jeep           363
porsche        361
hyundai        264
saab           210
infiniti       195
jaguar         180
Name: count, dtype: int64


Company - candidate contenenti simboli o short (esempi):


['mercedes-b', 'bmw']


Company molto rare (f < 0.1%):


Series([], Name: count, dtype: float64)


Model - top (count):


Model
diamante         418
silhouette       411
prizm            411
passat           391
ram pickup       383
jetta            382
rl               372
ls400            354
lhs              330
a6               329
528i             324
3000gt           303
montero sport    302
s40              282
tl               269
pathfinder       267
durango          262
grand marquis    261
323i             260
metro            258
forester         255
corvette         245
accord           243
300m             243
sunfire          241
viper            240
s-class          238
malibu           237
concorde         237
eldorado         232
sebring coupe    230
explorer         225
expedition       215
slk              212
c70              210
continental      206
328i             206
bravada          205
neon             205
park avenue      202
i30              195
frontier         195
cutlass          194
bonneville       185
voyager          181
s-type           180
tacoma           179
celica 


Esempi modelli che contengono 'passat' o 'jetta' per verificarne varianti:


Model
passat    391
Name: count, dtype: int64

Model
jetta    382
Name: count, dtype: int64

## Correzioni consigliate per `Company`

Obiettivo: correggere le poche varianti evidenti per unificare i brand.  
Suggerimento immediato: mappare `mercedes-b` → `mercedes-benz`.  
Nota: `bmw` è valido come canonical (non va cambiato).

In [26]:
# === CORREZIONI MIRATE SU COMPANY (con backup) ===
from pathlib import Path
import pandas as pd
from shutil import copyfile

MAPPINGS_DIR = Path("mappings")
company_map_path = MAPPINGS_DIR / "Company_mapping.csv"
backup_path = MAPPINGS_DIR / "Company_mapping_backup.csv"

if not company_map_path.exists():
    raise FileNotFoundError(f"{company_map_path} non trovato. Esegui prima l'inizializzazione mapping (Sezione C).")

# 1) backup
if not backup_path.exists():
    copyfile(company_map_path, backup_path)
    print("Backup creato:", backup_path.name)
else:
    print("Backup già esistente:", backup_path.name)

# 2) correzioni proposte
corrections = {
    "mercedes-b": "mercedes-benz",
}

# 3) applicazione
df_map = pd.read_csv(company_map_path, dtype=str).fillna('')
df_map['raw'] = df_map['raw'].astype(str).str.strip().str.lower()
df_map['canonical'] = df_map['canonical'].astype(str).str.strip().str.lower()

applied = []
for raw_val, new_can in corrections.items():
    mask = df_map['raw'] == raw_val
    if mask.any():
        df_map.loc[mask, 'canonical'] = new_can
        applied.append(raw_val)

df_map.to_csv(company_map_path, index=False)
print("Correzioni applicate per le raw:", applied)
display(df_map[df_map['raw'].isin(applied)].head(20))

Backup già esistente: Company_mapping_backup.csv
Correzioni applicate per le raw: ['mercedes-b']


Unnamed: 0,raw,canonical
16,mercedes-b,mercedes-benz


## Nota: aggiornamento mapping Company

Ho creato un backup (`mappings/Company_mapping_backup.csv`) e applicato la correzione mirata:
- `mercedes-b` → `mercedes-benz`

Questa modifica è stata salvata in `mappings/Company_mapping.csv`. 

In [27]:
# === APPLICAZIONE MAPPING AL DATAFRAME ===
from pathlib import Path
import pandas as pd

MAPPINGS_DIR = Path("mappings")
cat_cols = ['Company', 'Dealer_Name', 'Model', 'Transmission', 'Gender', 'Customer Name']

def load_mapping_dict(path: Path):
    m = pd.read_csv(path, dtype=str).fillna('')
    m['raw'] = m['raw'].astype(str).str.strip().str.lower()
    m['canonical'] = m['canonical'].astype(str).str.strip().str.lower()
    return {r: c for r, c in zip(m['raw'], m['canonical']) if c != ''}

for c in cat_cols:
    if c in df.columns:
        mapping_file = MAPPINGS_DIR / f"{c}_mapping.csv"
        canonical_col = c.lower().replace(' ', '_') + '_canonical'
        if mapping_file.exists():
            mapping_dict = load_mapping_dict(mapping_file)
            df[canonical_col] = (
                df[c].astype(str).str.strip().str.lower()
                  .map(lambda x: mapping_dict.get(x, x) if pd.notna(x) else None)
            )
            print("Mapping applicato:", c)
        else:
            df[canonical_col] = df[c].astype(str).str.strip().str.lower().replace({'nan': None})
            print("Fallback (no mapping):", c)

print("\nColonne create/aggiornate:")
print([c for c in df.columns if c.endswith('_canonical')])

Mapping applicato: Company
Mapping applicato: Dealer_Name
Mapping applicato: Model
Mapping applicato: Transmission
Mapping applicato: Gender
Mapping applicato: Customer Name

Colonne create/aggiornate:
['company_canonical', 'dealer_name_canonical', 'model_canonical', 'transmission_canonical', 'gender_canonical', 'customer_name_canonical']


## QA post-mapping e salvataggio sample

Eseguire la cella di controllo per verificare distinct prima/dopo, visualizzare i top canonical e salvare un sample per la condivisione interna/presentazione.

In [28]:
# QA e salvataggio sample
def no_surrounding_spaces(series):
    vals = [x for x in series.dropna().unique() if isinstance(x, str)]
    return all(v == v.strip() for v in vals)

for c in ['Company','Model']:
    if c in df.columns:
        raw_n = df[c].nunique()
        can_col = c.lower().replace(' ', '_') + '_canonical'
        can_n = df[can_col].nunique() if can_col in df.columns else None
        print(f"\n{c}: raw distinct = {raw_n}, canonical distinct = {can_n}")
        if can_col in df.columns:
            display(df[can_col].value_counts().head(30))
            print("No surrounding spaces:", no_surrounding_spaces(df[can_col]))

# salva sample per condivisione (non committare dati grezzi)
sample = df.sample(n=min(200, len(df)), random_state=42)
sample_path = PROC_DIR / "sample_standardized.csv"
PROC_DIR.mkdir(parents=True, exist_ok=True)
sample.to_csv(sample_path, index=False)
print("\nSample salvato in:", sample_path.resolve())


Company: raw distinct = 30, canonical distinct = 30


company_canonical
chevrolet        1819
dodge            1671
ford             1614
volkswagen       1333
mercedes-benz    1285
mitsubishi       1277
chrysler         1120
oldsmobile       1111
toyota           1110
nissan            886
mercury           874
lexus             802
pontiac           796
bmw               790
volvo             789
honda             708
acura             689
cadillac          652
plymouth          617
saturn            586
lincoln           492
audi              468
buick             439
subaru            405
jeep              363
porsche           361
hyundai           264
saab              210
infiniti          195
jaguar            180
Name: count, dtype: int64

No surrounding spaces: True

Model: raw distinct = 154, canonical distinct = 154


model_canonical
diamante         418
silhouette       411
prizm            411
passat           391
ram pickup       383
jetta            382
rl               372
ls400            354
lhs              330
a6               329
528i             324
3000gt           303
montero sport    302
s40              282
tl               269
pathfinder       267
durango          262
grand marquis    261
323i             260
metro            258
forester         255
corvette         245
accord           243
300m             243
sunfire          241
viper            240
s-class          238
malibu           237
concorde         237
eldorado         232
Name: count, dtype: int64

No surrounding spaces: True

Sample salvato in: /Users/serenatempesta/Documents/Progetti/Data_Analysis/progetto_finale/data/processed/sample_standardized.csv


In [29]:
# === EXPORT city_state per Tableau (usa df in memoria) ===
# Scopo (team): produrre `database_for_tableau_city_state.csv` con una colonna city_state coerente.
# Nota: in questo notebook NON calcoliamo la ratio; la calcola il notebook di aggregazione.

# normalizzazione conservativa: Title Case e trim
df['dealer_region_mapped'] = df['Dealer_Region'].astype(str).str.strip().str.title()

# city_state: se già contiene una virgola manteniamo; altrimenti aggiungiamo ", USA"
def mk_city_state(v):
    s = str(v).strip()
    if s == "" or s.lower() in ['nan','none']:
        return ""
    return s if ',' in s else f"{s}, USA"

df['city_state'] = df['dealer_region_mapped'].apply(mk_city_state)

# salva dettaglio
OUT_DETAIL.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(OUT_DETAIL, index=False)
print("Salvato dettaglio per Tableau:", OUT_DETAIL.resolve())
display(df[['dealer_region_mapped','city_state']].drop_duplicates().head(20))

Salvato dettaglio per Tableau: /Users/serenatempesta/Documents/Progetti/Data_Analysis/progetto_finale/data/processed/database_for_tableau_city_state.csv


Unnamed: 0,dealer_region_mapped,city_state
0,Middletown,"Middletown, USA"
1,Aurora,"Aurora, USA"
2,Greenville,"Greenville, USA"
3,Pasco,"Pasco, USA"
4,Janesville,"Janesville, USA"
5,Scottsdale,"Scottsdale, USA"
6,Austin,"Austin, USA"


In [30]:
# === SANITY CHECK FINALE ===
from pathlib import Path
import pandas as pd

out = Path(OUT_DETAIL)
assert out.exists(), f"File non trovato: {out}"
df_check = pd.read_csv(out, nrows=5)
print("OK:", out.name, "| righe campione:", len(df_check))
print("Colonne:", df_check.columns.tolist())
display(df_check.head())

OK: database_for_tableau_city_state.csv | righe campione: 5
Colonne: ['Date', 'Customer Name', 'Gender', 'Annual Income', 'Dealer_Name', 'Company', 'Model', 'Engine', 'Transmission', 'Color', 'Price ($)', 'Body Style', 'Dealer_Region', '_price_clean', 'Price', '_income_clean', 'Company_mapped', 'Dealer_Name_mapped', 'Model_mapped', 'Transmission_mapped', 'Gender_mapped', 'Customer Name_mapped', 'Dealer_Region_mapped', '_price_clean_w', '_income_clean_w', 'company_canonical', 'dealer_name_canonical', 'model_canonical', 'transmission_canonical', 'gender_canonical', 'customer_name_canonical', 'dealer_region_mapped', 'city_state']


Unnamed: 0,Date,Customer Name,Gender,Annual Income,Dealer_Name,Company,Model,Engine,Transmission,Color,...,_price_clean_w,_income_clean_w,company_canonical,dealer_name_canonical,model_canonical,transmission_canonical,gender_canonical,customer_name_canonical,dealer_region_mapped,city_state
0,2022-01-02,Geraldine,Male,13500,Buddy Storbeck's Diesel Service Inc,Ford,Expedition,DoubleÂ Overhead Camshaft,Auto,Black,...,26000.0,13500,ford,buddy storbeck's diesel service inc,expedition,auto,male,geraldine,Middletown,"Middletown, USA"
1,2022-01-02,Gia,Male,1480000,C & M Motors Inc,Dodge,Durango,DoubleÂ Overhead Camshaft,Auto,Black,...,19000.0,1480000,dodge,c & m motors inc,durango,auto,male,gia,Aurora,"Aurora, USA"
2,2022-01-02,Gianna,Male,1035000,Capitol KIA,Cadillac,Eldorado,Overhead Camshaft,Manual,Red,...,31500.0,1035000,cadillac,capitol kia,eldorado,manual,male,gianna,Greenville,"Greenville, USA"
3,2022-01-02,Giselle,Male,13500,Chrysler of Tri-Cities,Toyota,Celica,Overhead Camshaft,Manual,Pale White,...,14000.0,13500,toyota,chrysler of tri-cities,celica,manual,male,giselle,Pasco,"Pasco, USA"
4,2022-01-02,Grace,Male,1465000,Chrysler Plymouth,Acura,TL,DoubleÂ Overhead Camshaft,Auto,Red,...,24500.0,1465000,acura,chrysler plymouth,tl,auto,male,grace,Janesville,"Janesville, USA"
