# Data cleaning Immobiliare.it

Importing the librares

In [29]:
import numpy as np
import pandas as pd
import re

I check the raw dataframe and the datatypes

In [30]:
df_raw = pd.read_csv('house_prices_italy.csv')
df_raw.head()

Unnamed: 0.1,Unnamed: 0,region,city,area,rooms,toilets,price
0,0,abruzzo,Pescara,89m²,3,1,€ 75.000
1,1,abruzzo,Spoltore,199m²,5+,3+,€ 235.000
2,2,abruzzo,Pescara,227m²,5,3+,€ 299.000
3,3,abruzzo,Appartamenti di nuova costruzione a Tortoreto,43m²,2 - 4,1,da € 165.000
4,4,abruzzo,Rosciano,530m²,5+,3+,€ 650.000


In [31]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  40000 non-null  int64 
 1   region      40000 non-null  object
 2   city        40000 non-null  object
 3   area        39975 non-null  object
 4   rooms       39052 non-null  object
 5   toilets     40000 non-null  object
 6   price       40000 non-null  object
dtypes: int64(1), object(6)
memory usage: 2.1+ MB


Are there any null values? let's see:

In [32]:
df_raw.isnull().sum()

Unnamed: 0      0
region          0
city            0
area           25
rooms         948
toilets         0
price           0
dtype: int64

I decided to drop the first column since it does not carry any info:

In [33]:
df=df_raw.drop("Unnamed: 0", axis=1)

In [34]:
df.head()

Unnamed: 0,region,city,area,rooms,toilets,price
0,abruzzo,Pescara,89m²,3,1,€ 75.000
1,abruzzo,Spoltore,199m²,5+,3+,€ 235.000
2,abruzzo,Pescara,227m²,5,3+,€ 299.000
3,abruzzo,Appartamenti di nuova costruzione a Tortoreto,43m²,2 - 4,1,da € 165.000
4,abruzzo,Rosciano,530m²,5+,3+,€ 650.000


I force the "area" column as 'string', if not, later it will give errors.

In [35]:
df['area'] = df['area'].astype(str)

I do not like the column names. I rename them:

In [36]:
df.rename(columns={'area':'area[m2]', 'price':'price[€]'}, inplace=True)
df.sample(10)

Unnamed: 0,region,city,area[m2],rooms,toilets,price[€]
3536,basilicata,Maratea,269m²,4,1,"da € 228.093,00"
5690,campania,Torchiara,90m²,3,1,€ 110.000
16753,lombardia,Milano,162m²,4,3+,Prezzo su richiesta
33606,trentino-alto-adige,Mezzolombardo,143m²,5+,3,€ 285.000
27715,sardegna,Castelsardo,35m²,2,1,"€ 79.000€ 95.000(-16,8%)"
4155,campania,Acerra,106m²,3,1,"da € 64.293,00"
29790,sicilia,Palermo,125m²,4,2,€ 215.000
20368,molise,Venafro,132m²,5,3,€ 175.000
38581,veneto,Marcon,150m²,5+,2,€ 165.000
1722,abruzzo,Tortoreto,128m²,4,1,€ 99.000


Regex on "area" column. Let's start cleaning from here:

In [37]:
#compilo il pattern
p = re.compile('[0-9]+')


df['area[m2]'] = df['area[m2]'].apply(lambda x: np.nan if p.search(x) is None else p.search(x).group())
df.head()

Unnamed: 0,region,city,area[m2],rooms,toilets,price[€]
0,abruzzo,Pescara,89,3,1,€ 75.000
1,abruzzo,Spoltore,199,5+,3+,€ 235.000
2,abruzzo,Pescara,227,5,3+,€ 299.000
3,abruzzo,Appartamenti di nuova costruzione a Tortoreto,43,2 - 4,1,da € 165.000
4,abruzzo,Rosciano,530,5+,3+,€ 650.000


In [38]:
df.isna().sum()

region        0
city          0
area[m2]     25
rooms       948
toilets       0
price[€]      0
dtype: int64

Now it's the turn for the "price" column:

In [39]:
p_price = re.compile('[0-9]+\.[0-9]+')

In [40]:
df['price[€]'] = df['price[€]'].apply(lambda x: np.nan if p_price.search(x) is None else p_price.search(x).group())
df.isna().sum()

region         0
city           0
area[m2]      25
rooms        948
toilets        0
price[€]    1265
dtype: int64

I now rename the regions into a more appropriate manner:

In [41]:
regions_dict = {'abruzzo':'Abruzzo', 'basilicata':'Basilicata', 'campania':'Campania', 'calabria':'Calabria', 'emilia-romagna':'Emilia Romagna',
       'friuli-venezia-giulia':'Friuli-Venezia Giulia', 'lazio': 'Lazio', 'liguria':'Liguria', 'lombardia':'Lombardia', 'marche':'Marche',
       'molise':'Molise', 'piemonte':'Piemonte', 'puglia':'Puglia', 'sardegna':'Sardegna', 'sicilia':'Sicilia', 'toscana':'Toscana',
       'trentino-alto-adige': 'Trentino-Alto Adige', 'umbria':'Umbria', 'valle-d-aosta':'Valle d\'Aosta', 'veneto':'Veneto'}

df.replace({'region':regions_dict}, inplace=True)
df.isna().sum()


region         0
city           0
area[m2]      25
rooms        948
toilets        0
price[€]    1265
dtype: int64

In [42]:
df.sample(20)

Unnamed: 0,region,city,area[m2],rooms,toilets,price[€]
4776,Campania,Casalnuovo di Napoli,96,3,1,180.0
15194,Liguria,Loano,70,3,1,370.0
30223,Toscana,Pisa,135,5,2,245.0
37636,Valle d'Aosta,Champorcher,80,4,1,115.0
3119,Basilicata,Matera,53,2,1,200.0
30283,Toscana,Scandicci,415,5+,3+,790.0
5955,Campania,Napoli,65,3,2,149.0
33036,Trentino-Alto Adige,Merano,187,4,2,950.0
6206,Calabria,Diamante,60,3,1,120.0
36265,Valle d'Aosta,Sarre,751,5+,1,292.132


I believe this is enough. It will be possible to load the data into a DataViz software and keep going from there with some visualizations

In [43]:
df.to_csv('house_prices_italy_cleaned.csv')