In [1]:
import geopandas as gpd

# Read the GeoJSON file
catastici = gpd.read_file("./data/raw/20240221_Catastici1741_Intermediate.geojson")

# Filtering the dataset

In [2]:
# filter the necessary datapoints
catastici_ppl = catastici[(catastici['owner_code'] == 'PPL') & (catastici['owner_count']=='1')]
catastici_ppl = catastici_ppl[['owner_first_name','owner_family_name','function','place','an_rendi']]

Drop the rows if the owner first name is not given, not to confuse the model further. -> 1273 rows

In [3]:
# drop the rows with not owner first name info
catastici_ppl = catastici_ppl[catastici_ppl.owner_first_name!='']

Some First and Family names with "|" followed by some family relationship, e.g. "angela patella paolo | _moglie". There are 118 of these rows, so I am dropping these rows as well to have cleaner dataset.

In [4]:
catastici_ppl = catastici_ppl[(~catastici_ppl['owner_first_name'].str.contains('\|')) & (~catastici_ppl['owner_family_name'].str.contains('\|'))]

Set the nan values in the ramaining 2 columns (function and place) to 'NOT GIVEN' to have a model that also learns what information is not given.

In [5]:
# remove the NaN values
catastici_ppl = catastici_ppl[catastici_ppl['function']!='nan']        # 278
catastici_ppl = catastici_ppl[catastici_ppl['place']!='nan']           # 18

In [6]:
# lowercase everything
for col in catastici_ppl.columns.to_list():
    catastici_ppl[col] = catastici_ppl[col].str.lower()

<font color='red'> The price is given in various types, i.e. with different currencies, some include the rent perioud, some have the goods exchanged instead of monetary payment. I filtered only for the numerical payments which are given in ducati.</font>

In [7]:
catastici_ppl = catastici_ppl[catastici_ppl['an_rendi'].str.isnumeric()]
catastici_ppl['an_rendi'] = catastici_ppl['an_rendi'].astype(int)

In [8]:
# drop duplicate rows
catastici_ppl = catastici_ppl[['owner_first_name', 'owner_family_name', 'function', 'place', 'an_rendi']].drop_duplicates()

Rename the columns and store

In [9]:
# rename the columns
catastici_ppl.rename({
    'owner_first_name':'Owner_First_Name',
    'owner_family_name':'Owner_Family_Name',
    'function':'Property_Type',
    'place':'Property_Location',
    'an_rendi':'Rent_Income'
}, axis=1, inplace=True)
catastici_ppl.to_csv(f'./data/clean/catastici_num.csv', index=False)