# Data Preparation KI-Anwendung

The goal of this notebook is to prepare the housing data from KI-Anwendung to look the same as the prepared data from data analitics.

The goal is to have the following columns: 
Index(['bfs_number', 'bfs_name', 'lat', 'lon', 'rooms', 'area', 'price', 'postalcode',
       'address', 'town'],
      dtype='object')

In [31]:
# Libraries
import pandas as pd

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [32]:
# Read the data to a pandas data frame
df = pd.read_csv('apartments_data_zurich_30.12.2023.csv', sep=',', encoding='utf-8')

# Get number of rows and columns
df.shape

(1008, 7)

In [33]:
df.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,rooms_area_price_raw,address_raw,price_raw,description_raw,text_raw
0,1703955431-1,https://www.immoscout24.ch/de/wohnung/mieten/k...,"4,5 Zimmer, 148 m², CHF 4180.—","Schaffhauserstrasse 363, 8050 Zürich, ZH",CHF 4180.—,««Renovierte 4.5-Zimmerwohnung an zentraler La...,"4,5 Zimmer, 148 m², CHF 4180.—Schaffhauserstra..."
1,1703955431-2,https://www.immoscout24.ch/de/wohnung/mieten/k...,"1,5 Zimmer, 35 m², CHF 1620.—","Bernerstrasse Süd 167, 8048 Zürich, ZH",CHF 1620.—,«City Pop - Furnished Apartment in Zurich-Alts...,"1,5 Zimmer, 35 m², CHF 1620.—Bernerstrasse Süd..."
2,1703955431-3,https://www.immoscout24.ch/de/wohnung/mieten/k...,"15 m², CHF 2167.—","Militärstrasse 24, 8004 Zürich, ZH",CHF 2167.—,«Studio Apartment Mini»,"15 m², CHF 2167.—Militärstrasse 24, 8004 Züric..."
3,1703955431-4,https://www.immoscout24.ch/de/wohnung/mieten/k...,"4,5 Zimmer, 110 m², CHF 4500.—","8600 Dübendorf, ZH",CHF 4500.—,«Grosszügige 4.5-Zimmer-Maisonette-Wohnung in ...,"4,5 Zimmer, 110 m², CHF 4500.—8600 Dübendorf, ..."
4,1703955431-5,https://www.immoscout24.ch/de/wohnung/mieten/k...,"2,5 Zimmer, 50 m², CHF 1650.—","Rütistrasse 16, 8118 Pfaffhausen, ZH",CHF 1650.—,«Wohnung in Pfaffhausen»,"2,5 Zimmer, 50 m², CHF 1650.—Rütistrasse 16, 8..."


In [34]:
df['address_raw_origin'] = df['address_raw']
df['address_raw'] = df['address_raw'].str.replace(', ZH', '')
df['town'] = df['address_raw'].str.extract(r'(?<=\d\d\d\d)(.+)')
df['zip'] = df['address_raw'].str.extract(r'(\d\d\d\d)')
df['rooms'] = df['rooms_area_price_raw'].str.extract(r'(.+)(?=Zimmer)')
df['rooms'] = df['rooms'].str.replace(',', '.').str.strip()
df['size'] = df['rooms_area_price_raw'].str.extract(r'(?<=,)?(\d+)(?=\s*m²)')
df['price'] = df['rooms_area_price_raw'].str.extract(r'(?<=,)?(\d+)(?=\s*\.)')
df['address'] = df['address_raw']


In [35]:
df = df.drop(df[df['rooms'] == 'undefined'].index)
df = df.drop(df[df['zip'] == 'undefined'].index)
df = df.drop(df[df['size'] == 'undefined'].index)
df = df.drop(df[df['price'] == 'undefined'].index)
df = df.drop(df[df['zip'] == 'undefined'].index)

df = df.dropna()

In [36]:
#df['bfs_number'] = df['munbfs']
df['rooms'] = df['rooms'].astype(float)
df['area'] = df['size'].astype(int)
df['price'] = df['price'].astype(int)
df['postalcode'] = df['zip'].astype(int)

In [37]:
df.shape

(886, 16)

In [38]:
df_bfs_number_postalcode = pd.read_csv('bfs_number_postalcode.csv')
df = df.merge(df_bfs_number_postalcode, on='postalcode')
df.shape

(855, 18)

In [39]:
df.columns

Index(['web-scraper-order', 'web-scraper-start-url', 'rooms_area_price_raw',
       'address_raw', 'price_raw', 'description_raw', 'text_raw',
       'address_raw_origin', 'town', 'zip', 'rooms', 'size', 'price',
       'address', 'area', 'postalcode', 'Unnamed: 0', 'bfs_number'],
      dtype='object')

In [40]:
df[['bfs_number', 'rooms', 'area', 'price', 'postalcode',
       'address', 'town', 'description_raw']].to_csv('apartments_data_zurich_30.12.2023_with_bfs.csv', 
          sep=",", 
          encoding='utf-8',
          index=False)