Purpose

The purpose of this data science project is to come up with a pricing model for the the rents charged for german apartments. The project aims to build a predictive model for rents based on facilities and properties boasted by each individual listing. The model may be useful to companies or individuals looking to price their apartments for rents. 

Imports

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path
import requests
import pandas_profiling
from pandas_profiling.utils.cache import cache_file



Data Wrangling Objectives

Fundamental questions to resolve in this notebook before moving on.
Is this the right type of data that should be used to model rent prices?
    Has the required target value been identified?
    Are there potentially useful features?
Are there fundamental issues with the data?

Loading the raw data

In [79]:
immo_data= pd.read_csv('../data/raw/immo_data.csv')

Data Definition

The features of an apartment (the entity) has to be defined in sufficient detail for further analysis. In doing so we gain an understanding of its relevance to the each apartment record. We also identify our feature of interest - the target feature - which in our case is the rent charged for each apartment, and start to get a sense for how these features are related to our target feature. Of relevance is representing the features in the right data format for further processing and taking note of features that have limited values. 

We start inspecting the dataset by reviewing summary level information about the features and viewing the first view entries

This report gives us all the summary level information we want about the data set in a lot of detail

In [80]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [81]:
report = immo_data.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)


The info command gives a sense to the shape and quality of the dataset.

In [82]:
immo_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268850 entries, 0 to 268849
Data columns (total 49 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   regio1                    268850 non-null  object 
 1   serviceCharge             261941 non-null  float64
 2   heatingType               223994 non-null  object 
 3   telekomTvOffer            236231 non-null  object 
 4   telekomHybridUploadSpeed  45020 non-null   float64
 5   newlyConst                268850 non-null  bool   
 6   balcony                   268850 non-null  bool   
 7   picturecount              268850 non-null  int64  
 8   pricetrend                267018 non-null  float64
 9   telekomUploadSpeed        235492 non-null  float64
 10  totalRent                 228333 non-null  float64
 11  yearConstructed           211805 non-null  float64
 12  scoutId                   268850 non-null  int64  
 13  noParkSpaces              93052 non-null   f

We observe that the dataset has 49 features associated with 268,850 apartment records that have been put up for rent. The dataset features comprise of 4 main datatypes, including 6 boolean, 18 float, 6 integer and 19 object datatypes. We would need to verfiy that these datatypes are accurate in the context of our goal of developing a price model for rent. We also note a varying level of quality associated with each feature in terms of how many records exist for the particular feature. We would need to deal with those features that have a lot of empty values. One last observation is the amount of memory used up by the dataset (89.7+ MB).This size is manageable with the operating system currently in use. A larger dataset would necessitate the use of additional memory, possibly with cloud computing technologies. 

In [83]:
immo_data.head()

Unnamed: 0,regio1,serviceCharge,heatingType,telekomTvOffer,telekomHybridUploadSpeed,newlyConst,balcony,picturecount,pricetrend,telekomUploadSpeed,totalRent,yearConstructed,scoutId,noParkSpaces,firingTypes,hasKitchen,geo_bln,cellar,yearConstructedRange,baseRent,houseNumber,livingSpace,geo_krs,condition,interiorQual,petsAllowed,street,streetPlain,lift,baseRentRange,typeOfFlat,geo_plz,noRooms,thermalChar,floor,numberOfFloors,noRoomsRange,garden,livingSpaceRange,regio2,regio3,description,facilities,heatingCosts,energyEfficiencyClass,lastRefurbish,electricityBasePrice,electricityKwhPrice,date
0,Nordrhein_Westfalen,245.0,central_heating,ONE_YEAR_FREE,,False,False,6,4.62,10.0,840.0,1965.0,96107057,1.0,oil,False,Nordrhein_Westfalen,True,2.0,595.0,244.0,86.0,Dortmund,well_kept,normal,,Sch&uuml;ruferstra&szlig;e,Schüruferstraße,False,4,ground_floor,44269,4.0,181.4,1.0,3.0,4,True,4,Dortmund,Schüren,Die ebenerdig zu erreichende Erdgeschosswohnun...,Die Wohnung ist mit Laminat ausgelegt. Das Bad...,,,,,,May19
1,Rheinland_Pfalz,134.0,self_contained_central_heating,ONE_YEAR_FREE,,False,True,8,3.47,10.0,,1871.0,111378734,2.0,gas,False,Rheinland_Pfalz,False,1.0,800.0,,89.0,Rhein_Pfalz_Kreis,refurbished,normal,no,no_information,,False,5,ground_floor,67459,3.0,,,,3,False,4,Rhein_Pfalz_Kreis,Böhl_Iggelheim,Alles neu macht der Mai – so kann es auch für ...,,,,2019.0,,,May19
2,Sachsen,255.0,floor_heating,ONE_YEAR_FREE,10.0,True,True,8,2.72,2.4,1300.0,2019.0,113147523,1.0,,False,Sachsen,True,9.0,965.0,4.0,83.8,Dresden,first_time_use,sophisticated,,Turnerweg,Turnerweg,True,6,apartment,1097,3.0,,3.0,4.0,3,False,4,Dresden,Äußere_Neustadt_Antonstadt,Der Neubau entsteht im Herzen der Dresdner Neu...,"* 9 m² Balkon\n* Bad mit bodengleicher Dusche,...",,,,,,Oct19
3,Sachsen,58.15,district_heating,ONE_YEAR_FREE,,False,True,9,1.53,40.0,,1964.0,108890903,,district_heating,False,Sachsen,False,2.0,343.0,35.0,58.15,Mittelsachsen_Kreis,,,,Gl&uuml;ck-Auf-Stra&szlig;e,Glück-Auf-Straße,False,2,other,9599,3.0,86.0,3.0,,3,False,2,Mittelsachsen_Kreis,Freiberg,Abseits von Lärm und Abgasen in Ihre neue Wohn...,,87.23,,,,,May19
4,Bremen,138.0,self_contained_central_heating,,,False,True,19,2.46,,903.0,1950.0,114751222,,gas,False,Bremen,False,1.0,765.0,10.0,84.97,Bremen,refurbished,,,Hermann-Henrich-Meier-Allee,Hermann-Henrich-Meier-Allee,False,5,apartment,28213,3.0,188.9,1.0,,3,False,4,Bremen,Neu_Schwachhausen,Es handelt sich hier um ein saniertes Mehrfami...,Diese Wohnung wurde neu saniert und ist wie fo...,,,,,,Feb20


The head method allows us to see some, but not all the features of the dataset. We can change the viewing options with as follows. 

In [85]:
pd.options.display.max_columns = None
display(immo_data)

Unnamed: 0,regio1,serviceCharge,heatingType,telekomTvOffer,telekomHybridUploadSpeed,newlyConst,balcony,picturecount,pricetrend,telekomUploadSpeed,totalRent,yearConstructed,scoutId,noParkSpaces,firingTypes,hasKitchen,geo_bln,cellar,yearConstructedRange,baseRent,houseNumber,livingSpace,geo_krs,condition,interiorQual,petsAllowed,street,streetPlain,lift,baseRentRange,typeOfFlat,geo_plz,noRooms,thermalChar,floor,numberOfFloors,noRoomsRange,garden,livingSpaceRange,regio2,regio3,description,facilities,heatingCosts,energyEfficiencyClass,lastRefurbish,electricityBasePrice,electricityKwhPrice,date
0,Nordrhein_Westfalen,245.00,central_heating,ONE_YEAR_FREE,,False,False,6,4.62,10.0,840.0,1965.0,96107057,1.0,oil,False,Nordrhein_Westfalen,True,2.0,595.0,244,86.00,Dortmund,well_kept,normal,,Sch&uuml;ruferstra&szlig;e,Schüruferstraße,False,4,ground_floor,44269,4.0,181.40,1.0,3.0,4,True,4,Dortmund,Schüren,Die ebenerdig zu erreichende Erdgeschosswohnun...,Die Wohnung ist mit Laminat ausgelegt. Das Bad...,,,,,,May19
1,Rheinland_Pfalz,134.00,self_contained_central_heating,ONE_YEAR_FREE,,False,True,8,3.47,10.0,,1871.0,111378734,2.0,gas,False,Rheinland_Pfalz,False,1.0,800.0,,89.00,Rhein_Pfalz_Kreis,refurbished,normal,no,no_information,,False,5,ground_floor,67459,3.0,,,,3,False,4,Rhein_Pfalz_Kreis,Böhl_Iggelheim,Alles neu macht der Mai – so kann es auch für ...,,,,2019.0,,,May19
2,Sachsen,255.00,floor_heating,ONE_YEAR_FREE,10.0,True,True,8,2.72,2.4,1300.0,2019.0,113147523,1.0,,False,Sachsen,True,9.0,965.0,4,83.80,Dresden,first_time_use,sophisticated,,Turnerweg,Turnerweg,True,6,apartment,1097,3.0,,3.0,4.0,3,False,4,Dresden,Äußere_Neustadt_Antonstadt,Der Neubau entsteht im Herzen der Dresdner Neu...,"* 9 m² Balkon\n* Bad mit bodengleicher Dusche,...",,,,,,Oct19
3,Sachsen,58.15,district_heating,ONE_YEAR_FREE,,False,True,9,1.53,40.0,,1964.0,108890903,,district_heating,False,Sachsen,False,2.0,343.0,35,58.15,Mittelsachsen_Kreis,,,,Gl&uuml;ck-Auf-Stra&szlig;e,Glück-Auf-Straße,False,2,other,9599,3.0,86.00,3.0,,3,False,2,Mittelsachsen_Kreis,Freiberg,Abseits von Lärm und Abgasen in Ihre neue Wohn...,,87.23,,,,,May19
4,Bremen,138.00,self_contained_central_heating,,,False,True,19,2.46,,903.0,1950.0,114751222,,gas,False,Bremen,False,1.0,765.0,10,84.97,Bremen,refurbished,,,Hermann-Henrich-Meier-Allee,Hermann-Henrich-Meier-Allee,False,5,apartment,28213,3.0,188.90,1.0,,3,False,4,Bremen,Neu_Schwachhausen,Es handelt sich hier um ein saniertes Mehrfami...,Diese Wohnung wurde neu saniert und ist wie fo...,,,,,,Feb20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
268845,Bayern,90.00,heat_pump,ONE_YEAR_FREE,,False,True,0,2.74,10.0,910.0,2016.0,115641081,1.0,geothermal,False,Bayern,True,9.0,820.0,,90.00,Weilheim_Schongau_Kreis,mint_condition,sophisticated,no,no_information,,False,6,roof_storey,82390,3.0,,,,3,False,4,Weilheim_Schongau_Kreis,Eberfing,"Diese schöne, neuwertige Wohnung im Dachgescho...",Fliesen und Parkett. Sichtbarer Dachstuhl.,,,,,,Feb20
268846,Hessen,220.00,gas_heating,,,False,True,12,6.49,,1150.0,1983.0,96981497,1.0,gas,True,Hessen,False,4.0,930.0,,115.00,Bergstraße_Kreis,well_kept,sophisticated,negotiable,no_information,,False,6,apartment,68519,3.5,,1.0,1.0,3,False,5,Bergstraße_Kreis,Viernheim,Hier wird eine Wohnung im 2 Familienhaus angeb...,"Parkett, Kamin, Badewanne&Dusche\nGroßer Balko...",,,2015.0,,,May19
268847,Hessen,220.00,central_heating,ONE_YEAR_FREE,,False,True,21,2.90,40.0,930.0,1965.0,66924271,1.0,gas,False,Hessen,True,2.0,650.0,10,95.00,Limburg_Weilburg_Kreis,well_kept,,negotiable,Emsbachstrasse,Emsbachstrasse,False,5,apartment,65552,4.0,160.77,1.0,2.0,4,True,4,Limburg_Weilburg_Kreis,Limburg_an_der_Lahn,gemütliche 4-Zimmer-Wohnung im Obergeschoss ei...,"Böden: Wohn-/Schlafbereich = Laminat, Küche + ...",,,2019.0,,,Feb20
268848,Nordrhein_Westfalen,175.00,heat_pump,,,True,True,16,4.39,,1015.0,2019.0,110938302,1.0,gas,False,Nordrhein_Westfalen,True,9.0,840.0,58,70.00,Köln,first_time_use,sophisticated,no,Idastra&szlig;e,Idastraße,True,6,apartment,51069,2.0,24.70,,5.0,2,False,3,Köln,Dellbrück,"Neubau Erstbezug, gehobener Standard, alle Ein...","Wände:\nMaler­vlies, weiß gestrichen alter­nat...",40.00,NO_INFORMATION,2019.0,,,May19


First of all we need to identify each record uniquely. It would be preferrable not to use address information captured in teh street and streetPlain feature since this involves using 2 features and the datatype of these features is object (text), which makes it difficult for processing. However, there is scoutID feature that may work. This feature must have unique values for all 268,849 apartment records. 

In [87]:
immo_data["scoutId"].nunique()

268850

Perfect! Each scoutId value is distinct and exists for all 268,850 records. 

Let's now move scoutId to the first column as a unique identifier for each record. 

In [94]:
cols = list(immo_data) #Get a list of column names for the dataset
#Get the list index of scoutId using list index method, pop it from the list using the list pop method, and insert the value (scoutId) as the first list item using the list insert method
cols.insert(0,cols.pop(cols.index("scoutId")))
#Use the ix indexer to rearrange the dataframe based on the new list column names

AttributeError: 'DataFrame' object has no attribute 'ix'

In [93]:
cols

['scoutId',
 'regio1',
 'serviceCharge',
 'heatingType',
 'telekomTvOffer',
 'telekomHybridUploadSpeed',
 'newlyConst',
 'balcony',
 'picturecount',
 'pricetrend',
 'telekomUploadSpeed',
 'totalRent',
 'yearConstructed',
 'noParkSpaces',
 'firingTypes',
 'hasKitchen',
 'geo_bln',
 'cellar',
 'yearConstructedRange',
 'baseRent',
 'houseNumber',
 'livingSpace',
 'geo_krs',
 'condition',
 'interiorQual',
 'petsAllowed',
 'street',
 'streetPlain',
 'lift',
 'baseRentRange',
 'typeOfFlat',
 'geo_plz',
 'noRooms',
 'thermalChar',
 'floor',
 'numberOfFloors',
 'noRoomsRange',
 'garden',
 'livingSpaceRange',
 'regio2',
 'regio3',
 'description',
 'facilities',
 'heatingCosts',
 'energyEfficiencyClass',
 'lastRefurbish',
 'electricityBasePrice',
 'electricityKwhPrice',
 'date']

We observe a variety of prices, costs and factors that may be used to determine rent. These includes serviceCharge, priceTrend, totalRent, baseRent, baseRentRange, heatingCosts, electricityBasePrice, electricityKwhPrice.

In [30]:
immo_data.dtypes

regio1                       object
serviceCharge               float64
heatingType                  object
telekomTvOffer               object
telekomHybridUploadSpeed    float64
newlyConst                     bool
balcony                        bool
picturecount                  int64
pricetrend                  float64
telekomUploadSpeed          float64
totalRent                   float64
yearConstructed             float64
scoutId                       int64
noParkSpaces                float64
firingTypes                  object
hasKitchen                     bool
geo_bln                      object
cellar                         bool
yearConstructedRange        float64
baseRent                    float64
houseNumber                  object
livingSpace                 float64
geo_krs                      object
condition                    object
interiorQual                 object
petsAllowed                  object
street                       object
streetPlain                 

In [31]:
immo_data[['serviceCharge', 'pricetrend', 'totalRent', 'baseRent', 'heatingCosts', 'electricityBasePrice']]

Unnamed: 0,serviceCharge,pricetrend,totalRent,baseRent,heatingCosts,electricityBasePrice
0,245.00,4.62,840.0,595.0,,
1,134.00,3.47,,800.0,,
2,255.00,2.72,1300.0,965.0,,
3,58.15,1.53,,343.0,87.23,
4,138.00,2.46,903.0,765.0,,
...,...,...,...,...,...,...
268845,90.00,2.74,910.0,820.0,,
268846,220.00,6.49,1150.0,930.0,,
268847,220.00,2.90,930.0,650.0,,
268848,175.00,4.39,1015.0,840.0,40.00,


In [32]:
immo_data.describe()

Unnamed: 0,serviceCharge,telekomHybridUploadSpeed,picturecount,pricetrend,telekomUploadSpeed,totalRent,yearConstructed,scoutId,noParkSpaces,yearConstructedRange,baseRent,livingSpace,baseRentRange,geo_plz,noRooms,thermalChar,floor,numberOfFloors,noRoomsRange,livingSpaceRange,heatingCosts,lastRefurbish,electricityBasePrice,electricityKwhPrice
count,261941.0,45020.0,268850.0,267018.0,235492.0,228333.0,211805.0,268850.0,93052.0,211805.0,268850.0,268850.0,268850.0,268850.0,268850.0,162344.0,217541.0,171118.0,268850.0,268850.0,85518.0,80711.0,46846.0,46846.0
mean,151.206113,10.0,9.791958,3.389001,28.804928,901.3315,1966.40059,106969700.0,1.327634,3.714544,694.1294,74.355548,3.765256,37283.022235,2.641261,114.749533,2.122405,3.572319,2.571542,3.07079,76.990866,2013.904536,89.113612,0.199769
std,308.29579,0.0,6.408399,1.964874,16.337151,33238.33,46.992207,12500930.0,8.361403,2.738134,19536.02,254.759208,2.214357,27798.037296,2.63344,61.653663,3.634934,6.375496,0.937594,1.407127,147.716278,10.963125,5.395805,0.009667
min,0.0,10.0,0.0,-12.33,1.0,0.0,1000.0,28871740.0,0.0,1.0,0.0,0.0,1.0,852.0,1.0,0.1,-1.0,0.0,1.0,1.0,0.0,1015.0,71.43,0.1705
25%,95.0,10.0,6.0,2.0,10.0,469.8,1950.0,106691000.0,1.0,1.0,338.0,54.0,2.0,9128.0,2.0,79.0,1.0,2.0,2.0,2.0,54.0,2012.0,90.76,0.1915
50%,135.0,10.0,9.0,3.39,40.0,650.0,1973.0,111158400.0,1.0,3.0,490.0,67.32,3.0,38667.0,3.0,107.0,2.0,3.0,3.0,3.0,70.0,2017.0,90.76,0.1985
75%,190.0,10.0,13.0,4.57,40.0,985.0,1996.0,113768800.0,1.0,5.0,799.0,87.0,5.0,57072.0,3.0,140.3,3.0,4.0,3.0,4.0,90.0,2019.0,90.76,0.2055
max,146118.0,10.0,121.0,14.92,100.0,15751540.0,2090.0,115711700.0,2241.0,9.0,9999999.0,111111.0,9.0,99998.0,999.99,1996.0,999.0,999.0,5.0,7.0,12613.0,2919.0,90.76,0.2276


In [33]:
immo_data.regio1.nunique()

16

In [34]:
immo_data.regio1.value_counts()

Nordrhein_Westfalen       62863
Sachsen                   58154
Bayern                    21609
Sachsen_Anhalt            20124
Hessen                    17845
Niedersachsen             16593
Baden_Württemberg         16091
Berlin                    10406
Thüringen                  8388
Rheinland_Pfalz            8368
Brandenburg                6954
Schleswig_Holstein         6668
Mecklenburg_Vorpommern     6634
Hamburg                    3759
Bremen                     2965
Saarland                   1429
Name: regio1, dtype: int64

In [35]:
immo_data.heatingType.head()

0                   central_heating
1    self_contained_central_heating
2                     floor_heating
3                  district_heating
4    self_contained_central_heating
Name: heatingType, dtype: object

In [36]:
immo_data.heatingType.nunique()

13

In [37]:
immo_data.heatingType.unique()

array(['central_heating', 'self_contained_central_heating',
       'floor_heating', 'district_heating', 'gas_heating', 'oil_heating',
       nan, 'wood_pellet_heating', 'electric_heating',
       'combined_heat_and_power_plant', 'heat_pump',
       'night_storage_heater', 'stove_heating', 'solar_heating'],
      dtype=object)

In [38]:
immo_data.heatingType.value_counts()

central_heating                   128977
district_heating                   24808
gas_heating                        19955
self_contained_central_heating     19087
floor_heating                      17697
oil_heating                         5042
heat_pump                           2737
combined_heat_and_power_plant       1978
night_storage_heater                1341
wood_pellet_heating                  961
electric_heating                     901
stove_heating                        344
solar_heating                        166
Name: heatingType, dtype: int64

In [39]:
immo_data.heatingType.describe()

count              223994
unique                 13
top       central_heating
freq               128977
Name: heatingType, dtype: object

In [40]:
immo_data['telekomTvOffer'].head()

0    ONE_YEAR_FREE
1    ONE_YEAR_FREE
2    ONE_YEAR_FREE
3    ONE_YEAR_FREE
4              NaN
Name: telekomTvOffer, dtype: object

In [41]:
immo_data.telekomTvOffer.nunique()

3

In [42]:
immo_data.telekomTvOffer.value_counts()

ONE_YEAR_FREE    227632
NONE               4957
ON_DEMAND          3642
Name: telekomTvOffer, dtype: int64

In [43]:
immo_data.telekomTvOffer.unique()

array(['ONE_YEAR_FREE', nan, 'NONE', 'ON_DEMAND'], dtype=object)

In [44]:
immo_data.heatingType.value_counts()

central_heating                   128977
district_heating                   24808
gas_heating                        19955
self_contained_central_heating     19087
floor_heating                      17697
oil_heating                         5042
heat_pump                           2737
combined_heat_and_power_plant       1978
night_storage_heater                1341
wood_pellet_heating                  961
electric_heating                     901
stove_heating                        344
solar_heating                        166
Name: heatingType, dtype: int64

In [45]:
immo_data.firingTypes.unique()

array(['oil', 'gas', nan, 'district_heating', 'gas:electricity',
       'electricity', 'pellet_heating', 'natural_gas_light',
       'combined_heat_and_power_fossil_fuels',
       'district_heating:local_heating', 'steam_district_heating',
       'natural_gas_heavy', 'gas:district_heating', 'solar_heating:gas',
       'environmental_thermal_energy', 'local_heating',
       'gas:natural_gas_light', 'geothermal',
       'combined_heat_and_power_regenerative_energy', 'heat_supply',
       'oil:electricity', 'solar_heating', 'geothermal:solar_heating',
       'district_heating:electricity', 'liquid_gas', 'wood',
       'hydro_energy', 'combined_heat_and_power_renewable_energy', 'coal',
       'gas:steam_district_heating', 'bio_energy',
       'gas:environmental_thermal_energy', 'wood_chips', 'gas:oil',
       'solar_heating:wood', 'geothermal:gas',
       'solar_heating:gas:electricity',
       'gas:natural_gas_light:heat_supply', 'pellet_heating:gas',
       'solar_heating:gas:bio_energy'

There are a lot of firingTypes. Need to split the items to see which of these are unique

In [46]:
split_data = immo_data["firingTypes"].str.split(":", expand = True)

In [47]:
split_data.tail(50)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
268800,electricity,,,,,,,,,,,,
268801,gas,,,,,,,,,,,,
268802,gas,,,,,,,,,,,,
268803,gas,,,,,,,,,,,,
268804,gas,,,,,,,,,,,,
268805,natural_gas_light,,,,,,,,,,,,
268806,district_heating,,,,,,,,,,,,
268807,gas,,,,,,,,,,,,
268808,,,,,,,,,,,,,
268809,,,,,,,,,,,,,


We have an entry with 13 possible firing types. But most entries have one. We'll give these firing types columns from 1 to 13

In [48]:
split_data.columns = ["fuelType1", "fuelType2", "fuelType3", "fuelType4", "fuelType5", "fuelType6", "fuelType7", "fuelType8", "fuelType9", "fuelType10", "fuelType11", "fuelType12", "fuelType13"]

In [49]:
fueltypes = [[x, split_data[x].count()] for x in ["fuelType1", "fuelType2", "fuelType3", "fuelType4", "fuelType5", "fuelType6", "fuelType7", "fuelType8", "fuelType9", "fuelType10", "fuelType11", "fuelType12", "fuelType13"]]

In [50]:
fueltypes

[['fuelType1', 211886],
 ['fuelType2', 3407],
 ['fuelType3', 69],
 ['fuelType4', 17],
 ['fuelType5', 4],
 ['fuelType6', 4],
 ['fuelType7', 2],
 ['fuelType8', 2],
 ['fuelType9', 2],
 ['fuelType10', 2],
 ['fuelType11', 2],
 ['fuelType12', 2],
 ['fuelType13', 1]]

There are more entries in fuelType 1 and 2 than the others. We select only these columns and add to the dataframe

In [51]:
primaryFuels = split_data[["fuelType1","fuelType2"]]

In [52]:
immo_data = pd.concat([immo_data, primaryFuels], axis = 1)

In [53]:
immo_data.shape

(268850, 51)

In [54]:
immo_data[["heatingType", "firingTypes", "fuelType1", "fuelType2"]].head()

Unnamed: 0,heatingType,firingTypes,fuelType1,fuelType2
0,central_heating,oil,oil,
1,self_contained_central_heating,gas,gas,
2,floor_heating,,,
3,district_heating,district_heating,district_heating,
4,self_contained_central_heating,gas,gas,


We can drop the firingTypes column since we have captured relevant information in 

In [55]:
del immo_data["firingTypes"]

In [56]:
immo_data.shape

(268850, 50)

In [57]:
immo_data.fuelType1.unique()

array(['oil', 'gas', nan, 'district_heating', 'electricity',
       'pellet_heating', 'natural_gas_light',
       'combined_heat_and_power_fossil_fuels', 'steam_district_heating',
       'natural_gas_heavy', 'solar_heating',
       'environmental_thermal_energy', 'local_heating', 'geothermal',
       'combined_heat_and_power_regenerative_energy', 'heat_supply',
       'liquid_gas', 'wood', 'hydro_energy',
       'combined_heat_and_power_renewable_energy', 'coal', 'bio_energy',
       'wood_chips', 'combined_heat_and_power_bio_energy', 'wind_energy',
       'coal_coke'], dtype=object)

In [58]:
immo_data.fuelType1.nunique()

25

In [59]:
immo_data.fuelType2.nunique()

22

In [60]:
immo_data.fuelType1.describe()

count     211886
unique        25
top          gas
freq      112712
Name: fuelType1, dtype: object

In [61]:
immo_data.geo_bln.head()

0    Nordrhein_Westfalen
1        Rheinland_Pfalz
2                Sachsen
3                Sachsen
4                 Bremen
Name: geo_bln, dtype: object

In [62]:
immo_data[["regio1", "geo_bln"]].tail(50)

Unnamed: 0,regio1,geo_bln
268800,Hessen,Hessen
268801,Mecklenburg_Vorpommern,Mecklenburg_Vorpommern
268802,Brandenburg,Brandenburg
268803,Nordrhein_Westfalen,Nordrhein_Westfalen
268804,Nordrhein_Westfalen,Nordrhein_Westfalen
268805,Nordrhein_Westfalen,Nordrhein_Westfalen
268806,Baden_Württemberg,Baden_Württemberg
268807,Sachsen,Sachsen
268808,Saarland,Saarland
268809,Nordrhein_Westfalen,Nordrhein_Westfalen


In [63]:
immo_data['geo_bln'].unique()

array(['Nordrhein_Westfalen', 'Rheinland_Pfalz', 'Sachsen', 'Bremen',
       'Schleswig_Holstein', 'Baden_Württemberg', 'Thüringen', 'Hessen',
       'Niedersachsen', 'Bayern', 'Hamburg', 'Sachsen_Anhalt',
       'Mecklenburg_Vorpommern', 'Berlin', 'Brandenburg', 'Saarland'],
      dtype=object)

regio1 and geo_bln appear to have the same values. We can therefore go ahead and drop the column

In [64]:
del immo_data['geo_bln']

In [65]:
immo_data.shape

(268850, 49)

In [66]:
immo_data.houseNumber

0         244
1         NaN
2           4
3          35
4          10
         ... 
268845    NaN
268846    NaN
268847     10
268848     58
268849      8
Name: houseNumber, Length: 268850, dtype: object

In [67]:
immo_data.geo_krs

0                        Dortmund
1               Rhein_Pfalz_Kreis
2                         Dresden
3             Mittelsachsen_Kreis
4                          Bremen
                   ...           
268845    Weilheim_Schongau_Kreis
268846           Bergstraße_Kreis
268847     Limburg_Weilburg_Kreis
268848                       Köln
268849          Frankfurt_am_Main
Name: geo_krs, Length: 268850, dtype: object

In [68]:
immo_data.geo_krs.nunique()

419

In [69]:
immo_data.geo_krs.unique()

array(['Dortmund', 'Rhein_Pfalz_Kreis', 'Dresden', 'Mittelsachsen_Kreis',
       'Bremen', 'Schleswig_Flensburg_Kreis', 'Emmendingen_Kreis',
       'Gelsenkirchen', 'Chemnitz', 'Südliche_Weinstraße_Kreis', 'Hamm',
       'Weimar', 'Main_Kinzig_Kreis', 'Duisburg', 'Göttingen_Kreis',
       'Neumünster', 'Stuttgart', 'Leipzig', 'München', 'Hamburg',
       'Braunschweig', 'Esslingen_Kreis', 'Magdeburg', 'Schwerin',
       'Passau', 'Mettmann_Kreis', 'Vogtlandkreis', 'Groß_Gerau_Kreis',
       'Sächsische_Schweiz_Osterzgebirge_Kreis', 'Görlitz_Kreis',
       'Rheinisch_Bergischer_Kreis', 'Essen', 'Meißen_Kreis', 'Mannheim',
       'Wesermarsch_Kreis', 'Hochsauerlandkreis', 'Unna_Kreis',
       'Bautzen_Kreis', 'Berlin', 'Frankfurt_am_Main', 'Halle_Saale',
       'Steinburg_Kreis', 'Aschaffenburg', 'Oder_Spree_Kreis',
       'Bremerhaven', 'Zwickau_Kreis', 'Nordsachsen_Kreis',
       'Mansfeld_Südharz_Kreis', 'Alzey_Worms_Kreis', 'Gießen_Kreis',
       'Main_Taunus_Kreis', 'Zwickau', 'Wupp

In [70]:
immo_data.select_dtypes('object')

Unnamed: 0,regio1,heatingType,telekomTvOffer,houseNumber,geo_krs,condition,interiorQual,petsAllowed,street,streetPlain,typeOfFlat,regio2,regio3,description,facilities,energyEfficiencyClass,date,fuelType1,fuelType2
0,Nordrhein_Westfalen,central_heating,ONE_YEAR_FREE,244,Dortmund,well_kept,normal,,Sch&uuml;ruferstra&szlig;e,Schüruferstraße,ground_floor,Dortmund,Schüren,Die ebenerdig zu erreichende Erdgeschosswohnun...,Die Wohnung ist mit Laminat ausgelegt. Das Bad...,,May19,oil,
1,Rheinland_Pfalz,self_contained_central_heating,ONE_YEAR_FREE,,Rhein_Pfalz_Kreis,refurbished,normal,no,no_information,,ground_floor,Rhein_Pfalz_Kreis,Böhl_Iggelheim,Alles neu macht der Mai – so kann es auch für ...,,,May19,gas,
2,Sachsen,floor_heating,ONE_YEAR_FREE,4,Dresden,first_time_use,sophisticated,,Turnerweg,Turnerweg,apartment,Dresden,Äußere_Neustadt_Antonstadt,Der Neubau entsteht im Herzen der Dresdner Neu...,"* 9 m² Balkon\n* Bad mit bodengleicher Dusche,...",,Oct19,,
3,Sachsen,district_heating,ONE_YEAR_FREE,35,Mittelsachsen_Kreis,,,,Gl&uuml;ck-Auf-Stra&szlig;e,Glück-Auf-Straße,other,Mittelsachsen_Kreis,Freiberg,Abseits von Lärm und Abgasen in Ihre neue Wohn...,,,May19,district_heating,
4,Bremen,self_contained_central_heating,,10,Bremen,refurbished,,,Hermann-Henrich-Meier-Allee,Hermann-Henrich-Meier-Allee,apartment,Bremen,Neu_Schwachhausen,Es handelt sich hier um ein saniertes Mehrfami...,Diese Wohnung wurde neu saniert und ist wie fo...,,Feb20,gas,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
268845,Bayern,heat_pump,ONE_YEAR_FREE,,Weilheim_Schongau_Kreis,mint_condition,sophisticated,no,no_information,,roof_storey,Weilheim_Schongau_Kreis,Eberfing,"Diese schöne, neuwertige Wohnung im Dachgescho...",Fliesen und Parkett. Sichtbarer Dachstuhl.,,Feb20,geothermal,
268846,Hessen,gas_heating,,,Bergstraße_Kreis,well_kept,sophisticated,negotiable,no_information,,apartment,Bergstraße_Kreis,Viernheim,Hier wird eine Wohnung im 2 Familienhaus angeb...,"Parkett, Kamin, Badewanne&Dusche\nGroßer Balko...",,May19,gas,
268847,Hessen,central_heating,ONE_YEAR_FREE,10,Limburg_Weilburg_Kreis,well_kept,,negotiable,Emsbachstrasse,Emsbachstrasse,apartment,Limburg_Weilburg_Kreis,Limburg_an_der_Lahn,gemütliche 4-Zimmer-Wohnung im Obergeschoss ei...,"Böden: Wohn-/Schlafbereich = Laminat, Küche + ...",,Feb20,gas,
268848,Nordrhein_Westfalen,heat_pump,,58,Köln,first_time_use,sophisticated,no,Idastra&szlig;e,Idastraße,apartment,Köln,Dellbrück,"Neubau Erstbezug, gehobener Standard, alle Ein...","Wände:\nMaler­vlies, weiß gestrichen alter­nat...",NO_INFORMATION,May19,gas,
