# TampereBNB Listings - Price Prediction
<p> Get ready for an exhilarating data science adventure! In this exciting assignment, you will dive into the world of TampereBNB, the popular platform for short-term accommodation rentals. 
    <br>
    Your mission? To analyze data from this platform and use your data science skills to predict missing prices for some of the listings, using the tools mentioned in the following cell. </p>
<br>

## Instructions
- Train a regression model of your choice on predicting the listing prices of the training data. 
- Use the trained model to get the price predictions for the listings in the testing data.
- Store the resulting dataframe as a pickled (out.pkl) file. 

**NOTE: The code snippets for loading the data files and outputting the resulting dataframe, are provided. Do not update them.**


#### Accessing the dataset
To facilitate your work, we have created two separate training and testing TampereBNB csv files, located within the `data/` folder. Make sure the path to the files is the same, before submitting your solution.

#### TODO

- You are expected to predict prices for the listings on the testing data by using the following libraries (Besides the built-in python modules, specific libraries can be included upon request):
    - scikit-learn (sklearn)
    - pandas
    - numpy
    
- Store your predictions as a dataframe with the attribute `Hinta` (case sensitive).
- Save the dataframe in a pickle file, `out.pkl` (case sensitive).



## Importing the required packages

In [694]:
import sklearn
import numpy as np
import pandas as pd 

from sklearn import preprocessing
from sklearn.impute import SimpleImputer as Imputer
from sklearn.ensemble import RandomForestClassifier

## Loading Data

In [695]:
# !MAKE SURE TO NOT CHANGE THE CODE WITHIN THIS CELL!. 
# Instead, put the data files within a folder named 'data' such that the paths would work.
training_df = pd.read_csv('./data/Tampere_BNB_training_listing.csv')
testing_df = pd.read_csv('./data/Tampere_BNB_testing_listing.csv')


In [696]:
print(training_df.head())
training_df.info()

  Kaupunginosa           Huoneisto Talot.    m2    Rv  Krs Hissi  Kunto  \
0  Niemenranta       2h , kt, s, p     kt  50.0  2020  2/6    on   hyvä   
1       Vuores            1 H + KT     kt  28.0  2018  1/4    on   hyvä   
2  Niemenranta     3 h , kt , s, p     kt  63.0  2020  3/6    on   hyvä   
3     Keskusta  3h+k+vh+kph/wc+...     kt  84.0  1964  5/7    on    NaN   
4     Hervanta   2 h, kk, s, ph, p     kt  52.0  1995  6/6    on  tyyd.   

   Asunnon tyyppi  Pituusaste  Leveysaste  Hinta  
0  Kaksi huonetta   23.696606   61.524269    300  
1           Yksiö   23.804092   61.433185    162  
2  Kolme huonetta   23.696636   61.519368    363  
3  Kolme huonetta   24.062369   61.463896    483  
4  Kaksi huonetta   23.848751   61.446601    174  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Kaupunginosa    1080 non-null   object 
 

In [697]:
print(testing_df.head())
testing_df.info()

   Kaupunginosa           Huoneisto Talot.     m2    Rv  Krs Hissi  Kunto  \
0        Kaleva  2h, k, rt, kph,...     kt   56.0  1956  3/4    on  tyyd.   
1      Keskusta             2h,kk,s     kt   56.0  1908  NaN    on   hyvä   
2  Hämeenpuisto  4-5 h,avok,2xkp...     kt  184.0  1928  3/6    on   hyvä   
3      Viinikka       3h, k, kph, p     kt   65.0  1956  2/2    ei   hyvä   
4       Leinola  4 h + k + s + w...     ok  112.0  1978  NaN    ei   hyvä   

               Asunnon tyyppi  Pituusaste  Leveysaste  
0              Kaksi huonetta   23.805462   61.496083  
1              Kaksi huonetta   24.064453   61.468052  
2  Neljä huonetta tai enemmän   23.751403   61.492285  
3              Kolme huonetta   23.786301   61.489502  
4  Neljä huonetta tai enemmän   23.914912   61.490368  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   K

## Data Cleaning (optional)

In [698]:
# Clean the data for further processing

In [699]:
#Imputation strategy: Replace Non-existing values with the most frequent value, which is OK with at least Kunto
cols_to_clean =['Krs','Kunto']

imputer = Imputer(strategy='most_frequent')
training_df[cols_to_clean] = imputer.fit_transform(training_df[cols_to_clean])
training_df=training_df.dropna()

testing_df[cols_to_clean] = imputer.fit_transform(testing_df[cols_to_clean])
testing_df=testing_df.dropna()

## Feature Engineering (optional)

In [700]:
# Feature engineering/extraction/discovery to extract features new from the raw data.

In [701]:
cols_to_remove = ['Huoneisto','Kaupunginosa'] # ,'Pituusaste','Leveysaste'
training_df = training_df.drop(cols_to_remove, axis=1)
testing_df = testing_df.drop(cols_to_remove, axis=1)

In [702]:
taulukot = [training_df, testing_df]

for x in range(len(taulukot)):

    # Convert Krs to numerical values
    y = []
    for i in taulukot[x]['Krs']:
        a = int(i.split('/')[0])
        y.append(abs(a))

    taulukot[x]['Krs'] = y
    taulukot[x]['Krs'] = taulukot[x]['Krs']#.astype(int)

training_df = pd.get_dummies(training_df)
testing_df = pd.get_dummies(testing_df)

## Data Modeling

In [703]:
# Implement your prediction solution here.

In [704]:
# x = training_df.drop('Hinta', axis=1).values #returns numpy array
# min_max_scaler = preprocessing.MinMaxScaler()
# x_scaled = min_max_scaler.fit_transform(x)
# xMinMax = pd.DataFrame(x_scaled)

In [705]:
# Initialize the classifiers that we will be testing
clf = RandomForestClassifier()

#train model with train data
x = training_df.drop('Hinta', axis=1)
clf.fit(X=x, y=training_df['Hinta'])
   
#predict test data
predictions = clf.predict(X=testing_df)

In [690]:
# Make sure to store your predictions in the 'Hinta' attribute of the testing dataframe.

# By default, the following assignment will initialize the 'Hinta' column with the given constant.
# testing_df['Hinta'] = 1000000
testing_df['Hinta'] = predictions


In [706]:
predictions

array([237, 309, 519, 303, 279, 387, 645, 261, 213, 276, 261, 159, 327,
       390, 120, 192, 303, 297, 234, 195, 243, 246, 204, 264, 384, 246,
       237, 393, 168, 321, 510, 309, 348, 318, 282, 273, 312, 282, 357,
       351, 120, 204, 261, 180, 237, 192, 150, 354, 327, 336, 327, 168,
       282, 234, 159, 234, 255, 207, 297, 258, 213, 114, 204, 168, 291,
       207, 405, 195, 309, 186, 531, 225, 177, 255, 309, 192, 246, 651,
       651, 231, 336, 144, 366, 114, 228, 168, 243, 252, 498, 246, 360,
       168, 180, 384, 231, 177, 342, 219, 240, 192, 237, 585, 192, 327,
       270, 390, 405, 255, 351, 222, 399, 594, 267, 150, 264, 171, 276,
       180, 300, 285, 153, 174, 231, 363, 285, 357, 243, 192, 213, 471,
       180, 210, 198, 168, 324, 147, 171, 237, 225, 294, 273, 315, 264,
       120, 432, 708, 243, 213, 360, 138, 153, 177, 147, 147, 243, 183,
       159, 426, 264, 237, 165, 132, 285, 213, 186, 342, 258, 393, 150,
       354, 276, 234, 363, 384, 357, 303, 159, 294, 237, 315, 23

In [691]:
predictions

array([471, 471, 471, 399, 840, 492, 471, 471, 336, 471, 399, 492, 840,
       675, 675, 399, 492, 471, 399, 336, 336, 471, 336, 471, 675, 336,
       471, 471, 471, 675, 492, 336, 399, 399, 399, 399, 471, 471, 675,
       399, 336, 336, 471, 471, 336, 471, 471, 471, 471, 399, 471, 471,
       675, 399, 675, 471, 675, 336, 399, 399, 471, 336, 336, 471, 471,
       471, 471, 471, 399, 471, 492, 336, 336, 399, 399, 471, 471, 675,
       675, 471, 399, 675, 399, 336, 471, 399, 471, 399, 675, 399, 471,
       675, 336, 471, 471, 336, 471, 336, 471, 471, 471, 471, 492, 399,
       399, 675, 675, 399, 675, 336, 399, 840, 675, 471, 471, 471, 471,
       336, 675, 399, 399, 471, 336, 675, 471, 675, 399, 471, 471, 471,
       336, 399, 336, 471, 471, 399, 399, 336, 471, 471, 399, 399, 492,
       336, 471, 471, 399, 336, 675, 675, 336, 471, 399, 399, 336, 336,
       399, 840, 336, 336, 471, 471, 471, 471, 471, 675, 471, 471, 336,
       492, 675, 399, 675, 471, 492, 471, 675, 471, 336, 399, 39

## Store the results

In [652]:
# !MAKE SURE TO NOT CHANGE THE CODE WITHIN THIS CELL!. 
testing_df.to_pickle("out.pkl")

In [653]:
# #Imputation strategy: Replace Non-existing values with the most frequent value in the union of the two dataframes
# krs_mode = pd.concat([training_df['Krs'], testing_df['Krs']], axis=0).mode()
# kunto_mode = pd.concat([training_df['Kunto'], testing_df['Kunto']], axis=0).mode()

# # cols_to_clean =['Krs','Kunto']

# imputer = Imputer(strategy='constant', fill_value=krs_mode)
# training_df['Krs'] = imputer.fit_transform(training_df['Krs'])

# imputer = Imputer(strategy='constant', fill_value=kunto_mode)
# training_df['Kunto'] = imputer.fit_transform(training_df['Kunto'])

# training_df=training_df.dropna()

In [654]:
# # Convert Kaupunginosa to numerical values
# y = []
# for i in training_df['']:
#     if i == '':
#         y.append(1)
#     else:
#         y.append(0)

# training_df = training_df.drop('', axis=1)
# pd.concat()

# 

# for x in [testing_df]: # , testing_df

#     # Convert Talot. to numerical values
#     y = []
#     for i in x['Talot.']:
#         if i == 'ok':
#             y.append(2)
#         elif i == 'rt':
#             y.append(1)
#         else: # kt
#             y.append(0)

#     x = x.drop('Talot.', axis=1)
#     x['Talot.'] = y

#     # Convert Krs to numerical values
#     y = []
#     for i in x['Krs']:
#         y.append(i.split('/')[0])

#     x = x.drop('Krs', axis=1)
#     x['Krs'] = y

#     # Convert Hissi to numerical values
#     y = []
#     for i in x['Hissi']:
#         if i == 'on':
#             y.append(1)
#         else: # ei
#             y.append(0)

#     x = x.drop('Hissi', axis=1)
#     x['Hissi'] = y


#     # Convert Kunto to numerical values
#     y = []
#     for i in x['Kunto']:
#         if i == 'hyvä':
#             y.append(2)
#         elif i == 'tyyd.':
#             y.append(1)
#         else: # huono
#             y.append(0)

#     x = x.drop('Kunto', axis=1)
#     x['Kunto'] = y


#     # Convert Asunnon tyyppi to numerical values
#     y = []
#     for i in x['Asunnon tyyppi']:
#         if i == 'Yksiö':
#             y.append(1)
#         elif i == 'Kaksi huonetta':
#             y.append(2)
#         elif i == 'Kolme huonetta':
#             y.append(3)
#         else:
#             y.append(4)

#     x = x.drop('Asunnon tyyppi', axis=1)
#     x['Asunnon tyyppi'] = y