## NAME : Emre Koç
* In this project, the aim is to create an linear regression model for price. After examination of data, I thought that just choosing relevant features would not be enough. Because the entire data is too general, there are such features that if I exclude I lose information, however if I include it misleads. I will make filtering for some features instead of just excluding or including them, so the resulting model will aim part of the entire data.

# Installing Library

* Scikit-learn would be the main library of this project. I will use the provided algorithms to train my model. Even though creating such algorithms is also possible in low-level without need of any library, using a solid and well-defined library would make it much simpler.

In [5]:
! pip install scikit-learn



# Importing Library
* Pandas is the library that I will use for data preprocessing.

In [6]:
import pandas as pd


# Data Prepocessing - Feature Engineering

* This is the most important step of training process. After carefully investigated the dataset, I have identified features that I will include. 'room_type' is not part of the features I will include for the training but its rather for reaching the target data. I will target estimating price of 'Entire home/apt' room types. This filtering was necessary because the dataset includes broad category of homes and having good performance in the end would not be possible. Also using room_type as a feature by encoding it was an option.

* After the filtering I have choosen my features as 'accomodates', 'bathrooms_text', 'beds', 'number_of_reviews', 'review_scores_rating'. After carefull choice, using both my intuition and investigation, I thought applying further feature engineering would beneficial. So I combined my two features 'number_of_reviews' and 'review_scores_rating' by multiplyting them. Here the approach was thinking new feature as the total point that host received. Also I thought that the relation of features are not just linear, some of them has more significance, so I have used polynomial features to get more complex relationships.

* Finally to add, I have also performed data cleaning and modifications, I have dropped the missing values and also converted 'bathroom_text' feature to a numerical value by extracting the numeric information from the text. At the end, I have 5895 samples to start the training.

In [7]:
df = pd.read_csv("data.csv")
df_main = df[["room_type", "accommodates", "bathrooms_text", "beds","number_of_reviews", "review_scores_rating", "price"]]


In [8]:
import re
df_main = df_main[df_main["room_type"] == "Entire home/apt"]
df_main["bathrooms_text"] = df_main["bathrooms_text"].apply(lambda text: float(re.findall(r'\d+\.?\d*', str(text))[0]) if len(re.findall(r'\d+\.?\d*', str(text))) else None)
df_main["price"] = df_main["price"].apply(lambda text: float(str(text).replace("$","").replace(",","").strip()))
df_main = df_main.dropna()

df_main["accommodates"] = df_main["accommodates"].astype(float)
df_main["beds"] = df_main["beds"].astype(float)
df_main["number_of_reviews"] = df_main["number_of_reviews"].astype(float)
df_main["review_scores_rating"] = df_main["review_scores_rating"].astype(float)

In [9]:
df_main["weighted_review"] = df_main["review_scores_rating"] * df_main["number_of_reviews"]

X = df_main[["accommodates", "bathrooms_text", "beds", "weighted_review"]]
Y = df_main[["price"]]


# Train Model

* In this part, I have applied the data normalization and previously mentioned polynomial features. Data normalization is important for linear regression, however in this case the scale of the datas are not that different. So, not applying normalization was also an option but I see no harm to apply it in any case. 

* For polynomial features, I have used PolynomialFeatures class from sklearn. I have set interaction_only to false because, I also need higher order of original features to capture their non-linear impact.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures



X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
preprocessing.normalize(X_train)
preprocessing.normalize(X_test)

poly = PolynomialFeatures(degree=2, interaction_only=False)
X_train = poly.fit_transform(X_train)
X_test = poly.transform(X_test)

model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate Model

* Here I have evaluated the performance of model. I have calculated both Mean Square Error R² Error. In short, the model performed poorly. The MSE values are too high and the r2 metrics are close to zero, which indicates that model is underfitting. The resulting model is too simple to predict the price pattern. I may need to use more complex model rather than simple linear regression model such as random forests. 

In [11]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'MSE: {mse}')
print(f'R²: {r2}')



MSE: 18743.911536297983
R²: 0.23291157390208317
