# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
# refernce ModelSelectionForLogisticRegression.ipynb
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df_book = pd.read_csv(bookReviewDataSet_filename, header=0)

df_book.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


In [3]:
df_airbnb = pd.read_csv(airbnbDataSet_filename, header=0)
print(df_airbnb.columns)
df_airbnb.head()

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


In [4]:
df_airbnb['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Hotel room', 'Shared room'],
      dtype=object)

In [5]:
df_airbnb['review_scores_rating'].describe()

count    28022.000000
mean         4.683482
std          0.505857
min          0.000000
25%          4.600000
50%          4.830000
75%          5.000000
max          5.000000
Name: review_scores_rating, dtype: float64

In [6]:
df_airbnb['review_scores_rating'].unique()

array([4.7 , 4.45, 5.  , 4.21, 4.91, 4.56, 4.88, 4.86, 4.87, 4.76, 4.52,
       4.89, 4.66, 4.74, 4.39, 4.81, 4.9 , 4.49, 4.14, 4.68, 4.75, 4.82,
       4.55, 4.58, 4.85, 4.93, 4.6 , 4.78, 4.83, 4.8 , 4.84, 4.41, 4.95,
       4.71, 4.69, 4.34, 4.05, 4.97, 4.43, 4.62, 4.77, 4.92, 4.33, 3.5 ,
       4.67, 4.61, 4.63, 4.42, 4.65, 4.3 , 4.  , 4.35, 4.79, 4.98, 4.72,
       4.37, 4.53, 4.23, 4.59, 4.99, 4.64, 4.51, 4.12, 4.57, 4.73, 4.28,
       4.17, 4.5 , 4.48, 4.96, 4.29, 3.67, 4.36, 4.94, 4.44, 4.54, 4.07,
       4.18, 4.25, 3.75, 4.38, 4.11, 3.8 , 4.13, 4.46, 4.15, 4.26, 2.  ,
       4.47, 0.  , 4.24, 4.4 , 4.09, 3.  , 4.31, 4.06, 4.16, 4.32, 4.2 ,
       4.27, 1.  , 4.08, 4.22, 3.9 , 3.6 , 1.5 , 4.19, 3.57, 3.86, 3.29,
       3.83, 3.88, 3.43, 3.33, 3.71, 2.67, 3.89, 2.5 , 3.78, 3.4 , 3.25,
       3.2 , 3.93, 3.13, 3.22, 4.1 , 2.33, 1.75, 3.91, 4.03, 4.02, 3.96,
       3.17, 3.98, 3.56, 3.55, 3.73, 2.89, 3.3 , 3.97, 3.94, 3.95, 3.92,
       3.63, 4.04, 2.25, 3.76, 3.81, 3.65, 3.79, 2.

In [7]:
df_airbnb['review_scores_rating']

0        4.70
1        4.45
2        5.00
3        4.21
4        4.91
         ... 
28017    5.00
28018    5.00
28019    1.00
28020    5.00
28021    5.00
Name: review_scores_rating, Length: 28022, dtype: float64

In [8]:
df_airbnb['review_scores_cleanliness']

0        4.62
1        4.49
2        5.00
3        3.73
4        4.82
         ... 
28017    5.00
28018    5.00
28019    1.00
28020    5.00
28021    5.00
Name: review_scores_cleanliness, Length: 28022, dtype: float64

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. The data set I will be using is the airbnb dataset.
2. I will be predicting the price based on the location, room type, review score, and description.
3. This is a surpervised learning. It is a regression problem. 
4. My feature is the location, room type, review score, and description.
5. This is an important probelm since this can help us to know the estimated listing price for an apartment/house/hotel in an area based on the historical data. This will help the customers to decide which one is the most affordable to book while help the host to know how they should set the price.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [9]:
# check the datatypes in the data set
df_airbnb.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [10]:
# look at the size and shape of the set
print(df_airbnb.shape)

(28022, 50)


In [11]:
#unique and frequent values
df_airbnb.describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
count,16179.0,16909.0,28022.0,28022.0,28022.0,28022.0,25104.0,26668.0,28022.0,28022.0,...,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0
mean,0.906901,0.791953,14.554778,14.554778,2.874491,1.142174,1.329708,1.629556,154.228749,18.689387,...,4.8143,4.808041,4.750393,4.64767,9.5819,5.562986,3.902077,0.048283,1.758325,5.16951
std,0.227282,0.276732,120.721287,120.721287,1.860251,0.421132,0.700726,1.097104,140.816605,25.569151,...,0.438603,0.464585,0.415717,0.518023,32.227523,26.121426,17.972386,0.442459,4.446143,2.028497
min,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,29.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01,1.0
25%,0.94,0.68,1.0,1.0,2.0,1.0,1.0,1.0,70.0,2.0,...,4.81,4.81,4.67,4.55,1.0,0.0,0.0,0.0,0.13,4.0
50%,1.0,0.91,1.0,1.0,2.0,1.0,1.0,1.0,115.0,30.0,...,4.96,4.97,4.88,4.78,1.0,1.0,0.0,0.0,0.51,5.0
75%,1.0,1.0,3.0,3.0,4.0,1.0,1.0,2.0,180.0,30.0,...,5.0,5.0,5.0,5.0,3.0,1.0,1.0,0.0,1.83,7.0
max,1.0,1.0,3387.0,3387.0,16.0,8.0,12.0,21.0,1000.0,1250.0,...,5.0,5.0,5.0,5.0,421.0,308.0,359.0,8.0,141.0,13.0


In [12]:
nan_count = np.sum(df_airbnb.isnull(), axis = 0)
nan_count

name                                                5
description                                       570
neighborhood_overview                            9816
host_name                                           0
host_location                                      60
host_about                                      10945
host_response_rate                              11843
host_acceptance_rate                            11113
host_is_superhost                                   0
host_listings_count                                 0
host_total_listings_count                           0
host_has_profile_pic                                0
host_identity_verified                              0
neighbourhood_group_cleansed                        0
room_type                                           0
accommodates                                        0
bathrooms                                           0
bedrooms                                         2918
beds                        

In [13]:
condition = nan_count != 0 # look for all columns with missing values

col_names = nan_count[condition].index # get the column names
print(col_names)

Index(['name', 'description', 'neighborhood_overview', 'host_location',
       'host_about', 'host_response_rate', 'host_acceptance_rate', 'bedrooms',
       'beds'],
      dtype='object')


In [14]:
nan_cols = list(col_names)
nan_col_types = df_airbnb[nan_cols].dtypes
nan_col_types

name                      object
description               object
neighborhood_overview     object
host_location             object
host_about                object
host_response_rate       float64
host_acceptance_rate     float64
bedrooms                 float64
beds                     float64
dtype: object

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

- I will create a new feature list where the review score is averaged using review_scores_rating, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value.
- I will use NLP to process and transform the description.
- I will use one-hot encoding to transform categorial values (room type, location) into numerical for better analysis.
- I will also refine the location values to make it more nyc specific.
- My model is linear regression, decision tree, neural network, and gridSearch.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import roc_auc_score, mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
#get all the columns needed for this model
df_airbnb_cleared = df_airbnb.loc[:, ['name', 'description', 'room_type','review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_checkin', 'review_scores_communication', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'neighbourhood_group_cleansed', 'price']]

In [6]:
#check the new columns in df_airbnb
df_airbnb_cleared.columns

Index(['name', 'description', 'room_type', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'neighbourhood_group_cleansed', 'price'],
      dtype='object')

In [7]:
#create a new column 'average_review_score' that summarizes and average all the review scores for a wholistic review.
review_columns = ['review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_checkin', 'review_scores_communication', 'review_scores_communication', 'review_scores_location', 'review_scores_value']
df_airbnb_cleared['average_review_score'] = df_airbnb[review_columns].mean(axis=1).round(2) 
print(df_airbnb_cleared['average_review_score'].head())

0    4.71
1    4.68
2    4.94
3    4.42
4    4.93
Name: average_review_score, dtype: float64


In [8]:
df_airbnb_cleared.drop(columns=review_columns, inplace=True)
print(df_airbnb_cleared.columns)

Index(['name', 'description', 'room_type', 'neighbourhood_group_cleansed',
       'price', 'average_review_score'],
      dtype='object')


In [9]:
#perform OneHotEncoder for categorical data (roomtype)
room_types =  list(df_airbnb_cleared['room_type'].unique())
print(room_types)
for room_type in room_types:
    df_airbnb_cleared['room_type_' + room_type] = np.where(df_airbnb_cleared['room_type']==room_type,1,0)

['Entire home/apt', 'Private room', 'Hotel room', 'Shared room']


In [10]:
print(df_airbnb_cleared.columns)

Index(['name', 'description', 'room_type', 'neighbourhood_group_cleansed',
       'price', 'average_review_score', 'room_type_Entire home/apt',
       'room_type_Private room', 'room_type_Hotel room',
       'room_type_Shared room'],
      dtype='object')


In [11]:
#perform OneHotEncoder for location 
neighbourhoods = list(df_airbnb_cleared['neighbourhood_group_cleansed'].unique())
print(neighbourhoods)
for neighbourhood in neighbourhoods:
    df_airbnb_cleared['neighbourhoods_' + neighbourhood] = np.where(df_airbnb_cleared['neighbourhood_group_cleansed']==neighbourhood,1,0)

['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']


In [12]:
print(df_airbnb_cleared.columns)

Index(['name', 'description', 'room_type', 'neighbourhood_group_cleansed',
       'price', 'average_review_score', 'room_type_Entire home/apt',
       'room_type_Private room', 'room_type_Hotel room',
       'room_type_Shared room', 'neighbourhoods_Manhattan',
       'neighbourhoods_Brooklyn', 'neighbourhoods_Queens',
       'neighbourhoods_Staten Island', 'neighbourhoods_Bronx'],
      dtype='object')


In [13]:
#drop the original neighbourhood_group_cleansed and room_type column
df_airbnb_cleared.drop(columns=['neighbourhood_group_cleansed', 'room_type'], inplace=True)
print(df_airbnb_cleared.columns)

Index(['name', 'description', 'price', 'average_review_score',
       'room_type_Entire home/apt', 'room_type_Private room',
       'room_type_Hotel room', 'room_type_Shared room',
       'neighbourhoods_Manhattan', 'neighbourhoods_Brooklyn',
       'neighbourhoods_Queens', 'neighbourhoods_Staten Island',
       'neighbourhoods_Bronx'],
      dtype='object')


In [14]:
#check if there's null in description and name 
np.sum(df_airbnb_cleared.isnull(), axis = 0)

name                              5
description                     570
price                             0
average_review_score              0
room_type_Entire home/apt         0
room_type_Private room            0
room_type_Hotel room              0
room_type_Shared room             0
neighbourhoods_Manhattan          0
neighbourhoods_Brooklyn           0
neighbourhoods_Queens             0
neighbourhoods_Staten Island      0
neighbourhoods_Bronx              0
dtype: int64

In [15]:
#Since there are over 20000+ data, I will delete the rows without a descriptoin to have a better performance. 
df_airbnb_cleared.dropna(axis=0, inplace=True)
df_airbnb_cleared.isnull().sum()

name                            0
description                     0
price                           0
average_review_score            0
room_type_Entire home/apt       0
room_type_Private room          0
room_type_Hotel room            0
room_type_Shared room           0
neighbourhoods_Manhattan        0
neighbourhoods_Brooklyn         0
neighbourhoods_Queens           0
neighbourhoods_Staten Island    0
neighbourhoods_Bronx            0
dtype: int64

In [16]:
# Now we are done with handeling the features
#split the data into X, and y, where X is the features and y is the label
y = df_airbnb_cleared['price']
X = df_airbnb_cleared.drop(columns='price')

In [17]:
#split the dataset with 75% of the dataset and a random state of 1234
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75, random_state=1234)

#take a look at the head of the trianing set
X_train.head()

Unnamed: 0,name,description,average_review_score,room_type_Entire home/apt,room_type_Private room,room_type_Hotel room,room_type_Shared room,neighbourhoods_Manhattan,neighbourhoods_Brooklyn,neighbourhoods_Queens,neighbourhoods_Staten Island,neighbourhoods_Bronx
3042,Cozy Bedroom in best area of FiDi,"Hi everyone,<br />My bedroom is in a huge/atyp...",4.0,0,1,0,0,1,0,0,0,0
13901,豪华双人间 法拉盛步行地铁站,Basement studio<br /><br /><b>Guest access</b>...,3.88,0,1,0,0,0,0,1,0,0
19229,42nd Street Studio in Hell’s Kitchen,Studio apartment in a doorman building located...,4.84,1,0,0,0,1,0,0,0,0
23213,Magnificent place in a heart of Midtown,1 BR Unit for rent in central midtown (1 block...,4.0,1,0,0,0,1,0,0,0,0
249,Hancock Town House!-Stuyvesant Mews,"<b>The space</b><br />Hello, my name is Fred. ...",4.65,0,1,0,0,0,1,0,0,0


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

I will now use the TF-IDF vectorizer to transform my training set into vectors where row is represented by numerical vectors in the matrix and column is represented by word in the vocabulary. 

In [29]:
#transform descriptoin

# create a tfidf vectorizer 
tfidf_vectorizer = TfidfVectorizer()

# fit the vectorizer to X_train description
tfidf_vectorizer.fit(X_train['description'])

# print the first 50 items in the vocabulary
print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

#transform the training and test data 
X_train_tfidf = tfidf_vectorizer.transform(X_train['description'])
X_test_tfidf = tfidf_vectorizer.transform(X_test['description'])

# creat and print the matrix
tfidf_matrix = X_train_tfidf.todense()
print(tfidf_matrix)

Vocabulary size 25544: 
[('hi', 11061), ('everyone', 8838), ('br', 4357), ('my', 14885), ('bedroom', 3715), ('is', 12112), ('in', 11663), ('huge', 11404), ('atypical', 3118), ('apartment', 2705), ('perfectly', 16420), ('located', 13354), ('the', 21621), ('best', 3861), ('area', 2857), ('of', 15640), ('financial', 9384), ('district', 7678), ('stone', 20737), ('street', 20821), ('with', 23746), ('all', 2378), ('its', 12160), ('bars', 3511), ('and', 2567), ('restaurants', 18435), ('20', 673), ('second', 19287), ('walk', 23276), ('from', 9901), ('place', 16695), ('lot', 13459), ('touristic', 22027), ('places', 16699), ('just', 12407), ('few', 9314), ('minutes', 14462), ('away', 3243), ('space', 20293), ('perfect', 16415), ('for', 9674), ('or', 15808), ('persons', 16489), ('amenities', 2501), ('really', 17866), ('comfortable', 5940), ('compared', 6027), ('to', 21873), ('ordinary', 15821), ('old', 15689)]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0

In [30]:
#transform name

# create a tfidf vectorizer 
tfidf_vectorizer = TfidfVectorizer()

# fit the vectorizer to X_train description
tfidf_vectorizer.fit(X_train['name'])

# print the first 50 items in the vocabulary
print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

#transform the training and test data 
X_train_tfidf = tfidf_vectorizer.transform(X_train['name'])
X_test_tfidf = tfidf_vectorizer.transform(X_test['name'])

# creat and print the matrix
tfidf_matrix = X_train_tfidf.todense()
print(tfidf_matrix)

Vocabulary size 5299: 
[('cozy', 1759), ('bedroom', 1137), ('in', 2753), ('best', 1168), ('area', 971), ('of', 3586), ('fidi', 2244), ('豪华双人间', 5275), ('法拉盛步行地铁站', 5252), ('42nd', 478), ('street', 4519), ('studio', 4534), ('hell', 2587), ('kitchen', 2917), ('magnificent', 3169), ('place', 3782), ('heart', 2574), ('midtown', 3310), ('hancock', 2545), ('town', 4764), ('house', 2698), ('stuyvesant', 4552), ('mews', 3294), ('breezy', 1299), ('bushwick', 1378), ('apartment', 937), ('with', 5103), ('backyard', 1043), ('beautiful', 1122), ('bdrm', 1105), ('large', 2967), ('private', 3869), ('patio', 3702), ('hip', 2632), ('modern', 3362), ('brooklyn', 1321), ('minutes', 3342), ('to', 4735), ('manhattan', 3198), ('great', 2490), ('location', 3080), ('clean', 1564), ('bronx', 1319), ('1bdrm', 200), ('3bdrm', 417), ('apt', 958), ('green', 2493), ('rm', 4091), ('medical', 3266), ('professional', 3894)]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ...

In [31]:
#I will now implement linear regression 
lr_model = LinearRegression()
lr_model.fit(X_train_tfidf, y_train)

In [32]:
y_lr_pred = lr_model.predict(X_test_tfidf)

In [31]:
from sklearn.metrics import mean_squared_error, r2_score
# 1. Compute the RMSE using mean_squared_error()
lr_rmse = mean_squared_error(y_test, y_lr_pred, squared=False)

# 2. Compute the R2 score using r2_score()
# YOUR CODE HERE
lr_r2 = r2_score(y_test, y_lr_pred)

print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

[LR] Root Mean Squared Error: 112.01948029634318
[LR] R2: 0.38162549964514214




In [34]:
#now i will test how my model is performing 
#by randomly picking the a review fromt the test

print('Apartment 1: \n')
print(X_test.to_numpy()[120])

print('\nPrediction: {}\n'.format(y_lr_pred[120]))

print('Actual:  {}\n'.format(y_test.to_numpy()[120]))


Apartment 1: 

['Private Suite, Free Metrocard*'
 "Looking for an affordable place to stay in NYC? I have the perfect room in the lower level of my house in Flushing, Queens, a quiet residential NYC neighborhood.<br /><br /><b>The space</b><br />The space has its own entrance and VERY SMALL bathroom so you have privacy but we are upstairs if you need anything. This is a private part of my primary residence. We use the room for relaxing when we don't have guests. Room size: 500 sq ft/47 square meters .<br /><br />THERE ARE TWO SLEEPING OPTIONS: The full-sized sofabed for 1 or 2 people and twin-sized bed for 1 person. The room size is ideal for 1-2 persons.<br /><br />1 closet, refrigerator, kitchen sink, hot pot, microwave, and toaster (For safety reasons, there is no stove/oven). <br /><br />The small private bathroom has a toilet, sink, and shower. Towels and toiletries are in the cabinet above the toilet. It is a tiny bathroom.<br /><br /><b>Guest access</b><br />Laundry room<br />Bi

In [35]:
print('Apartment 2: \n')
print(X_test.to_numpy()[95])

print('\nPrediction: {}\n'.format(y_lr_pred[95]))

print('Actual:  {}\n'.format(y_test.to_numpy()[95]))

Apartment 2: 

['Brooklyn Royal Huge most stylish next level Loft'
 'Brooklyn Royal Loft in Downtown Brooklyn , one of the best location , and one of the Top Notch 3800 sqft  huge Loft . everything inside will make you feel like in wonderland. State of art lights system, good sound system , smoke machines  ,Royal Luxurious chandeliers installed in every section. Vvip lounges at upper and lower level . Huge main floor ,<br />Note: its 5 hours based, no over night stay, and pricing is with the number of guests and other rental things like lights, sound, security, lounge<br /><br /><b>The space</b><br />Brooklyn Royal Loft in Downtown Brooklyn , one of the best location , 5 minutes short walk from A and C train , 10 minutes from Brooklyn Bridge,,is one of the Top Notch 3800 sqft  huge Loft . We created this loft with Royal touch and everything inside will make you feel like in wonderland. State of art lights system, amazing visual effects , good sound system , smoke machines  ,Royal Luxur

In [36]:
print('Apartment 3: \n')
print(X_test.to_numpy()[1000])

print('\nPrediction: {}\n'.format(y_lr_pred[1000]))

print('Actual:  {}\n'.format(y_test.to_numpy()[1000]))

Apartment 3: 

['Private Bushwick Room Near Many Trains COZY'
 "Private room in a wonderful, old Bushwick building. The apartment is full of plants, has a large kitchen, and 2 cute kittens. The apartment is near many hip bars and restaurants. You'll also be close to 3 subway stops, all of which will take you to Manhattan in 20 minutes.<br /><br /><b>The space</b><br />The apartment is an quirky, creative, all natural kind of space, with very friendly roommates and lots of beautiful natural light.<br /><br /><b>Guest access</b><br />Bathroom, kitchen, bedroom.<br /><br /><b>Other things to note</b><br />There are cats, and if you need help, you can get in contact with us."
 4.5 0 1 0 0 0 1 0 0 0]

Prediction: -67.91036347840026

Actual:  29.0



From above, we can tell that the model is not performing well. With a lot of errors. What if now we focus on either numerical or text.  

I will first analyze the room type and rating.

In [37]:
df_airbnb_cleared.columns

Index(['name', 'description', 'price', 'average_review_score',
       'room_type_Entire home/apt', 'room_type_Private room',
       'room_type_Hotel room', 'room_type_Shared room',
       'neighbourhoods_Manhattan', 'neighbourhoods_Brooklyn',
       'neighbourhoods_Queens', 'neighbourhoods_Staten Island',
       'neighbourhoods_Bronx'],
      dtype='object')

In [38]:
col = ['average_review_score', 'room_type_Entire home/apt', 'room_type_Private room',
       'room_type_Hotel room', 'room_type_Shared room',
       'neighbourhoods_Manhattan', 'neighbourhoods_Brooklyn',
       'neighbourhoods_Queens', 'neighbourhoods_Staten Island',
       'neighbourhoods_Bronx']
X_1 = df_airbnb_cleared[col]
y_1 = df_airbnb_cleared['price']

In [39]:
#split the dataset with 75% of the dataset and a random state of 1234
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1,y_1,train_size=0.75, random_state=1234)

#take a look at the head of the trianing set
X_train_1.head()

Unnamed: 0,average_review_score,room_type_Entire home/apt,room_type_Private room,room_type_Hotel room,room_type_Shared room,neighbourhoods_Manhattan,neighbourhoods_Brooklyn,neighbourhoods_Queens,neighbourhoods_Staten Island,neighbourhoods_Bronx
3042,4.0,0,1,0,0,1,0,0,0,0
13901,3.88,0,1,0,0,0,0,1,0,0
19229,4.84,1,0,0,0,1,0,0,0,0
23213,4.0,1,0,0,0,1,0,0,0,0
249,4.65,0,1,0,0,0,1,0,0,0


In [40]:
#train the linear regression model again
lr_model_1 = LinearRegression()
lr_model_1.fit(X_train_1, y_train_1)
y_lr_pred_1 = lr_model_1.predict(X_test_1)

In [41]:
# 1. Compute the RMSE using mean_squared_error()
lr_rmse_1 = mean_squared_error(y_test_1, y_lr_pred_1, squared=False)

# 2. Compute the R2 score using r2_score()
# YOUR CODE HERE
lr_r2_1 = r2_score(y_test_1, y_lr_pred_1)

print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse_1))
print('[LR] R2: {0}'.format(lr_r2_1))

[LR] Root Mean Squared Error: 127.6839012270449
[LR] R2: 0.17147424263851452




In [42]:
print('Apartment 1: \n')
print(X_test.to_numpy()[120])

print('\nPrediction: {}\n'.format(y_lr_pred_1[120]))

print('Actual:  {}\n'.format(y_test_1.to_numpy()[120]))


Apartment 1: 

['Private Suite, Free Metrocard*'
 "Looking for an affordable place to stay in NYC? I have the perfect room in the lower level of my house in Flushing, Queens, a quiet residential NYC neighborhood.<br /><br /><b>The space</b><br />The space has its own entrance and VERY SMALL bathroom so you have privacy but we are upstairs if you need anything. This is a private part of my primary residence. We use the room for relaxing when we don't have guests. Room size: 500 sq ft/47 square meters .<br /><br />THERE ARE TWO SLEEPING OPTIONS: The full-sized sofabed for 1 or 2 people and twin-sized bed for 1 person. The room size is ideal for 1-2 persons.<br /><br />1 closet, refrigerator, kitchen sink, hot pot, microwave, and toaster (For safety reasons, there is no stove/oven). <br /><br />The small private bathroom has a toilet, sink, and shower. Towels and toiletries are in the cabinet above the toilet. It is a tiny bathroom.<br /><br /><b>Guest access</b><br />Laundry room<br />Bi

In [43]:
print('Apartment 2: \n')
print(X_test.to_numpy()[95])

print('\nPrediction: {}\n'.format(y_lr_pred_1[95]))

print('Actual:  {}\n'.format(y_test_1.to_numpy()[95]))

Apartment 2: 

['Brooklyn Royal Huge most stylish next level Loft'
 'Brooklyn Royal Loft in Downtown Brooklyn , one of the best location , and one of the Top Notch 3800 sqft  huge Loft . everything inside will make you feel like in wonderland. State of art lights system, good sound system , smoke machines  ,Royal Luxurious chandeliers installed in every section. Vvip lounges at upper and lower level . Huge main floor ,<br />Note: its 5 hours based, no over night stay, and pricing is with the number of guests and other rental things like lights, sound, security, lounge<br /><br /><b>The space</b><br />Brooklyn Royal Loft in Downtown Brooklyn , one of the best location , 5 minutes short walk from A and C train , 10 minutes from Brooklyn Bridge,,is one of the Top Notch 3800 sqft  huge Loft . We created this loft with Royal touch and everything inside will make you feel like in wonderland. State of art lights system, amazing visual effects , good sound system , smoke machines  ,Royal Luxur

In [44]:
print('Apartment 3: \n')
print(X_test.to_numpy()[1000])

print('\nPrediction: {}\n'.format(y_lr_pred_1[1000]))

print('Actual:  {}\n'.format(y_test_1.to_numpy()[1000]))

Apartment 3: 

['Private Bushwick Room Near Many Trains COZY'
 "Private room in a wonderful, old Bushwick building. The apartment is full of plants, has a large kitchen, and 2 cute kittens. The apartment is near many hip bars and restaurants. You'll also be close to 3 subway stops, all of which will take you to Manhattan in 20 minutes.<br /><br /><b>The space</b><br />The apartment is an quirky, creative, all natural kind of space, with very friendly roommates and lots of beautiful natural light.<br /><br /><b>Guest access</b><br />Bathroom, kitchen, bedroom.<br /><br /><b>Other things to note</b><br />There are cats, and if you need help, you can get in contact with us."
 4.5 0 1 0 0 0 1 0 0 0]

Prediction: 81.8125

Actual:  29.0



The data prediction still has a huge error. This might be caused by our feature selection. It doesn't have the most relevant correlation to our price setting. I will now move on to description processed using TFIDF.

In [45]:
X = df_airbnb_cleared['description']
y = df_airbnb_cleared['price']

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75, random_state=1234)
X_train.head()

3042     Hi everyone,<br />My bedroom is in a huge/atyp...
13901    Basement studio<br /><br /><b>Guest access</b>...
19229    Studio apartment in a doorman building located...
23213    1 BR Unit for rent in central midtown (1 block...
249      <b>The space</b><br />Hello, my name is Fred. ...
Name: description, dtype: object

In [47]:
#transform descriptoin
# create a tfidf vectorizer 
tfidf_vectorizer = TfidfVectorizer()

# fit the vectorizer to X_train description
tfidf_vectorizer.fit(X_train)

# print the first 50 items in the vocabulary
print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

#transform the training and test data 
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# creat and print the matrix
tfidf_matrix = X_train_tfidf.todense()
print(tfidf_matrix)

Vocabulary size 25544: 
[('hi', 11061), ('everyone', 8838), ('br', 4357), ('my', 14885), ('bedroom', 3715), ('is', 12112), ('in', 11663), ('huge', 11404), ('atypical', 3118), ('apartment', 2705), ('perfectly', 16420), ('located', 13354), ('the', 21621), ('best', 3861), ('area', 2857), ('of', 15640), ('financial', 9384), ('district', 7678), ('stone', 20737), ('street', 20821), ('with', 23746), ('all', 2378), ('its', 12160), ('bars', 3511), ('and', 2567), ('restaurants', 18435), ('20', 673), ('second', 19287), ('walk', 23276), ('from', 9901), ('place', 16695), ('lot', 13459), ('touristic', 22027), ('places', 16699), ('just', 12407), ('few', 9314), ('minutes', 14462), ('away', 3243), ('space', 20293), ('perfect', 16415), ('for', 9674), ('or', 15808), ('persons', 16489), ('amenities', 2501), ('really', 17866), ('comfortable', 5940), ('compared', 6027), ('to', 21873), ('ordinary', 15821), ('old', 15689)]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0

In [48]:
#perform linear regression
lr_model = LinearRegression()
lr_model.fit(X_train_tfidf, y_train)

In [49]:
y_lr_pred = lr_model.predict(X_test_tfidf)

In [50]:
#Compute the RMSE using mean_squared_error()
lr_rmse = mean_squared_error(y_test, y_lr_pred, squared=False)

#Compute the R2 score using r2_score()
lr_r2 = r2_score(y_test, y_lr_pred)

print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

[LR] Root Mean Squared Error: 501.8218003519243
[LR] R2: -11.797725057186819




In [51]:
print('Apartment 1: \n')
print(X_test.to_numpy()[120])

print('\nPrediction: {}\n'.format(y_lr_pred[120]))

print('Actual:  {}\n'.format(y_test.to_numpy()[120]))

print('Apartment 2: \n')
print(X_test.to_numpy()[95])

print('\nPrediction: {}\n'.format(y_lr_pred[95]))

print('Actual:  {}\n'.format(y_test.to_numpy()[95]))

print('Apartment 3: \n')
print(X_test.to_numpy()[1000])

print('\nPrediction: {}\n'.format(y_lr_pred[1000]))

print('Actual:  {}\n'.format(y_test.to_numpy()[1000]))

Apartment 1: 

Looking for an affordable place to stay in NYC? I have the perfect room in the lower level of my house in Flushing, Queens, a quiet residential NYC neighborhood.<br /><br /><b>The space</b><br />The space has its own entrance and VERY SMALL bathroom so you have privacy but we are upstairs if you need anything. This is a private part of my primary residence. We use the room for relaxing when we don't have guests. Room size: 500 sq ft/47 square meters .<br /><br />THERE ARE TWO SLEEPING OPTIONS: The full-sized sofabed for 1 or 2 people and twin-sized bed for 1 person. The room size is ideal for 1-2 persons.<br /><br />1 closet, refrigerator, kitchen sink, hot pot, microwave, and toaster (For safety reasons, there is no stove/oven). <br /><br />The small private bathroom has a toilet, sink, and shower. Towels and toiletries are in the cabinet above the toilet. It is a tiny bathroom.<br /><br /><b>Guest access</b><br />Laundry room<br />Bikes<br />Wifi<br />Patio and garden<

From the high RMSE and low R2 score, we can confirm that the price has little correlation to the description. And the model does not fit well. I will redo the prediction using all the features in the set. 

In [20]:
df_airbnb.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [21]:
df_num_only = df_airbnb.select_dtypes(include=['number'])

In [22]:
df_num_only.dtypes

host_response_rate                              float64
host_acceptance_rate                            float64
host_listings_count                             float64
host_total_listings_count                       float64
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        float64
beds                                            float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          float64
maximum_minimum_nights                          float64
minimum_maximum_nights                          float64
maximum_maximum_nights                          float64
minimum_nights_avg_ntm                          float64
maximum_nights_avg_ntm                          float64
availability_30                                 

In [23]:
np.sum(df_num_only.isnull())

host_response_rate                              11843
host_acceptance_rate                            11113
host_listings_count                                 0
host_total_listings_count                           0
accommodates                                        0
bathrooms                                           0
bedrooms                                         2918
beds                                             1354
price                                               0
minimum_nights                                      0
maximum_nights                                      0
minimum_minimum_nights                              0
maximum_minimum_nights                              0
minimum_maximum_nights                              0
maximum_maximum_nights                              0
minimum_nights_avg_ntm                              0
maximum_nights_avg_ntm                              0
availability_30                                     0
availability_60             

I will clean the datas with no value with the mean

In [24]:
mean_response_rate =df_num_only['host_response_rate'].mean()
df_num_only['host_response_rate'].fillna(value=mean_response_rate, inplace=True)

mean_acceptance_rate =df_num_only['host_acceptance_rate'].mean()
df_num_only['host_acceptance_rate'].fillna(value=mean_acceptance_rate, inplace=True)

mean_bedrooms =df_num_only['bedrooms'].mean()
df_num_only['bedrooms'].fillna(value=mean_bedrooms, inplace=True)

mean_beds =df_num_only['beds'].mean()
df_num_only['beds'].fillna(value=mean_beds, inplace=True)

In [25]:
np.sum(df_num_only.isnull())

host_response_rate                              0
host_acceptance_rate                            0
host_listings_count                             0
host_total_listings_count                       0
accommodates                                    0
bathrooms                                       0
bedrooms                                        0
beds                                            0
price                                           0
minimum_nights                                  0
maximum_nights                                  0
minimum_minimum_nights                          0
maximum_minimum_nights                          0
minimum_maximum_nights                          0
maximum_maximum_nights                          0
minimum_nights_avg_ntm                          0
maximum_nights_avg_ntm                          0
availability_30                                 0
availability_60                                 0
availability_90                                 0


In [26]:
y = df_num_only['price']
X = df_num_only.drop(columns='price')

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,random_state=1234) 

In [28]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

In [29]:
y_lr_pred=lr_model.predict(X_test)

In [32]:
lr_rmse = mean_squared_error(y_test, y_lr_pred, squared=False)

lr_r2 = r2_score(y_test, y_lr_pred)

print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

[LR] Root Mean Squared Error: 112.01948029634318
[LR] R2: 0.38162549964514214




I will apply scaling and see if it will help. 

In [33]:
col_list = list(df_num_only.columns)

In [34]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
df_scaled = std_scaler.fit_transform(df_num_only.to_numpy())
df_scaled = pd.DataFrame(df_scaled, columns=col_list)

In [35]:
df_scaled.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,-0.6190192,-2.893351,-0.054298,-0.054298,-1.007673,-0.337606,0.0,-0.588233,-0.030031,0.442362,...,-0.123804,-0.038834,0.263662,-0.458809,-0.204236,-0.09812,-0.217119,-0.109127,-0.321256,1.888373
1,-4.730337,-0.474289,-0.112284,-0.112284,0.06747,-0.337606,-0.497128,1.28049,-0.562648,-0.691838,...,-0.078204,-0.017309,-0.097167,-0.014806,-0.266296,-0.174687,-0.217119,-0.109127,0.697623,0.409419
2,0.5390984,-2.521188,-0.112284,-0.112284,0.605041,0.849692,1.010653,0.346128,0.857665,-0.535396,...,0.423398,0.413191,-0.602328,0.680156,-0.266296,-0.174687,-0.217119,-0.109127,-0.390981,-1.069535
3,0.5390984,0.967845,-0.112284,-0.112284,-0.470102,-0.337606,-0.497128,-0.588233,-0.612359,-0.652727,...,-0.351805,-0.835257,0.287717,-0.555332,-0.266296,-0.21297,-0.161477,-0.109127,0.432219,-0.57655
4,6.428845e-16,0.0,-0.112284,-0.112284,-1.007673,-0.337606,-0.497128,-0.588233,-0.562648,-0.652727,...,0.354998,0.305566,0.456104,0.52572,-0.266296,-0.21297,-0.161477,-0.109127,-0.1998,0.902404


In [36]:
y = df_scaled['price']
X = df_scaled.drop(columns='price')

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,random_state=1234) 
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_lr_pred=lr_model.predict(X_test)
lr_rmse = mean_squared_error(y_test, y_lr_pred, squared=False)

lr_r2 = r2_score(y_test, y_lr_pred)

print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

[LR] Root Mean Squared Error: 0.7955132777357903
[LR] R2: 0.38162549961866576




Now we have reduced the RMSE to a smaller amount

In [38]:
price_scaler = StandardScaler()
price_scaler.fit(df_num_only[['price']])

In [40]:
y_lr_pred_unscaled = price_scaler.inverse_transform(y_lr_pred.reshape(-1,1))
y_test_unscaled = price_scaler.inverse_transform(y_test.values.reshape(-1, 1))

In [43]:
print('Apartment 1: \n')
print(X_test.to_numpy()[120])

print('\nPrediction: {}\n'.format(y_lr_pred_unscaled[120]))

print('Actual:  {}\n'.format(y_test_unscaled[120]))

print('Apartment 2: \n')
print(X_test.to_numpy()[95])

print('\nPrediction: {}\n'.format(y_lr_pred_unscaled[95]))

print('Actual:  {}\n'.format(y_test_unscaled[95]))

print('Apartment 3: \n')
print(X_test.to_numpy()[1000])

print('\nPrediction: {}\n'.format(y_lr_pred_unscaled[1000]))

print('Actual:  {}\n'.format(y_test_unscaled[1000]))

Apartment 1: 

[ 0.53909845 -0.19516627 -0.10399991 -0.10399991 -1.00767327 -0.33760586
  0.         -0.58823265  0.44236178 -0.00610549  0.42666901  0.2423094
 -0.0267443  -0.04100123  0.25510639 -0.03561033 -0.68018951 -0.81674098
 -0.34014473  1.0127523  -0.50783268 -0.22931415 -0.43313143 -0.02665279
 -0.49374777  0.42339773  0.41319061  0.60043576  0.04310755 -0.23526584
 -0.2129702  -0.10583519 -0.10912727 -0.30326262  0.40941894]

Prediction: [87.00265057]

Actual:  [40.]

Apartment 2: 

[-0.03996039  0.68872194 -0.1122836  -0.1122836  -0.47010173  2.03699038
 -0.49712826 -0.58823265 -0.69183771 -0.00613153 -0.65579887 -0.48478611
 -0.0267443  -0.04100123 -0.4878546  -0.03561033 -0.68018951 -0.91181805
 -0.99130498 -0.69912985 -0.50783268 -0.22931415  1.33571125  0.62571687
  0.67374206  0.42339773  0.41319061  0.60043576  0.68015612 -0.23526584
 -0.17468677 -0.1614771  -0.10912727  0.27927516  0.90240354]

Prediction: [162.26928864]

Actual:  [97.]

Apartment 3: 

[ 0.53909845 

From this problem, we initially incoroperated NLP on text datas and average for numerical datas we wanted to use to predict the price of the airbnb using linear regression since this is a progression data instead of cetegorical. However, it had a poor root mean square. Then, we decided to seperate the text and the numerical and performed the analysis seperately. It turned out that the text has lower accuracy with the price comparing to numericals. This means there's more correlatoin between the price and the rating than price and the description. However, there were still large descrepency in RMSE. So, I thought that there might be a scalling issue in the data that's casuing the large RMSE as our R2 score is in a reasonable accurate range. After normalizaition, we obtained a reasonable RMSE and R2 score. 
Overall, We see that the correlation between description and price are very low. While the ratings have a better correlation to the price. However, because the price has a very large range, the prediction is hard to be accurate with small error that's the reason why our RMSE is super high. After performing scaling, we were able to reduce our RMSE under 1. To further improve this prediction model, we might need to improve our datas on price. Also, maybe think about better scalling methods and feature selection. 