### Business Analytics Group Assignment - Predicting Airbnb Listing Prices in Melbourne__ {-}

--- 

**Kaggle Competition Ends:** Friday, 2 June 2023 @ 3:00pm (Week 13)  
**Assignment Due Date on iLearn:** Friday, 2 June 2023 @ 11.59pm (Week 13)   

**Overview:**   

- In the group assignment you will form a team of 3 students and participate in a forecasting competition on Kaggle
- The goal is to predict listed prices of Airbnb properties in Melbourne based on various Airbnb characteristics and regression models
- Assessment Summary:  
    - Write a problem statement and perform Exploratory Data Analysis  
    - Clean up data, deal with categorical features and missing observations, and create new explanatory variables (feature engineering)  
    - Construct and tune forecasting models, produce forecasts and submit your predictions to Kaggle  
    - Each member of the team will record a video presentation of their work  
    - Marks will be awarded producing a prediction in the top 5 positions of their unit as well as for reaching the highest ranking on Kaggle amongst all teams.

**Instructions:** 

- Form a team of 3 students (minimum 2 students)  
- Each team member needs to join [https://www.kaggle.com](https://www.kaggle.com/)  
- Choose a team leader and form a team in the competition [https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12](https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12)
    - Team leader to click on `team` and join and invite other team members to join
    - Your **team's name must start** with your unit code, for instance you could have a team called BUSA8001_masterful_geniuses or BUSA3020_l33t 
- All team members should work on all the tasks listed below however   
    - Choose a team member who will be responsible for one of each of the 3 tasks listed below    
- Your predictions must be generated by a model you develop here 
    - You will receive a mark of zero if your code provided here does not produce the forecasts you submit to Kaggle

**Marks**: 

- Total Marks: 40
- Your individual mark will consist of:  
    - 50% x overall assignment mark + 45% x mark for the task that you are responsible for + 5% x mark received from your teammates for your effort in group work 

**Competition Marks:**  

- 1 mark: Ranking in the top 5 places of your unit on Kaggle (make sure you name your team as instructed above)   
- 2 marks: Reaching the first place in your unit (make sure you name your team as instructed above)   


**Submissions:**  

1. On Kaggle: submit your team's forecast in order to be ranked by Kaggle
    - Can do this as many times as necessary while building their model  
2. On iLearn **only team leader to submit** this Jupyter notebook re-named `Group_Assignment_Team_Name.ipynb` where Team_Name is your team's name on Kaggle   
    - The Jupyter notebook must contain team members names/ID numbers, and team name in the competition
    - Provide answers to the 3 Tasks below in the allocated cells including all codes/outputs/writeups 
    - One 15 minute video recording of your work 
        - Each team member to provide a 5 minute presentation of the Task that they led (it is best to jointly record your video using Zoom)
        - When recording your video make sure your face is visible, that you share your Jupyter Notebook and explain everything you've done in the submitted Jupyter notebook on screen
        - 5 marks will be deducted from each Task for which there is no video presentation or if you don't follow the above instructions
        
3. On iLearn each student needs to submit a file with their teammates' names, ID number and a mark for their group effort (out of 100%)



---

**Fill out the following information**

For each team member provide name, Student ID number and which task is performed below

- Team Name on Kaggle: `BUSA8001_superhosts`
- Team Leader and Team Member 1: `Felix Rosenberger`
- Team Member 2: `John Rizk`

---

## Task 1: Problem Description and Initial Data Analysis {-}

1. Read the Competition Overview on Kaggle [https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12](https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12)
2. Referring to Competition Overview and the data provided on Kaggle write about a 500 words **Problem Description** focusing on key points that will need to be addressed as first steps in Tasks 2 and 3 below, using the following headings:
    - Forecasting Problem - explain what you are trying to do and how it could be used in the real world (i.e. why it may be important)
    - Evaluation Criteria - explain the criteria is used to assess forecast performance 
    - Types of Variables/Features
    - Data summary and main data characteristics
    - Missing Values (only explain what you found at this stage)
    - Hint: you should **not** discuss any specific predictive algorithms at this stage
    - Note: This task should be completed in a single Markdown cell (text box)
    
Total Marks: 12


In [13]:
# Task 1 code here
import pandas as pd
import numpy as np

# setting display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
#pd.set_option('display.max_rows', None)

# read in data
trainpath = "train.csv"
df_train = pd.read_csv(trainpath, index_col='ID')
testpath = "test.csv"
df_test = pd.read_csv(testpath, index_col='ID')

# concatenate dataframes to reduce redundancies in operations
df = pd.concat([df_train, df_test])

#df.head()
#test_ids = df.ID.iloc[7000:].values
df.to_csv("df_1.csv")

In [None]:
# data summary and characteristics
df['price'].describe()
df.info()
df.dtypes.value_counts().to_frame()
#print(len(df.columns))
# types of variables / features
#vtypes = df.dtypes.to_frame()
#vtypes.value_counts()
#vtypes

#print(df.shape[1])


In [None]:
df = pd.concat([df_train, df_test])
df['price'] = df.price.str.replace('$', '', regex=True).str.replace(',', '', regex=True).astype('float')
d = df.groupby('neighbourhood_cleansed')['price'].describe().round(2)
#d = df.groupby(['property_type','bathrooms'])['price'].describe().round(2)

#d = df.groupby('property_type')['price'].mean().dropna().round(2)
d.sort_values(by=['mean'], inplace=True, ascending=False)
#d.sort_values(by=['count'], inplace=True, ascending=False)
d
#df['price']
print(min(df['price']))
print(max(df['price']))

In [None]:
# missing values
missing_values_count_train = pd.DataFrame(df_train.isnull().sum(axis=0)).loc[df_train.isnull().sum(axis=0) != 0]

missing_values_count_test = pd.DataFrame(df_test.isnull().sum(axis=0)).loc[df_test.isnull().sum(axis=0) != 0]

missing_values_count = pd.merge(missing_values_count_train, missing_values_count_test, how='outer', left_index=True, right_index=True)
missing_values_count.fillna(0, inplace=True)
missing_values_count = missing_values_count.rename(columns={'0_x': 'missing_train', '0_y': 'missing_test'})
missing_values_count['total_missing'] = missing_values_count['missing_train'] + missing_values_count['missing_test']

print('Total Missing records = ', sum(missing_values_count['total_missing']))

missing_values_count


### <font color='darkblue'>Forecasting Problem</font>
<font color='darkblue'>
Aiirbnb is an "online marketplace that connects people who want to rent out their homes with people who are looking for accommodations in specific locales." https://www.investopedia.com/articles/personal-finance/032814/pros-and-cons-using-airbnb.asp

The goal of is this assignment is to develop a model for predicting nightly prices of Melbournd based Airbnb listings with different features and characteristics based on statistical machine learning. The model can be used to assess the how rental prices differ based on specific characteristics of the property or against other suburbs, which can then be used to determine the profitability and feasibilty of certain listings. 

### Evaluation Criteria
The criteria to assess prediction performance is RMSE. This performance metric measures the average distance between predictions obtained by a model and actual target values. Thus, the lower the distance (and the smaller RMSE), the better the prediction quality. It also has the advantage of being in the same unit as the predicted variable which makes it easy to interpret.

### Variables / Features

The data consists of 60 columns, 27 of type object, 18 of type float64 and 15 of type int64 types.
<br>
<br>
The variables types were classified into the following data types, 21 nominal, 12 ordinal and 27 numeric.


### Data Summary and Main Characteristics

Evaluation of the prices suggest that the distribution of prices is skewed with the range of prices between <b>25</b> and <b>145160</b> with a mean of <b>285.65</b>.

Prices appear to be sensitive to property type, with <b>Private room in villa</b> having the highest mean price of <b>2358.36</b> by <b>property type</b>, which is much higher than the mean prices of other property types, suggesting that the distribution of prices for this property type may be highly skewed. <b>Private room in bungalow</b> has the lowest mean price of <b>64.11</b>. 

<b>Entire rental unit</b> have the highest listings <b>2984</b>, and a mean price of <b>296.87</b>, which is close to the overall mean price, while <b>Shared room in guesthouse</b> has the least listings <b>2</b>, and a mean price of <b>67.00</b>.

<b>Boroondara</b> has the highest mean price of <b>894.95</b> by <b>neighbourhood_cleansed</b> and <b>Greater Dandenong</b> has the lowest mean price of <b>115.41</b>. Prices also seem sensitive to neighbourhood_cleansed. <b>Melbourne</b> has the highest listing of <b>2062</b>, and mean price of <b>335.35</b>, and <b>Greater Dandenong</b> has the lowest lisitngs of<b>32</b>.

<b>Instant bookable</b> properties have a slighly higher mean price <b>298.96</b> compared with those that are not instant bookable <b>281.34</b>.

<b>Entire home/apt</b> have the highest listing <b>by room type</b> with a mean price of <b>312.19</b>. <b>Hotel room</b> and <b>Shared room</b> have the least lisitings with a combined total of <b>80</b>.

Prices increase as the number of <b>accomodates</b> increases from 1 to 16, ranging from a mean of <b>81.65</b> to <b>724.04</b>. 

Intial review of the features indicates that price dependant variables may include:
<br>&nbsp;&nbsp;room_type (nominal)
<br>&nbsp;&nbsp;neighbourhood_cleansed (nominal)
<br>&nbsp;&nbsp;accommodates (numeric)
<br>&nbsp;&nbsp;bathrooms (numeric)
<br>&nbsp;&nbsp;bedrooms (numeric)
<br>&nbsp;&nbsp;beds (numeric)
<br>&nbsp;&nbsp;amenities (nominal)
<br>&nbsp;&nbsp;review_scores_rating (ordinal)
<br>&nbsp;&nbsp;instant_bookable (nominal)

### Missing Observation

There are 24202 missing values across 29 variables, spread fairly evenly between the train and data sets.
</font>    

---

## Task 2: Data Cleaning, Missing Observations and Feature Engineering {-}
- In this task you will follow a set of instructions/questions listed below.
- Make sure you **explain** each step you do both in Markdown text and on your video.
    - Do not just read out your commands without explaining what they do and why you used them 

Total Marks: 12

**Task 2, Question 1**: Clean **all** numerical features and the target variable `price` so that they can be used in training algorithms. For instance, `host_response_rate` feature is in object format containing both numerical values and text. Extract numerical values (or equivalently eliminate the text) so that the numerical values can be used as a regular feature.  
(2 marks)

In [14]:
# Data Cleaning

#Functions
def replace_string(df, c, s, r='', f='strip'):
    if f == 'find_replace':
        mask = (df[c].notnull()) & (df[c].astype(str).str.contains(s))
        df.loc[mask, c] = df.loc[mask, c].astype(str).str.replace(s, r)
    if f == 'replace':
        df[c] = df[c].replace(s, r)
    elif f == 'strip':
        df[c] = df[c].dropna().astype(str).str.replace(s, r, regex=True)
    return df

def replace_numeric(df, c, n, r=0, f='match'):
    if f == 'isgreater':
        df.loc[df[c] > n, c] = r
    elif f == 'isless':
        df.loc[df[c] < n, c] = r
    elif f == 'match':
        df.loc[df[c] == n, c] = r
    return df

def convert_numeric(df, c, t, d=1):
    df[c] = pd.to_numeric(df[c], errors='coerce')
    df[c] = df[c].astype(t)
    df[c] = df[c] / d
    return df


In [15]:
df2 = df

# price
df2 = replace_string(df2, 'price', '$','', 'strip')
df2 = replace_string(df2, 'price', ',','', 'strip')
df2 = convert_numeric(df2, 'price', 'float', 1)

# host_response_rate
df2 = replace_string(df2, 'host_response_rate', '%','', 'strip')
df2 = convert_numeric(df2, 'host_response_rate', 'float', 100)

# host_acceptance_rate
df2 = replace_string(df2, 'host_acceptance_rate', '%','', 'strip')
df2 = convert_numeric(df2, 'host_acceptance_rate', 'float', 100)

# bathrooms
df2 = replace_string(df2, 'bathrooms', 'Half-bath','0.5', 'find_replace')
df2 = replace_string(df2, 'bathrooms', 'half-bath','0.5', 'find_replace')
df2 = replace_string(df2, 'bathrooms', '[^0-9\.]','', 'strip')
df2 = convert_numeric(df2, 'bathrooms', 'float', 1)

# max/min nights - replace extreme values
df2 = replace_numeric(df2, 'maximum_nights', 9000, 1000, 'isgreater')
df2 = replace_numeric(df2, 'minimum_maximum_nights', 9000, 1000, 'isgreater')
df2 = replace_numeric(df2, 'maximum_maximum_nights', 9000, 1000, 'isgreater')
df2 = replace_numeric(df2, 'minimum_nights_avg_ntm', 9000, 1000, 'isgreater')
df2 = replace_numeric(df2, 'maximum_nights_avg_ntm', 9000, 1000, 'isgreater')

df2.to_csv("df_2.csv")

#### <font color='darkblue'>price Problem</font>
<font color='darkblue'>
The target variable price was converted from a string containing the $ sign to a numeric float variable.
<br>
    
#### <font color='darkblue'>host_response_rate and host_acceptance_rate</font>    
host_response_rate and host_acceptance_rate variables were converted from strings containing the % sign to a numeric float variable.
<br>

#### <font color='darkblue'>bathrooms</font>
Some records for bathrooms were 'Half-bath' and 'half-bath', that is they did not contain any numeric characters. These were converted to the string '0.5'. For the remaining records non-numeric characters were removed and then the whole column was converted to a numeric float variable.
<br>

#### <font color='darkblue'>maximum_nights, minimum_maximum_nights,maximum_maximum_nights and maximum_nights_avg_ntm</font>    
maximum_nights, minimum_maximum_nights,maximum_maximum_nights and maximum_nights_avg_ntm had some extreme values, which appear to be in error - these were replaced with 1000 which appears to be the threshold based on other min/max nights variables.
</font>


In [None]:
df2

**Task 2, Question 2** Create at least 4 new features from existing features which contain multiple items of information, e.g. creating `email`,  `phone`, `work_email`, etc. from feature `host_verifications`.  
(2 marks)

In [16]:
df3 = df2

# Create new features email, phone and work_email from host_verifications
df3 = replace_string(df3, 'host_verifications', "['email']","'1','0','0'", 'replace')
df3 = replace_string(df3, 'host_verifications', "['phone']","'0','1','0'", 'replace')
df3 = replace_string(df3, 'host_verifications', "['email', 'phone']","'1','1','0'", 'replace')
df3 = replace_string(df3, 'host_verifications', "['phone', 'work_email']","'0','1','1'", 'replace')
df3 = replace_string(df3, 'host_verifications', "['email', 'phone', 'work_email']","'1','1','1'", 'replace')

df3[['email', 'phone', 'work_email']] = df3['host_verifications'].str.split(',', expand=True)

df3 = replace_string(df3, 'email', "'",'', 'strip')
df3 = convert_numeric(df3, 'email','int', 1)

df3 = replace_string(df3, 'phone', "'",'', 'strip')
df3 = convert_numeric(df3, 'phone','int', 1)

df3 = replace_string(df3, 'work_email', "'",'', 'strip')
df3 = convert_numeric(df3, 'work_email','int', 1)

df3.drop(['host_verifications'], axis=1, inplace=True)

In [17]:
# Create new features smoke_alarm, kitchen, essential, hangers, wifi from amenities
# These are the top 5 ammenities in the dataset

from collections import Counter

amenity_count = Counter()
amenity_count_total = Counter()
count_total = []

for amenities_str in df3['amenities']:
    amenity_count_total = 0
    amenities_list = amenities_str.strip('][').replace('"', '').split(', ')
    for amenity in amenities_list:
        amenity_count[amenity] += 1
        amenity_count_total  += 1
    count_total.append(amenity_count_total)

df_amenities = pd.DataFrame(columns=[ 'amenity_count'])
df_amenities['amenity_count'] = amenity_count
df_amenities = df_amenities.sort_values('amenity_count', ascending=False)
df_amenities.head(5)

#df_acc = pd.DataFrame(columns=[ 'total_amenity_counts'])
#df_acc['total_amenity_counts'] = count_total
#df_acc
#print(acc)

#Smoke alarm	9548
#Kitchen	9383
#Essentials	9327
#Hangers	8702
#Wifi	8618

df3['amenity_count'] = count_total

df3[['smoke_alarm','kitchen','essentials','hangers','wifi']] = 0

for idx, amenities_str in df3['amenities'].items():
    amenities_list = amenities_str.strip('][').replace('"', '').split(', ')
    if 'Smoke alarm' in amenities_list:
        df3.loc[idx, 'smoke_alarm'] = 1
    if 'Kitchen' in amenities_list:
        df3.loc[idx, 'kitchen'] = 1        
    if 'Essentials' in amenities_list:
        df3.loc[idx, 'essentials'] = 1      
    if 'Hangers' in amenities_list:
        df3.loc[idx, 'hangers'] = 1      
    if 'Wifi' in amenities_list:
        df3.loc[idx, 'wifi'] = 1              


#df3.drop(['amenities'], axis=1, inplace=True) # needed in Felix notebook to calculate sum of amenities

df3

df3.to_csv("df_3.csv")


### <font color='darkblue'>New Features</font>
<font color='darkblue'>

#### <font color='darkblue'>host_verifications</font>    
Created new binary numeric features 'email', 'phone', 'work_email from 'host_verifications' and deleted the column 'host_verifications'
<br>
    
#### <font color='darkblue'>amenities</font>    
Create new binary numeric features 'smoke_alarm', 'kitchen', 'essential', 'hangers', 'wifi' from 'amenities' which are the top 5 amenities in the dataset and deleted the column 'amenities'.
<br>
</font>

**Task 2, Question 3**: Impute missing values for all features in both training and test datasets. Hint: make sure you do **not** impute the price in the test dataset.
(3 marks)

In [18]:
df4 = df3

from sklearn.impute import SimpleImputer

def impute_missing(df, c, s='most_frequent'):
    for col in c:
        i = SimpleImputer(missing_values = np.nan, strategy=s) 
        i = i.fit(df[[col]])
        df[[col]] = i.transform(df[[col]])
    return df

# host_location-> most_frequent
df4 = impute_missing(df4, ['host_location'], 'most_frequent')

# host_response_time -> most_frequent
df4 = impute_missing(df4, ['host_response_time'], 'most_frequent')

# host_response_rate, host_acceptance_rate -> mean
df4 = impute_missing(df4, ['host_response_rate', 'host_acceptance_rate'], 'mean')

# host_is_superhost -> most_frequent
df4 = impute_missing(df4, ['host_is_superhost'], 'most_frequent')

# host_neighbourhood, neighbourhood, neighbourhood_cleansed -> most_frequent
df4 = impute_missing(df4, ['host_neighbourhood', 'neighbourhood', 'neighbourhood_cleansed'], 'most_frequent')

# property_type, room_type -> most_frequent
df4 = impute_missing(df4, ['property_type', 'room_type'], 'most_frequent')

# bathrooms, bedrooms, beds, first_review -> median
df4 = impute_missing(df4, ['bathrooms','bedrooms','beds'], 'median')

# minimum_minimum_nights, maximum_maximum_nights -> median
df4 = impute_missing(df4, ['minimum_minimum_nights', 'maximum_maximum_nights'], 'median')

# availability_365 -> mean
df4 = impute_missing(df4, ['availability_365'], 'mean')

# first_review, last_review -> most_frequent
df4 = impute_missing(df4, ['first_review', 'last_review'], 'most_frequent')

#review_scores_accuracy, review_scores_checkin, review_scores_cleanliness, review_scores_communication, review_scores_location
# review_scores_rating, review_scores_value -> mean
df4 = impute_missing(df4, ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin',
                           'review_scores_communication', 'review_scores_location','review_scores_value'], 'mean')
# reviews_per_month -> mean
df4 = impute_missing(df4, ['reviews_per_month'], 'mean')

# email, phone, work_email from -> most_frequent
df4 = impute_missing(df4, ['email', 'phone', 'work_email'], 'most_frequent')

# smoke_alarm, kitchen, essentials, hangers, wifi -> most_frequent
df4 = impute_missing(df4, ['smoke_alarm', 'kitchen', 'essentials', 'hangers', 'wifi'], 'most_frequent')

#df4

df4.to_csv("df_4.csv")

### <font color='darkblue'>Imputing missing values</font>
<font color='darkblue'>
<br>
Imputed missing values for all features except for 'description', 'neighborhood_overview', 'host_location' and 'host_about'.
<br>
<br>    

**Task 2, Question 4**: Encode all categorical variables appropriately as discussed in class. 


Where a categorical feature contains more than 5 unique values, map the feature into 5 most frequent values + 'other' and then encode appropriately. For instance, you could group then map `property_type` into 5 most frequent property types + 'other'  
(2 marks)

In [None]:
#onehot encoder function



In [30]:
df5 = df4


#onehot encoder function
def onehot(df, c):
    for col in c:
        df = df.join(pd.get_dummies(df[[col]], drop_first=True))
        df.drop([col], axis=1, inplace=True)
    return df

#encode binary classifiers
# 'host_is_superhost','host_has_profile_pic','host_identity_verified','has_availability','instant_bookable'
df5 = onehot(df5, ['source', 'host_is_superhost','host_has_profile_pic','host_identity_verified','has_availability','instant_bookable'])

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

#encode source
#le = encoder.fit_transform(df5['source'].values)
#df5['source'] = le
#print('source:', encoder.classes_)

#encode room_type
le = encoder.fit_transform(df5['room_type'].values)
df5['room_type'] = le
room_type_classes = encoder.classes_


#encode top 5 property_type and other
top_5_property_type = df5['property_type'].value_counts().nlargest(5).index.tolist()  
encoder.fit(top_5_property_type + ['other'])  
#df5['property_type_encoded'] = df5['property_type'].apply(lambda x: x if x in top_5_property_type else 'other')
df5['property_type'] = df5['property_type'].apply(lambda x: x if x in top_5_property_type else 'other')
df5 = onehot(df5, ['property_type'])

#df5['property_type_encoded'] = encoder.transform(df5['property_type'].apply(lambda x: x if x in top_5_property_type else 'other'))
#df5.drop(['property_type'], axis=1, inplace=True)
#df5 = df5.rename(columns={'property_type_encoded': 'property_type'})
#property_type_classes = encoder.classes_


#encode top 5 neighbourhood_cleansed and other
top_5_neighbourhood_cleansed = df5['neighbourhood_cleansed'].value_counts().nlargest(5).index.tolist()  
encoder.fit(top_5_neighbourhood_cleansed + ['other'])  
df5['neighbourhood_cleansed_encoded'] = encoder.transform(df5['neighbourhood_cleansed'].apply(lambda x: x if x in top_5_neighbourhood_cleansed else 'other'))
df5.drop(['neighbourhood_cleansed'], axis=1, inplace=True)
df5 = df5.rename(columns={'neighbourhood_cleansed_encoded': 'neighbourhood_cleansed'})
neighbourhood_cleansed_classes = encoder.classes_


# map/rank host_response_time
host_response_mapping = {'within an hour':1, 'within a few hours':2, 'within a day':3, 'a few days or more':4}
df5['host_response_time'] = df5['host_response_time'].map(host_response_mapping)

# convert host_since into days based on current date
from datetime import datetime
today = datetime.today()
df5['host_since'] = pd.to_datetime(df5['host_since'], format='%Y/%m/%d')
df5['host_since'] = (today - df5['host_since']).dt.days

#df5

df5.to_csv("df_5.csv")

<font color='darkblue'>

* Encoded  binray classifications 
<br>
'source', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'has_availability', 'instant_bookable' using onehot encoding.
<br>
<br>
* Encoded 'room_type' using LabelEncoder.
<br>
<br>
* Encoded 'property_type' and 'neighbourhood_cleansed' using LabelEncoder grouped by the top 5 classifications in each and the rese grouped as 'other'
<br>
<br>
* Encoded and ranked host_response_time.
</font>
<br>

**Task 2, Question 5**: Perform any other actions you think need to be done on the data before constructing predictive models, and clearly explain what you have done.   
(1 marks)

In [31]:
import numpy as np

df6 = df5
df6['log_price'] = np.log(df6['price'])

df6.to_csv("df_6.csv")





In [32]:
df6

Unnamed: 0_level_0,name,description,neighborhood_overview,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_neighbourhood,host_listings_count,neighbourhood,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,price,email,phone,work_email,amenity_count,smoke_alarm,kitchen,essentials,hangers,wifi,source_previous scrape,host_is_superhost_t,host_has_profile_pic_t,host_identity_verified_t,has_availability_t,instant_bookable_t,property_type_Entire home,property_type_Entire rental unit,property_type_Private room in home,property_type_Private room in rental unit,property_type_other,neighbourhood_cleansed,log_price
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1
0,"The Stables, Richmond",Superbly located hotel style accommodation in ...,Richmond is a great neighbourhood. A beautifu...,Ione,3721,"Melbourne, Australia",I'm a working mum who loves being able to shar...,1,1.000000,0.980000,Richmond,2.0,"Richmond, Victoria, Australia",-37.82030,144.99016,0,2,1.0,1.0,1.0,"[""Sukin conditioner"", ""Extra pillows and blank...",2,14,2.0,2,1125,1125.0,2.0,1125.0,0,0,0,12.0,741,37,1,2013-03-29,2023-02-18,4.880000,4.910000,4.970000,4.940000,4.93000,4.930000,4.820000,2,2,0,0,6.110000,132.0,1.0,1.0,0.0,38,1,0,1,1,1,0,0,1,1,1,0,0,0,0,0,1,3,4.882802
1,Room in Cool Deco Apartment in Brunswick East,A large air conditioned room with firm queen s...,This hip area is a crossroads between two grea...,Lindsay,4998,"Melbourne, Australia",As an artist working in animation and video I ...,2,1.000000,0.980000,Brunswick,1.0,"Brunswick East, Victoria, Australia",-37.76606,144.97951,2,2,1.0,1.0,1.0,"[""Extra pillows and blankets"", ""Laundromat nea...",4,27,4.0,4,27,27.0,4.0,27.0,0,12,22,112.0,169,25,3,2013-01-12,2023-03-08,4.480000,4.640000,3.970000,4.720000,4.69000,4.650000,4.600000,1,0,1,0,1.370000,39.0,1.0,1.0,0.0,57,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,0,5,3.663562
2,The Suite @ Angelus Retreat,<b>The space</b><br />Welcome to ANGELUS Retre...,,Margaret Jiin,4195,"Melbourne, Australia",I have very special interests in Life and Life...,2,1.000000,0.780000,Central Business District,2.0,"Melbourne, Victoria, Australia",-37.90546,145.39447,0,4,2.5,2.0,4.0,"[""Microwave"", ""Hot tub"", ""Conditioner"", ""Smoke...",2,365,2.0,2,365,365.0,2.0,365.0,30,60,90,365.0,8,2,0,2015-07-06,2022-06-13,4.750000,4.880000,4.750000,4.880000,4.50000,5.000000,4.750000,2,2,0,0,0.090000,270.0,1.0,1.0,0.0,21,1,1,0,0,1,0,1,1,1,1,0,0,1,0,0,0,4,5.598422
3,Million Dollar Views Over Melbourne,<b>The space</b><br /><b>Enjoy Million Dollar ...,,Paul,4728,"Melbourne, Australia",Professional couple who enjoy entertaining in ...,3,0.750000,0.920000,Southbank,4.0,"Melbourne, Victoria, Australia",-37.82163,144.96672,2,2,2.5,1.0,1.0,"[""Hot tub"", ""Gym"", ""Washer"", ""Dryer"", ""Kitchen...",1,730,1.0,1,730,730.0,1.0,730.0,30,60,90,365.0,2,0,0,2011-10-16,2012-01-27,4.500000,4.000000,4.500000,4.000000,4.00000,5.000000,4.000000,1,0,1,0,0.010000,1000.0,1.0,1.0,0.0,13,0,1,0,0,1,0,0,1,1,1,0,0,0,0,1,0,0,6.907755
4,Melbourne - Old Trafford Apartment,After hosting many guests from all over the wo...,Our street is quiet & secluded but within walk...,Daryl & Dee,4699,"Berwick, Australia",We are an active couple who work from home and...,2,1.000000,0.870000,Central Business District,1.0,"Berwick, Victoria, Australia",-38.05725,145.33936,0,5,1.0,3.0,3.0,"[""Laundromat nearby"", ""Private patio or balcon...",1,14,1.0,1,14,14.0,1.0,14.0,17,21,51,312.0,214,39,4,2010-11-24,2023-03-03,4.860000,4.910000,4.980000,4.910000,4.93000,4.900000,4.870000,1,1,0,0,1.430000,116.0,1.0,1.0,1.0,49,1,1,1,1,1,0,1,1,1,1,0,0,1,0,0,0,0,4.753590
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Comfy Bedroom with Private Bathroom,The apartment is located 180 meters away from ...,Hawthorn is a very family oriented suburb and ...,Kimberley,3215,"Melbourne, Australia",I'm a fun and avid traveller. I've been expose...,1,0.959781,0.886732,Central Business District,1.0,"Hawthorn, Victoria, Australia",-37.82025,145.03088,2,2,1.0,1.0,1.0,"[""Essentials"", ""Smoke alarm"", ""Hair dryer"", ""W...",1,1125,1.0,1,1125,1125.0,1.0,1125.0,0,0,0,0.0,2,0,0,2016-07-18,2016-07-25,4.500000,4.000000,5.000000,5.000000,5.00000,4.500000,4.500000,1,0,1,0,0.020000,,1.0,1.0,0.0,13,1,1,1,1,1,1,0,1,1,1,0,0,0,0,1,0,5,
9996,Light-Filled & Cosy in the Heart of South Yarra,"Footsteps from South Yarra Train station, this...","Just 3km from the heart of Melbourne, South Ya...",Ruth,1199,"Melbourne, Australia",,1,0.959781,0.886732,Central Business District,36.0,"South Yarra, Victoria, Australia",-37.83624,144.99299,0,2,1.0,1.0,1.0,"[""Mini fridge"", ""Microwave"", ""Cleaning product...",1,90,3.0,7,90,90.0,4.0,90.0,1,1,1,268.0,42,9,0,2019-03-06,2023-01-15,4.210000,4.500000,4.290000,4.740000,4.64000,4.740000,4.190000,36,36,0,0,0.860000,,1.0,1.0,1.0,40,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,2,
9997,New 6 bedrooms house in Williams Landing,Have fun with the whole family at this stylish...,,Anna,3121,"Melbourne, Australia","I am a travel consultant, love to meet differe...",1,1.000000,0.920000,Central Business District,36.0,"Melbourne, Victoria, Australia",-37.86326,144.75456,0,16,3.5,6.0,8.0,"[""Microwave"", ""Essentials"", ""Smoke alarm"", ""Ov...",2,365,2.0,2,365,365.0,2.0,365.0,9,39,69,340.0,0,0,0,2022-04-11,2023-03-04,4.687442,4.761851,4.677927,4.804996,4.82302,4.834347,4.666018,36,36,0,0,1.544088,,1.0,1.0,0.0,42,1,1,1,1,1,0,0,1,1,1,1,1,0,0,0,0,5,
9998,Comfortable bedroom in CBD,"In Melbourne CBD, within the free tram zone. v...",,Frank,1761,"Melbourne, Australia",,1,0.959781,0.886732,Central Business District,2.0,"Melbourne, Victoria, Australia",-37.80913,144.96058,2,2,0.5,1.0,2.0,"[""Essentials"", ""Smoke alarm"", ""Pool"", ""Hair dr...",1,1125,1.0,1,1125,1125.0,1.0,1125.0,0,0,0,0.0,3,0,0,2020-01-26,2020-02-25,3.000000,4.000000,3.000000,2.330000,3.33000,4.000000,3.330000,2,0,2,0,0.080000,,0.0,1.0,0.0,10,1,1,1,1,1,1,0,1,1,1,1,0,0,0,1,0,0,


<br>
<font color='darkblue'>
* Created <b>log_price</b> variable which is the logarithmic transformation of the price variable, to scale the price and reduce the effect of the outliers and make the distribution more normal.
</font>
<br>
<br>

In [None]:

df7 = df6
df7.drop(['amenities'], axis=1, inplace=True) 
#df7.drop(['name', 'description','neighborhood_overview','host_name',
#          'host_about','neighbourhood','latitude','longitude'], axis=1, inplace=True)
df7.drop(['name', 'description','neighborhood_overview','host_name',
          'host_about','neighbourhood'], axis=1, inplace=True)

df7.drop(['host_location', #'host_response_rate','host_acceptance_rate',
          'host_neighbourhood',
          #'host_listings_count'
         ], 
          axis=1, inplace=True)
df7.drop(['minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm'], axis=1, inplace=True)
#df7.drop(['number_of_reviews','number_of_reviews_ltm', 'number_of_reviews_l30d'], axis=1, inplace=True)
df7.drop(['first_review', 'last_review'], axis=1, inplace=True)
#df7.drop(['review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value'], axis=1, inplace=True) 
df7.drop(['calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms'], axis=1, inplace=True)
df7.drop(['reviews_per_month'], axis=1, inplace=True)
 
df7

df7.to_csv("df_7.csv")

In [None]:

df7


#### <font color='darkblue'> Feature - selection
<font color='darkblue'>
* Dropped features for initial analysis.
<br>
<br>

**Task 2, Question 6**: Perform exploratory data analysis to measure the relationship between the features and the target and write up your findings. 
(2 marks)

In [None]:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt


plt.hist(df7['log_price'], bins=100)
plt.xscale('log')
plt.show()
        

`(Task 2, Question 6 Text Here)`

--- 
## Task 3: Fit and tune a forecasting model/Submit predictions/Report score and ranking {-}

Make sure you **clearly explain each step** you do, both in text and on the recoded video.

1. Build a machine learning (ML) regression model by taking into account the outcomes of Tasks 1 & 2 (Explain carefully)
2. Fit the model and tune hyperparameters via cross-validation: make sure you comment and explain each step clearly
3. Create predictions using the test dataset and submit your predictions on Kaggle's competition page
4. Provide Kaggle ranking and **score** (screenshot your best submission) and Comment
5. Make sure your Python code works, so that a marker that can replicate your all of your results and obtain the same RMSE from Kaggle

- Hint: to perform well in this assignment you will need to iterate Tasks 2 & 3, creating new features and training various models in order to find the best one.

Total Marks: 12

In [None]:
df9 = df7
df9.to_csv("df_9.csv")

df_test_train = pd.read_csv("df_7.csv")

df_train = df_test_train.iloc[:7000]
df_train = df_train[df_train['price'] < 4000]

df_test = df_test_train.iloc[7000:]

y_train = df_train['log_price'].values.ravel()

#y_test = df_test['log_price']

X_train = df_train.drop(['ID', 'price','log_price','latitude','longitude','source_previous scrape'], axis=1).values

X_test = df_test.drop(['ID', 'price','log_price','latitude','longitude','source_previous scrape'], axis=1).values



In [22]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression


from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostRegressor
from sklearn.svm import SVC

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import cross_val_score, KFold


In [None]:
# fit and predit on test data

#X_train, y_train = make_regression(n_samples=150, n_features=33, random_state=42)

pipe = make_pipeline(StandardScaler(), 
       RandomForestRegressor(random_state=42, n_estimators=400, max_depth=20))

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
y_pred_dollar = np.exp(y_pred)

test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price":y_pred_dollar})
pred.to_csv("pricepredictions.csv", index=False, header=True)
pred.head


In [None]:
# test on subset of training data
pipe = make_pipeline(StandardScaler(),
                     RandomForestRegressor(random_state=42, n_estimators=400, max_depth=20))


df_test_t = pd.read_csv("df_test.csv")

y_test = df_test_t['log_price'].values

X_test = df_test_t.drop(['ID', 'price','log_price'], axis=1).values

x = pipe.fit(X_train, y_train)
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))


y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))

y_test_dollar = np.round(np.exp(y_test))
rmse = np.sqrt(mean_squared_error(y_test_dollar, y_pred_dollar))
print(f'Root Mean Squared Error: {rmse}')

test_pred = pd.DataFrame({"price":y_test_dollar, "pred":y_pred_dollar})
test_pred.head(100)
#print(train_pred.head(15))


In [None]:



pipe_rf = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=42))

pipe_rf.fit(X_train, y_train)

n_folds = 20
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

cv_scores = -cross_val_score(pipe_rf, X_train, y_train, cv=kf, scoring='neg_mean_squared_error')

mse = np.mean(cv_scores)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

pipe_rf.fit(X_train, y_train)


y_pred = pipe_rf.predict(X_train)
mse = mean_squared_error(y_train,y_pred)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

# calculate RMSE in original scale
y_pred_dollar = np.exp(y_pred)
y_test_dollar = np.exp(y_train)
rmse = np.sqrt(mean_squared_error(y_test_dollar, y_pred_dollar))
print(f'Root Mean Squared Error in AUD: {rmse}')

In [None]:
#X_test1 = df7.drop(['price','source_previous scrape','host_since','host_has_profile_pic_t','host_identity_verified_t'
#             ], axis=1).iloc[7000:10000].values
y_pred1 = pipe_rf.predict(X_test)
y_pred1
#len(y_pred1)

In [None]:
# calculate RMSE in original scale
y_pred_dollar = np.exp(y_pred)
y_test_dollar = np.exp(y_test)

rmse = np.sqrt(mean_squared_error(y_test_dollar, y_pred_dollar))
print(f'Root Mean Squared Error in AUD: {rmse}')

In [None]:
y_pred_dollar

In [26]:

#df6  = pd.read_csv('df_6.csv')

df_train = df6[:7000]
#df_train = df_train[df_train['price'] < 4001]

df_test = df6[7000:]



columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
'property_type','amenities']


dfxx = df6.drop(columns=columns_to_drop).values


In [None]:
# RandomForestRegressor

#df6  = pd.read_csv('df_6.csv')

df_train = df6[:7000]
#df_train = df_train[df_train['price'] < 4001]

df_test = df6[7000:]



columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
'amenities',
    
    
   
'price'
,'log_price'
    
#,'calculated_host_listings_count_shared_rooms'
#,'calculated_host_listings_count_entire_homes'
#,'email'
#,'smoke_alarm'
#,'property_type_Entire home'
#,'maximum_nights_avg_ntm'
#,'minimum_nights'
#,'calculated_host_listings_count'
#,'maximum_maximum_nights'
#,'minimum_minimum_nights'
#,'availability_60'
#,'review_scores_rating'

,'review_scores_location'
,'number_of_reviews_l30d'
,'maximum_minimum_nights'
,'reviews_per_month'
,'minimum_maximum_nights'
,'availability_365'
,'review_scores_cleanliness'
,'review_scores_value'
,'number_of_reviews'
,'availability_90'
,'number_of_reviews_ltm'
,'host_listings_count'
,'review_scores_communication'
,'minimum_nights_avg_ntm'
,'longitude'
,'maximum_nights'
,'host_response_rate'
,'amenity_count'
,'host_identity_verified_t'
,'review_scores_accuracy'
,'host_acceptance_rate'
,'source_previous scrape'
,'hangers'
,'review_scores_checkin'
,'latitude'
,'wifi'
,'kitchen'
,'host_since'
,'beds'
,'property_type_other'
,'property_type_Entire rental unit'
,'property_type_Private room in rental unit'
,'host_response_time'
,'host_is_superhost_t'
,'work_email'
,'essentials'
,'phone'
,'has_availability_t'
,'host_has_profile_pic_t'

    
]
    



y_train = df_train['log_price'].values

X_train = df_train.drop(columns=columns_to_drop).values

X_test =  df_test.drop(columns=columns_to_drop).values
  
    
    

param_grid = {
    'n_estimators': [100, 200, 300,400,500,600,700,800],  # Number of boosting stages to perform
#    'learning_rate': [0.1, 0.05, 0.01],  # Learning rate shrinks the contribution of each tree
    'max_depth': [3, 4, 5, 10, 15, 20, 25, 30,50 ]  # Maximum depth of each decision tree
}

gb_model = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs = 24)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
#print(best_params)

pipe = make_pipeline(StandardScaler(),
                     RandomForestRegressor(random_state=42, **best_params))
    
#pipe = make_pipeline(StandardScaler(),
#                     RandomForestRegressor(random_state=42, n_estimators=20 , max_depth=700))
      
    
    
    
x = pipe.fit(X_train, y_train)
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_train)
y_pred_dollar = np.round(np.exp(y_pred))

y_train_dollar = np.round(np.exp(y_train))


rmse = np.sqrt(mean_squared_error(y_train_dollar, y_pred_dollar))
print(f'Root Mean Squared Error: {rmse}')

train_pred = pd.DataFrame({"price":y_train_dollar, "pred":y_pred_dollar})
train_pred.head(5)


#df_test = pd.read_csv("test.csv")
#X_test = df_test.drop(['ID'], axis=1).values
y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))

#test_pred = pd.DataFrame({"price":y_test_dollar, "pred":y_test_pred_dollar})
#test_pred.head(100)


test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price": np.exp(y_pred)})
pred.to_csv("pricepredictions.csv", index=False, header=True)
pred.head(5)


In [None]:
df6

In [None]:
# LASSO
from sklearn.linear_model import Lasso

#df6  = pd.read_csv('df_6.csv')

df_train = df6[:7000]

df_test = df6[7000:]
columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
    
   
'price'
,'log_price'

    
,'minimum_nights_avg_ntm'
,'availability_365'
,'maximum_maximum_nights'
,'host_response_rate'
,'longitude'
,'availability_90'
,'host_listings_count'
,'minimum_nights'
,'availability_60'
,'number_of_reviews_ltm'
,'amenity_count'
,'number_of_reviews'
,'reviews_per_month'
,'beds'
,'maximum_nights'
,'review_scores_cleanliness'
,'host_acceptance_rate'
,'hangers'
,'wifi'
,'latitude'
,'work_email'
,'source_previous scrape'
,'kitchen'
,'maximum_nights_avg_ntm'
,'review_scores_accuracy'
,'review_scores_checkin'
,'host_is_superhost_t'
,'review_scores_communication'
,'essentials'
,'host_since'
,'host_response_time'
,'host_has_profile_pic_t'
,'phone'
#,'has_availability_t'
   
    
]

y_train = df_train['log_price'].values

X_train = df_train.drop(columns=columns_to_drop).values

X_test =  df_test.drop(columns=columns_to_drop).values
  

    
param_grid = {
    'alpha': [0.1, 0.5, 1.0],
    'max_iter': [1000, 2000, 3000,4000],
    'tol': [0.01,0.001, 0.0001, 0.00001]
}


gb_model = Lasso(random_state=42)

grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, cv=5, n_jobs=24)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

pipe = make_pipeline(StandardScaler(),
                     Lasso(random_state=42, **best_params))    
    
    
    
    
    
    
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_train)
y_pred_dollar = np.round(np.exp(y_pred))

y_train_dollar = np.round(np.exp(y_train))




rmse = np.sqrt(mean_squared_error(y_train_dollar, y_pred_dollar))
print(f'Root Mean Squared Error: {rmse}')

train_pred = pd.DataFrame({"price":y_train_dollar, "pred":y_pred_dollar})
train_pred.head(5)

y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))

test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price":y_pred_dollar})
pred.to_csv("pricepredictions.csv", index=False, header=True)
pred.head(5)

In [None]:
# GradientBoostingRegressor
from sklearn.linear_model import Lasso

#df6  = pd.read_csv('df_6.csv')

df_train = df6[:7000]
#df_train = df_train[df_train['price'] < 4000]

df_test = df6[7000:]



columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
    
   
'price'
,'log_price'

    
,'minimum_nights_avg_ntm'
,'availability_365'
,'maximum_maximum_nights'
,'host_response_rate'
,'longitude'
,'availability_90'
,'host_listings_count'
,'minimum_nights'
,'availability_60'
,'number_of_reviews_ltm'
,'amenity_count'
,'number_of_reviews'
,'reviews_per_month'
,'beds'
,'maximum_nights'
,'review_scores_cleanliness'
,'host_acceptance_rate'
,'hangers'
,'wifi'
,'latitude'
,'work_email'
,'source_previous scrape'
,'kitchen'
,'maximum_nights_avg_ntm'
,'review_scores_accuracy'
,'review_scores_checkin'
,'host_is_superhost_t'
#,'review_scores_communication'
#,'essentials'
#,'host_since'
#,'host_response_time'
#,'host_has_profile_pic_t'
#,'phone'
#,'has_availability_t'

   
]

y_train = df_train['log_price'].values

X_train = df_train.drop(columns=columns_to_drop).values

X_test =  df_test.drop(columns=columns_to_drop).values
  

columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
    
 
   
    
]
    
param_grid = {
    'n_estimators': [100, 200, 300,400,500],  # Number of boosting stages to perform
    'learning_rate': [0.1, 0.05, 0.01,0.5],  # Learning rate shrinks the contribution of each tree
    'max_depth': [3, 4, 5]  # Maximum depth of each decision tree
}

gb_model = GradientBoostingRegressor(random_state=42)

grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, cv=5, n_jobs=24)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

pipe = make_pipeline(StandardScaler(),
                     GradientBoostingRegressor(random_state=42, **best_params))    
    
    
    
    
    
    
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_train)
y_pred_dollar = np.round(np.exp(y_pred))

y_train_dollar = np.round(np.exp(y_train))


rmse = np.sqrt(mean_squared_error(y_train_dollar, y_pred_dollar))
print(f'Root Mean Squared Error: {rmse}')

train_pred = pd.DataFrame({"price":y_train_dollar, "pred":y_pred_dollar})
train_pred.head(5)

y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))

test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price":np.exp(y_pred)})
pred.to_csv("pricepredictions.csv", index=False, header=True)
pred.head(5)


In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

#df6  = pd.read_csv('df_6.csv')

df_train = df6[:7000]

df_test = df6[7000:]

    
columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
    
   
'price'
,'log_price'

    
,'minimum_nights_avg_ntm'
,'availability_365'
,'maximum_maximum_nights'
,'host_response_rate'
,'longitude'
,'availability_90'
,'host_listings_count'
,'minimum_nights'
,'availability_60'
,'number_of_reviews_ltm'
,'amenity_count'
,'number_of_reviews'
,'reviews_per_month'
,'beds'
,'maximum_nights'
,'review_scores_cleanliness'
,'host_acceptance_rate'
,'hangers'
,'wifi'
,'latitude'
,'work_email'
,'source_previous scrape'
,'kitchen'
#,'maximum_nights_avg_ntm'
#,'review_scores_accuracy'
#,'review_scores_checkin'
#,'host_is_superhost_t'
#,'review_scores_communication'
#,'essentials'
#,'host_since'
#,'host_response_time'
#,'host_has_profile_pic_t'
#,'phone'
#,'has_availability_t'

  
    
]


y_train = df_train['log_price'].values

X_train = df_train.drop(columns=columns_to_drop).values

X_test =  df_test.drop(columns=columns_to_drop).values


param_grid = {
    'alpha': [0.1, 0.5, 1.0],
    'l1_ratio': [0.2, 0.5, 0.8],
    'max_iter': [1000, 2000, 3000],
    'tol': [0.001, 0.0001, 0.00001]
}

gb_model = ElasticNet()

grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, cv=5, n_jobs=24)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

pipe = make_pipeline(StandardScaler(),
                     ElasticNet(**best_params))    
    
    
    
    
    
    
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_train)
y_pred_dollar = np.round(np.exp(y_pred))

y_train_dollar = np.round(np.exp(y_train))


rmse = np.sqrt(mean_squared_error(y_train_dollar, y_pred_dollar))
print(f'Root Mean Squared Error: {rmse}')

train_pred = pd.DataFrame({"price":y_train_dollar, "pred":y_pred_dollar})
train_pred.head(5)

y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))

test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price":y_pred_dollar})
pred.to_csv("pricepredictions.csv", index=False, header=True)
pred.head(5)


In [None]:
import pandas as pd
from xgboost import plot_importance, DMatrix
import xgboost as xgb
import matplotlib.pyplot as plt

#df6  = pd.read_csv('df_6.csv')

df_train = df6[:7000]

#df_X = df_train[df_train['price'] < 4000]

columns_to_drop = [
'name','description','neighborhood_overview','host_name','host_location','host_about',
'host_neighbourhood','neighbourhood','amenities','first_review','last_review','price',
'log_price'
]


y_train = df_X['log_price'].values

X_train = df_X.drop(columns=columns_to_drop).values

feature_names = df_train.drop(columns=columns_to_drop).columns
model = xgb.XGBRegressor()

model.fit(X_train, y_train)

importance = model.feature_importances_

fig, ax = plt.subplots(figsize=(45, 8))
plot_importance(model, ax=ax)
ax.set_yticks(range(len(feature_names)))
ax.set_yticklabels(feature_names)
plt.show()

In [None]:
df_importances = pd.DataFrame({'Feature': feature_names, 'Importance': importance })

df_importances['Rank'] = df_importances['Importance'].rank(ascending=False)

# Sort the importances by rank
df_importances = df_importances.sort_values(by='Rank')

# Print the feature importances with ranks
print(len(importance))
print(len(feature_names))
df_importances

In [None]:
print(feature_names)

In [None]:
#df_test = pd.read_csv("test.csv")
#X_test = df_test.drop(['ID'], axis=1).values
y_pred = pipe.predict(X_test)
y_pred_dollar = np.round(np.exp(y_pred))

#test_pred = pd.DataFrame({"price":y_test_dollar, "pred":y_test_pred_dollar})
#test_pred.head(100)


test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price":y_pred_dollar})
pred.to_csv("pricepredictions.csv", index=False, header=True)
pred.head(100)

In [None]:
print("Best Hyperparameters:", best_params)

In [None]:

y_test_pred = pipe.predict(X_test)
y_test_pred_dollar = np.round(np.exp(y_test_pred))

y_test_dollar = np.round(np.exp(y_test))
rmse = np.sqrt(mean_squared_error(y_test_dollar, y_test_pred_dollar))
print(f'Root Mean Squared Error: {rmse}')

test_pred = pd.DataFrame({"price":y_test_dollar, "pred":y_test_pred_dollar})
test_pred.head(100)
#print(train_pred.head(15))


#test_id = np.arange(7000, 10000, 1)
#pred = pd.DataFrame({"ID":test_id, "price":y_pred_dollar})
#pred.to_csv("pricepredictions.csv", index=False, header=True)


In [None]:
y_pred1 = pipe.predict(X_test)

y_pred1_dollar = np.round(np.exp(y_pred1))
test_id = np.arange(7000, 10000, 1)
pred = pd.DataFrame({"ID":test_id, "price":y_pred1_dollar})
pred.to_csv("pricepredictions.csv", index=False, header=True)

In [None]:
xxxx = sum(train_pred['price']) - sum(train_pred['pred']) 

xxxx

#test_id = np.arange(7000, 10000, 1)
#pred = pd.DataFrame({"ID":test_id, "price":y_pred_dollar})
#pred.to_csv("pricepredictions.csv", index=False, header=True)

#train_pred.to_csv("train_pred.csv")

In [None]:
pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=42))


param_grid = {'randomforestregressor__n_estimators': [100, 200, 300],
              'randomforestregressor__max_depth': [1, 3, 5, 10, 20, None]
              }

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

print(best_model)
print(best_params)

In [None]:
print(xxx['price'] > 4000)

In [None]:
df

In [None]:
df7

In [None]:
# columns_to_drop = [
# 'name','description','neighborhood_overview','host_name','host_location','host_about',
# 'host_neighbourhood','neighbourhood','amenities','first_review','last_review',
    
# 'minimum_minimum_nights',
# 'maximum_minimum_nights',
# 'minimum_maximum_nights',
# 'maximum_maximum_nights',
# 'minimum_nights_avg_ntm',
# 'maximum_nights_avg_ntm',

# 'calculated_host_listings_count',
# 'calculated_host_listings_count_entire_homes',
# 'calculated_host_listings_count_private_rooms',
# 'calculated_host_listings_count_shared_rooms'
    
# ,'price'
# ,'host_since'
# ,'host_response_time'
# #,'host_response_rate'
# #,'host_acceptance_rate'
# ,'host_listings_count'
# ,'latitude'
# ,'longitude'
# #,'room_type'
# #,'accommodates'
# #,'bathrooms'
# #,'bedrooms'
# #,'beds'
# ,'minimum_nights'
# ,'maximum_nights'
# ,'availability_30'
# ,'availability_60'
# ,'availability_90'
# ,'availability_365'
# #,'number_of_reviews'
# #,'number_of_reviews_ltm'
# ,'number_of_reviews_l30d'
# #,'review_scores_rating'
# #,'review_scores_accuracy'
# #,'review_scores_cleanliness'
# #,'review_scores_checkin'
# #,'review_scores_communication'
# #,'review_scores_location'
# #,'review_scores_value'
# ,'email'
# ,'phone'
# ,'work_email'
# ,'amenity_count'
# ,'smoke_alarm'
# #,'kitchen'
# ,'essentials'
# ,'hangers'
# #,'wifi'
# ,'source_previous scrape'
# ,'host_is_superhost_t'
# ,'host_has_profile_pic_t'
# ,'host_identity_verified_t'
# #,'has_availability_t'
# ,'instant_bookable_t'
# #,'property_type'
# #,'neighbourhood_cleansed'
# ,'log_price'

# ]
