# Preamble 
This project was inspired by the work of : [scrapfishies](https://github.com/scrapfishies)
# Prediction of rental prices  

For this project we're gonna use an API in order to extract our data and frame into a dataset to exploit

This study will be based with the help of the various tools presented below :

#### **Web scraping**
* Request 
* BeautifulSoup 

#### **Tools** 
* Pandas
* Numpy

#### **Modeling** 
* sklearn
* statsmodels


#### **Visualizations** 
* Seaborn
*Matplotlib



#Step 1: Web scraping from craiglist

*  Scrape the page will be focused on listing the date when the anounce was posted, title, url, rent amount, square footage, neighborhood and number of bedrooms.

* Append all those url together into a dataframe

* Repeat the process until the number of instance is obtained

First let's begin by importing the librairies that we'll use to initialize this study. 
In this case we're going to scrape 25 pages of craiglist containing apartment and housing listing.

**Date and time of scrape :** 

>22 January 2022

**Setting for scraping from craiglist :**

> SF bay area **>** san francisco **>** housing **>** apartments / housing for rent


In [None]:
from bs4 import BeautifulSoup
import requests

import pandas as pd
import numpy as np

from random import randint
from time import sleep
%load scrape_cl.py
from scrape_cl import *

In [None]:
start_url = 'https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1'

In [None]:
import six
import sys
sys.modules['sklearn.externals.six'] = six
from mlxtend.classifier import StackingCVClassifier

In [None]:
sf=full_listings_scrape(start_url)

Scraping page 1 of 25...

Listing page scrape complete!
Number of postings scraped: 123

Individual posts scrape complete!
Number of posts scraped:  123

Page 1 of 25 scrape complete!

Scraping page 2 of 25...

Listing page scrape complete!
Number of postings scraped: 125

Individual posts scrape complete!
Number of posts scraped:  125

Page 2 of 25 scrape complete!

Scraping page 3 of 25...

Listing page scrape complete!
Number of postings scraped: 120

Individual posts scrape complete!
Number of posts scraped:  120

Page 3 of 25 scrape complete!

Scraping page 4 of 25...

Listing page scrape complete!
Number of postings scraped: 120

Individual posts scrape complete!
Number of posts scraped:  120

Page 4 of 25 scrape complete!

Scraping page 5 of 25...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 5 of 25 scrape complete!

Scraping page 6 of 25...

Listing page scrape complete!
Number of postings 

In [None]:
#Drop extra index
sf = sf.drop(['index'], axis=1)

In [None]:
sf.tail()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
3096,Jan 22,Top floor Richmond/USF studio private bath/Kit...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2200,1.0,400.0,richmond / seacliff,1Ba,"[flooring: wood, apartment, no laundry on site..."
3097,Jan 22,"Yoga Studio, Historic architectural detail, Ou...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1798,,,downtown / civic / van ness,1Ba,"[apartment, w/d in unit, no parking]"
3098,Jan 22,"VERY BRIGHT & NICE 3 BEDROOMS, 2 BATHS FLAT!!!",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3700,3.0,,richmond / seacliff,2Ba,"[apartment, laundry in bldg, attached garage, ..."
3099,Jan 22,"Penthouse Flat by Beach and Park, Large Yard",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2795,1.0,,sunset / parkside,1Ba,"[cats are OK - purrr, flooring: wood, flat, no..."
3100,Jan 22,"LARGE STUDIO, downtown, newly renovated, eat-i...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1725,,600.0,downtown / civic / van ness,1Ba,"[cats are OK - purrr, dogs are OK - wooof, flo..."


In [None]:
#Encapsulate our Data into a csv file
sf.to_csv('raw_sf_scrape.csv', index=False)

# Step 2 : Cleaning our Data

With the help of the scraple process we have now a decent dataset to exploit. A first move to make will be to clean it the best we can to further progress to the predictive part. 

Let's begin first by : 

* Importing the librairies that we will use on our dataset

In [None]:
import pandas as pd
import numpy as np

import matplotlib as plt
%matplotlib inline

* Importing the Data

In [None]:
sf_r=pd.read_csv('raw_sf_scrape.csv')

* Type of data to be handled

In [None]:
sf_r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3101 entries, 0 to 3100
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       3101 non-null   object 
 1   title      3101 non-null   object 
 2   link       3101 non-null   object 
 3   price      3101 non-null   int64  
 4   brs        2632 non-null   float64
 5   sqft       1437 non-null   float64
 6   hood       3101 non-null   object 
 7   bath       3099 non-null   object 
 8   amenities  3099 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 218.2+ KB


* First and last component of our dataset 

In [None]:
sf_r.head()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
0,Jan 25,"Bright, immaculate 3 Bd Ba Apt. /1plus with Ba...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4200,3.0,,north beach / telegraph hill,1Ba,"['flooring: wood', 'apartment', 'laundry in bl..."
1,Jan 25,"Dog Park, Stainless Steel Appliances, Cable Re...",https://sfbay.craigslist.org/sfc/apa/d/saratog...,1280,2.0,1620.0,"***Saratoga, CA*** city of san francisco",1.5Ba,"['air conditioning', 'cats are OK - purrr', 'd..."
2,Jan 25,"Home in Ingleside, Urbano Dr., Single Family D...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,5999,4.0,1999.0,ingleside / SFSU / CCSF,3Ba,"['application fee details: Application fee', '..."
3,Jan 25,"Brand new carpet, top flr, bright, 98 walk sco...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3995,3.0,1200.0,inner richmond,2Ba,"['flooring: carpet', 'apartment', 'laundry in ..."
4,Jan 25,Amazing SF Location Eastside Calhoun at Union,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2950,1.0,635.0,north beach / telegraph hill,1Ba,"['flooring: wood', 'apartment', 'w/d in unit',..."


In [None]:
sf_r.tail()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
3096,Jan 22,Top floor Richmond/USF studio private bath/Kit...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2200,1.0,400.0,richmond / seacliff,1Ba,"['flooring: wood', 'apartment', 'no laundry on..."
3097,Jan 22,"Yoga Studio, Historic architectural detail, Ou...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1798,,,downtown / civic / van ness,1Ba,"['apartment', 'w/d in unit', 'no parking']"
3098,Jan 22,"VERY BRIGHT & NICE 3 BEDROOMS, 2 BATHS FLAT!!!",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3700,3.0,,richmond / seacliff,2Ba,"['apartment', 'laundry in bldg', 'attached gar..."
3099,Jan 22,"Penthouse Flat by Beach and Park, Large Yard",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2795,1.0,,sunset / parkside,1Ba,"['cats are OK - purrr', 'flooring: wood', 'fla..."
3100,Jan 22,"LARGE STUDIO, downtown, newly renovated, eat-i...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1725,,600.0,downtown / civic / van ness,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."


* There's chances to fall into some duplicates coming from the fact that a user may list multiple times the same announce to give it great chances to be seen by apartement hunters.

> Knowing that the user keeps the same title, we're gonna give a new dataframe where we're ordering the titles a way we can catch repetition easily.      

In [None]:
sf=sf_r.drop(['date','link'],axis=1)

In [None]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
0,"Bright, immaculate 3 Bd Ba Apt. /1plus with Ba...",4200,3.0,,north beach / telegraph hill,1Ba,"['flooring: wood', 'apartment', 'laundry in bl..."
1,"Dog Park, Stainless Steel Appliances, Cable Re...",1280,2.0,1620.0,"***Saratoga, CA*** city of san francisco",1.5Ba,"['air conditioning', 'cats are OK - purrr', 'd..."
2,"Home in Ingleside, Urbano Dr., Single Family D...",5999,4.0,1999.0,ingleside / SFSU / CCSF,3Ba,"['application fee details: Application fee', '..."
3,"Brand new carpet, top flr, bright, 98 walk sco...",3995,3.0,1200.0,inner richmond,2Ba,"['flooring: carpet', 'apartment', 'laundry in ..."
4,Amazing SF Location Eastside Calhoun at Union,2950,1.0,635.0,north beach / telegraph hill,1Ba,"['flooring: wood', 'apartment', 'w/d in unit',..."


In [None]:
sf.sort_values("title",inplace=True)

In [None]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
439,! Rent this Single Room in Coliving Community!,1045,,90.0,SOMA / south beach,sharedBa,"['flooring: wood', 'furnished', 'apartment', '..."
2595,"""Move In Special"" Spacious Apartment in Sunny ...",2995,2.0,975.0,mission district,1Ba,"['flooring: carpet', 'apartment', 'laundry in ..."
2580,"""Move In Special"" Spacious Apartment in Sunny ...",2995,2.0,975.0,mission district,1Ba,"['flooring: carpet', 'apartment', 'laundry in ..."
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,1.0,805.0,hayes valley,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,1.0,805.0,hayes valley,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."


In [None]:
sf.drop_duplicates(keep=False,inplace=True)  

* Showing the size of our dataset after mofication 

In [None]:
sf.shape

(2448, 7)

In [None]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
439,! Rent this Single Room in Coliving Community!,1045,,90.0,SOMA / south beach,sharedBa,"['flooring: wood', 'furnished', 'apartment', '..."
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,1.0,805.0,hayes valley,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,1.0,805.0,hayes valley,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,2.0,1388.0,hayes valley,2.5Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,2.0,1388.0,hayes valley,2.5Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."


## Cleaning variable bathrooms
* In this phase we're going to better resume the number of bathroom in the house by erasing any extra word from the numerical value      

In [None]:
sf.bath.unique()

array(['sharedBa', '1Ba', '2.5Ba', '2Ba', 'splitBa', '1.5Ba', '3Ba',
       '4Ba', '5Ba', '3.5Ba', '7Ba', '4.5Ba', nan, '5.5Ba', '6.5Ba'],
      dtype=object)

* After establishing the value that we may encounter in this variable we make some changes to better adapt it  

In [None]:
shared_baths=sf_r[(sf_r.bath=="sharedBa")]
shared_baths_links=list(shared_baths.link)

In [None]:
#Number of user's sharing the same value 'sharedBa'
len(shared_baths_links)

18

* In order to better understand the *sharedBa* value let's investigate the links

In [None]:
for link in shared_baths_links[:5]:
  print(link)
  print("")

https://sfbay.craigslist.org/sfc/apa/d/1200-private-rm-not-share-for-rent/7431680580.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-cozy-studio-in-prime/7437214610.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-1050-special-private/7437205341.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-rent-this-single-room-in/7435784205.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-amazing-value-for-sro/7437116433.html



* After quick verification the **sharedBa** stands for not full apartment, we're going then to skip them in our study  

In [None]:
sf=sf[sf.bath!='sharedBa']

* Let's see now the changes that we've commited

In [None]:
sf.bath.unique()

array(['1Ba', '2.5Ba', '2Ba', 'splitBa', '1.5Ba', '3Ba', '4Ba', '5Ba',
       '3.5Ba', '7Ba', '4.5Ba', nan, '5.5Ba', '6.5Ba'], dtype=object)

* Another modification will be for **splitBa** where this time we're going to switch to the value 1  

In [None]:
sf["bath"]=sf["bath"].replace("splitBa",'1Ba')

In [None]:
sf.bath.unique()

array(['1Ba', '2.5Ba', '2Ba', '1.5Ba', '3Ba', '4Ba', '5Ba', '3.5Ba',
       '7Ba', '4.5Ba', nan, '5.5Ba', '6.5Ba'], dtype=object)

* Let's see now for the the missing value's cases where here we have **nan** and **'0Ba'** ? 

In [None]:
sf.to_csv('sf_scrape.csv', index=False)

In [None]:
miss_bath_info=sf_r[(sf_r.bath == np.nan)|(sf_r.bath=="0Ba")]
len(miss_bath_info)

0

* Access as previously seen the links to better understand the interpretation of **nan** and **'0Ba'**

In [None]:
for link in miss_bath_info.link:
  print(link)
  print("")

* The post specify that the place contains one bathroom so we replace those values by 1

In [None]:
sf["bath"]=sf["bath"].replace(np.nan,'1Ba')
sf["bath"]=sf["bath"].replace('0Ba','1Ba')

In [None]:
sf.bath.unique()

array(['1Ba', '2.5Ba', '2Ba', '1.5Ba', '3Ba', '4Ba', '5Ba', '3.5Ba',
       '7Ba', '4.5Ba', '5.5Ba', '6.5Ba'], dtype=object)

* Now let's erase the suffix **'Ba'** to better computate our data in what will come after

In [None]:
sf["bath"]=sf["bath"].str.replace("Ba",'').astype(float)

In [None]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,1.0,805.0,hayes valley,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,1.0,805.0,hayes valley,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,2.0,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,2.0,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',..."
902,#370 Beautiful Townhome With Amazing Amenities...,4750,2.0,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',..."


## Dealing with missing values
* In order to enhance the prediction that will come further after we deal with the missing values  

In [None]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2436 entries, 1381 to 579
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      2436 non-null   object 
 1   price      2436 non-null   int64  
 2   brs        2045 non-null   float64
 3   sqft       1198 non-null   float64
 4   hood       2436 non-null   object 
 5   bath       2436 non-null   float64
 6   amenities  2434 non-null   object 
dtypes: float64(3), int64(1), object(3)
memory usage: 152.2+ KB


* Let's first see the variables that present missing values

In [None]:
sf.isnull().sum()

title           0
price           0
brs           391
sqft         1238
hood            0
bath            0
amenities       2
dtype: int64

* We select to begin with the lines that does not suffer of missing values 

In [None]:
sf=sf[sf["sqft"].notna()]
sf.shape

(1198, 7)

In [None]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,1.0,805.0,hayes valley,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,1.0,805.0,hayes valley,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,2.0,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,2.0,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',..."
902,#370 Beautiful Townhome With Amazing Amenities...,4750,2.0,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',..."


* Let's focus now on the **bedrooms**

In [None]:
sf.brs.unique()

array([ 1.,  2., nan,  3.,  4.,  5.,  8.,  7.,  6.])

In [None]:
len(sf.brs.unique())

9

* Let's see how many **NaN** values do we have 

In [None]:
miss_brs=sf[sf.brs.isnull()]

In [None]:
miss_brs.head(3)

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
1042,"$1,950 Newly remodeled studio",1950,,470.0,mission district,1.0,"['flooring: wood', 'apartment', 'no laundry on..."
569,$1950 TOP FL Private SUNNY Quiet Heart Hayes V...,1950,,400.0,hayes valley,1.0,"['flooring: wood', 'apartment', 'laundry in bl..."
2269,$1950 TOP FL Private SUNNY Quiet Heart Hayes V...,1950,,400.0,hayes valley,1.0,"['flooring: wood', 'apartment', 'laundry in bl..."


In [None]:
print('Number of missing bedroom rows :',len(miss_brs))

Number of missing bedroom rows : 167


* As done before let's see the meaning of this missing value by checking the link 

In [None]:
br_nan=sf_r[sf_r['brs'].isnull()]
br_nan_links=list(br_nan.link)

In [None]:
for link in br_nan_links[:4]:
  print(link)
  print("")

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-nice-studio-available-now/7436185394.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-large-remodeled-art-deco/7433050028.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-bright-top-floor-studio/7432088404.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-top-floor-renovated-and/7434393942.html



* We see that the information was not filled but it's still contained in the title, let's deal with that

In [None]:
def replace_missing_brs(post_title):
    if 'studio' in post_title.lower():
        return 0
    elif '1br' in post_title.lower().replace(' ', ''):
        return 1
    elif '1bed' in post_title.lower().replace(' ', ''):
        return 1
    elif 'onebed' in post_title.lower().replace(' ', ''):
        return 1
    elif '2br' in post_title.lower().replace(' ', ''):
        return 2
    elif '3br' in post_title.lower().replace(' ', ''):
        return 3
    elif '4br' in post_title.lower().replace(' ', ''):
        return 4
    elif '4bd' in post_title.lower().replace(' ', ''):
        return 4
    elif '4bed' in post_title.lower().replace(' ', ''):
        return 4
    else:
        pass

* Before applying the function let's fill the NaN values with the character **'missing'**   

In [None]:
sf["brs"]=sf['brs'].fillna('missing')

In [None]:
sf["beds"]=sf.apply(lambda row : replace_missing_brs(row['title']) 
                                                    if row['brs'] == 'missing' 
                                                    else row['brs'], axis=1)

In [None]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities,beds
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,1,805.0,hayes valley,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',...",1.0
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,1,805.0,hayes valley,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',...",1.0
2811,#21 Beautiful Townhome Now Available... Check ...,5105,2,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',...",2.0
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,2,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',...",2.0
902,#370 Beautiful Townhome With Amazing Amenities...,4750,2,1388.0,hayes valley,2.5,"['cats are OK - purrr', 'dogs are OK - wooof',...",2.0


In [None]:
miss_brs_mod=sf[sf['beds'].isnull()]

In [None]:
print("Number of missing bedrooms at start: ", len(miss_brs))
print("Number of missing bedrooms after title parse: ",len(miss_brs_mod))
print("Number of recoverd bedrooms: ", len(miss_brs) - len(miss_brs_mod))
print("Percent recovered: ", ((len(miss_brs) - len(miss_brs_mod)) / len(miss_brs)) * 100, '%')

Number of missing bedrooms at start:  167
Number of missing bedrooms after title parse:  49
Number of recoverd bedrooms:  118
Percent recovered:  70.65868263473054 %



* We saved 118 bedrooms, now let's drop the missing values

In [None]:
sf = sf[['title', 'price', 'sqft', 'beds', 'bath', 'hood', 'amenities']]

In [None]:
sf=sf[sf['beds'].notna()]

* Now let's check our data cleaned

In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',..."
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',..."


## Amenities re-adaptation

* As it can be seen the Amenities columns suffer from a high cluster of data all packed together. In this part we will enhance our dataset by separting them in new columns. In order to proceed we're gonna call the library **ast**   

In [None]:
from ast import literal_eval

In [None]:
def f(x):
    try:
        return literal_eval(str(x))   
    except Exception as e:
        print(e)
        return []

sf['amens_list'] = sf.amenities.apply(lambda x: f(x))

malformed node or string: <_ast.Name object at 0x7f9eb14f6ad0>


In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,amens_list
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."


* Let's test our results before the big changes

In [None]:
amen_eg=sf.loc[1381,'amens_list']

In [None]:
type(amen_eg)

list

In [None]:
'condo' in amen_eg

False

* We can now create columns in order to better separate the values in **amenities**

In [None]:
sf.drop(['amenities'],axis=1)

Unnamed: 0,title,price,sqft,beds,bath,hood,amens_list
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"[cats are OK - purrr, dogs are OK - wooof, apa..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"[cats are OK - purrr, dogs are OK - wooof, apa..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"[cats are OK - purrr, dogs are OK - wooof, apa..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"[cats are OK - purrr, dogs are OK - wooof, apa..."
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"[cats are OK - purrr, dogs are OK - wooof, apa..."
...,...,...,...,...,...,...,...
2376,♣♣♣♣ LARGE STUDIO 450 SQ FT ♣♣♣♣,1425,450.0,0.0,1.0,downtown / civic / van ness,"[cats are OK - purrr, flooring: wood, apartmen..."
111,♤ ♤ ♤Huge Rooms & Ideal Location - Safe & Conv...,5200,2200.0,5.0,2.0,inner sunset / UCSF,"[flat, laundry in bldg, no smoking, attached g..."
2827,"❤️❤️Fantastic Furnished, 1 BR/ 1 BA, Laundry, ...",3495,600.0,1.0,1.0,castro / upper market,"[cats are OK - purrr, dogs are OK - wooof, fur..."
1144,"❤️❤️Million Dollar Views! Furnished, Laundry, ...",3495,600.0,1.0,1.0,twin peaks / diamond hts,"[cats are OK - purrr, dogs are OK - wooof, flo..."


In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,amens_list
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, apa..."


## Laundry
* We will group this variable into three different classes

In [None]:
def laundry_parse(amen_list):
    if 'w/d in unit' in amen_list:
        return '(a) in-unit'
    elif 'laundry in bldg' in amen_list:
        return '(b) on-site'
    elif 'laundry on site' in amen_list:
        return '(b) on-site'
    else:
        return '(c) no laundry'

In [None]:
sf['laundry'] = sf['amens_list'].apply(lambda amen_list: laundry_parse(amen_list))

## Pets 
* In this case we're going to segment our variables into four categories

In [None]:
def pets_allowed(amen_list):
    if 'dogs are OK - wooof' in amen_list and 'cats are OK - purrr' in amen_list:
        return '(a) both'
    elif 'dogs are OK - wooof' in amen_list:
        return '(b) dogs'
    elif 'cats are OK - purrr' in amen_list:
        return '(c) cats'
    else:
        return '(d) no pets'

In [None]:
sf['pets']=sf['amens_list'].apply(lambda amen_list: pets_allowed(amen_list))

## Housing Type
We're now going to groupe following three categories

In [None]:
def housing_type(amen_list):
    if 'cottage/cabin' in amen_list:
        return '(a) single'
    elif ' duplex' in amen_list:
        return '(b) double'
    elif 'house' in amen_list:
        return '(a) single'
    elif 'in-law' in amen_list:
        return '(b) double'
    elif 'townhouse' in amen_list:
        return '(a) single'
    else:
        return '(c) multi'

In [None]:
sf['housing_type']=sf['amens_list'].apply(lambda amen_list:housing_type(amen_list))

## Parking
* This variable will be grouped into four categories

In [None]:
def parking_situation(amen_list):
    if 'attached garage' in amen_list:
        return '(b) protected'
    elif 'valet parking' in amen_list:
        return '(a) valet'
    elif 'carport' in amen_list:
        return '(b) protected'
    elif 'detatched garage' in amen_list:
        return '(b) protected'
    elif 'off-street parking' in amen_list:
        return '(c) off-street'
    else:
        return '(d) no parking'

In [None]:
sf['parking']=sf['amens_list'].apply(lambda amen_list: parking_situation(amen_list))

In [None]:
sf=sf.drop(['amens_list'],axis=1)

In [None]:
sf.shape

(1149, 11)

In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected


##  Neighborhoods
* The main objective for this part will be to re-adapt the neighborhoods in order to use regression models and avoiding too many neighborhoods

In [None]:
cl_locations = ['alamo square / nopa', 'bayview', 'bernal heights', 
               'castro / upper market', 'cole valley / ashbury hts','downtown / civic / van ness',
               'excelsior / outer mission','financial district','glen park','haight ashbury','hayes valley',
               'ingleside / SFSU / CCSF','inner richmond','inner sunset / UCSF', 'laurel hts / presidio',
               'lower haight','lower nob hill','lower pac hts','marina / cow hollow','mission district',
               'nob hill','noe valley','north beach / telegraph hill','pacific heights','portola district',
               'potrero hill','richmond / seacliff', 'russian hill','SOMA / south beach','sunset / parkside',
               'tenderloin','treasure island','twin peaks / diamond hts','USF / panhandle','visitacion valley',
               'west portal / forest hill', 'western addition']

In [None]:
len(cl_locations)

37

In [None]:
hood_list=list(sf.hood.unique())

In [None]:
print("Number of unique neighborhoods provided: ", len(hood_list))
print("Number of NON-CL Hoods proviced: ", len(hood_list) - len(cl_locations))

Number of unique neighborhoods provided:  71
Number of NON-CL Hoods proviced:  34


In [None]:
extra_hoods = []

for hood in hood_list:
    if hood not in cl_locations:
        extra_hoods.append(hood)
        
print(extra_hoods)   

['San Francsico city of san francisco ', 'San Francisco city of san francisco ', 'russian hill city of san francisco ', 'San Francisco Civic Center city of san francisco ', ' city of san francisco ', 'Dog Patch city of san francisco ', '601 Niantic Ave Daly City, CA city of san francisco ', 'Los Banos city of san francisco ', '2228 Union St city of san francisco ', 'South of Market city of san francisco ', 'SAN JOSE city of san francisco ', 'SoMa city of san francisco ', '801 E Jones St city of san francisco ', 'Pacific Heights city of san francisco ', 'Bernal Heights city of san francisco ', 'Outer Richmond District city of san francisco ', 'Geary @ 41st Ave. city of san francisco ', 'Mission & Cortland city of san francisco ', 'Richmond District city of san francisco ', 'Ocean Ave city of san francisco ', '***Saratoga, CA*** city of san francisco ', 'Mission Bay city of san francisco ', 'Sausalito city of san francisco ', 'Showplace Square city of san francisco ', 'South Park city of

* Now we will re-order the different neighborhoods following the district keys 

In [None]:
sf_map_dict = {1: ['inner richmond', 'richmond / seacliff', 'San Franciso Richmond District' 
               ],
           2: ['inner sunset / UCSF', 'sunset / parkside', 'Golden Gate Heights, San Francisco' 
              ], 
           3: ['ingleside / SFSU / CCSF' 
              ], 
           4: ['twin peaks / diamond hts', 'west portal / forest hill', 'San Francisco/SunnySide'
              ], 
           5: ['alamo square / nopa', 'castro / upper market', 'cole valley / ashbury hts', 
               'glen park', 'haight ashbury', 'noe valley', 'The Castro', 'Glen Park',
               'North Panhandle', 'Eureka Valley' 
              ], 
           6: ['hayes valley', 'lower haight', 'USF / panhandle', 'western addition', 
              'NOPA', 'Hayes Valley', 'Lower Pacific Heights'
              ], 
           7: ['laurel hts / presidio', 'lower pac hts', 'marina / cow hollow', 'pacific heights',
              'Pacific Heights', 'Marina District', 'Marina'
              ], 
           8: ['downtown / civic / van ness', 'financial district', 'lower nob hill', 'nob hill', 
              'north beach / telegraph hill', 'russian hill', 'tenderloin', 'Nob Hill', 'Lower Nob Hill', 
               'North Beach', 'Civic Center, Downtown, Van Ness', 'Telegraph Hill', 
               'Embarcadero / North Waterfront', 'San Francisco North Waterfront' 
              ], 
           9: ['bernal heights', 'mission district', 'potrero hill', 'SOMA / south beach', 
              'Mission Bay', 'Mission District', 'South Park', 'South Beach', 'Rincon Hill, San Francisco', 
              'Rincon Hill', 'SoMa', 'SOMA', 'South of Market' 
              ], 
           10: ['bayview', 'excelsior / outer mission', 'portola district', 
                'visitacion valley', 'Bayview' 
               ]
           }

In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected


In [None]:
def set_hood_district(hood):
  for dist,hood_list in sf_map_dict.items():
    if hood in hood_list:
      return dist

In [None]:
sf['hood_district']=sf['hood'].apply(lambda x : set_hood_district(x))

In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking,hood_district
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0


## Dropping rows missing values 

In [None]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149 entries, 1381 to 579
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          1149 non-null   object 
 1   price          1149 non-null   int64  
 2   sqft           1149 non-null   float64
 3   beds           1149 non-null   float64
 4   bath           1149 non-null   float64
 5   hood           1149 non-null   object 
 6   amenities      1148 non-null   object 
 7   laundry        1149 non-null   object 
 8   pets           1149 non-null   object 
 9   housing_type   1149 non-null   object 
 10  parking        1149 non-null   object 
 11  hood_district  1091 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 156.7+ KB


In [None]:
sf.isnull().sum()

title             0
price             0
sqft              0
beds              0
bath              0
hood              0
amenities         1
laundry           0
pets              0
housing_type      0
parking           0
hood_district    58
dtype: int64

In [None]:
sf=sf[sf['hood_district'].notna()]

In [None]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1091 entries, 1381 to 579
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          1091 non-null   object 
 1   price          1091 non-null   int64  
 2   sqft           1091 non-null   float64
 3   beds           1091 non-null   float64
 4   bath           1091 non-null   float64
 5   hood           1091 non-null   object 
 6   amenities      1090 non-null   object 
 7   laundry        1091 non-null   object 
 8   pets           1091 non-null   object 
 9   housing_type   1091 non-null   object 
 10  parking        1091 non-null   object 
 11  hood_district  1091 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 110.8+ KB


# Outliers
* Before developping model for prediction purposes we will check the outliers to see if the posts were legitimate

In [None]:
sf.describe()

Unnamed: 0,price,sqft,beds,bath,hood_district
count,1091.0,1091.0,1091.0,1091.0,1091.0
mean,3930.183318,961.496792,1.648946,1.384051,6.909258
std,2235.911325,502.823954,1.047735,0.616354,2.582584
min,225.0,1.0,0.0,1.0,1.0
25%,2702.5,633.0,1.0,1.0,6.0
50%,3500.0,830.0,2.0,1.0,8.0
75%,4500.0,1200.0,2.0,2.0,9.0
max,28500.0,5530.0,8.0,6.5,10.0


## Price
* The table above we see that **75%** are over **4K**, knowing that the maximum value is **28500**. Let's see if our data are accurate

In [None]:
high_price_sf=sf[sf.price>10000]

In [None]:
len(high_price_sf)

16

In [None]:
high_price_sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking,hood_district
445,1355 Lombard #300/Fort Mason/Lombard Shopping-...,12995,2100.0,3.0,3.0,russian hill,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,8.0
1603,"181 FREMONT 2bed, 2.5baths & Office on 58th floor",24000,2147.0,2.0,2.5,SOMA / south beach,"['EV charging', 'air conditioning', 'applicati...",(a) in-unit,(a) both,(c) multi,(a) valet,9.0
459,"High Floor Corner Unit At Ritz, Downtown Views...",11500,2100.0,3.0,3.0,downtown / civic / van ness,"['application fee details: $40 per person', 'c...",(a) in-unit,(a) both,(c) multi,(a) valet,8.0
171,Inner Mission 3-Level Victorian Home; 4BD/4.5B...,12000,2210.0,4.0,4.5,mission district,"['application fee details: $30 Per Applicant',...",(a) in-unit,(d) no pets,(a) single,(b) protected,9.0
677,"Luxury 5 Bed, 3.5 Bath Home Coming Soon!",12500,2850.0,5.0,3.5,mission district,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(a) single,(b) protected,9.0


* In order to investigate our dataset we will need to check the links from **Craiglist** for that we use **sf_r** the first dataset before erasing the link columns

In [None]:
high_prices=sf_r[sf_r.price>10000]

In [None]:
len(high_prices)

23

In [None]:
high_prices.describe()

Unnamed: 0,price,brs,sqft
count,23.0,23.0,18.0
mean,16270.478261,3.478261,2713.833333
std,5714.312158,1.16266,1085.017064
min,10496.0,2.0,1900.0
25%,12000.0,3.0,2111.75
50%,13995.0,3.0,2240.0
75%,17997.5,3.0,2890.5
max,29500.0,7.0,5530.0


In [None]:
high_prices.head()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
21,Jan 25,"Luxury house in Sea Cliff with 3 levels, 5 bed...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,16500,5.0,2904.0,richmond / seacliff,4Ba,"['furnished', 'house', 'w/d in unit', 'no smok..."
32,Jan 25,Stunning Kent Woodlands estate on a 4.3-acre lot,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,25950,6.0,5530.0,pacific heights,6.5Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
143,Jan 24,"Penthouse with Amazing Features! DECK - Views,...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,16995,3.0,3192.0,nob hill,3Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
171,Jan 24,Inner Mission 3-Level Victorian Home; 4BD/4.5B...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,12000,4.0,2210.0,mission district,4.5Ba,"['application fee details: $30 Per Applicant',..."
172,Jan 24,Wonderful 3BR/3.5BA Home w/ Water/GG Bridge Vi...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,12800,3.0,2334.0,marina / cow hollow,3.5Ba,"['application fee details: $30 Per Applicant',..."


In [None]:
price_28500=high_prices[high_prices.price==28500]

In [None]:
price_28500

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
1994,Jan 23,Marina Blvd - 7 Bedroom + Garage + Patio + Views!,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,28500,7.0,5410.0,marina / cow hollow,5.5Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."


* After checking the maximum value for rent everything seems pretty legit so we will keep everything for our study

## Square footage

In [None]:
high_sqft_sf=sf[sf.sqft>5000]

In [None]:
len(high_sqft_sf)

2

In [None]:
high_sqft_sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking,hood_district
1994,Marina Blvd - 7 Bedroom + Garage + Patio + Views!,28500,5410.0,7.0,5.5,marina / cow hollow,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,7.0
32,Stunning Kent Woodlands estate on a 4.3-acre lot,25950,5530.0,6.0,6.5,pacific heights,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(a) single,(b) protected,7.0


* In order to better filtrate our data we will look for square footage above **3K** (the 75 percentile is 1200) and less than **5K**

In [None]:
high_sqft_sf=sf[(sf.sqft>3000)&(sf.sqft<5000)]

In [None]:
len(high_sqft_sf)
high_sqft_sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking,hood_district
560,4 beds 5 baths with home theater almost 4000ftHi,8800,3854.0,4.0,5.0,west portal / forest hill,"['house', 'laundry in bldg', 'no smoking', 'at...",(b) on-site,(d) no pets,(a) single,(b) protected,4.0
703,"Gorgeous Bay View GG Bridge&Alcatraz, Beautifu...",9550,3300.0,3.0,3.0,pacific heights,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(d) no parking,7.0
143,"Penthouse with Amazing Features! DECK - Views,...",16995,3192.0,3.0,3.0,nob hill,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(d) no parking,8.0
49,"Spectacular 2-bedroom, 2-bath Russian Hill co-...",4200,3460.0,2.0,2.0,russian hill,"['flooring: other', 'apartment', 'w/d in unit'...",(a) in-unit,(d) no pets,(c) multi,(b) protected,8.0
1758,Stunning Top Floor 3x3 PENTHOUSE w/Sweeping Ci...,16995,3300.0,3.0,3.0,nob hill,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,8.0


In [None]:
high_sqft_sf.describe()

Unnamed: 0,price,sqft,beds,bath,hood_district
count,5.0,5.0,5.0,5.0,5.0
mean,11308.0,3421.2,3.0,3.2,7.0
std,5580.954891,260.159951,0.707107,1.095445,1.732051
min,4200.0,3192.0,2.0,2.0,4.0
25%,8800.0,3300.0,3.0,3.0,7.0
50%,9550.0,3300.0,3.0,3.0,8.0
75%,16995.0,3460.0,3.0,3.0,8.0
max,16995.0,3854.0,4.0,5.0,8.0


* In order to reduce the dispersion of our data we crop the maximal square footage to **4000**

In [None]:
sf=sf[sf.sqft<4000]

In [None]:
sf.describe()

Unnamed: 0,price,sqft,beds,bath,hood_district
count,1089.0,1089.0,1089.0,1089.0,1089.0
mean,3887.401286,953.216713,1.640037,1.375574,6.909091
std,2001.537779,464.603463,1.027604,0.583859,2.584954
min,225.0,1.0,0.0,1.0,1.0
25%,2700.0,632.0,1.0,1.0,6.0
50%,3500.0,830.0,2.0,1.0,8.0
75%,4500.0,1200.0,2.0,2.0,9.0
max,24000.0,3854.0,8.0,5.0,10.0


* We follow the same steps for the lowest square footage

In [None]:
low_sqft_sf=sf[ sf.sqft <250 ]

In [None]:
low_sqft_sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking,hood_district
2806,Bright Contemporary Studio - PRIVATE ENTRY & GATE,1100,140.0,0.0,1.0,sunset / parkside,"['in-law', 'no laundry on site', 'street parki...",(c) no laundry,(d) no pets,(b) double,(d) no parking,2.0
2094,Garden Studio at Central Sunset close to shopp...,1600,200.0,0.0,1.0,sunset / parkside,"['flooring: wood', 'apartment', 'laundry in bl...",(b) on-site,(d) no pets,(c) multi,(d) no parking,2.0
2275,Large Remodeled 2BR in Lower Pacific Heights W...,3500,1.0,2.0,1.0,lower pac hts,"['cats are OK - purrr', 'dogs are OK - wooof',...",(b) on-site,(a) both,(c) multi,(d) no parking,7.0
597,Lovely Studio * New flooring and paint * Ideal...,1695,225.0,0.0,1.0,tenderloin,"['apartment', 'laundry on site', 'no parking']",(b) on-site,(d) no pets,(c) multi,(d) no parking,8.0
2731,Newly Remodeled In- Law Studio,1100,230.0,1.0,1.0,richmond / seacliff,"['flooring: wood', 'flat', 'laundry in bldg', ...",(b) on-site,(d) no pets,(c) multi,(d) no parking,1.0


In [None]:
len(low_sqft_sf)

8

In [None]:
low_sqft_index=list(low_sqft_sf.index)

In [None]:
low_sqft_index

[2806, 2094, 2275, 597, 2731, 1447, 156, 803]

In [None]:
for link in sf_r.iloc[low_sqft_index].link:
  print(link)
  print('')

https://sfbay.craigslist.org/sfc/apa/d/bright-contemporary-studio-private/7436414296.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-garden-studio-at-central/7434472513.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-large-remodeled-2br-in/7436670971.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-lovely-studio-new/7437169291.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-newly-remodeled-in-law/7435519856.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-newly-updated-studio/7428086465.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-room-private-bath/7430490749.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-studio-in-historic-pet/7437112536.html



In [None]:
sf.drop([156],inplace=True)

In [None]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,laundry,pets,housing_type,parking,hood_district
1381,#146 Beautiful One Bedroom With Patio Availabl...,3920,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
2955,#174 Amazing One Bedroom With Great Amenities ...,3870,805.0,1.0,1.0,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
2811,#21 Beautiful Townhome Now Available... Check ...,5105,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
312,#368 WOW! Beautiful Townhome With Amazing Amen...,4850,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0
902,#370 Beautiful Townhome With Amazing Amenities...,4750,1388.0,2.0,2.5,hayes valley,"['cats are OK - purrr', 'dogs are OK - wooof',...",(a) in-unit,(a) both,(c) multi,(b) protected,6.0


In [None]:
sf.describe()

Unnamed: 0,price,sqft,beds,bath,hood_district
count,1088.0,1088.0,1088.0,1088.0,1088.0
mean,3889.917279,953.909007,1.640625,1.375919,6.913603
std,2000.734624,464.254833,1.027893,0.584017,2.581849
min,225.0,1.0,0.0,1.0,1.0
25%,2703.75,633.5,1.0,1.0,6.0
50%,3500.0,830.0,2.0,1.0,8.0
75%,4500.0,1200.0,2.0,2.0,9.0
max,24000.0,3854.0,8.0,5.0,10.0


# Final Cleaning and exporting

In [None]:
sf_clean = sf[['price', 'sqft', 'beds', 'bath', 'laundry', 'pets', 'housing_type', 'parking', 'hood_district']].reset_index(drop=True)

In [None]:
sf_clean.head()

Unnamed: 0,price,sqft,beds,bath,laundry,pets,housing_type,parking,hood_district
0,3920,805.0,1.0,1.0,(a) in-unit,(a) both,(c) multi,(b) protected,6.0
1,3870,805.0,1.0,1.0,(a) in-unit,(a) both,(c) multi,(b) protected,6.0
2,5105,1388.0,2.0,2.5,(a) in-unit,(a) both,(c) multi,(b) protected,6.0
3,4850,1388.0,2.0,2.5,(a) in-unit,(a) both,(c) multi,(b) protected,6.0
4,4750,1388.0,2.0,2.5,(a) in-unit,(a) both,(c) multi,(b) protected,6.0


In [None]:
sf_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1088 entries, 0 to 1087
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          1088 non-null   int64  
 1   sqft           1088 non-null   float64
 2   beds           1088 non-null   float64
 3   bath           1088 non-null   float64
 4   laundry        1088 non-null   object 
 5   pets           1088 non-null   object 
 6   housing_type   1088 non-null   object 
 7   parking        1088 non-null   object 
 8   hood_district  1088 non-null   float64
dtypes: float64(4), int64(1), object(4)
memory usage: 76.6+ KB


In [None]:
sf_clean.to_csv('sf_clean.csv',index=False)

# Step 3 : Predicting rent's cost

[Colab_Notebook where the prediction study was made separetely](https://colab.research.google.com/drive/1CtuPRUxeYJaojwkcam7zV4GVSfIVUlzT?authuser=1#scrollTo=NbKuc0M8cwB4)