# Data Cleaning

**To Do:**
- [x] Drop date col
- [x] Drop link col
- [x] Drop dupes 
- [x] BRs: NaN --> 1
- [x] BAs: Remove 'Ba' suffix; convert to int
    * SplitBa --> 1
    * SharedBa --> DROP (means it's not a housing/apt rental
- [ ] Hoods: create groups
- [x] Amenities: create groups
- [x] Deal with missing data (esp. sqft) --> *use Median*?


## Import Data

In [1]:
import pandas as pd
import numpy as np

import matplotlib as plt
%matplotlib inline

In [2]:
sf_raw = pd.read_csv('raw_sf_scrape.csv')

In [3]:
sf_raw.head()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
0,Oct 1,"3D Virtual Tour - 2 BR, 2 BA Condo 966 Sq. Ft....",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3850,2.0,966.0,Mission Bay,2Ba,"['condo', 'w/d in unit', 'attached garage']"
1,Oct 1,Beautiful house for rent,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,900,1.0,,portola district,0Ba,['house']
2,Oct 1,4 BEDROOM APARTMENT IN THE HAIGHT,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
3,Oct 1,4 BEDROOM APARTMENT IN THE HAIGHT,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
4,Oct 1,ENJOY GOLDEN GATE PARK EVERYDAY,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2800,2.0,700.0,USF / panhandle,1Ba,"['apartment', 'laundry in bldg', 'no smoking',..."


In [4]:
sf_raw.tail()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
3088,Sep 30,1 Month Free Rent: Massive Modern Luxury Studio,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1948,,500.0,SOMA / south beach,1Ba,['application fee details: $30 application fee...
3089,Sep 30,1 Month Free Rent: Massive Modern Luxury Studio,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1948,,500.0,downtown / civic / van ness,1Ba,['application fee details: $30 application fee...
3090,Sep 30,"Large, Beautiful Classic Victorian 1bd/1ba",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3395,1.0,800.0,laurel hts / presidio,1Ba,"['apartment', 'laundry in bldg']"
3091,Sep 30,3 bedroom in Nopa with Private Deck (Top Floor),https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4995,3.0,1500.0,alamo square / nopa,2Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
3092,Sep 30,Private 1 bed/1 bath/1 kitchen in-law unit ava...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1795,1.0,300.0,ingleside / SFSU / CCSF,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."


In [5]:
len(sf_raw)

3093

## Dealing with duplicate listings

In [6]:
sf = sf_raw.drop(['date', 'link'], axis=1)

In [7]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
0,"3D Virtual Tour - 2 BR, 2 BA Condo 966 Sq. Ft....",3850,2.0,966.0,Mission Bay,2Ba,"['condo', 'w/d in unit', 'attached garage']"
1,Beautiful house for rent,900,1.0,,portola district,0Ba,['house']
2,4 BEDROOM APARTMENT IN THE HAIGHT,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
3,4 BEDROOM APARTMENT IN THE HAIGHT,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
4,ENJOY GOLDEN GATE PARK EVERYDAY,2800,2.0,700.0,USF / panhandle,1Ba,"['apartment', 'laundry in bldg', 'no smoking',..."


In [8]:
sf.sort_values('title', inplace=True)

In [9]:
sf.head(3)

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2240,"""INCENTIVE"" 2 BR + Bonus Room(Office/Den) - DU...",3400,2.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."
2243,"""INCENTIVE"" 2 BR + Bonus Room(Office/Den) - DU...",3400,2.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."
2245,"""INCENTIVE"" IN THE HEART OF THE DUBOCE TRIANGLE",3700,3.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."


In [10]:
sf.drop_duplicates(keep=False, inplace=True)

In [11]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2245,"""INCENTIVE"" IN THE HEART OF THE DUBOCE TRIANGLE",3700,3.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,2.0,1600.0,pacific heights,2Ba,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,1.0,550.0,pacific heights,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,2.0,1300.0,marina / cow hollow,1Ba,"['EV charging', 'cats are OK - purrr', 'dogs a..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3.0,3500.0,pacific heights,2.5Ba,"['furnished', 'apartment', 'w/d in unit', 'no ..."


In [12]:
len(sf)

2461

## Price: data type conversion to integer

In [13]:
sf.price = sf.price.astype(int)

In [14]:
sf.dtypes

title         object
price          int64
brs          float64
sqft         float64
hood          object
bath          object
amenities     object
dtype: object

## Cleaning Bathrooms

In [15]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', nan, '3.5Ba', 'sharedBa', '0Ba', '3Ba',
       '1.5Ba', 'splitBa', '4Ba', '5.5Ba', '4.5Ba'], dtype=object)

In [16]:
shared_baths = sf_raw[(sf_raw.bath == 'sharedBa')]

shared_baths_links = list(shared_baths.link)

In [17]:
# number of listings with shared bath
len(shared_baths)

15

In [18]:
# Investigating a few links
for link in shared_baths_links[:5]:
    print(link)
    print("")

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-spoons-when-all-you-need/7206516011.html

https://sfbay.craigslist.org/sfc/apa/d/excelente-oportunidad-centrico-sf/7206512351.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-fabulous-mission-studio/7200480477.html

https://sfbay.craigslist.org/sfc/apa/d/1400-private-rm-not-share-for-rent/7205302134.html

https://sfbay.craigslist.org/sfc/apa/d/best-room-rate-in-north-beach/7203890620.html



Upon inspection, listings with `sharedBa` are not full apartment/housing options -- they are a shared living arrangement and should be dropped.

In [19]:
sf = sf[sf.bath != 'sharedBa']

In [20]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', nan, '3.5Ba', '0Ba', '3Ba', '1.5Ba',
       'splitBa', '4Ba', '5.5Ba', '4.5Ba'], dtype=object)

`splitBa` means it has a private bathroom, but sink and toilet/shower are just separate -- these will be converted to 1.

In [21]:
sf['bath'] = sf['bath'].replace('splitBa', '1Ba')

In [22]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', nan, '3.5Ba', '0Ba', '3Ba', '1.5Ba', '4Ba',
       '5.5Ba', '4.5Ba'], dtype=object)

What about `nan` and `0Ba`? 

In [23]:
missing_bath_info = sf_raw[(sf_raw.bath == np.nan) | 
                          (sf_raw.bath == '0Ba')]

len(missing_bath_info)

9

In [24]:
for link in missing_bath_info.link:
    print(link)
    print('')

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-beautiful-house-for-rent/7206621152.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-sunny-victorian-flat/7206563979.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-two-bedroom-1200-sq-ft/7206561289.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-airy-bright-marina-1br/7206542254.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-7th-floor-corner-jr-1-bd/7191572212.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-2847-turk-street-san/7201729219.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-live-work-space-in-modern/7197876630.html

https://sfbay.craigslist.org/sfc/apa/d/huge-1-bed-hardwood-gas-eat-in-kit/7206260644.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-bed-1bathh-wd-free/7206132572.html



These all have 1 bathroom - so these values will be converted to `1`

In [25]:
sf['bath'] = sf['bath'].replace(np.nan, '1Ba')
sf['bath'] = sf['bath'].replace('0Ba', '1Ba')

In [26]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', '3.5Ba', '3Ba', '1.5Ba', '4Ba', '5.5Ba',
       '4.5Ba'], dtype=object)

Now to take care of the 'Ba' suffix and convert everything to a float

In [27]:
sf['bath'] = sf['bath'].str.replace('Ba', '').astype(float)

In [28]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2245,"""INCENTIVE"" IN THE HEART OF THE DUBOCE TRIANGLE",3700,3.0,,castro / upper market,1.0,"['application fee details: $30 Per Applicant',..."
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,2.0,1600.0,pacific heights,2.0,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,1.0,550.0,pacific heights,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,2.0,1300.0,marina / cow hollow,1.0,"['EV charging', 'cats are OK - purrr', 'dogs a..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3.0,3500.0,pacific heights,2.5,"['furnished', 'apartment', 'w/d in unit', 'no ..."


Clean bathrooms!

## Dealing with missing square footage

Square footage is likely to be an important feature in determining rent price. Currently it's the column with the most missing values. I'll simply create a new version of `sf` dataframe where there are no missing values for `sqft`. 

In [29]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2448 entries, 2245 to 2495
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      2448 non-null   object 
 1   price      2448 non-null   int64  
 2   brs        1953 non-null   float64
 3   sqft       1122 non-null   float64
 4   hood       2423 non-null   object 
 5   bath       2448 non-null   float64
 6   amenities  2448 non-null   object 
dtypes: float64(3), int64(1), object(3)
memory usage: 153.0+ KB


In [30]:
sf = sf[sf['sqft'].notna()]

In [31]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,2.0,1600.0,pacific heights,2.0,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,1.0,550.0,pacific heights,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,2.0,1300.0,marina / cow hollow,1.0,"['EV charging', 'cats are OK - purrr', 'dogs a..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3.0,3500.0,pacific heights,2.5,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2013,#158 Furnished ONLY Jr 1 bedroom w/ Parking at...,3100,1.0,561.0,marina / cow hollow,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."


In [32]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1122 entries, 1980 to 225
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      1122 non-null   object 
 1   price      1122 non-null   int64  
 2   brs        942 non-null    float64
 3   sqft       1122 non-null   float64
 4   hood       1101 non-null   object 
 5   bath       1122 non-null   float64
 6   amenities  1122 non-null   object 
dtypes: float64(3), int64(1), object(3)
memory usage: 70.1+ KB


Great! Now bedrooms is the limiting feature. Let's see if I can boost that number a little higher.

## Cleaning Bedrooms

In [33]:
sf.brs.unique()

array([ 2.,  1.,  3.,  4., nan,  5.,  6.])

In [34]:
len(sf.brs.unique())

7

How many nan value? 

In [35]:
missing_brs = sf[sf.brs.isnull()]

In [36]:
missing_brs.head(3)

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2276,#LiveTuring Fantastic Studios! 8 Weeks Free! T...,2495,,563.0,,1.0,['apartment']
2419,***LIKE NEW STUDIO that feels like 1 Bedroom A...,1880,,350.0,SOMA / south beach,1.0,"['application fee details: $20', 'apartment', ..."
795,**8 Weeks Free** Studio Apartment at 923 Folsom,2705,,509.0,SOMA / south beach,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."


In [37]:
print("Number of missing bedroom rows: ", len(missing_brs))

Number of missing bedroom rows:  180


What are these nan values? Need to take a look at a few links to see what's happening.

In [38]:
br_nans = sf_raw[sf_raw['brs'].isnull()]

br_nan_links = list(br_nans.link)

In [39]:
for link in br_nan_links[:5]:
    print(link)
    print('')

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-huge-newly-remodeled/7206619643.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-cozy-edwardian-apartment/7194887293.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-fifth-floor-studio-in/7206615888.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-second-floor-studio/7202953088.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-1bd-1bath-california/7200175190.html



In [40]:
for link in br_nan_links[9:15]:
    print(link)
    print('')

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-wow-prices-you-cant-beat/7206609316.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-2-months-free-newly/7206609071.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-nob-hill-efficiency/7205168796.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-always-fresh-forever/7206603015.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-two-months-free-studio-in/7206586455.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-available-now-4-bedroom-2/7200487016.html



It seems like a mix -- some are studios, some have 1 or more bedrooms. The key is in the title. So I'll try to parse through the title and if it contains and key words that I can fill in.

In [41]:
# Function to fill missing bedroom information
# by parsing through titles for key words: 

def replace_missing_brs(post_title):
    if 'studio' in post_title.lower():
        return 0
    elif '1br' in post_title.lower().replace(' ', ''):
        return 1
    elif '1bed' in post_title.lower().replace(' ', ''):
        return 1
    elif 'onebed' in post_title.lower().replace(' ', ''):
        return 1
    elif '2br' in post_title.lower().replace(' ', ''):
        return 2
    elif '3br' in post_title.lower().replace(' ', ''):
        return 3
    elif '4br' in post_title.lower().replace(' ', ''):
        return 4
    elif '4bd' in post_title.lower().replace(' ', ''):
        return 4
    elif '4bed' in post_title.lower().replace(' ', ''):
        return 4
    else:
        pass

In [42]:
sf['brs'] = sf['brs'].fillna('missing')

In [43]:
sf['beds'] = sf.apply(
    lambda row: replace_missing_brs(row['title']) if row['brs'] == 'missing' else row['brs'], axis=1)

In [44]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities,beds
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,2,1600.0,pacific heights,2.0,"['furnished', 'apartment', 'w/d in unit', 'no ...",2.0
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,1,550.0,pacific heights,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',...",1.0
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,2,1300.0,marina / cow hollow,1.0,"['EV charging', 'cats are OK - purrr', 'dogs a...",2.0
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3,3500.0,pacific heights,2.5,"['furnished', 'apartment', 'w/d in unit', 'no ...",3.0
2013,#158 Furnished ONLY Jr 1 bedroom w/ Parking at...,3100,1,561.0,marina / cow hollow,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',...",1.0


In [45]:
missing_brs_end = sf[sf['beds'].isnull()]

In [46]:
print("Number of missing bedrooms at start: ", len(missing_brs))
print("Number of missing bedrooms after title parse: ",len(missing_brs_end))
print("Number of recoverd bedrooms: ", len(missing_brs) - len(missing_brs_end))
print("Percent recovered: ", ((len(missing_brs) - len(missing_brs_end)) / len(missing_brs)) * 100)

Number of missing bedrooms at start:  180
Number of missing bedrooms after title parse:  46
Number of recoverd bedrooms:  134
Percent recovered:  74.44444444444444


Good enough! Now to drop everything else that's missing.

In [47]:
sf = sf[['title', 'price', 'sqft', 'beds', 'bath', 'hood', 'amenities']]

In [48]:
sf = sf[sf['beds'].notna()]

In [49]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,1600.0,2.0,2.0,pacific heights,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,550.0,1.0,1.0,pacific heights,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,1300.0,2.0,1.0,marina / cow hollow,"['EV charging', 'cats are OK - purrr', 'dogs a..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3500.0,3.0,2.5,pacific heights,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2013,#158 Furnished ONLY Jr 1 bedroom w/ Parking at...,3100,561.0,1.0,1.0,marina / cow hollow,"['cats are OK - purrr', 'dogs are OK - wooof',..."


## Amenities

### Codebook: Fields by Category & Numerical Assignments

Values are assigned numerical values representing a rank based on percieved worth

**Housing Type**
```
* apartment          ---> 0
* condo              ---> 0
* cottage/cabin      ---> 1
* duplex             ---> 1
* flat               ---> 0
* house              ---> 2
* in-law             ---> 1
* loft               ---> 0
* townhouse          ---> 1
* manufactured       ---> 0
* assisted living    ---> 0
```    
    
    
**Laundry**
```
* w/d in unit          ---> 2
* w/d hookups          ---> 0
* laundry in bldg      ---> 1
* laundry on site      ---> 1
* no laundry on site   ---> 0
```    
 
**Parking**
```
* carport             ---> 2
* attached garage     ---> 3
* detatched garage    ---> 2
* off-street parking  ---> 1
* street parking      ---> 0
* valet parking       ---> 3
* no parking          ---> 0
```

**Pets Allowed**
```
* none     ---> 0
* cats ok  ---> 1 
* dogs ok  ---> 2
* both     ---> 3
```

**Others (to be excluded)**
```
* furnished
* no smoking
* wheelchair accessible
* EV Charging
```

First - I need to convert the values in the `amenities` column from a string to a list.

In [50]:
import ast

In [51]:
sf['amens_list'] = sf['amenities'].apply(lambda x: ast.literal_eval(x))

In [52]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amenities,amens_list
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,1600.0,2.0,2.0,pacific heights,"['furnished', 'apartment', 'w/d in unit', 'no ...","[furnished, apartment, w/d in unit, no smoking..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,550.0,1.0,1.0,pacific heights,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, fur..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,1300.0,2.0,1.0,marina / cow hollow,"['EV charging', 'cats are OK - purrr', 'dogs a...","[EV charging, cats are OK - purrr, dogs are OK..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3500.0,3.0,2.5,pacific heights,"['furnished', 'apartment', 'w/d in unit', 'no ...","[furnished, apartment, w/d in unit, no smoking..."
2013,#158 Furnished ONLY Jr 1 bedroom w/ Parking at...,3100,561.0,1.0,1.0,marina / cow hollow,"['cats are OK - purrr', 'dogs are OK - wooof',...","[cats are OK - purrr, dogs are OK - wooof, fur..."


In [53]:
amen_eg = sf.loc[0, 'amens_list']
amen_eg

['condo', 'w/d in unit', 'attached garage']

In [54]:
type(amen_eg)

list

In [55]:
'condo' in amen_eg

True

Looks like it worked - and now I can parse through the list to check for membership, creating columns for each type of amenity. 

In [56]:
sf = sf.drop(['amenities'], axis=1)

In [57]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,amens_list
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,1600.0,2.0,2.0,pacific heights,"[furnished, apartment, w/d in unit, no smoking..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,550.0,1.0,1.0,pacific heights,"[cats are OK - purrr, dogs are OK - wooof, fur..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,1300.0,2.0,1.0,marina / cow hollow,"[EV charging, cats are OK - purrr, dogs are OK..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3500.0,3.0,2.5,pacific heights,"[furnished, apartment, w/d in unit, no smoking..."
2013,#158 Furnished ONLY Jr 1 bedroom w/ Parking at...,3100,561.0,1.0,1.0,marina / cow hollow,"[cats are OK - purrr, dogs are OK - wooof, fur..."


In [58]:
def laundry_parse(amen_list):
    if 'w/d in unit' in amen_list:
        return 2
    elif 'laundry in bldg' in amen_list:
        return 1
    elif 'laundry on site' in amen_list:
        return 1
    else:
        return 0

In [59]:
sf['laundry'] = sf['amens_list'].apply(lambda amen_list: laundry_parse(amen_list))

In [60]:
def pets_allowed(amen_list):
    if 'dogs are OK - wooof' in amen_list and 'cats are OK - purrr' in amen_list:
        return 3
    elif 'dogs are OK - wooof' in amen_list:
        return 2
    elif 'cats are OK - purrr' in amen_list:
        return 1
    else:
        return 0

In [61]:
sf['pets'] = sf['amens_list'].apply(lambda amen_list: pets_allowed(amen_list))

In [62]:
def housing_type(amen_list):
    if 'cottage/cabin' in amen_list:
        return 1
    elif ' duplex' in amen_list:
        return 2
    elif 'house' in amen_list:
        return 3
    elif 'in-law' in amen_list:
        return 1
    elif 'townhouse' in amen_list:
        return 2
    else:
        return 0

In [63]:
sf['housing_type'] = sf['amens_list'].apply(lambda amen_list: housing_type(amen_list))

In [64]:
def parking_situation(amen_list):
    if 'attached garage' in amen_list:
        return 3
    elif 'valet parking' in amen_list:
        return 3
    elif 'carport' in amen_list:
        return 2
    elif 'detatched garage' in amen_list:
        return 2
    elif 'off-street parking' in amen_list:
        return 1
    else:
        return 0

In [65]:
sf['parking'] = sf['amens_list'].apply(lambda amen_list: parking_situation(amen_list))

In [66]:
sf = sf.drop(['amens_list'], axis=1)

In [67]:
sf.head()

Unnamed: 0,title,price,sqft,beds,bath,hood,laundry,pets,housing_type,parking
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,1600.0,2.0,2.0,pacific heights,2,0,0,3
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,550.0,1.0,1.0,pacific heights,2,3,0,3
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,1300.0,2.0,1.0,marina / cow hollow,2,3,0,0
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3500.0,3.0,2.5,pacific heights,2,0,0,3
2013,#158 Furnished ONLY Jr 1 bedroom w/ Parking at...,3100,561.0,1.0,1.0,marina / cow hollow,0,3,0,0


## Neighborhoods

In [68]:
hood_list = list(sf.hood.unique())

In [69]:
print("Number of unique neighborhoods provided: ", len(hood_list))

Number of unique neighborhoods provided:  83


Some of these are NOT in the SF Craigslist location list:

In [70]:
cl_locations = ['alamo square / nopa', 'bayview', 'bernal heights', 
               'castro / upper market', 'cole valley / ashbury hts','downtown / civic / van ness',
               'excelsior / outer mission','financial district','glen park','haight ashbury','hayes valley',
               'ingleside / SFSU / CCSF','inner richmond','inner sunset / UCSF', 'laurel hts / presidio',
               'lower haight','lower nob hill','lower pac hts','marina / cow hollow','mission district',
               'nob hill','noe valley','north beach / telegraph hill','pacific heights','portola district',
               'potrero hill','richmond / seacliff', 'russian hill','SOMA / south beach','sunset / parkside',
               'tenderloin','treasure island','twin peaks / diamond hts','USF / panhandle','visitacion valley',
               'west portal / forest hill', 'western addition']

In [71]:
len(cl_locations)

37

In [72]:
extra_hood_count = 0
extra_hoods = []

for hood in hood_list:
    if hood not in cl_locations:
        extra_hoods.append(hood)
        extra_hood_count += 1
        
print("Number of extra hoods: ", extra_hood_count)
print(extra_hoods)        

Number of extra hoods:  47
[nan, 'Curtis Park (Heart of Sacramento)', 'Mountain View', 'Golden Gate Heights, San Francisco', 'San Francisco/SunnySide', 'San Franciso Richmond District', 'brisbane', 'Mission Bay', 'Mission District', 'Bayview', 'CA', 'San Francisco, CA', '1365 McCandless Drive Milpitas, CA', 'NOPA', 'San Francisco', 'South Park', 'South Beach', 'Hayes Valley', 'Rincon Hill, San Francisco', 'Nob Hill', 'Rincon Hill', 'Cupertino', 'Lower Nob Hill', 'Pacific Heights', 'The Castro', 'North Beach', 'SoMa', 'Fremont', 'Civic Center, Downtown, Van Ness', 'Telegraph Hill', 'Marina District', 'Embarcadero / North Waterfront', 'Lawai, Poipu Adjacent', 'Glen Park', '2026 Beach St, Concord, CA', 'Vallejo', 'San Francisco North Waterfront', 'Daly City', 'Lower Pacific Heights', 'Westlake', 'North Panhandle', 'Eureka Valley', 'SAN FRANCISCO', 'SOMA', 'West Sacramento', 'South of Market', 'Marina']


One's not in SF should be dropped. Others that are just unique fields but actually can be mapped to a CL location will be. 

But also, I need to reduce the number of CL neighborhoods.