# Data Cleaning

**To Do:**
- [x] Drop date col
- [x] Drop link col
- [x] Drop dupes 
- [ ] BRs: NaN --> 1
- [x] BAs: Remove 'Ba' suffix; convert to int
    * SplitBa --> 1
    * SharedBa --> DROP (means it's not a housing/apt rental
- [ ] Hoods: create groups
- [ ] Amenities: create groups
- [ ] Deal with missing data (esp. sqft) --> *use Median*?


## Import Data

In [1]:
import pandas as pd
import numpy as np

import matplotlib as plt
%matplotlib inline

In [2]:
sf_raw = pd.read_csv('raw_sf_scrape.csv')

In [None]:
sf.info()

In [3]:
sf_raw.head()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
0,Oct 1,"3D Virtual Tour - 2 BR, 2 BA Condo 966 Sq. Ft....",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3850,2.0,966.0,Mission Bay,2Ba,"['condo', 'w/d in unit', 'attached garage']"
1,Oct 1,Beautiful house for rent,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,900,1.0,,portola district,0Ba,['house']
2,Oct 1,4 BEDROOM APARTMENT IN THE HAIGHT,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
3,Oct 1,4 BEDROOM APARTMENT IN THE HAIGHT,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
4,Oct 1,ENJOY GOLDEN GATE PARK EVERYDAY,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2800,2.0,700.0,USF / panhandle,1Ba,"['apartment', 'laundry in bldg', 'no smoking',..."


In [4]:
sf_raw.tail()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
3088,Sep 30,1 Month Free Rent: Massive Modern Luxury Studio,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1948,,500.0,SOMA / south beach,1Ba,['application fee details: $30 application fee...
3089,Sep 30,1 Month Free Rent: Massive Modern Luxury Studio,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1948,,500.0,downtown / civic / van ness,1Ba,['application fee details: $30 application fee...
3090,Sep 30,"Large, Beautiful Classic Victorian 1bd/1ba",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3395,1.0,800.0,laurel hts / presidio,1Ba,"['apartment', 'laundry in bldg']"
3091,Sep 30,3 bedroom in Nopa with Private Deck (Top Floor),https://sfbay.craigslist.org/sfc/apa/d/san-fra...,4995,3.0,1500.0,alamo square / nopa,2Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
3092,Sep 30,Private 1 bed/1 bath/1 kitchen in-law unit ava...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1795,1.0,300.0,ingleside / SFSU / CCSF,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."


In [5]:
len(sf_raw)

3093

## Dealing with duplicate listings

In [6]:
sf = sf_raw.drop(['date', 'link'], axis=1)

In [7]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
0,"3D Virtual Tour - 2 BR, 2 BA Condo 966 Sq. Ft....",3850,2.0,966.0,Mission Bay,2Ba,"['condo', 'w/d in unit', 'attached garage']"
1,Beautiful house for rent,900,1.0,,portola district,0Ba,['house']
2,4 BEDROOM APARTMENT IN THE HAIGHT,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
3,4 BEDROOM APARTMENT IN THE HAIGHT,4995,4.0,,haight ashbury,2Ba,"['apartment', 'w/d in unit', 'street parking']"
4,ENJOY GOLDEN GATE PARK EVERYDAY,2800,2.0,700.0,USF / panhandle,1Ba,"['apartment', 'laundry in bldg', 'no smoking',..."


In [8]:
sf.sort_values('title', inplace=True)

In [9]:
sf.head(3)

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2240,"""INCENTIVE"" 2 BR + Bonus Room(Office/Den) - DU...",3400,2.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."
2243,"""INCENTIVE"" 2 BR + Bonus Room(Office/Den) - DU...",3400,2.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."
2245,"""INCENTIVE"" IN THE HEART OF THE DUBOCE TRIANGLE",3700,3.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."


In [10]:
sf.drop_duplicates(keep=False, inplace=True)

In [11]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2245,"""INCENTIVE"" IN THE HEART OF THE DUBOCE TRIANGLE",3700,3.0,,castro / upper market,1Ba,"['application fee details: $30 Per Applicant',..."
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,2.0,1600.0,pacific heights,2Ba,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,1.0,550.0,pacific heights,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,2.0,1300.0,marina / cow hollow,1Ba,"['EV charging', 'cats are OK - purrr', 'dogs a..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3.0,3500.0,pacific heights,2.5Ba,"['furnished', 'apartment', 'w/d in unit', 'no ..."


In [12]:
len(sf)

2461

## Price: data type conversion to integer

In [13]:
sf.price = sf.price.astype(int)

In [14]:
sf.dtypes

title         object
price          int64
brs          float64
sqft         float64
hood          object
bath          object
amenities     object
dtype: object

## Cleaning Bathrooms

In [15]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', nan, '3.5Ba', 'sharedBa', '0Ba', '3Ba',
       '1.5Ba', 'splitBa', '4Ba', '5.5Ba', '4.5Ba'], dtype=object)

In [16]:
shared_baths = sf_raw[(sf_raw.bath == 'sharedBa')]

shared_baths_links = list(shared_baths.link)

In [17]:
# number of listings with shared bath
len(shared_baths)

15

In [18]:
# Investigating a few links
for link in shared_baths_links[:5]:
    print(link)
    print("")

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-spoons-when-all-you-need/7206516011.html

https://sfbay.craigslist.org/sfc/apa/d/excelente-oportunidad-centrico-sf/7206512351.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-fabulous-mission-studio/7200480477.html

https://sfbay.craigslist.org/sfc/apa/d/1400-private-rm-not-share-for-rent/7205302134.html

https://sfbay.craigslist.org/sfc/apa/d/best-room-rate-in-north-beach/7203890620.html



Upon inspection, listings with `sharedBa` are not full apartment/housing options -- they are a shared living arrangement and should be dropped.

In [19]:
sf = sf[sf.bath != 'sharedBa']

In [20]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', nan, '3.5Ba', '0Ba', '3Ba', '1.5Ba',
       'splitBa', '4Ba', '5.5Ba', '4.5Ba'], dtype=object)

`splitBa` means it has a private bathroom, but sink and toilet/shower are just separate -- these will be converted to 1.

In [21]:
sf['bath'] = sf['bath'].replace('splitBa', '1Ba')

In [22]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', nan, '3.5Ba', '0Ba', '3Ba', '1.5Ba', '4Ba',
       '5.5Ba', '4.5Ba'], dtype=object)

What about `nan` and `0Ba`? 

In [23]:
missing_bath_info = sf_raw[(sf_raw.bath == np.nan) | 
                          (sf_raw.bath == '0Ba')]

len(missing_bath_info)

9

In [24]:
for link in missing_bath_info.link:
    print(link)
    print('')

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-beautiful-house-for-rent/7206621152.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-sunny-victorian-flat/7206563979.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-two-bedroom-1200-sq-ft/7206561289.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-airy-bright-marina-1br/7206542254.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-7th-floor-corner-jr-1-bd/7191572212.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-2847-turk-street-san/7201729219.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-live-work-space-in-modern/7197876630.html

https://sfbay.craigslist.org/sfc/apa/d/huge-1-bed-hardwood-gas-eat-in-kit/7206260644.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-bed-1bathh-wd-free/7206132572.html



These all have 1 bathroom - so these values will be converted to `1`

In [25]:
sf['bath'] = sf['bath'].replace(np.nan, '1Ba')
sf['bath'] = sf['bath'].replace('0Ba', '1Ba')

In [26]:
sf.bath.unique()

array(['1Ba', '2Ba', '2.5Ba', '3.5Ba', '3Ba', '1.5Ba', '4Ba', '5.5Ba',
       '4.5Ba'], dtype=object)

Now to take care of the 'Ba' suffix and convert everything to a float

In [27]:
sf['bath'] = sf['bath'].str.replace('Ba', '').astype(float)

In [28]:
sf.head()

Unnamed: 0,title,price,brs,sqft,hood,bath,amenities
2245,"""INCENTIVE"" IN THE HEART OF THE DUBOCE TRIANGLE",3700,3.0,,castro / upper market,1.0,"['application fee details: $30 Per Applicant',..."
1980,"#107 Furnished Only Heaven by Views, Luxury, S...",6800,2.0,1600.0,pacific heights,2.0,"['furnished', 'apartment', 'w/d in unit', 'no ..."
2018,"#120 Furnished ONLY Gorgeous, Cozy Junior-One-...",3500,1.0,550.0,pacific heights,1.0,"['cats are OK - purrr', 'dogs are OK - wooof',..."
2095,#125 FURNISHED ONLY Beautiful 2BR in Marina,5100,2.0,1300.0,marina / cow hollow,1.0,"['EV charging', 'cats are OK - purrr', 'dogs a..."
2015,#132 Furnished ONLY Beautiful Mansion in Heart...,9000,3.0,3500.0,pacific heights,2.5,"['furnished', 'apartment', 'w/d in unit', 'no ..."


Clean bathrooms!

## Cleaning Bedrooms

In [29]:
sf.brs.unique()

array([ 3.,  2.,  1., nan,  4.,  5.,  6.])

In [30]:
len(sf.brs.unique())

7

How many nan value? 

In [54]:
br_nan.head()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
18,Oct 1,Huge Newly Remodeled Studio Downtown; Amazing ...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1995,,,lower nob hill,1Ba,"['cats are OK - purrr', 'dogs are OK - wooof',..."
31,Oct 1,Cozy Edwardian Apartment - 2 Bedroom (RH),https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2395,,,russian hill,2.5Ba,['apartment']
33,Oct 1,"Fifth-floor studio in Theater District, PREVIEW",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1600,,410.0,lower nob hill,1Ba,"['apartment', 'laundry in bldg', 'no smoking',..."
35,Oct 1,"Second-floor studio, preview for October move-...",https://sfbay.craigslist.org/sfc/apa/d/san-fra...,1800,,440.0,lower nob hill,1Ba,"['apartment', 'laundry in bldg', 'no smoking',..."
41,Oct 1,1BD/1Bath - California & Larkin - Pets Welcome!,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2100,,,russian hill,2Ba,['apartment']


In [55]:
len(br_nan)

617

What are these nan values? 

In [56]:
br_nan_links = list(br_nan.link)

In [58]:
for link in br_nan_links[:9]:
    print(link)
    print('')

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-huge-newly-remodeled/7206619643.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-cozy-edwardian-apartment/7194887293.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-fifth-floor-studio-in/7206615888.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-second-floor-studio/7202953088.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-1bd-1bath-california/7200175190.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-quiet-remodeled-1-bedroom/7197456326.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-furnished-central-garden/7203461074.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-lg-stu-view-top-floor-off/7198666488.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-lg-remodeled-studio-top/7198668794.html



In [59]:
for link in br_nan_links[9:20]:
    print(link)
    print('')

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-wow-prices-you-cant-beat/7206609316.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-2-months-free-newly/7206609071.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-nob-hill-efficiency/7205168796.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-always-fresh-forever/7206603015.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-two-months-free-studio-in/7206586455.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-available-now-4-bedroom-2/7200487016.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-2-months-free-newly/7206598060.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-beautiful-modern-studios/7206593196.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-beautiful-modern-studios/7206592788.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-huge-studio-with-wall/7206597243.html

https://sfbay.craigslist.org/sfc/apa/d/san-francisco-large-studio-in-p

It seems like a mix -- some are studios, some have 1 or more bedrooms. The key is in the title. So I'll try to parse through the title and if it contains: 
```
* studio --> 0
* 1BR    --> 1
* 2BR    --> 2
```
and so on...

In [60]:
'studio' in 'this is a studio'

True

In [61]:
'1br' in 'ThIs has a 1bR'.lower()

True

## Amenities

## Neighborhoods