# IBM Data Science Capstone Project - UWS Apartment Hunting

### Part 1 - Scraping Apartments.com

In [101]:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import folium
from pandas.io.json import json_normalize

These are some of the packages needed for my scraping tool - notably, "pandas," used to create and maniupulate dataframes and "beautiful soup," which in my opinion is the best web scraper there is.

In [2]:
base_url = "https://www.apartments.com/upper-west-side-new-york-ny/1-bedrooms/"
page2_url = base_url + "2/"
page3_url = base_url + "3/"
page4_url = base_url + "4/"
page5_url = base_url + "5/"

So, first thing is first - I need to scrape apartment listings from Apartments.com. Specifically, addresses and rent prices for 1-bedroom apartments in the Upper West Side (UWS) neighborhood of Manhattan, NYC.  As you can see here, Apartments.com is very logical in their url construction. The /neighborhood is followed by different filters, which in this case is /1-bedrooms.  The last part of the url includes the results page number.  Personally, 5 pages of results (25 listings per page * 5 = 125 search results) seemed like enough for a data science novice like me!

^^These are the page urls - I'll combine them to make a pandas dataframe further down.

In [3]:
def scraper(url):
    
    #makes me seem more human as I pull
    headers = ({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #scrapes all addresses
    address = []
    location = soup.find_all("div",{"class":"location"})
    for addresses in location:
        address.append(addresses.get("title"))
    
    #scrapes all rent prices (some are ranges)
    rent_price = []
    rent = soup.find_all("span",{"class":"altRentDisplay"})
    for prices in rent:
        rent_price.append(prices.text[1:6].replace(',',''))
    
    #name of the apartment, which usually is just the street address
    name = []
    title = soup.find_all("a",{"class":"placardTitle js-placardTitle"})
    for prices in title:
        name.append(prices.get("title").split(',')[0])
    
    #link to the apartments.com listing in case I want to browse later!
    link = []
    href = soup.find_all("a",{"class":"placardTitle js-placardTitle"})
    for hrefs in href:
        link.append(hrefs.get("href"))
    
    #creating the column names for the dataframe - thank you Pandas
    data = {'Name':name,'Address':address,'Rent Price':rent_price,'Link':link}
    df = pd.DataFrame(data)

    return df

Okay, so I have created a function here that responds to the url input.  There's probably a more efficient way to use a for loop here but hey, I know how to fix the door not neccesarily build the house ya know? Bottom line - it works!

In [4]:
df1 = scraper(base_url)
df2 = scraper(page2_url)
df3 = scraper(page3_url)
df4 = scraper(page4_url)
df5 = scraper(page5_url)

Since I'm working with 5 result pages, I just execute my function 5 times and make 5 different dataframes. Dataframe is denoted as df.

In [5]:
frames = [df1, df2, df3, df4, df5]
df = pd.concat(frames)

Now, I combine these dataframes by stacking them ontop of each other - just like legos.  Technically, this is a form of concatenation, which is executed via panda's .concat function.  To check and see that I did this correctly, I will check the shape of the master dataframe to see if there are 125 rows (25 search results * 5 pages = 125)

Before that actually, let's take a look at the data for a better understanding of what we are working with.

In [6]:
df

Unnamed: 0,Name,Address,Rent Price,Link
0,Waterline Square Luxury Rentals,"675 W 59th St, New York, NY 10069",4119,https://www.apartments.com/waterline-square-lu...
1,Park Towers South,"315 W 57th St, New York, NY 10019",2750,https://www.apartments.com/park-towers-south-n...
2,The Max,"606 W 57th St, New York, NY 10019",2996,https://www.apartments.com/the-max-new-york-ny...
3,Columbus Square,"808 Columbus Ave, New York, NY 10025",3295,https://www.apartments.com/columbus-square-new...
4,The Helena,"601 W 57th St, New York, NY 10019",2884,https://www.apartments.com/the-helena-new-york...
5,Hudson Park,"323 W 96th St, New York, NY 10025",2417,https://www.apartments.com/hudson-park-new-yor...
6,FRANK 57 WEST,"600 W 58th St, New York, NY 10019",3438,https://www.apartments.com/frank-57-west-new-y...
7,VIA 57 WEST,"625 W 57th St, New York, NY 10019",3270,https://www.apartments.com/via-57-west-new-yor...
8,1080 Amsterdam,"1080 Amsterdam Ave, New York, NY 10025",3269,https://www.apartments.com/1080-amsterdam-new-...
9,Avalon Morningside Park,"1 Morningside Dr, New York, NY 10025",2810,https://www.apartments.com/avalon-morningside-...


Okay now, let's officially see if there are 125 rows...

In [7]:
df.shape

(125, 4)

Bingo. Next step.

In [8]:
df["Address"].head()

0       675 W 59th St, New York, NY 10069
1       315 W 57th St, New York, NY 10019
2       606 W 57th St, New York, NY 10019
3    808 Columbus Ave, New York, NY 10025
4       601 W 57th St, New York, NY 10019
Name: Address, dtype: object

Okay, here's where it gets tricky-ish.  Apartments.com puts the big commericial real estate complexes at the top of the list - because they pay a premium for the top space. Usually, the commercial guys name their building something legit (and cliche) like "The Lofts" or "Chase Apartments at Fancy Place" to imply they have luxury apartments.

Okay, so what's the problem? Well, as you can see above, the first 5 commericial listings have the correct address - because they have an actual; name.  The little guys just list their address as the name, which you will notice in the tail...

In [9]:
df["Address"].tail()

20    15 W 64th St, New York, NY 10023
21    33 W 63rd St, New York, NY 10023
22                  New York, NY 10023
23                  New York, NY 10023
24                  New York, NY 10023
Name: Address, dtype: object

See, not a <i> real </i> address.  But look at their names...

In [10]:
df["Name"].tail()

20                    15 w 64th St
21             33 West 63rd Street
22    15 Central Park West Unit 6H
23                    37 W 76th St
24            64 W 69th St Unit 1A
Name: Name, dtype: object


We've got the data still! We just need a way to make the address column uniform.  How? Well, first by seperating the commericial apartments from the mom-and-pop ones, correcting the addresses, then putting them both back together.  Again, if we don't screw up, there should be a 125 rows.  Let's give it a shot.

In [11]:
address_length = []
for length in df["Address"]:
    address_length.append((len(length)))

In [12]:
df["Address Length"] = address_length

The first thing I notice is that despite the difference in zip code, the length of the smaller apartments' addresses are the exact same length - 18 characters "New York, NY 10023".  So, why not find the length and then filter the longer ones away? Check it out...

In [13]:
#resetting the index
df.reset_index(drop=True,inplace=True)

In [14]:
df["Address Length"].head(30)

0     33
1     33
2     33
3     36
4     33
5     33
6     33
7     33
8     38
9     36
10    38
11    33
12    18
13    33
14    33
15    18
16    18
17    18
18    18
19    18
20    18
21    18
22    18
23    18
24    18
25    18
26    18
27    18
28    18
29    18
Name: Address Length, dtype: int64

The little guys (18 characters) are toward the bottom of the listings.  To focus only on them, we'll make a new dataframe and use a Boolean variable to filter them down - meaning, does the character length = 18? Yes or No? 0 or 1? Boolean folks.

In [15]:
#using Boolean variable to filter dataframe
eighteens = df["Address Length"]==18
df18 = df[eighteens]
df18.insert(0, "Street Address",df18["Name"] + ", " + df["Address"])
del df18["Name"]
del df18["Address Length"]
del df18["Address"]

In [16]:
#rejects
rejects = df["Address Length"]!=18
df_r = df[rejects]
del df_r["Name"]
del df_r["Address Length"]
df_r.columns = ['Street Address', 'Rent Price', 'Link']

We also focus on the commericial guys, aka the rejects (maybe it's actually reverse in reality haha), standardizing them seperately.

In [17]:
frames = [df18, df_r]
df = pd.concat(frames)

Again, combining the two dataframes as we did before.  Let's reset the index for good measure and check and see if we still have 125 rows...

In [18]:
#resetting the index again
df.reset_index(drop=True,inplace=True)
df.shape

(125, 3)

Yup! 125.

In [19]:
df["Street Address"].tail()

120     63 W 90th St, New York, NY 10024
121    220 W 71st St, New York, NY 10023
122     35 W 65th St, New York, NY 10023
123     15 W 64th St, New York, NY 10023
124     33 W 63rd St, New York, NY 10023
Name: Street Address, dtype: object

Beautiful. Next.

In [20]:
#making sure the rent price column is an integar, as I'll want to sort it later from cheapest to most expensive
df["Rent Price"] = df["Rent Price"].astype(int)

Here I am changing the rent price from a string to an integer so I can sort the values from cheapest to most expensive later.

In [21]:
#removing word "Unit" from address because it doesn't register with geopy when trying to find latitude and longitude
df["Street Address"] = [j.replace('Unit','') for j in df["Street Address"]]

Some of the smaller apartment complexes listed the unit number and so, I am removing that just in case geopy does not register it as part of the address in the next step.

### Part 2 - Calculating Distance to Shuttle Stop based on Location Data

In [22]:
from geopy.extra.rate_limiter import RateLimiter
import geopy
from geopy.geocoders import Nominatim
from geopy import distance

Next, what I need is the distance of each apartment to the shuttle stop, which will aid me in my search.  The commute will already be 45 minutes on the shuttle and so, I'll need to live closeby to this stop if I want to keep it under an hour.  To do this, we will need the latitude and longitude of each apartment as well as shuttle stop.  Geopy is a great package for this.  It is able to recognize street addresses and convert them into coordinates via "Nominatim" and can convert them into distance via "distance."

In [23]:
locator = Nominatim(user_agent="myGeocoder")

In [24]:
#delays between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

#creates location column
df['location'] = df['Street Address'].apply(geocode)

#creates longitude, laatitude and altitude from location column (returns tuple)
df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)

#splits point column into latitude, longitude and altitude columns
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['point'].tolist(), index=df.index)

In [25]:
#removing rows that were unable to produce a latitude and longitude - 6 total rows
df = df.dropna()

In [26]:
#resetting the index again
df.reset_index(drop=True,inplace=True)

Okay, we should see latitude and longitude now.  How many rows did we lose?

In [27]:
df.shape

(118, 8)

In [28]:
df.head()

Unnamed: 0,Street Address,Rent Price,Link,location,point,latitude,longitude,altitude
0,"146-148 W 68th St, New York, NY 10023",2085,https://www.apartments.com/146-148-w-68th-st-n...,"(146, West 68th Street, Upper West Side, Manha...","(40.77568881632653, -73.98285832653062, 0.0)",40.775689,-73.982858,0.0
1,"2783-2787 Broadway, New York, NY 10025",2750,https://www.apartments.com/2783-2787-broadway-...,"(2783, Broadway, Manhattan Valley, Manhattan, ...","(40.802426499999996, -73.967615, 0.0)",40.802426,-73.967615,0.0
2,"39 W 71st St, New York, NY 10023",2225,https://www.apartments.com/39-w-71st-st-new-yo...,"(39, West 71st Street, Upper West Side, Manhat...","(40.77633955, -73.97788736083112, 0.0)",40.77634,-73.977887,0.0
3,"39 W 71st St, New York, NY 10023",2375,https://www.apartments.com/39-w-71st-st-new-yo...,"(39, West 71st Street, Upper West Side, Manhat...","(40.77633955, -73.97788736083112, 0.0)",40.77634,-73.977887,0.0
4,"328 W 101st St, New York, NY 10025",2350,https://www.apartments.com/328-w-101st-st-new-...,"(328, West 101st Street, Manhattan Valley, Man...","(40.798713899999996, -73.97200920159592, 0.0)",40.798714,-73.972009,0.0


In [29]:
shuttle_stop = (40.7876872, -73.9772022)

Here is the location of the shuttle stop - it's on the corner of 86th and Amsterdamn, which I found via Google Maps (it's embedded into the url)

I need the distance to appear as a seperate column in the dataframe.  To do this, we need to iterate though the "point" column, apply geopy's distance function, and append the result into a list called "dist," which we can convert to a pandas dataframe column.

In [30]:
dist = []
for i in df['point']:
    dist.append(distance.distance(shuttle_stop, i).miles)
df['Distance to Stop'] = dist

In [31]:
df.head()

Unnamed: 0,Street Address,Rent Price,Link,location,point,latitude,longitude,altitude,Distance to Stop
0,"146-148 W 68th St, New York, NY 10023",2085,https://www.apartments.com/146-148-w-68th-st-n...,"(146, West 68th Street, Upper West Side, Manha...","(40.77568881632653, -73.98285832653062, 0.0)",40.775689,-73.982858,0.0,0.879474
1,"2783-2787 Broadway, New York, NY 10025",2750,https://www.apartments.com/2783-2787-broadway-...,"(2783, Broadway, Manhattan Valley, Manhattan, ...","(40.802426499999996, -73.967615, 0.0)",40.802426,-73.967615,0.0,1.134539
2,"39 W 71st St, New York, NY 10023",2225,https://www.apartments.com/39-w-71st-st-new-yo...,"(39, West 71st Street, Upper West Side, Manhat...","(40.77633955, -73.97788736083112, 0.0)",40.77634,-73.977887,0.0,0.783847
3,"39 W 71st St, New York, NY 10023",2375,https://www.apartments.com/39-w-71st-st-new-yo...,"(39, West 71st Street, Upper West Side, Manhat...","(40.77633955, -73.97788736083112, 0.0)",40.77634,-73.977887,0.0,0.783847
4,"328 W 101st St, New York, NY 10025",2350,https://www.apartments.com/328-w-101st-st-new-...,"(328, West 101st Street, Manhattan Valley, Man...","(40.798713899999996, -73.97200920159592, 0.0)",40.798714,-73.972009,0.0,0.808146


It worked! There is now a column called "Distance to Stop," which I can use to filter down apartment results.  If i just want to look at apartments that are less than 0.5 miles away, now I can.

In [32]:
#using Boolean variable to filter dataframe
shortlist = df["Distance to Stop"] <= 0.5
df_shortlist = df[shortlist]

#getting rid of columns I don't need for this particular dataframe
del df_shortlist["location"]
del df_shortlist["point"]
del df_shortlist["altitude"]

In [102]:
#sorting by distance to stop
df_shortlist.sort_values(by=["Distance to Stop"], inplace=True)
df_shortlist.head(20)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Street Address,Rent Price,Link,latitude,longitude,Distance to Stop
47,"203 W 85th St, New York, NY 10024",2550,https://www.apartments.com/203-w-85th-st-new-y...,40.787406,-73.975797,0.07622
45,"203 W 85th St, New York, NY 10024",2350,https://www.apartments.com/203-w-85th-st-new-y...,40.787406,-73.975797,0.07622
61,"203 W 85th St, New York, NY 10024",2250,https://www.apartments.com/203-w-85th-st-new-y...,40.787406,-73.975797,0.07622
90,"265 W 87th St, New York, NY 10024",3099,https://www.apartments.com/265-w-87th-st-new-y...,40.789353,-73.976607,0.119111
67,"247 W 87th St, New York, NY 10024",4375,https://www.apartments.com/247-w-87th-st-new-y...,40.78888,-73.97547,0.122581
66,"247 W 87th St, New York, NY 10024",4175,https://www.apartments.com/247-w-87th-st-new-y...,40.78888,-73.97547,0.122581
92,"212 W 82nd St, New York, NY 10024",2588,https://www.apartments.com/212-w-82nd-st-new-y...,40.785267,-73.977843,0.170365
84,"208 W 82nd St, New York, NY 10024",2588,https://www.apartments.com/208-w-82nd-st-new-y...,40.785196,-73.977675,0.173707
112,"334 W 88th St, New York, NY 10024",2750,https://www.apartments.com/334-west-88th-stree...,40.79064,-73.978473,0.21435
12,"149 W 87th St, New York, NY 10024",2800,https://www.apartments.com/149-w-87th-st-new-y...,40.787915,-73.972988,0.221581


Now with a shortlist of 20 apartments less than a half mile away from the shuttle stop, it's now clear the rent price ranges from 2350 - 4600 per month with the average probably being around 2500.  However, it would also be nice to have a visual of where these apartments are located.  

In [41]:
# create map of UWS using latitude and longitude values of bus stop
map_uws = folium.Map(location=shuttle_stop, zoom_start=10)

# add markers to map
for lat, lng, rent, distance in zip(df_shortlist['latitude'], df_shortlist['longitude'], df_shortlist['Rent Price'], df_shortlist['Distance to Stop']):
    label = '{}, {}'.format(distance, rent)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uws)  

By using folium, we can create a map of Manhattan and then zoom into UWS using the following code:

In [43]:
sw = df[['latitude', 'longitude']].min().values.tolist()
ne = df[['latitude', 'longitude']].max().values.tolist()
map_uws.fit_bounds([sw, ne])

In [44]:
map_uws

### Part 3 - Using the Foursquare API to analyze what venues are in the neighborhood

In [45]:
CLIENT_ID = 'XDILCGU3IYR2ZMGSRAQVQ0HBXA5QJP1M4HBE35IESI2MLC1M' # your Foursquare ID
CLIENT_SECRET = 'YNRXJVAM01K5HXVYRKCEFH500L4NJ55BKVGPUJA5BOE4GNXU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

First, to acesss the API, credentials need to be established via a client ID and secret key

In [49]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    shuttle_stop[0], 
    shuttle_stop[1], 
    radius, 
    LIMIT)

shuttle stop[0] = latitude, [1] = longitude. Combined with the API url, credentials, and limit/radius, we can call the results using the get command:

In [59]:
results = requests.get(url).json()

In [60]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [94]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Jacob's Pickles,Southern / Soul Food Restaurant,40.786653,-73.975622
1,AMC Loews 84th Street 6,Movie Theater,40.786770,-73.977608
2,Maison Pickle,American Restaurant,40.786990,-73.977787
3,Eléa,Greek Restaurant,40.787531,-73.976681
4,Han Dynasty,Chinese Restaurant,40.787620,-73.976359
5,Celeste,Italian Restaurant,40.786689,-73.975737
6,Barnes & Noble,Bookstore,40.786116,-73.978645
7,Barney Greengrass,Bagel Shop,40.788008,-73.974794
8,Juice Generation,Juice Bar,40.788209,-73.976994
9,The Mermaid Inn,Seafood Restaurant,40.788744,-73.974243


Great! Now we have an idea of what kind of places are in the neighborhood - cafes, bars, restaurants as well as a movie theatre.

In [104]:
uws_grouped = nearby_venues.groupby('categories').count().sort_values(["name"],ascending=False).reset_index()
uws_grouped.head(20)

Unnamed: 0,categories,name,lat,lng
0,Italian Restaurant,5,5,5
1,Coffee Shop,4,4,4
2,Wine Bar,3,3,3
3,Bakery,3,3,3
4,Bar,3,3,3
5,Indian Restaurant,3,3,3
6,American Restaurant,2,2,2
7,Ice Cream Shop,2,2,2
8,Vegetarian / Vegan Restaurant,2,2,2
9,Thai Restaurant,2,2,2


Finally, if we want to dig a bit deeper and see specifically how many shops/restaurants/etc there are in each category, we can group them together and sort them in descending order.

And voila! I hope you found this code useful - I had fun putting it together for this project.