# Capstone Project: Find Similar Neighborhoods

By Yuxiao Gao






## Introduction

Takumen is a casual, contemporary Izakaya-style Japanese restaurant in Long Island City, a neighborhood in the Queens borough of New York City. It features ramens, poké bowls, sake and craft beers. On Google Maps it has a very high rating, 4.5 out of 5. 


The success of Takumen may well thank to its location. The neighborhood, Long Island City, is just right across East River of Manhattan, and therefore becomes a very popular place of living of young professionals. Rent is reasonable, and the area is less populated. It has less dining places and coffee shops compared to Manhattan in general. So, Takumen faces less fierce restaurants competitions, yet the neighborhood residents can afford to pay and enjoy a casual, contemporary Japanese dining experience. 

Now the owners of Takumen is ready to open their second restaurant in New York City. They want to borrow from Takumen's success and carefully pick a similar neighborhood in the city. Since Long Island City is in Queens borough, and Bronx is too far from most of the popular sites, the owners will specifically consider neighborhoods in Manhattan and northern Brooklyn. 

In this project, I will use geospacial data and economic data to measure the similarity of neighborhoods in the following aspects:

1. Rent

   Average rent of a neighborhood is a good indicator of economic condition of the neighborhood. A wealthy neighborhood will have higher rent, and residents may have more to spare on dining. But on the other hand, cost of opening a restaurant will be higher too. Thus, a similar level of rent compared to Long Island City is ideal. 
   

2. Other restaurants

   How much competition will the new Takumen face in the neighborhood? How many of existing restaurants are Asian restaurants and thus be more relavent competitors of Takumen?
   
   
3. Other popular venues

   Is it more of a residential neighborhood, a business and commercial neighborhood, or a shopping and dining neighborhood?

Finally, similar neighborhood will be clustered and visualized on a map of New York City.

<p>&nbsp;</p>

## Data

CityRealty is a popular real estate website. It collects rent and sale information. Conveniently, it lists Manhattan and Northern Brooklyn average apartment rent information by neighborhoods. The data is of February 2019. Since this project will use rent as comparison among neighborhoods, 2019 data is acceptable. 

Foursquare location data provides venue locations, categories, and ratings. The dataset is also updated constantly, so the information is up-to-date. In this project, Foursquare location data will be used to cluster restaruant competition and neighborhood popular venues. Since the rent data is from February 2019, I will use February 2019 version of Foursquare data. 

In addition, neighborhood geo location data is collected from and Spatial Data Repository of NYU and Google Maps. Since rent data is harder to find than geo location data, I used neighborhood partition by the rent dataset.   

<p>&nbsp;</p>

## Method

#### Data preparation: Rent

By a first glance of the average rent of each neighborhood, I found out that some studio and 3 bedroom rent is missing for some neighborhood. 1 bedroom and 2 bedroom apartments are more common, thus the data should be more reliable than studios and 3 bedrooms. Thus, in comparison of rent levels, data for studios and 3 bedrooms are dropped. All the rest of the data is normalized. 

Immediately after normalization, Long Island City posed as an affordable neighborhood among all the neighborhoods in the dataset, with indicators of only 0.28 and 0.29 for 1 bedroom apartments and 2 bedroom apartments. It's above the 25% but lower than 50% percentile. 

#### Data preparation: Venues

1. Restaurant competitions

   I explore 30 restaurants in each neighborhood, and relabled all asian restaurants in one group, "Asian Restaurants". Then the ratio of asian restaurants and total number of restaurants is calculated for each neighborhood, stored separately in three columns, namely "Asian Restaurant Counts", "Total Restaurant Counts", and "Asian Restaurant PCT". Some neighborhoods don't have a single asian taste place. One neighborhood doesn't even have a restaurant, which is the Roosevelt Island, a small piece of residential area between Manhattan and Queens. Here, all missing data is filled with zero. 
   
   Long Island City has in total 7 restaurants, and 3 of them are asian restaurants. Among all the neighborhoods, Long Island City has one of the highest ratio of asian restaurants to total restaurants.

2. Most popular venues in neighborhoods

   I gathered 100 venues for each neighborhood, and ranked top 15 popular venues. I one hot coded the categories and get the frequencies for each venue category for each neighborhood. 
   
   From frequencies of top venues for each neighborhood, difference between each neighborhood starts to show. For example, Long Island City's top venues are hotels, coffee shops, bars, pizza places, and cafés. The neighborhood is very relaxing and residency friendly. On the contrary, SOHO's top venues are clothing stores, Italian restaurants, boutiques, coffee shops, and cosmetic shops. It's a lot more commercial than Long Island City. 

#### Building model: K-Means Clustering

K-Means clustering algorithm is used in this project. Since the goal is to find similar neighborhood in terms of rents and venues, Long Island City is listed as one of the neighborhood. After clustering, neighborhoods in the same cluster with Long Island City will be considered potential neighborhoods to open the new Takumen restaurant. 

After data preparation, two important dataframes are generated. "df_neighborhood" contains neighborhoods data (neighborhood name, borough, latitude, longitude), normalized rent data (1 bedroom apartment rent, 2 bedroom apartment rent), restaurant competition data (number of asian restaurants, number of total restaurants, percentage of asian restaurants), and first to 15th common venues. "df_kmeans" is one hot coded dataset, including the same neighborhoods data, normalized rent data, restaurant competition data, and frequencies of each venue category in each neighborhood. It is the dataset from where features of K-Means clustering are drawn. 

I excluded neighborhood data from df_kmeans dataset to generate feature dataset X, then fit and trained K-Means model with number of cluster from 1 to 10. Euclidean distance is used in evaluating each model. K values and distortions are listed below:

      
  
      1 : 7.677323229842826
      2 : 4.1560487790151015
      3 : 3.1277847349533032
      4 : 2.576973876467541
      5 : 1.9033142257362188
      6 : 1.730833637467119
      7 : 1.5774003581972293
      8 : 1.432883223329601
      9 : 1.3037567227002564
      10 : 1.175022228721088



After examining the figure of different k values and distortions, k is set at 5. The cluster label for each neighborhood is added to "df_neighborhood" dataframe. 


|<img src='https://github.com/emmakai/IBM-Capstone/blob/927070bf8a42236f0a8acbe5f74fe00ca6c46add/C9-Analysis-Scoreeshot%202.png?raw=true' alt='Appendix 2' width=400 /> |
|:--:| 
| *Fig 1. The Elbow Method* |

Finally, neighborhoods are drawn onto a New York City map, with colors assigned to each neighborhood cluster.    

  

<p>&nbsp;</p>

## Result

Three neighborhoods are clustered together with Long Island City: Roosevelt Island of Manhattan, Red Hook of Brooklyn, and Windsor Terrace of Brooklyn. Their rent, competition and top 5 most common venue information is attached below. 

|<img src="https://github.com/emmakai/IBM-Capstone/blob/927070bf8a42236f0a8acbe5f74fe00ca6c46add/C9-Analysis-Screenshot%201.png?raw=true" alt="apendix 1" width="800"/>|
|:--:|
|*Fig 2. Cluster 3*|


A map of clustered neighborhoods (and a zoomed map) is presented below.

<table><tr>
<td> <img src='https://github.com/emmakai/IBM-Capstone/blob/927070bf8a42236f0a8acbe5f74fe00ca6c46add/C9-Analysis-Screenshot%204.png?raw=true' alt='Appendix 4' width='800'/> Fig 4.
<td> <img src='https://github.com/emmakai/IBM-Capstone/blob/927070bf8a42236f0a8acbe5f74fe00ca6c46add/C9-Analysis-Screenshot%203.png?raw=true' alt='Appendix 3' width='800'/> Fig 5.
</tr></table>   


Each dots represent one neighborhood, and neighborhoods with the same dot color are in the same cluster. Long Island City is the sky blue dots to the most right. The dot slightly above it also in sky blue is Roosevelt Island. With the two other sky blue dots on the bottom of the map, Red Hook and Windsor Terrace, are cluster 3. These four neighborhoods should have similar level of rent, restaurant competition, and popular venues.

<p>&nbsp;</p>

## Discussion

#### Roosevelt Island

If we refer to the table in figure 2, we can see that both apartment rents Roosevelt Island are slightly above Long Island City, and there is no restaurant there. On the map, Roosevelt Island is the thin isolated neighborhood located between Manhattan and Queens. This can be seen as an oppotunity as well as a risk. A slightly higher rent may still in the afforable cost of new Takumen, but residents in the community may have more budget to spend on dining than Long Island City residents. Similarly, there's no restaurant yet in the neighborhood, and thus Takumen has no business to model after in the neighborhood. On the other hand, residents would also only be left with Takumen if they choose to dine in their neighborhood. I would recommend conduct a further research on dining behaviors of residents living in the neighborhood.

#### Red Hook and Windsor Terrace

Both neighborhoods are very similar to Long Island City. But judging from the common venues of the neighborhoods, Red Hook seems to have a younger and artistic vibe. It has art gallery, flower shop, wine shop, park, and bar listed in  top 10 common venues, whereas Windsor Terrace is more residential (diner, grocery store, café). Red Hook so far only has 2 restaurants, and neither of them is Asian cuisine. Windsor Terrace has more dining options, and the 2 Asian restaurants guarantee acceptance of Asian cuisine. In the next research, I would recommend look at Red Hook residents' dining taste, and Windsor Terrace demand on Asian dining options (whether or not 2 existing restaurants are enough for the opitions).

#### About Data

1. Rent data

   In this project, I only used 1 bedroom and 2 bedroom apartment rent data. It can be a strong indicator of restaurant rent costs as well as residents purchasing power, but I expect some variance from it. In following research, I would recommend directly estimating restaurant cost in different location, as well as using household income of each neighborhood.


2. Restaurant competition

   Foursqure API has data on venue ratings and tips using premium calls. Due to limitation of type of calls I can make in the project, ratings and tips are not included. However, these data is also very relavent in concluding strong suggestions. I would recommend look into these going further.

<p>&nbsp;</p>

## Conclusion

Takumen restaurant in Long Island City is doing very well in its neighborhood. The owners are considering opening a new restaurant in a similar neighborhood and continue Takumen's success. The analysis looks at average apartment rent, number of restaurant, and common venues of the neighborhoods in Manhattan and northern Brooklyn. The project mainly uses the k-means clustering algorithm, along with some descriptive data. At first glance at the result, Roosevelt Island of Manhattan, Red Hook and Windsor Terrace of Brooklyn show similarities with Long Island City. In the research of next step, I would recommond look deeper into residents dining behavior, as well as directly estimate restaurant costs of the selected neighborhoods. 