# Clustering
In this file, instructions how to approach the challenge can be found.

We can use different types of clustering algorithms:

- KMeans
- Hierarchical
- DBScan

## Segmentation of NYC neighborhoods

The goal of this project is to segment the neighborhoods of New York City into separate clusters and examine the information about them. For clustering, We can use any available information **except** demographic and economic indicators. We don't want to segment them based on those and we want to keep them for the **profiling of clusters** to see if there are any important economic differences between the created clusters.

### Feature Engineering

Feature engineering plays a crucial role in this problem. We have limited amount of attributes so we need to create some features that will be important for segmentation.

- Google Places, Yelp and Foursquare APIs: number of venues, density of venues per square mile, number of restaurants, top restarurant category...
- Uber: number of rides per day in the neighborhood
- Meetups: number of events
- etc...

[Feature Engineering Article](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

[Another Feature Engineering Article](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b)

In [3]:
# Import modules
import pandas as pd
import numpy as np

In [14]:
poi = pd.read_csv('full_poi.csv')
uber = pd.read_csv('clean_data/uber_rides.csv')
rats = pd.read_csv('clean_data/rat_sightings_sample_cleaned.csv')

In [17]:
poi

Unnamed: 0,Name,Rating,Price,Latitude,Longitude,Borough,Zipcode,Category
0,Ripe Kitchen & Bar,1.918605,2.50,40.898209,-73.838855,Mount Vernon,,restaurant
1,New China Garden,1.686047,1.25,40.897919,-73.853364,Bronx,10466.0,restaurant
2,Dunkin',1.627907,1.25,40.890459,-73.849089,Bronx,10466.0,restaurant
3,Subway,1.511628,1.25,40.890468,-73.849152,Bronx,10466.0,restaurant
4,Popeyes Louisiana Kitchen,1.627907,1.25,40.889492,-73.843383,Bronx,10466.0,restaurant
...,...,...,...,...,...,...,...,...
27217,My Way Deli,,,40.617311,-74.081740,Staten Island,10304.0,restaurant
27218,Campo Bello,,,40.617311,-74.081740,Staten Island,10304.0,restaurant
27219,Al Baraka Restaurant,2.325581,,40.617311,-74.081740,Staten Island,10304.0,restaurant
27220,Chicken R Us,2.325581,,40.617311,-74.081740,Staten Island,10304.0,restaurant


### Feature Selection / Dimensionality Reduction
We need to apply different selection techniques to find out which one will be the best for our problems.

Original Features vs. PCA components?

Don't forget to scale the features for KMeans.

Articles

[Feature Selection](https://machinelearningmastery.com/an-introduction-to-feature-selection/)

[Feature Selection 2](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)

[Feature Selection w/ Scikit-learn 1](https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/)

[Feature Selection w/ Scikit-learn 2](https://scikit-learn.org/stable/modules/feature_selection.html)

[Scaling/Normilization](https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff)

[PCA](https://drscotthawley.github.io/blog/2019/12/21/PCA-From-Scratch.html)

[Regression](https://datatofish.com/statsmodels-linear-regression/)

#### Feature Selection

In [4]:
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Chi-squared test
# ANOVA
# For all 3 types of features? (pois, uber, location (i assume means housing/nyc json data), rats)
# classification, categorical (ordinal, nominal data types)

#### PCA

### Modeling

Use different attributes and clustering techniques and compare the created clusters:

- clustering only on restaurant features
- clustering only on Uber features
- clustering only on location
- combination of all

**Questions:**
1. Which clustering is the best?
2. How are neighborhoods split when we select only 2 clusters?
3. Are there any differences in housing and rental costs in different clusters?

### Evaluation

1. Check the segmentation evaluation metrics:
    - inertia
    - silhoutte score
2. How did you come up with the correct number of clusters?
3. Is there any relationship between the clusters and economic indicators? If yes, what does it mean?

You are required to share the file containing all NYC neighborhoods together with cluster_id with LighthouseLabs.

[Silohutte Score Plot Article](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam)

[Inertia Article](https://www.codecademy.com/learn/machine-learning/modules/dspath-clustering/cheatsheet)