## Final Project - Part 2 + 3
> #### Cecelia Shao

> February 2018


> From: https://github.com/ga-students/DAT-NYC-1.16.18/tree/master/projects/final-project/02-experiment-writeup


### Part 2: Project Design Writeup
**Requirements:**
- Well-articulated problem statement with "specific aim" and hypothesis, based on your lightning talk 
- An outline of any potential methods and models
- Detailed explanation of extant data available (ie: build a data dictionary or link to pre-built data dictionaries)
- Describe any outstanding questions, assumptions, risks, caveats
- Demonstrate domain knowledge, including specific features or relevant benchmarks from similar projects
- Define your goals and criteria, in order to explain what success looks like

**Bonus**

- Consider alternative hypotheses: if your project is a regression problem, is it possible to rewrite it as a classification problem?
- "Convert" your goal metric from a statistical one (like Mean Squared Error) and tie it to something non-data people can understand, like a cost/benefit analysis, etc.

### Project Problem and Hypothesis
**What's the project about? What problem are you solving?**
My project centers around potential ways to predict gentrification and exploring the social cues/factors that could help communities/individuals identify whether they're at risk of being displaced.

Gentrification is a very contentius issue in New York City especially because it impacts both the quality of life and opportunities for advancement for people who live in NY neighborhoods. Often, established residents who are impacted by gentrification find themselves economically and socially marginalized, and it's alarming that these changes often occur along racial and economic fault lines.


**Problem Statement:** Using geographically tagged social media data from Yelp, Foursquare, and Instagram, determine whether or not higher levels of social media are positively associated with the likelihood of a neighborhood in NYC to become gentrified (aka incur higher property values both for renting and purchase options). 

**What kind of impact do you think it could have?**
By having information earlier on about which areas/neighborhoods are most likely to be undergo gentrification, government officials and community groups can begin preparing residents and adjusting policies, such as rent control/stabilization policies, to lessen the negative impacts of gentrification or to ensure that the rights/needs of established residents are better taken into account when development decisions are being made in that area.

Other impacts this analysis could have are actually to promote/induce gentrification (aka identify "hot areas" that should have more investments and that businesses should move to).


**Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?**
While the nature of this project could include predict either a continuous number (predicting home values) or a binary value (Y/N to whether an area is at "risk" of being gentrified), I specifically chose to combine those two outcomes in a way. By trying to predict a "normal" (aka not impacted by gentrification) growth rate for home values, I'll try to identify if the growth rate for the values of properties (both rental and sales) is "normal" or is a high outlier (aka impacted by gentrification). 

**What do you think will have the most impact in predicting the value you are interested in solving for?**
I'm specifically using social media data and also the presence/price of airbnb values to see if those factors impact/point to the likelihood of an area that's likely to undergo gentrification. The timeline of predicting vs. identifying in early stages (aka if an area is also showing these signs, does that mean that gentrification has already begun?)

### Datasets
- Description of data set available, at the field level (see table)

- If from an API, include a sample return (this is usually included in API documentation!) (if doing this in markdown, use the javacription code tag)

(see below in Part 3 for this information)

### Domain knowledge
- What experience do you already have around this area?
- Does it relate or help inform the project in any way?
- What other research efforts exist? See what approaches others have made, or talk with your colleagues if it is work related about previous attempts at similar problems.

> http://www.cam.ac.uk/research/news/predicting-gentrification-through-social-networking-data
> https://www.inverse.com/article/14168-predict-gentrification-through-social-media-data
> https://spatial.usc.edu/wp-content/uploads/2015/03/Schaefer_Bryan.pdf
> https://www.npr.org/sections/13.7/2017/08/29/546980178/what-does-it-take-to-see-gentrification-before-it-happens
> http://mappingideas.sdsu.edu/big-group2016/group4/
> https://datasmart.ash.harvard.edu/news/article/where-is-gentrification-happening-in-your-city-1055


- Include a benchmark, how other models have performed, even if you are unsure what the metric means.

### Project Concerns
- What questions do you have about your project? What are you not sure you quite yet understand? (The more honest you are about this, the easier your instructors can help).
- What are the assumptions and caveats to the problem?
- What data do you not have access to but wish you had?
- What is already implied about the observations in your data set? For example, if your primary data set is twitter data, it may not be representative of the whole sample (say, predicting who would win an election)
- What are the risks to the project?
- What's the cost of your model being wrong? (What's the benefit of your model being right?)
- Is any of the data incorrect? Could it be incorrect?

### Outcomes
- What do you expect the output to look like?
- What does your target audience expect the output to look like?
- What gain do you expect from your most important feature on its own?
- How complicated does your model have to be?
- How successful does your project have to be in order to be considered a "success"?
- What will you do if the project is a bust (this happens! but it shouldn't here)?

**Obtaining data from API:**
https://schoolofdata.org/2013/11/18/web-apis-for-non-programmers/

**Data Sets:**
Social Media Data – “features”
- Yelp reviews + businesses: https://www.kaggle.com/yelp-dataset/yelp-dataset
- Foursquare  API: https://developer.foursquare.com/ 
- Instagram API: https://www.instagram.com/developer/ 

**Using Property Value as a Proxy for Gentrification**
Airbnb Listings: http://insideairbnb.com/get-the-data.html 
Zillow: https://www.zillow.com/research/data/ 
Streeteasy: https://streeteasy.com/blog/download-data/ 


Step 1: Get access to the social media APIs from Foursquare and Instagram (the yelp reviews + business data set is already in a ready-format) and confirm that geospatial data is available for all the sources 
Step 2: Do EDA and confirm data quality across all datasets
Step 3: Determine thresholds for gentrified or not using property value (what does it mean for rent/property value to be out of the norm)
Step 4: Identify trends in growth (or lack thereof) of social media across NYC and categorize into tiers/groups/ranks 
Step 5: Determine if ranks of social media activity for a given area is predictive of changes in property value 



![image.png](attachment:image.png)

Helpful link: https://www.dataquest.io/blog/python-api-tutorial/

From Greg: if they don’t have a sdk then you need to make `http` calls in your language of choice - python makes this easy but i always struggle with the authentication pieces - some apis make it more challenging than others.  SDKs will make it more straight forward

### Instagram API
- client ID - d8acf95e75474e0ab970b661c20b3adb

- Client Secret 91664a8e599e42d2a6a824de6ea456ec

https://auth0.com/docs/connections/social/instagram

In [5]:
import os
os.environ["instagram_client_secret"] = "91664a8e599e42d2a6a824de6ea456ec"

# Set environment variable + check
# export instagram_client_secret=91664a8e599e42d2a6a824de6ea456ec
# echo $instagram_client_secret

### Part 3: Project Design Writeup

> From: https://github.com/ga-students/DAT-NYC-1.16.18/tree/master/projects/final-project/03-exploratory-analysis

**Requirements:**
- Review the data set and project with an EIR during office hours.
- Practice importing (potentially unformatted) data into clean matrices|data frames, and if necessary, export into a form that makes sense (text files or a database, for example).
- Explore the mathematical properties and visualize data through a python visualization tool (matplotlib and seaborn)
- Provide insight about the data set and any impact on a hypothesis.

**Detailed Breakdown:**
- A well organized iPython notebook with code and output
- At least one visual for each independent variable and, if possible, its relationship to your dependent variable.
- It's just as important to show what's not correlated as it is to show any actual correlations found.
- Visuals should be well labeled and intuitive based on the data types.

**Bonus:**
- Surface and share your analysis online. Jupyter makes this very simple and the setup should not take long.
- Try experimenting with other visualization languages; python/pandas-highcharts, shiny/r, or for a real challenge, d3 on its own. Interactive data analysis opens the doors for others to easily interpret your work and explore the data themselves!

### Predicting house prices using python

https://towardsdatascience.com/create-a-model-to-predict-house-prices-using-python-d34fe8fad88f
> https://github.com/Shreyas3108/house-price-prediction

# Airbnb Listings: http://insideairbnb.com/get-the-data.html 

- listing
- calendar data (October 2017)
> Data goes from 2017 back to January 2015
- neighborhoods



In [15]:
import pandas as pd

In [28]:
neighborhoods = pd.read_csv('../final_project/neighborhoods.csv')
neighborhoods.head()

Unnamed: 0,neighbourhood_group,neighbourhood
0,Bronx,Allerton
1,Bronx,Baychester
2,Bronx,Belmont
3,Bronx,Bronxdale
4,Bronx,Castle Hill


In [29]:
listings = pd.read_csv('../final_project/listings.csv')
listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,18461891,"Bright, comfortable 1B studio near everything!",916092,Connie Mae,Queens,Ditmars Steinway,40.774142,-73.916246,Entire home/apt,110,6,0,,,1,0
1,20702398,Quiet house on City Island,1457680,James,Bronx,City Island,40.849191,-73.786509,Private room,50,1,2,2017-10-01,2.0,1,169
2,6627449,Large 1 BDRM in Great location,13886510,Arlene,Bronx,City Island,40.849775,-73.786609,Entire home/apt,125,3,21,2017-09-26,0.77,1,363
3,19949243,Stay aboard a sailboat,1149260,MoMo,Bronx,City Island,40.848838,-73.782276,Entire home/apt,100,3,0,,,1,90
4,1886820,Quaint City Island Community.,9815788,Steve,Bronx,City Island,40.841144,-73.783052,Entire home/apt,300,7,0,,,1,365


In [30]:
listings = pd.read_csv('../final_project/listings.csv.gz', nrows=100, compression='gzip',
                   error_bad_lines=False)
print(listings)

          id                            listing_url       scrape_id  \
0   18461891  https://www.airbnb.com/rooms/18461891  20171002002103   
1   20702398  https://www.airbnb.com/rooms/20702398  20171002002103   
2    6627449   https://www.airbnb.com/rooms/6627449  20171002002103   
3   19949243  https://www.airbnb.com/rooms/19949243  20171002002103   
4    1886820   https://www.airbnb.com/rooms/1886820  20171002002103   
5    5557381   https://www.airbnb.com/rooms/5557381  20171002002103   
6   19609887  https://www.airbnb.com/rooms/19609887  20171002002103   
7    7949480   https://www.airbnb.com/rooms/7949480  20171002002103   
8   21057372  https://www.airbnb.com/rooms/21057372  20171002002103   
9   16042478  https://www.airbnb.com/rooms/16042478  20171002002103   
10   9147025   https://www.airbnb.com/rooms/9147025  20171002002103   
11   1936633   https://www.airbnb.com/rooms/1936633  20171002002103   
12  19758402  https://www.airbnb.com/rooms/19758402  20171002002103   
13  11

## Instagram: https://www.instagram.com/developer/authentication/

### Foursquare Places API:  https://developer.foursquare.com/docs/api/getting-started

In [7]:
import json, requests
url = 'https://api.foursquare.com/v2/venues/explore'

params = dict(
  client_id="1KNECHOW4ALXKWS4OWU2TEUZMPW0WUN1NORS2OUMWWBBCV4C",
  client_secret='O4QOOLKGDZK44DTBRPQIUDGO2Z4XQYYJQOJ0LN5E5FAQASMM',
  v='20170801',
  ll='40.7243,-74.0018',
  query='coffee',
  limit=1
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

### Client ID
1KNECHOW4ALXKWS4OWU2TEUZMPW0WUN1NORS2OUMWWBBCV4C
### Client Secret
O4QOOLKGDZK44DTBRPQIUDGO2Z4XQYYJQOJ0LN5E5FAQASMM

In [8]:
data

{u'meta': {u'code': 200, u'requestId': u'5a95c5e1db04f53652c3c3be'},
 u'response': {u'groups': [{u'items': [{u'reasons': {u'count': 0,
       u'items': [{u'reasonName': u'globalInteractionReason',
         u'summary': u'This spot is popular',
         u'type': u'general'}]},
      u'referralId': u'e-0-45e98bacf964a52080431fe3-0',
      u'tips': [{u'agreeCount': 3,
        u'canonicalUrl': u'https://foursquare.com/item/54e51f11498ef4039fd5157a',
        u'createdAt': 1424301841,
        u'disagreeCount': 0,
        u'id': u'54e51f11498ef4039fd5157a',
        u'likes': {u'count': 3, u'groups': [], u'summary': u'3 likes'},
        u'logView': True,
        u'photo': {u'createdAt': 1424301843,
         u'height': 1440,
         u'id': u'54e51f13498ecf2f32fbfb2e',
         u'prefix': u'https://igx.4sqi.net/img/general/',
         u'source': {u'name': u'Foursquare for iOS',
          u'url': u'https://foursquare.com/download/#/iphone'},
         u'suffix': u'/54904394_O0ODgxCxgXgr6Mf8Odchhs1

In [None]:
# https://developer.foursquare.com/docs/api/checkins/details

In [9]:
import json, requests
url = 'https://api.foursquare.com/v2/checkins/54e51f11498ef4039fd5157a'

params = dict(
  client_id="1KNECHOW4ALXKWS4OWU2TEUZMPW0WUN1NORS2OUMWWBBCV4C",
  client_secret='O4QOOLKGDZK44DTBRPQIUDGO2Z4XQYYJQOJ0LN5E5FAQASMM',
  v='20170801',
  ll='40.7243,-74.0018',
limit=1
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

In [10]:
data

{u'meta': {u'code': 400,
  u'errorDetail': u'Invalid checkin id',
  u'errorType': u'param_error',
  u'requestId': u'5a95c6cd1ed2190ba7ee81a8'},
 u'response': {}}

In [13]:
import json, requests
url = 'https://api.foursquare.com/v2/venues/search'

params = dict(
  client_id="1KNECHOW4ALXKWS4OWU2TEUZMPW0WUN1NORS2OUMWWBBCV4C",
  client_secret='O4QOOLKGDZK44DTBRPQIUDGO2Z4XQYYJQOJ0LN5E5FAQASMM',
  v='20170801',
  near='New York, NY',
limit=10
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

In [14]:
data

{u'meta': {u'code': 200, u'requestId': u'5a95c831351e3d52faa40f93'},
 u'response': {u'confident': False,
  u'geocode': {u'feature': {u'cc': u'US',
    u'displayName': u'New York, NY, United States',
    u'geometry': {u'bounds': {u'ne': {u'lat': 40.882214, u'lng': -73.907},
      u'sw': {u'lat': 40.679548, u'lng': -74.047285}},
     u'center': {u'lat': 40.742185, u'lng': -73.992602}},
    u'highlightedName': u'<b>New York</b>, <b>NY</b>, United States',
    u'id': u'geonameid:5128581',
    u'longId': u'72057594043056517',
    u'matchedName': u'New York, NY, United States',
    u'name': u'New York',
    u'slug': u'new-york-city-new-york',
    u'woeType': 7},
   u'parents': [],
   u'what': u'',
   u'where': u'new york ny'},
  u'venues': [{u'allowMenuUrlEdit': True,
    u'beenHere': {u'lastCheckinExpiredAt': 0},
    u'categories': [{u'icon': {u'prefix': u'https://ss3.4sqi.net/img/categories_v2/food/deli_',
       u'suffix': u'.png'},
      u'id': u'4bf58dd8d48988d146941735',
      u'name':

In [16]:
foursquare_data_dict = dict(data)

In [17]:
#making dataframe from python data dictionary 
ny_df=pd.DataFrame.from_dict(foursquare_data_dict)

In [18]:
foursquare_data_dict.keys()

[u'meta', u'response']

In [29]:
# get one venue's information

foursquare_data_dict['response'].items()[2][1][0]

{u'allowMenuUrlEdit': True,
 u'beenHere': {u'lastCheckinExpiredAt': 0},
 u'categories': [{u'icon': {u'prefix': u'https://ss3.4sqi.net/img/categories_v2/food/deli_',
    u'suffix': u'.png'},
   u'id': u'4bf58dd8d48988d146941735',
   u'name': u'Deli / Bodega',
   u'pluralName': u'Delis / Bodegas',
   u'primary': True,
   u'shortName': u'Deli / Bodega'}],
 u'contact': {},
 u'delivery': {u'id': u'322016',
  u'provider': {u'icon': {u'name': u'/delivery_provider_seamless_20180129.png',
    u'prefix': u'https://igx.4sqi.net/img/general/cap/',
    u'sizes': [40, 50]},
   u'name': u'seamless'},
  u'url': u'https://www.seamless.com/menu/essen-chelsea-699-6th-ave-new-york/322016?affiliate=1131&utm_source=foursquare-affiliate-network&utm_medium=affiliate&utm_campaign=1131&utm_content=322016'},
 u'hasPerk': False,
 u'hereNow': {u'count': 1,
  u'groups': [{u'count': 1,
    u'items': [],
    u'name': u'Other people here',
    u'type': u'others'}],
  u'summary': u'One other person is here'},
 u'id': u

In [21]:
# put response data into the data dictionary
ny_df=pd.DataFrame.from_dict(foursquare_data_dict['response'])

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.