1. My data source is monthly median rent figures for each NYC neighborhood listed on StreetEasy.com since 2010
URL: https://streeteasy.com/blog/data-dashboard/?agg=Total&metric=Inventory&type=Sales&bedrooms=Any%20Bedrooms&property=Any%20Property%20Type&minDate=2010-01-01&maxDate=2020-01-01&area=Flatiron,Brooklyn%20Heights

Additional data:
Job data is available through the NYS Department of Labor: https://www.labor.ny.gov/stats/nyc/
Transportation data is available through NYC Open Data and Citibike: http://web.mta.info/developers/turnstile.html (this shows use of each station-- would love to learn how to scrape and aggregate some of this data if it's not too much) https://www.citibikenyc.com/system-data https://www1.nyc.gov/html/dot/html/bicyclists/bike-counts.shtml
Zoning and rent control are a more qualitative measure that I will research here: https://rentguidelinesboard.cityofnewyork.us/resources/faqs/rent-control/

2. Data quality issues include StreetEasy's bias -- it is certainly possible to find cheaper housing rental prices by word of mouth or via Craigslist.com, Facebook Marketplace, or Facebook groups, so StreetEasy is certainly not perfectly reflective of the NYC Housing market. To address this, and because this very problem prompted an interesting question for me, my project will investigate StreetEasy directly. To this end, and addressing another potential data quality issue, is looking into which neighborhoods starting popping up on StreetEasy at what times. This focuses the question on StreetEasy, rather than purporting to answer all questions about NYC housing using only StreetEasy's data. StreetEasy is still a good source for this project, since it reflects many buyers' and renters' choices. 

3. The data are formatted in CSV. I'd do the same if I had made it myself, since this is a format I find both easily human readable (since I'm very used to Excel spreadsheets) and easily workable within Python or R. 

4. Numbers are integral to real estate. That being said, Beer's data rationality -- the idea that speed is crucial to "good analytics" and will give a competitive edge -- doesn't come through very strongly to me in this dataset. The dataset gives monthly rental data on different neighborhoods from 2010 to January of 2020, so it's already outdated by "data gaze" standards, and monthly data seems like a reasonable granularity of data to collect in this context (since most renters pay on a monthly basis). It does give me pause that StreetEasy makes this data available, and I wonder what benefits they gain, if any. I'm sure they have teams of data analysts providing real time insights into more detailed data behind closed doors, though.

In [1]:
import pandas as pd

In [2]:
import os
os.getcwd()

'/Users/evasibinga'

In [3]:
med_rent_data = pd.read_csv("WWDdata/medianAskingRent_All.csv")

In [4]:
med_rent_data.head()

Unnamed: 0,areaName,Borough,areaType,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,...,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,2019-10,2019-11,2019-12,2020-01
0,All Downtown,Manhattan,submarket,3200.0,3200.0,3025.0,3100.0,3100.0,3200.0,3195.0,...,3950.0,4000.0,4095.0,4000.0,3995.0,4014.0,4095.0,4099.0,4081.0,4050.0
1,All Midtown,Manhattan,submarket,2875.0,2800.0,2800.0,2850.0,2895.0,2950.0,3000.0,...,3593.0,3643.0,3695.0,3718.0,3725.0,3711.0,3695.0,3740.0,3750.0,3754.0
2,All Upper East Side,Manhattan,submarket,2460.0,2450.0,2400.0,2500.0,2550.0,2550.0,2595.0,...,2995.0,3000.0,3050.0,3067.0,2995.0,3000.0,3125.0,3250.0,3348.0,3300.0
3,All Upper Manhattan,Manhattan,submarket,1836.0,1800.0,1795.0,1800.0,1823.0,1850.0,1875.0,...,2400.0,2450.0,2500.0,2500.0,2495.0,2441.0,2391.0,2350.0,2350.0,2395.0
4,All Upper West Side,Manhattan,submarket,2895.0,2800.0,2750.0,2800.0,2800.0,2795.0,2800.0,...,3479.0,3495.0,3478.0,3425.0,3539.0,3650.0,3650.0,3695.0,3629.0,3700.0


6. Data validity could be assessed through statistical measures including mean and median -- I know the general amounts I'm expecting based on my familiarity with NYC rent princes, so testing mean and median could show if there are definitely incorrect values. 

In [7]:
df = pd.DataFrame(med_rent_data)

In [17]:
df.mean()

2010-01    2303.238636
2010-02    2266.213483
2010-03    2250.160920
2010-04    2277.079545
2010-05    2327.710843
2010-06    2329.273810
2010-07    2403.162500
2010-08    2419.097561
2010-09    2562.126582
2010-10    2509.917647
2010-11    2439.804598
2010-12    2397.056180
2011-01    2399.579545
2011-02    2369.326087
2011-03    2341.638298
2011-04    2368.451613
2011-05    2378.115789
2011-06    2342.265306
2011-07    2410.063830
2011-08    2413.783505
2011-09    2517.152174
2011-10    2507.702128
2011-11    2496.187500
2011-12    2527.233333
2012-01    2516.717391
2012-02    2524.106383
2012-03    2505.031250
2012-04    2586.087912
2012-05    2545.145833
2012-06    2505.632653
              ...     
2017-08    2517.281690
2017-09    2528.000000
2017-10    2483.521429
2017-11    2450.105634
2017-12    2488.555556
2018-01    2475.035971
2018-02    2496.072993
2018-03    2508.426471
2018-04    2534.496350
2018-05    2530.517730
2018-06    2567.773050
2018-07    2575.222222
2018-08    

In [15]:
df.median()

2010-01    2092.5
2010-02    2125.0
2010-03    2100.0
2010-04    2100.0
2010-05    2200.0
2010-06    2250.0
2010-07    2400.0
2010-08    2425.0
2010-09    2530.0
2010-10    2400.0
2010-11    2200.0
2010-12    2200.0
2011-01    2299.5
2011-02    2225.0
2011-03    2275.0
2011-04    2350.0
2011-05    2301.0
2011-06    2187.5
2011-07    2387.5
2011-08    2400.0
2011-09    2475.0
2011-10    2497.5
2011-11    2412.5
2011-12    2462.5
2012-01    2387.5
2012-02    2421.5
2012-03    2499.5
2012-04    2500.0
2012-05    2497.5
2012-06    2480.0
            ...  
2017-08    2217.5
2017-09    2200.0
2017-10    2177.5
2017-11    2112.5
2017-12    2195.0
2018-01    2155.0
2018-02    2200.0
2018-03    2249.5
2018-04    2295.0
2018-05    2200.0
2018-06    2275.0
2018-07    2300.0
2018-08    2273.0
2018-09    2250.0
2018-10    2200.0
2018-11    2168.0
2018-12    2195.0
2019-01    2199.0
2019-02    2200.0
2019-03    2200.0
2019-04    2200.0
2019-05    2250.0
2019-06    2250.0
2019-07    2275.0
2019-08   

7. The relationship I'm most curious about right away is to visualize the change in rent over time by neighborhood, and if possible color each line by rental market/neighborhood so that I can compare different areas' rents. 

In [18]:
med_rent_data.head()

Unnamed: 0,areaName,Borough,areaType,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,...,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,2019-10,2019-11,2019-12,2020-01
0,All Downtown,Manhattan,submarket,3200.0,3200.0,3025.0,3100.0,3100.0,3200.0,3195.0,...,3950.0,4000.0,4095.0,4000.0,3995.0,4014.0,4095.0,4099.0,4081.0,4050.0
1,All Midtown,Manhattan,submarket,2875.0,2800.0,2800.0,2850.0,2895.0,2950.0,3000.0,...,3593.0,3643.0,3695.0,3718.0,3725.0,3711.0,3695.0,3740.0,3750.0,3754.0
2,All Upper East Side,Manhattan,submarket,2460.0,2450.0,2400.0,2500.0,2550.0,2550.0,2595.0,...,2995.0,3000.0,3050.0,3067.0,2995.0,3000.0,3125.0,3250.0,3348.0,3300.0
3,All Upper Manhattan,Manhattan,submarket,1836.0,1800.0,1795.0,1800.0,1823.0,1850.0,1875.0,...,2400.0,2450.0,2500.0,2500.0,2495.0,2441.0,2391.0,2350.0,2350.0,2395.0
4,All Upper West Side,Manhattan,submarket,2895.0,2800.0,2750.0,2800.0,2800.0,2795.0,2800.0,...,3479.0,3495.0,3478.0,3425.0,3539.0,3650.0,3650.0,3695.0,3629.0,3700.0


And... I realize that I need to spend more time figuring out how to turn this data from wide to long format. I thought I knew how to do this, but I've been working on this for a while and am stumped at the moment with this particular dataset, given that it has SO many columns of 2010-01 etc etc dates. Until I "melt" my dataset and get all date values in a column (this seems like it will make an impossibly large dataset), I can't visualize the data over time. 