## Assignment 1

**Submission instructions**: Please submit this assignment as a jupiter notebook on Canvas. Your submission should consist of this file with answers filled out in the notebook itself; do not submit any other files. Rename the file to be called "A1_YourNetID.ipynb". If you have two team members, please name it "A1_NetIDa_NetIDb.ipynb".

### Problem 1: Getting Data (5 points)

#### Go to [NYC Open Data](https://data.cityofnewyork.us/) and find a dataset which looks interesting to you. Describe its primary columns and describe what each row corresponds to. Brainstorm 3 questions you can answer with this dataset.

### [Hyperlocal Temperature Monitoring](https://data.cityofnewyork.us/dataset/Hyperlocal-Temperature-Monitoring/qdq3-9eqn)
Updated August 20, 2021 & Data Provided by Department of Health and Mental Hygiene (DOHMH)

**Tags:** _Climate Change_, _Heat Impact_, _Urban Climate_, _Safety_, _Resilience_


**Description of columns**

1. `Sensor.ID`: Unique indentifier of sensor
2. `AirTemp`: Average hourly air temperature, in Fahrenheit
3. `Day`: date
4. `Hour`: Hour of day
5. `Latitude`: Latitude of sensor
6. `Longitude`: Longitude of sensor
7. `Year`: Year
8. `Install.Type`: Type of mounting (tree/light)
9. `Borough`: borough
10. `ntacode`: Neighborhood Tabulation Areas (NTA) code


**Three questions to answer with this dataset**

1. Where can you find a cool place in your neighbourhood during heatwaves?
2. How much heat will be 'trapped' (eg. calculate difference between an individual AirTemp and the average AirTemp of certain radius) during the evening? 
3. Which locations have the highest difference between highest and coldest measurements? 



### Problem 2: Getting Data (20 points)

#### Identify the datasets on NYC Open Data that can answer the following questions. **You do not actually have to answer the questions; only the dataset names and links are needed.**

1. How many sidewalk cafe license applications were there in NYC in 2019 and 2020?
2. How did the number of street trees in NYC increase from 2005 to 2015?
3. On what date in 2020 did NYC have the highest positive rate of COVID-19 testing?
4. Which school district in NYC has the lowest percentage of families who prefer remote learning? 
5. Where are the residential buildings in NYC that have the highest and lowest median rent per bedroom? - 


**Datasets**

1. [Sidewalk Café Licenses and ApplicationsBusiness](https://data.cityofnewyork.us/Business/Sidewalk-Caf-Licenses-and-Applications/qcdj-rwhu) 
2. [2005 Street Tree Census](https://data.cityofnewyork.us/Environment/2005-Street-Tree-Census/29bw-z7pj) & [2015 Street Tree Census - Tree Data](https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh) 
3. [COVID-19 Daily Counts of Cases, Hospitalizations, and Death](https://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3)
4. [Learning Preference City Remote Learning - as of Jan 4, 2021](https://data.cityofnewyork.us/Education/Learning-Preference-City-Remote-Learning-as-of-Jan/k5d2-tkrr)
5. [Housing New York Units by Building](https://data.cityofnewyork.us/Housing-Development/Housing-New-York-Units-by-Building/hg8x-zxpr)

### Problem 3: Data Analysis with NYC Open Data (30 points)

#### Use [Motor Vehicle Collisions - Crashes](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) and [Motor Vehicle Collisions - Person](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Person/f55k-p6yu) to answer the following questions: 
1. Did the number of vehicle crashes increase or decrease after Covid-19 lockdown in March 2020? By how much 3 months later?
2. Where did the most serious vehicle crash (with the highest number of casualties) occur since 2010? 
3. Use matplotlib to make a plot of the death rate of pedestrians by age. Describe the trend. 

### Answers

1. The number of vehicle crashes **decreased** after the COVID-19 lockdown in March 2020. Three months later, in June 2021, the amount of motor vehicle crashes was 7141 instead of 11074 in March 2020.
2. Both datasets record do not record data before July 2012. Therefor there is a chance the most serious vehicle crash happened between 2010 and 2012. However, for this exercise we look at the earliest collected date (07/01/2012) until the latest datapoint (09/03/2021). In this case, the crash with the highest number of casualties since 2021 occured at **WEST STREET/WEST HOUSTON STREET, MANHATTAN on 15:08 10/31/2017**. The casualties were 8 persons, whereof 6 pedestrians and 2 cyclists. The motorist remained alive. 


In [18]:
# import matplotlib and pandas
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


# Import Socrate module for NYC Open Data API
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# First 2000 results, returned as JSON from API / converted to Python list of dictionaries by sodapy.
# Source how to use  NYC Open Data API: https://dev.socrata.com/foundry/data.cityofnewyork.us/f55k-p6yu

# Get data through API from NYC Open Data
limitCall = 2000 # limit for API call and later usage to divide group by
client = Socrata("data.cityofnewyork.us", None) # call to client
data = pd.DataFrame(client.get("f55k-p6yu", person_type="Pedestrian", limit=limitCall)) # get specific data set with limit


# Append column if person was killed or not 
data['person_killed'] = np.where(data.person_injury=="Killed", True, False) # use Numpy .where() to have conditional statement

# select specfic columns only for performance and sort on "person_age"
selectedData = data[["person_age", "person_injury", "person_killed"]].sort_values(by=["person_age"])

# print mean for each age group to find date ratio
print('Death ratio in: ' + str(selectedData['person_killed'].mean()))
print(selectedData.head())





Death ratio in: 0.0085
     person_age person_injury  person_killed
953           0       Injured          False
1463          0       Injured          False
1316          0       Injured          False
1165          0   Unspecified          False
1050          0   Unspecified          False


### Problem 4: Simple crawling with API (45 points)

Here, you will learn how to use the Twitter API to find tweets. To use the Twitter API, you will need to get an API key. 

##### Getting an API key
1. Register for a Twitter account and go to the [developer portal](https://developer.twitter.com/en/portal/petition/user-case). Apply for a *Student* developer account for this course. You may or may not get a developer account immediately.
2. On the Developer Portal, find "Projects & Apps". Create an app for this course. Remember to save you Consumer_key, Consumer_secret, Access_token, Access_token_secret. If you didn't, don't worry, you can regenerate new ones.
3. The consumer_key, consumer_secret, access_token, and access_token_secret can be used in combination with the Tweepy library to access Twitter data. 

#### Create a tweepy.API object using your consumer_key, consumer_secret, access_token, and access_token_secret. Use the api.search method to search for 100 tweets with keyword "NY mayor". You should be able to do this in a single line of code. (10 points)


In [104]:
import tweepy

# authenticating twitter api credentials
consumer_key="Fa5KUVmeTOlst9kT9Xp62MgDQ"
consumer_secret="hnUXgaacVeAvP7i160sP4F4u0LTE2dcv283ju1ANNlaggSsmnE"
access_token="280544543-w3FudWSNdU3n6qKlZrfWDJ5CQPjp7lmgNfUBVfIH"
access_token_secret="KuAvbaBitKLzehiMBOdE51STrVAtJksDbtI6SmsxnZEBB"

# instantiating the api
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# creating API object
api = tweepy.API(auth,wait_on_rate_limit=True)

# search and print latest 100 tweets with the query "NY Mayor"
results = api.search("NY mayor", count="100");

# print results
print(results)

[Status(_api=<tweepy.api.API object at 0x7fa8fa01c280>, _json={'created_at': 'Sun Sep 12 23:17:56 +0000 2021', 'id': 1437193790817714184, 'id_str': '1437193790817714184', 'text': 'RT @theRealKiyosaki: NY Mayor De Blasio just announced NY jails are at their lowest levels. He just released thousands of inmates. He also…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'theRealKiyosaki', 'name': 'therealkiyosaki', 'id': 29856819, 'id_str': '29856819', 'indices': [3, 19]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1011896766306836480, 'id_str': '1011896766306836480', 'name': 'Blundell James', 'screen_name': 'BlundellJame', 'location'

In response to your query, the tweepy library will return a list of length 100, with each element in the list corresponding to one tweet. Take the first element in this list and look at its _json field: this returns the data from your query in a JSON blob. 
#### Describe 5 fields in the JSON (5 points). 

* _created_at_: Creation time of tweet in UTC format - `created_at': 'Sun Sep 12 21:54:55 +0000 2021`
* _profile_image_url_https_: URL to image of users profile picture with HTTPS - `profile_image_url_https': 'https://pbs.twimg.com/profile_images/1197327700945506304/kWMTzTP1_normal.jpg'`
* _profile_background_color_: HEX color or users' profile page - `profile_background_color': 'EDD607'`
* _source_: HTML link element linking to the source app where the tweet was posted from - `'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>'`
* _geo_enabled_: shows if tweet is geo location enabled - `'geo_enabled': True` 

One of the fields in the JSON provides the location information in the user's profile, if the user has chosen to make this public. 
#### Using this field, write a function to compute what 1) fraction of users are from New York state, 2) what fraction of users are not from New York state, and 3) what fraction of users either have no public location information or do not allow you to answer this question precisely. The three fractions should add up to 1. (15 points)


In [106]:
# Regex for state & USA location by Lea Pope https://towardsdatascience.com/filtering-tweets-by-location-baca601ae5cd
usa_states_regex = ‘,\s{1(A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])’

usa_states_fullname_regex = '(ALABAMA|ALASKA|ARIZONA|ARKANSAS|CALIFORNIA|'\
                            'COLORADO|CONNECTICUT|DELAWARE|FLORIDA|GEORGIA|HAWAII|'\
                            'IDAHO|ILLINOIS|INDIANA|IOWA|KANSAS|KENTUCKY|'\
                            'LOUISIANA|MAINE|MARYLAND|MASSACHUSETTS|MICHIGAN|'\
                            'MINNESOTA|MISSISSIPPI|MISSOURI|MONTANA|'\
                            'NEBRASKA|NEVADA|NEW\sHAMPSHIRE|NEWSJERSEY|'\
                            'NEW\sMEXICO|NEW\sYORK|NORTH\sCAROLINA|'\
                            'NORTH\sDAKOTA|OHIO|OKLAHOMA|OREGON|PENNSYLVANIA|'\
                            'RHODE\sISLAND|SOUTH\sCAROLINA|SOUTH\sDAKOTA|'\
                            'TENNESSEE|TEXAS|UTAH|VERMONT|VIRGINIA|'\
                            'WASHINGTON|WEST\sVIRGINIA|WISCONSIN|WYOMING|USA)'
                        

# declare function that can find location from tweet list passed down
def findLocation(res, loc):
    
    locMatch = 0;
    locNoMatch = 0;
    locUndefined = 0;
    
    # loop over all tweets in results and print location
    for tweet in res:
        
        tweetLocation = tweet._json["user"]['location'] # get user location from tweet
        print(tweetLocation)
        
        if tweetLocation in loc: # find if substring exists in string that is passed through via the 'loc' parameter
            locMatch += 1 # increment match counter
        elif tweetLocation not in loc:
            locNoMatch += 1 # increment no match counter
        else:
            locUndefined += 1
        
    return [locMatch, locNoMatch, locUndefined]

    

findLocation(results, "NY")

#for status in tweepy.Cursor(api).items():
    # process status here
   # process_status(status)
    
    
#for result in resultsJson:
    #print(result['created_at'])
    
# resultsJson = results[0]._json

# print(resultsJson["user"]['location'])

# resultsJson = results[0]._json

Los Angeles, CA

New York, NY


AMERICA, USA
Pocatello, ID

DC area

under your bed.
Jefferson, LA
Florida, USA

Orange County, CA
Martinez, CA


New York State

 U.S.A

New York, NY



Midwest



NYC
Kanagawa-ken, Japan
Queens, NYC
rent free in your head
Closer than you think, but not
New Jersey, USA
New York, USA
USA

Brooklyn, NY
New York, NY


Figment, Imagination
Northern Ireland
NY
Soulsville, Memphis✊🏾🇺🇸
Chelsea, Manhattan
Fresno
Holland, MI

Australia 🇦🇺

Virginia Beach, VA
Canarsee Land
Sand Bar, just outside of TX
New York, NY
Hutt space
East Coast 
United States
Los Angeles

Iowa
Saskatchewan , Canada

New York, NY
Toronto, Ontario • Chester, PA
Miami, FL, New York




They/them
Big D's close enough, Texas 🤠


Geosynchronous orbit over D.C.

Philadelphia, PA
British Columbia, Canada


NYC
Atlanta, GA
Atlanta, GA
Atlanta, GA
Atlanta, GA
Forest Hills, NY


Forest Hills, NY
Brooklyn, NY
Ballston Spa, NY
Florida, USA
Oakland, CA
Texas

Cut and Shoot, TX




[38, 62, 0]

Another way to know where users are tweeting from is to use their geolocation data, which again is only made public for a fraction of users. 

#### Use the Tweepy API to search for 100 tweets on any topic (you can do this by using "*" for the q field in api.search) within one mile of Cornell Tech (40.7556,-73.9562). Read the Tweepy documentation carefully for the correct format to pass in geocode data (https://docs.tweepy.org/en/stable/api.html). (10 points)

#### Write a couple sentences on the ethical considerations involved in using Twitter location data. What privacy concerns might we have? Given that relatively few Tweets have geolocation data, and that most people are not on Twitter, what concerns about data representativeness might we have? (5 points)