# DS 300 Mini Project 1

## Joseph Sepich and Hunter Dicicco


In [1]:
# Import packages
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import numpy as np

# Read in data set
airbnb_data = pd.read_csv("AB_NYC_2019.csv")
airbnb_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


# Dataset Analysis

List the column names in the dataset. For each column, identify the type of information (string, integer, etc) and present the basic descriptives (range of values, mean, frequencies, outliers or unique values, etc). Explain the utility of the data that you want to preserve after your anonymization process. For example, discuss what types of queries you want your anonymized data to support. Your anonymization approach should support your answer.

## Column Description

Note that in this data set, each row or case represents a property listed for rent.

In [2]:
# Get mean stats
airbnb_data.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [3]:
# Cardinality
for column in airbnb_data.columns:
    print(column)
    print(airbnb_data[column].nunique())

id
48895
name
47905
host_id
37457
host_name
11452
neighbourhood_group
5
neighbourhood
221
latitude
19048
longitude
14718
room_type
3
price
674
minimum_nights
109
number_of_reviews
394
last_review
1764
reviews_per_month
937
calculated_host_listings_count
47
availability_365
366


In [4]:
# Range
range_cols = ['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
for col in range_cols:
    print(col)
    print(airbnb_data[col].max() - airbnb_data[col].min())

latitude
0.41326999999999003
longitude
0.5314299999999861
price
10000
minimum_nights
1249
number_of_reviews
629
reviews_per_month
58.49
calculated_host_listings_count
326
availability_365
365


| Column Name | Data Type | Mean | Range | Cardinality | Description |
|------------:|-----------|------|-------|-------------|-------------|
|     Id      | Integer   |      |       |48895|Listing Identifier|
|    host_id  | Integer   |      |       |37457|Host Identifier|
|latitude     | Double    |40.72849|0.41326999|19048|Latitude Location|
|longitude    | Double    |-73.952170|0.5314299|14718|Longitude Location|
|price        | Integer   |152.720687|10000|674|Nightly Rental Rate|
|minimum_nights| Integer  |7.029962|1249|109|Rental Night Minimum|
|number_of_reviews|Integer|23.274466|629|394|Total Reviews for Property|
|reviews_per_month|Double |1.373221|58.49|937|Reviews each Month|
|host_listings_count|Integer|7.143982|326|47|Total Listings for host|
|availability_365|Integer |112.781327|365|366|Days available in the year|
|neighbourhood_group|String|      |       |5|Big Five Neighbourhood Location|
|neighbourhood|String     |      |       |221|Sub Neighbourhood Location|
|name         |String     |      |       |47905|Name of Rental|
|host_name    |String     |      |       |11452|Name of Host|
|room_type    |String     |      |       |3|Type of Rental available|
|last_review  |Date     |      |       |1764|Date of Last review|



## Utility Preservation Goals

In our anonymization approach we want to preserve queries that relate to aggregate statistics. This is especially true for mainting the distribution of the data. If we can keep the distribution, then these types of queries will be more accurate. We also want some information available that would be fairly generic like type of the room or how long you have to rent there. Some example queries that we would want to preserve are:

* SELECT mean(price) FROM airbnb_listings
* SELECT mean(price) FROM airbnb_listings WHERE minimum_nights < 3
* SELECT min(price), max(price) FROM airbnb_listings WHERE neighbourhood_group == "Manhattan"
* SELECT min(price), max(price) FROM airbnb_listings WHERE room_type == "Entire home/apt"
* SELECT room_type, minimum_nights FROM airbnb_listings WHERE neighbourhood_group == "Brooklyn"


In [5]:
print(airbnb_data['price'].mean())
print(airbnb_data[airbnb_data['minimum_nights'] < 3]['price'].mean())
print(str(airbnb_data[airbnb_data['neighbourhood_group'] == 'Manhattan']['price'].min()) + ',' + str(airbnb_data[airbnb_data['neighbourhood_group'] == 'Manhattan']['price'].max()))
print(str(airbnb_data[airbnb_data['room_type'] == 'Entire home/apt']['price'].min()) + ',' + str(airbnb_data[airbnb_data['room_type'] == 'Entire home/apt']['price'].max()))

print(airbnb_data[airbnb_data['neighbourhood_group'] == 'Brooklyn'][['room_type', 'minimum_nights']])





152.7206871868289
144.0574623197903
0,10000
0,10000
             room_type  minimum_nights
0         Private room               1
3      Entire home/apt               1
6         Private room              45
12        Private room               4
15     Entire home/apt               2
...                ...             ...
48882     Private room              20
48884     Private room               7
48887  Entire home/apt               1
48890     Private room               2
48891     Private room               4

[20104 rows x 2 columns]


# Identify Sensitive Information

Based on the data analysis you did, identify the explicit identifiers, quasi-identifiers, sensitive attributes and non-sensitive attributes in the dataset. List these attributes with your reasons for your classification.

### Explicit Identifiers

* Id - unqiue and explicit index of property for rent

### Quasi-Identifiers

* Host_id - Could identify a property by knowing the host and other information.
* Name - Could identify a property based off the name given. 
* Host_name - Could identify a property by knowing the host and other information.
* Latitude - Location can help to identify a property
* Longitude - Location can help to identify a property

### Sensitive Attributes

* Availability_365 - non available dates can help reveal the use of the rental
* Last_Review - last review could indicate details about last person's trip
* Neighbourhood - can reveal a lot about a certain rental property
* Price - Reveals how valuable a property could be. Robbers would want to target richer properties.
* Calculated_host_listings_count - Reveals how many properties a person owns. 
* Number_of_reviews - Reveals popularity of a certain property.

### Non-sensitive Attributes

* Neighbourhood_group - very vague information about location
* Room_type - Not much information can be derived from this. This information could be inferred by location information.
* Reviews_per_month - mostly reveals usage of a property
* Minimum_nights - mostly reveals how much the host is willing to let the rental be rented

# Anonymization Approach

Develop a solution (coding) to anonymize the dataset. You can use any technique or combination of algorithms discussed in the class (k-anonymity, l-diversity, t-closeness, differential privacy, randomization,  cryptography, etc).

## Naive Anonymization

To start off any anonmyization it is good to first take the naive approach. By the naive approach I mean removing explicit identifiers as well as any quasi-identifiers that are not important to the utility of the data, or are likely to disclose too much information. To complete this first part I would remove the explicity identifier **id** as well as the quasi-identifiers **latitude** and **longitude**. I would remove these two pieces of location data, because the reveal the **exact** location of the property. This could lead to the disclosure of a lot of other information through publicly available auxillary data. Furthermore there already exists more vague, yet helpful location data in the form of neighbourhood and neighbourhood_group attributes.

### Example of QI being revealing

Let's look at the first row in the data set:

In [8]:
airbnb_data.head(1)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365


Just looking up the latitude and longitude in google I am able to find this property's address: 805 Friel Pl, Brooklyn, NY 11218. I can also see in the whitepages to see that the host's full name is John Layyapilli in his 70s, and I can also see that most of the apartments in the building are used to rent out, since they are not listed as having residents. Any attacker reasonably good at digging up publicly available information can easily use this to expose this record.

Going off this the first step in our anonymization process will be removing these three rows:

In [11]:
redacted_airbnb = airbnb_data.drop(['id','latitude','longitude'], axis=1)
redacted_airbnb.head()

Unnamed: 0,name,host_id,host_name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,Private room,149,1,9,2018-10-19,0.21,6,365
1,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,Private room,150,3,0,,,1,365
3,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## HUNTER START HERE


## Differential Privacy Queries

# Utility Loss