<span style="font-size:2em;"> ****Data Science 300 Final Project**** </span>

<span style="font-size:1em;">**Anonymizing AirBnb (New York) 2019 listings database**</span>



<span style="font-size:1.5em;">First, lets Import the following libraries:</span>

In [1]:
import pandas as pd
import numpy as np


<span style="font-size:1.5em;">Now we will use the pandas library function **'.read_csv'** to import the csv file to pandas as a DataFrame to this notebook:</span>

In [2]:
# You can change the path to where the file is on your computer by editing the dataset_path defintion below:
dataset_path = "/Users/arko/Desktop/AB_NYC_2019.csv"

air_db = pd.read_csv(dataset_path) 


<span style="font-size:1.5em;">The first step of **k-anonymization** is to the remove the **Explicit identifiers** from the original database. Since I already took a look at the database I know the Explicit Identifiers to be:

* <span style="font-size:1.5em;">id</span>
* <span style="font-size:1.5em;">name</span>
* <span style="font-size:1.5em;">host_name</span>
* <span style="font-size:1.5em;">host_id</span><br>

<span style="font-size:1.5em;">We **remove these fields from the database** and print the first 5 rows of the new table.</span>

In [3]:
#drop all the excplicit identifiers from the table
air_db = air_db.drop("id", axis=1)
air_db = air_db.drop("name", axis=1)
air_db = air_db.drop("host_name", axis=1)
air_db = air_db.drop("host_id", axis=1)

#Print first 5 rows to check new data
air_db.head()

Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


<span style="font-size:1.5em;">We selected *latitude*, *longitudate* and *price* as **Quasi Identifiers** for this dataset, and hence must generalize these values. For **latitude** and **longitude** I found it very hard to generalize without removing most of it's utility, so I **suppressed both** of them.</span>

In [4]:
latitude = air_db['latitude']
air_db['latitude'] = air_db['latitude'].mask(latitude>40,"40.****")


In [5]:
longitude = air_db['longitude']
air_db['longitude'] = air_db['longitude'].mask(longitude < -73,"-73.****")


In [6]:
air_db['longitude'] = air_db['longitude'].mask(longitude<-74,"-74.****")


<span style="font-size:1.5em;">Now we generalize price by using ranged labels for the price, and creating a new field called **price_range**. It will be appended **price_range** to the last field of the table.</span>

In [7]:
labels = ["{0} - {1}".format(i, i + 49) for i in range(0, 10000, 50)]

#print(len(labels)) -> output -> 200 -> hence we can created 200 labels

# New column called price_range

air_db["price_range"] = pd.cut(air_db.price, range(0, 10001, 50), right=False, labels=labels)

#Print first 10 rows to check new data

air_db.head()


Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,price_range
0,Brooklyn,Kensington,40.****,-73.****,Private room,149,1,9,2018-10-19,0.21,6,365,100 - 149
1,Manhattan,Midtown,40.****,-73.****,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,200 - 249
2,Manhattan,Harlem,40.****,-73.****,Private room,150,3,0,,,1,365,150 - 199
3,Brooklyn,Clinton Hill,40.****,-73.****,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,50 - 99
4,Manhattan,East Harlem,40.****,-73.****,Entire home/apt,80,10,9,2018-11-19,0.1,1,0,50 - 99


<span style="font-size:1.5em;">Now we will change the position of **price_range** in the table to make it easier to observe. We will not drop the original **price** column just yet, as we need it in the last part of this code.</span>

In [8]:
price_range =air_db["price_range"]

air_db = air_db.drop("price_range", axis=1)

#air_db = air_db.drop("price", axis=1)


In [9]:
air_db.insert(2, "price_range",price_range)
air_db.head()


Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,100 - 149,40.****,-73.****,Private room,149,1,9,2018-10-19,0.21,6,365
1,Manhattan,Midtown,200 - 249,40.****,-73.****,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,Manhattan,Harlem,150 - 199,40.****,-73.****,Private room,150,3,0,,,1,365
3,Brooklyn,Clinton Hill,50 - 99,40.****,-73.****,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,Manhattan,East Harlem,50 - 99,40.****,-73.****,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


<span style="font-size:1.5em;">Now we are going to group the dataset with the pandas function "**.groupby**" using **price_range** as the parameter. It is important to note that the "**.groupby**" function will return a pandas object, and we need to print the output to see if the function worked. I did this using a loop below. </span>

In [10]:
g = air_db.groupby('price_range')
g


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc6d6720bb0>

In [11]:
for price_g, price_g_air_db in g:
    print(price_g)
    print(price_g_air_db)
    

0 - 49
      neighbourhood_group       neighbourhood price_range latitude longitude  \
28              Manhattan              Inwood      0 - 49  40.****  -73.****   
36               Brooklyn  Bedford-Stuyvesant      0 - 49  40.****  -73.****   
39              Manhattan     Lower East Side      0 - 49  40.****  -73.****   
58               Brooklyn          Greenpoint      0 - 49  40.****  -73.****   
149              Brooklyn         Fort Greene      0 - 49  40.****  -73.****   
...                   ...                 ...         ...      ...       ...   
48871           Manhattan              Harlem      0 - 49  40.****  -73.****   
48877            Brooklyn            Bushwick      0 - 49  40.****  -73.****   
48878              Queens            Elmhurst      0 - 49  40.****  -73.****   
48882            Brooklyn            Bushwick      0 - 49  40.****  -73.****   
48891            Brooklyn            Bushwick      0 - 49  40.****  -73.****   

          room_type  price  mini

In [None]:
#g.groups
#for value in g.groups.items():
 #   print(value)


<span style="font-size:1.5em;">Now we will group the data with respect to the price range, and also make sure that rows with empty price ranges do not get appended to the new dataframe. We will use an ordered dictionary to order the price table with respect to price range in ascending order. The dataframe that is '**GeneralDF**'.</span>

In [13]:
#we import ordered dictionary 
from collections import OrderedDict

myarr = OrderedDict()


In [14]:
#print(group_list)
for key, value in g.groups.items():
    indices = list(value)
    newdf = air_db.iloc[indices]
    if len(newdf) == 0:
        continue
    else:
        myarr[key] = newdf
#test    
myarr['100 - 149']


Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,100 - 149,40.****,-73.****,Private room,149,1,9,2018-10-19,0.21,6,365
10,Manhattan,Upper West Side,100 - 149,40.****,-73.****,Entire home/apt,135,5,53,2019-06-22,0.43,1,6
14,Manhattan,West Village,100 - 149,40.****,-74.****,Entire home/apt,120,90,27,2018-10-31,0.22,1,0
15,Brooklyn,Williamsburg,100 - 149,40.****,-73.****,Entire home/apt,140,2,148,2019-06-29,1.20,1,46
17,Manhattan,Chelsea,100 - 149,40.****,-73.****,Private room,140,1,260,2019-07-01,2.12,1,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48875,Manhattan,East Harlem,100 - 149,40.****,-73.****,Private room,140,1,0,,,1,180
48879,Brooklyn,Williamsburg,100 - 149,40.****,-73.****,Entire home/apt,120,20,0,,,1,22
48880,Brooklyn,Williamsburg,100 - 149,40.****,-73.****,Entire home/apt,120,1,0,,,3,365
48888,Manhattan,Hell's Kitchen,100 - 149,40.****,-73.****,Private room,125,4,0,,,1,31


In [15]:
print(len(list(g.groups.keys())))

print(len(list(myarr.keys())))


200
80


In [None]:
myarr.keys()


In [16]:
generalDF = pd.DataFrame(myarr['0 - 49'])
for key, value in myarr.items():
    if key != '0 - 49':
        tempdf = myarr[key]
        #print(key)
        generalDF = pd.concat([generalDF, tempdf])
        
    

In [17]:
generalDF


Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
28,Manhattan,Inwood,0 - 49,40.****,-73.****,Private room,44,3,108,2019-06-15,1.11,3,311
36,Brooklyn,Bedford-Stuyvesant,0 - 49,40.****,-73.****,Private room,35,60,0,,,1,365
39,Manhattan,Lower East Side,0 - 49,40.****,-73.****,Shared room,40,1,214,2019-07-05,1.81,4,188
58,Brooklyn,Greenpoint,0 - 49,40.****,-73.****,Private room,49,4,138,2019-06-04,1.19,3,320
149,Brooklyn,Fort Greene,0 - 49,40.****,-73.****,Private room,44,8,27,2019-06-29,1.05,5,280
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4377,Brooklyn,Clinton Hill,8000 - 8049,40.****,-73.****,Entire home/apt,8000,1,1,2016-09-15,0.03,11,365
30268,Manhattan,Tribeca,8500 - 8549,40.****,-74.****,Entire home/apt,8500,30,2,2018-09-18,0.18,1,251
6530,Manhattan,East Harlem,9950 - 9999,40.****,-73.****,Entire home/apt,9999,5,1,2015-01-02,0.02,1,0
12342,Manhattan,Lower East Side,9950 - 9999,40.****,-73.****,Private room,9999,99,6,2016-01-01,0.14,1,83


In [19]:
# Interquartile range to understand the distribution of price
Q1 = generalDF['price'].quantile(0.25)
Q3 = generalDF['price'].quantile(0.75)
IQR = Q3 - Q1
print(Q1)
print(Q3)
print(IQR)

69.0
175.0
106.0


<span style="font-size:1.5em;"> 
Observing the data frame above we can clearly see that there are only 
three properities in the '9950 - 9999' price range, which is definitely not good for anonymization.
We can also notice from the functions above that most of the price data in this table is below $1000. Hence we will 
split the generalDF into two dataframes:

* subsetdataframe1(for price < 1000 dollars)
* subsetdataframe2(for price > 999 dollars)

</span>

In [21]:
#using the key array we split the Dataframe for price < 1000
subsetdataframe1 = generalDF[generalDF['price_range'].isin(['0 - 49', '50 - 99', '100 - 149', '150 - 199', '200 - 249', '250 - 299', '300 - 349', '350 - 399', '400 - 449', '450 - 499', '500 - 549', '550 - 599', '600 - 649', '650 - 699', '700 - 749', '750 - 799', '800 - 849', '850 - 899', '900 - 949', '950 - 999'])]
subsetdataframe1

Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
28,Manhattan,Inwood,0 - 49,40.****,-73.****,Private room,44,3,108,2019-06-15,1.11,3,311
36,Brooklyn,Bedford-Stuyvesant,0 - 49,40.****,-73.****,Private room,35,60,0,,,1,365
39,Manhattan,Lower East Side,0 - 49,40.****,-73.****,Shared room,40,1,214,2019-07-05,1.81,4,188
58,Brooklyn,Greenpoint,0 - 49,40.****,-73.****,Private room,49,4,138,2019-06-04,1.19,3,320
149,Brooklyn,Fort Greene,0 - 49,40.****,-73.****,Private room,44,8,27,2019-06-29,1.05,5,280
...,...,...,...,...,...,...,...,...,...,...,...,...,...
46291,Manhattan,West Village,950 - 999,40.****,-74.****,Entire home/apt,995,1,3,2019-07-05,3.00,1,344
46438,Manhattan,Flatiron District,950 - 999,40.****,-73.****,Entire home/apt,950,1,0,,,1,365
46596,Manhattan,West Village,950 - 999,40.****,-74.****,Private room,999,1,2,2019-07-05,2.00,5,321
46965,Manhattan,Civic Center,950 - 999,40.****,-74.****,Entire home/apt,950,3,0,,,1,69


In [22]:
#using the key array we split the Dataframe for price > 999
subsetdataframe2 = generalDF[generalDF['price_range'].isin(['1000 - 1049', '1050 - 1099', '1100 - 1149', '1150 - 1199', '1200 - 1249', '1250 - 1299', '1300 - 1349', '1350 - 1399', '1400 - 1449', '1450 - 1499', '1500 - 1549', '1550 - 1599', '1600 - 1649', '1650 - 1699', '1700 - 1749', '1750 - 1799', '1800 - 1849', '1850 - 1899', '1900 - 1949', '1950 - 1999', '2000 - 2049', '2100 - 2149', '2200 - 2249', '2250 - 2299', '2300 - 2349', '2350 - 2399', '2400 - 2449', '2500 - 2549', '2550 - 2599', '2600 - 2649', '2650 - 2699', '2750 - 2799', '2800 - 2849', '2850 - 2899', '2900 - 2949', '2950 - 2999', '3000 - 3049', '3200 - 3249', '3500 - 3549', '3600 - 3649', '3750 - 3799', '3800 - 3849', '3900 - 3949', '4000 - 4049', '4100 - 4149', '4150 - 4199', '4200 - 4249', '4500 - 4549', '5000 - 5049', '5100 - 5149', '5250 - 5299', '6000 - 6049', '6400 - 6449', '6500 - 6549', '6800 - 6849', '7500 - 7549', '7700 - 7749', '8000 - 8049', '8500 - 8549', '9950 - 9999'])]
subsetdataframe2

Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
1414,Manhattan,Upper West Side,1000 - 1049,40.****,-73.****,Entire home/apt,1000,30,44,2015-09-28,0.53,11,364
2215,Manhattan,Tribeca,1000 - 1049,40.****,-74.****,Entire home/apt,1000,1,25,2019-06-22,0.36,3,37
2355,Manhattan,Upper West Side,1000 - 1049,40.****,-73.****,Entire home/apt,1000,30,24,2016-01-27,0.33,11,364
2386,Manhattan,Hell's Kitchen,1000 - 1049,40.****,-73.****,Entire home/apt,1000,1,0,,,1,365
3345,Manhattan,Upper West Side,1000 - 1049,40.****,-73.****,Entire home/apt,1000,1,0,,,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4377,Brooklyn,Clinton Hill,8000 - 8049,40.****,-73.****,Entire home/apt,8000,1,1,2016-09-15,0.03,11,365
30268,Manhattan,Tribeca,8500 - 8549,40.****,-74.****,Entire home/apt,8500,30,2,2018-09-18,0.18,1,251
6530,Manhattan,East Harlem,9950 - 9999,40.****,-73.****,Entire home/apt,9999,5,1,2015-01-02,0.02,1,0
12342,Manhattan,Lower East Side,9950 - 9999,40.****,-73.****,Private room,9999,99,6,2016-01-01,0.14,1,83


<span style="font-size:1.5em;"> For subsetdataframe2 we want to replace all the values in price_range and change it to ">=100". This is slightly tricky as the ".mask" function I used before only works on int64 data types which price_range is not. So instead of doing this operation on price_range, I did the operation on the price field (remember I did not remove price from the tables yet). I then dropped price_range from the table and inserted a copy of the modified 'price' column into the same position as price_range with the same name.</span>
    
 

In [23]:
pd.set_option('mode.chained_assignment', None)
price=subsetdataframe2['price']
subsetdataframe2['price'] = subsetdataframe2['price'].mask(price > 999, ">=1000")


In [24]:
subsetdataframe2


Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
1414,Manhattan,Upper West Side,1000 - 1049,40.****,-73.****,Entire home/apt,>=1000,30,44,2015-09-28,0.53,11,364
2215,Manhattan,Tribeca,1000 - 1049,40.****,-74.****,Entire home/apt,>=1000,1,25,2019-06-22,0.36,3,37
2355,Manhattan,Upper West Side,1000 - 1049,40.****,-73.****,Entire home/apt,>=1000,30,24,2016-01-27,0.33,11,364
2386,Manhattan,Hell's Kitchen,1000 - 1049,40.****,-73.****,Entire home/apt,>=1000,1,0,,,1,365
3345,Manhattan,Upper West Side,1000 - 1049,40.****,-73.****,Entire home/apt,>=1000,1,0,,,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4377,Brooklyn,Clinton Hill,8000 - 8049,40.****,-73.****,Entire home/apt,>=1000,1,1,2016-09-15,0.03,11,365
30268,Manhattan,Tribeca,8500 - 8549,40.****,-74.****,Entire home/apt,>=1000,30,2,2018-09-18,0.18,1,251
6530,Manhattan,East Harlem,9950 - 9999,40.****,-73.****,Entire home/apt,>=1000,5,1,2015-01-02,0.02,1,0
12342,Manhattan,Lower East Side,9950 - 9999,40.****,-73.****,Private room,>=1000,99,6,2016-01-01,0.14,1,83


In [25]:
subsetdataframe2 = subsetdataframe2.drop("price_range", axis=1)
subsetdataframe2


Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
1414,Manhattan,Upper West Side,40.****,-73.****,Entire home/apt,>=1000,30,44,2015-09-28,0.53,11,364
2215,Manhattan,Tribeca,40.****,-74.****,Entire home/apt,>=1000,1,25,2019-06-22,0.36,3,37
2355,Manhattan,Upper West Side,40.****,-73.****,Entire home/apt,>=1000,30,24,2016-01-27,0.33,11,364
2386,Manhattan,Hell's Kitchen,40.****,-73.****,Entire home/apt,>=1000,1,0,,,1,365
3345,Manhattan,Upper West Side,40.****,-73.****,Entire home/apt,>=1000,1,0,,,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
4377,Brooklyn,Clinton Hill,40.****,-73.****,Entire home/apt,>=1000,1,1,2016-09-15,0.03,11,365
30268,Manhattan,Tribeca,40.****,-74.****,Entire home/apt,>=1000,30,2,2018-09-18,0.18,1,251
6530,Manhattan,East Harlem,40.****,-73.****,Entire home/apt,>=1000,5,1,2015-01-02,0.02,1,0
12342,Manhattan,Lower East Side,40.****,-73.****,Private room,>=1000,99,6,2016-01-01,0.14,1,83


In [26]:
price_range_2 =subsetdataframe2['price']

subsetdataframe2.insert(2, "price_range",price_range_2)
subsetdataframe2.head()



Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
1414,Manhattan,Upper West Side,>=1000,40.****,-73.****,Entire home/apt,>=1000,30,44,2015-09-28,0.53,11,364
2215,Manhattan,Tribeca,>=1000,40.****,-74.****,Entire home/apt,>=1000,1,25,2019-06-22,0.36,3,37
2355,Manhattan,Upper West Side,>=1000,40.****,-73.****,Entire home/apt,>=1000,30,24,2016-01-27,0.33,11,364
2386,Manhattan,Hell's Kitchen,>=1000,40.****,-73.****,Entire home/apt,>=1000,1,0,,,1,365
3345,Manhattan,Upper West Side,>=1000,40.****,-73.****,Entire home/apt,>=1000,1,0,,,1,0


<span style="font-size:1.5em;">Now we will drop price from both subset dataframes and join them together to get the nearly anonymized dataset.  </span> 

In [27]:
subsetdataframe1 = subsetdataframe1.drop("price", axis=1)
subsetdataframe2 = subsetdataframe2.drop("price", axis=1)


In [28]:
nearly_anon_db = pd.concat([subsetdataframe1,subsetdataframe2])
nearly_anon_db


Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
28,Manhattan,Inwood,0 - 49,40.****,-73.****,Private room,3,108,2019-06-15,1.11,3,311
36,Brooklyn,Bedford-Stuyvesant,0 - 49,40.****,-73.****,Private room,60,0,,,1,365
39,Manhattan,Lower East Side,0 - 49,40.****,-73.****,Shared room,1,214,2019-07-05,1.81,4,188
58,Brooklyn,Greenpoint,0 - 49,40.****,-73.****,Private room,4,138,2019-06-04,1.19,3,320
149,Brooklyn,Fort Greene,0 - 49,40.****,-73.****,Private room,8,27,2019-06-29,1.05,5,280
...,...,...,...,...,...,...,...,...,...,...,...,...
4377,Brooklyn,Clinton Hill,>=1000,40.****,-73.****,Entire home/apt,1,1,2016-09-15,0.03,11,365
30268,Manhattan,Tribeca,>=1000,40.****,-74.****,Entire home/apt,30,2,2018-09-18,0.18,1,251
6530,Manhattan,East Harlem,>=1000,40.****,-73.****,Entire home/apt,5,1,2015-01-02,0.02,1,0
12342,Manhattan,Lower East Side,>=1000,40.****,-73.****,Private room,99,6,2016-01-01,0.14,1,83


<span style="font-size:1.5em;"> For the sensitive column **calculated_host_listings_count**, we will mask all the rows as it can be used to identify individuals. This is will the last step, and we will have a k-anonymized dataset grouped by price range.  </span>

In [29]:
calculated_host_listings_count = nearly_anon_db['calculated_host_listings_count']
nearly_anon_db['calculated_host_listings_count'] = nearly_anon_db['calculated_host_listings_count'].mask(calculated_host_listings_count>-1,"***")

In [31]:
K_Anon_DB = nearly_anon_db
K_Anon_DB

Unnamed: 0,neighbourhood_group,neighbourhood,price_range,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
28,Manhattan,Inwood,0 - 49,40.****,-73.****,Private room,3,108,2019-06-15,1.11,***,311
36,Brooklyn,Bedford-Stuyvesant,0 - 49,40.****,-73.****,Private room,60,0,,,***,365
39,Manhattan,Lower East Side,0 - 49,40.****,-73.****,Shared room,1,214,2019-07-05,1.81,***,188
58,Brooklyn,Greenpoint,0 - 49,40.****,-73.****,Private room,4,138,2019-06-04,1.19,***,320
149,Brooklyn,Fort Greene,0 - 49,40.****,-73.****,Private room,8,27,2019-06-29,1.05,***,280
...,...,...,...,...,...,...,...,...,...,...,...,...
4377,Brooklyn,Clinton Hill,>=1000,40.****,-73.****,Entire home/apt,1,1,2016-09-15,0.03,***,365
30268,Manhattan,Tribeca,>=1000,40.****,-74.****,Entire home/apt,30,2,2018-09-18,0.18,***,251
6530,Manhattan,East Harlem,>=1000,40.****,-73.****,Entire home/apt,5,1,2015-01-02,0.02,***,0
12342,Manhattan,Lower East Side,>=1000,40.****,-73.****,Private room,99,6,2016-01-01,0.14,***,83


<span style="font-size:1.5em;"> Finally, we will exported the anonymized dataframe to a '.csv' file. </span>

In [32]:
K_Anon_DB.to_csv("/Users/arko/Desktop/K_Anon_DB.csv")
