## 简介：该数据集包含了2019年纽约市的Airbnb上线的房间情况。Airbnb是一个旅行房屋租赁社区，用户可通过网站或手机APP发布、搜索度假房屋租赁信息并在线预定。
变量含义：
- id：房间id
- name：房间名称
- host_id：房东id
- host_name：房东姓名
- neighbourhood_group：地区
- neighbourhood：街区
- latitude：纬度坐标
- longitude：经度坐标
- room_type：房间类型
- price：价格（美元）
- minimum_nights：最少预定夜晚数
- number_of_reviews：评论数量
- last_review：最新浏览
- reviews_per_month：每月浏览次数
- calculated_host_listings_count：房东挂出房子的数量
- availability_365：可预定房源的天数

## 读取数据

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
original_data=pd.read_csv("F:/数据实战/airbnb_NYC_2019.csv")

In [4]:
original_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## 评估数据

### 评估数据整齐度

In [5]:
original_data.sample(10)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
3282,1958765,Stylish 1 BR Loft Apt Williamsburg,10123248,Sylvana,Brooklyn,Williamsburg,40.71688,-73.94584,Entire home/apt,250,2,88,2019-05-28,1.31,1,223
15243,12191010,Chic Spacious Beachfront House,65482154,Antoinette,Queens,Arverne,40.58734,-73.79971,Entire home/apt,299,3,4,2019-05-27,0.31,1,318
18709,14808890,Comfortable Twin Size Bed Near Airport,51550484,Zhur,Queens,Richmond Hill,40.69416,-73.84698,Private room,47,2,25,2019-06-10,1.13,2,152
10875,8388278,Bright Williamsburg Apartment,25370219,Sacha,Brooklyn,Williamsburg,40.71495,-73.96321,Private room,65,2,12,2017-01-31,0.26,1,0
42667,33123667,Perfect Place to just Sleep while you explore NYC,56349939,Oliver,Manhattan,Lower East Side,40.71728,-73.99085,Private room,84,2,18,2019-06-29,5.35,1,21
2856,1617443,Great Artistic Studio in Historic Building,2240143,Pavel,Manhattan,Harlem,40.81606,-73.94363,Shared room,85,3,38,2019-05-27,0.56,1,157
36339,28913324,6 Guests LUXURIOUS MANHATTAN Condo With ROOFTO...,96986507,Jason,Manhattan,Harlem,40.80702,-73.95334,Entire home/apt,270,2,12,2019-06-11,1.34,1,365
39410,30722712,Amazing 1 BR on Gramercy (Min 30 Days),178224519,Lisa,Manhattan,Kips Bay,40.74002,-73.97914,Entire home/apt,155,30,0,,,8,365
36911,29338559,Cozy Room for Rent very close to Manhattan,4666670,Jeanny,Queens,Astoria,40.76902,-73.91856,Private room,49,5,9,2019-06-12,1.05,4,318
29228,22434082,Cozy West Village Apartment,164318938,Noni,Manhattan,West Village,40.73248,-74.00053,Entire home/apt,225,1,29,2019-06-23,1.72,1,14


从抽样的数据来看“每行是一个观察值，每列是一个变量，每一个单元格是一个值”，因此数据不存在结构性问题。


### 评估数据干净度

In [6]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

从输出结果来看，共有48895个观察值，其中`name`,`host_name`,`last_review`,`reviews_per_month`缺少观察值，
`id`,`host_id`,`latitude`,`longitude`数据类型应为字符串，应当进行数据类型格式转化
`last_review`数据类型应为日期，需要进行数据类型转化


### 评估缺失数据

在了解`last_review`缺失之后，提出这些缺失的观察值

In [7]:
original_data[original_data['last_review'].isnull()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
19,7750,Huge 2 BR Upper East Cental Park,17985,Sing,Manhattan,East Harlem,40.79685,-73.94872,Entire home/apt,190,7,0,,,2,249
26,8700,Magnifique Suite au N de Manhattan - vue Cloitres,26394,Claude & Sophie,Manhattan,Inwood,40.86754,-73.92639,Private room,80,4,0,,,1,0
36,11452,Clean and Quiet in Brooklyn,7355,Vt,Brooklyn,Bedford-Stuyvesant,40.68876,-73.94312,Private room,35,60,0,,,1,365
38,11943,Country space in the city,45445,Harriet,Brooklyn,Flatbush,40.63702,-73.96327,Private room,150,1,0,,,1,365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


根据猜想，缺失`last_review`的观察值同时缺失`reviews_per_month`,
为了验证猜想，增加筛选条件，在`last_review`缺失的同时，有没有`reviews_per_month`不为non的

In [8]:
original_data[(original_data['last_review'].isnull())&(~original_data['reviews_per_month'].isnull())]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


根据分析结果，发现缺失`last_review`的观察值同时也缺失`reviews_per_month`，所以我们可以选择将缺失`last_review`的观察值把`reviews_per_month`赋值为0

同样我们再提取`name`的缺失值

In [9]:
original_data[original_data['name'].isnull()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2854,1615764,,6676776,Peter,Manhattan,Battery Park City,40.71239,-74.0162,Entire home/apt,400,1000,0,,,1,362
3703,2232600,,11395220,Anna,Manhattan,East Village,40.73215,-73.98821,Entire home/apt,200,1,28,2015-06-08,0.45,1,341
5775,4209595,,20700823,Jesse,Manhattan,Greenwich Village,40.73473,-73.99244,Entire home/apt,225,1,1,2015-01-01,0.02,1,0
5975,4370230,,22686810,Michaël,Manhattan,Nolita,40.72046,-73.9955,Entire home/apt,215,7,5,2016-01-02,0.09,1,0
6269,4581788,,21600904,Lucie,Brooklyn,Williamsburg,40.7137,-73.94378,Private room,150,1,0,,,1,0
6567,4756856,,1832442,Carolina,Brooklyn,Bushwick,40.70046,-73.92825,Private room,70,1,0,,,1,0
6605,4774658,,24625694,Josh,Manhattan,Washington Heights,40.85198,-73.93108,Private room,40,1,0,,,1,0
8841,6782407,,31147528,Huei-Yin,Brooklyn,Williamsburg,40.71354,-73.93882,Private room,45,1,0,,,1,0
11963,9325951,,33377685,Jonathan,Manhattan,Hell's Kitchen,40.76436,-73.98573,Entire home/apt,190,4,1,2016-01-05,0.02,1,0
12824,9787590,,50448556,Miguel,Manhattan,Harlem,40.80316,-73.95189,Entire home/apt,300,5,0,,,5,0


我们可以发现缺失`name`的同时，也缺少`reviews_per_month`或者其数值较低,大大减少了吸引力，大概率不会存在有顾客去第二次的情况，因此我们可以将这些删除

同样我们再提取`host_id`的缺失值

In [10]:
original_data[original_data['host_name'].isnull()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
360,100184,Bienvenue,526653,,Queens,Queens Village,40.72413,-73.76133,Private room,50,1,43,2019-07-08,0.45,1,88
2700,1449546,Cozy Studio in Flatbush,7779204,,Brooklyn,Flatbush,40.64965,-73.96154,Entire home/apt,100,30,49,2017-01-02,0.69,1,342
5745,4183989,SPRING in the City!! Zen-Style Tranquil Bedroom,919218,,Manhattan,Harlem,40.80606,-73.95061,Private room,86,3,34,2019-05-23,1.0,1,359
6075,4446862,Charming Room in Prospect Heights!,23077718,,Brooklyn,Crown Heights,40.67512,-73.96146,Private room,50,1,0,,,1,0
6582,4763327,"Luxurious, best location, spa inc'l",24576978,,Brooklyn,Greenpoint,40.72035,-73.95355,Entire home/apt,195,1,1,2015-10-20,0.02,1,0
8163,6292866,Modern Quiet Gem Near All,32722063,,Brooklyn,East Flatbush,40.65263,-73.93215,Entire home/apt,85,2,182,2019-06-19,3.59,2,318
8257,6360224,"Sunny, Private room in Bushwick",33134899,,Brooklyn,Bushwick,40.70146,-73.92792,Private room,37,1,1,2015-07-01,0.02,1,0
8852,6786181,R&S Modern Spacious Hideaway,32722063,,Brooklyn,East Flatbush,40.64345,-73.93643,Entire home/apt,100,2,157,2019-06-19,3.18,2,342
9138,6992973,1 Bedroom in Prime Williamsburg,5162530,,Brooklyn,Williamsburg,40.71838,-73.9563,Entire home/apt,145,1,0,,,1,0
9817,7556587,Sunny Room in Harlem,39608626,,Manhattan,Harlem,40.82929,-73.94182,Private room,28,1,1,2015-08-01,0.02,1,0


我们发现虽然缺少`host_name`,但是有的`reviews_per_month`较多，可以暂且保留

### 评估重复数据

我们可以发现`id`不能重复,`name`房间名称可以一样,`latitude`,`longitude`不能同时重复，所以后续我们需要进行删除

### 评估不一致数据

我们发现这些列，不会存在两种名字指代同一个东西，所以无需进行分析

### 评估无效或错误数据

In [11]:
original_data.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


我们可以发现`price`最小值不可能为0,需要删除并且不保留小数
`number_of_reviews`最小为0的同时，不可能`reviews_per_month`不为0，需要删除
`availability_365`不可能为0，为无效数据，要进行删除


In [None]:
def check(a,b):

## 清洗数据

根据前面评估的结果，我们进行了如下清洗步骤：
`id`,`host_id`,`latitude`,`longitude`数据类型改为字符串；
`last_review`数据类型改为日期；
将缺失`last_review`的观察值把`reviews_per_month`赋值为0；
缺失`name`的观察值进行删除；
将重复的`id`进行删除；
将`latitude`,`longitude`两者同时重复的进行删除；
将`price`为0的进行删除，同时只保留到整数；
将`number_of_reviews`为0的同时，把`reviews_per_month`不为0的观察值进行删除；
把`availability_365`为0的观察值进行删除。



为了区分开经过清理的数据和原始的数据，我们创建新的变量`cleaned_data`，让它为`original_data`复制出的副本。我们之后的清理步骤都将被运用在`cleaned_data`上。