# 共享住房公司定价

某共享住房公司（如Airbnb类公司）希望能够给不同房间需求制定价格，最终目的为最大化收益。在此过程中，此公司需要建立一个引擎预测不同房间需求在不同价格下面的购买情况。对此，此公司有一些历史数据如下：

- ID：数据编号
- Region: 房间所属区域（类型离散变量，取1-10内整数）
- Date: 需求日期（1-365之间整数，这里我们考虑的需求都是假定为1天的需求）
- Weekday：星期几（1-7之间整数）
- Apartment/Room：是否是整个apartment还是只是一个房间（0-1变量）
- #Beds: 床的数量（1-4之间整数）
- Review：平均历史评价（连续变量，3-5之间实数）
- Pic Quality：照片质量指标（连续变量，0-1之间实数）
- Price：历史使用价格（连续变量，正实数）
- Accept：历史上这个数据是否被接受


目标：需找一个模型可以最好的通过变量和价格预测购买结果。在数据集Training Sheet中我们已经有实际的结果。任务为对Test Sheet集中的购买情况进行预测（Test Sheet中也有实际结果，我们只是在数据集中隐去了）。最终我们将用两个评判标准分别评判：

- 评判标准1：对Test Data每一个进行0/1预测（也即在黄色区域填入0或者1），以正确数量进行评判，正确数量记为S2. 
- 评判标准2、对Test Data中每一个记录预测一个购买概率（也即在黄色区域填入一个概率值），以预测概率在真实情况下的log-likelihood 为评判标准。也即，如对第i条数据预测概率为pi，而第i条数据购买情况为Xi（Xi = 1表示实际购买了，Xi = 0表示实际未购买），则评价标准为：

\begin{equation}
S_3 = \sum_{i=0} （X_i*log(P_i)+ X_i*log(1-X_i)）
\end{equation}




# 探索性数据分析

In [14]:
#载入包
import numpy as np
import pandas as pd                  
import matplotlib.pyplot as plt
from IPython.display import display
#载入数据
data = pd.read_csv('traintwoRAWDATA.csv')
data=data[:50000]
display(data.head(n=5))

Unnamed: 0,ID,Region,Date,Weekday,ApartmentRoom,Beds,Twobedroom,Review,PicQuality,Price,Accept
0,1,2,1,2,1,1,0,4.906215,0.777532,419.342677,0
1,2,7,1,2,1,4,0,4.927244,0.568539,511.447051,1
2,3,2,1,2,0,1,0,3.448766,0.937857,336.708906,0
3,4,5,1,2,1,1,0,3.797086,0.802913,317.400498,0
4,5,8,1,2,1,1,0,3.024098,0.984053,280.088862,0


In [13]:
#成交率
success=len(data['ID'][data['Accept']==1])/50000.00
#未成交率
abort=1-success
print '成交率',success,'未成交率',abort

成交率 0.2924 未成交率 0.7076


可以看出该共享住房公司数据未成交交易占比较大，超过70%， 而成功交易数据量占比较小。

## 原数据建模

Logistic模型作为baseline模型，没有进行数据清洗（特征工程），5fold交叉检验下正确率为0.73

In [18]:
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import linear_model

accepts = data['Accept']
features = data.drop(['Accept','ID'], axis = 1)

logit = linear_model.LogisticRegression()
scores = cross_val_score(logit,features,accepts,cv=5,scoring='accuracy')

scores.mean()

0.73306661332266465

RF模型在未清洗过的数据集上表现并不是很好，这有可能同数据集中 Nominal的时间数据并未处理有关。

In [26]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_split=200) 
scores = cross_val_score(rf,features,accepts,cv=5,scoring='accuracy')
scores.mean()

0.64056811362272448

## 模型二 Logistic 建模 （特征工程后）

这部分将数值变量进行标准化，并将Region、Date等类别变量进行独热码编码。

In [34]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd 

accepts  = data['Accept']
features = data.drop(['Accept','ID'], axis = 1)

scaler = MinMaxScaler()
numerical = ['Price', 'PicQuality','Review','Beds']
features[numerical] = scaler.fit_transform(features[numerical])

features['Region']= features['Region'].astype('str')
features['Date']= features['Date'].astype('str')
features['Weekday']= features['Weekday'].astype('str')
features = pd.get_dummies(features)

- 经过特征工程后，再一次进行Logistic建模。
- 模型准确率有小部分提高（0.003）
### 模型提高不明显的原因是将 Date这一变量 one hot encode后，带来了大量的稀疏数据。且Date作为弱变量，并不能为模型做出显著贡献。

In [35]:
logit = linear_model.LogisticRegression()
scores = cross_val_score(logit,features,accepts,cv=5,scoring='accuracy')
scores.mean()

0.7323064612922584

- 除去Date变量后，模型表现提高到0.748

In [36]:
#除去Date变量进行建模
accepts  = data['Accept']
features = data.drop(['Accept','ID'], axis = 1)

scaler = MinMaxScaler()
numerical = ['Price', 'PicQuality','Review','Beds']
features[numerical] = scaler.fit_transform(features[numerical])

features['Region']= features['Region'].astype('str')
#features['Date']= features['Date'].astype('str')
features['Weekday']= features['Weekday'].astype('str')
features = pd.get_dummies(features)

logit = linear_model.LogisticRegression()
scores = cross_val_score(logit,features,accepts,cv=5,scoring='accuracy')
scores.mean()

0.7481496299259851

### 模型三 Logistic 建模 （特征工程+维度扩展）

这一部分建模过程中， 增加了已有变量进行转换，目的是通过提升特征维度进而提升模型表现：
- DayNum： 当日的数据点数目（无论接受与否）
- Rate：当日接受的数据点数目
- predictedPrice：使用除了价格外数据对价格进行回归，目的是得出在该特征下的预测价格
- difference：(prices - predictedprice)/prices，实际价格偏离预测价格的大小
- Twobedroom: 如果出租房屋且屋内有两张床（含）则为1，否则为0
- LowReview：若Review评价为最低分（3），则为1，否则为0

- 在5折交叉检验中，该模型预测准确度达0.751

In [41]:
data = pd.read_csv('trainingtwo.csv')
data=data[:50000]

data['DayNum']=0
data['Rate'] =0
for i in range(365):
    data['DayNum'][data['Date']==i+1]=len(data['ID'][data['Date']==i+1])
    data['Rate'][data['Date']==i+1]=len(data['ID'][data['Date']==i+1][data['Accept']==1])
display(data.head(n=5))

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,ID,Region,Date,LowDate,HighDate,Weekday,ApartmentRoom,Beds,Review,LowReview,PicQuality,Price,LowPrice,HighPrice,Accept,DayNum,Rate
0,1,2,1,0,0,2,1,1,4.906215,0,0.777532,419.342677,0,0,0,150,41
1,2,7,1,0,0,2,1,4,4.927244,0,0.568539,511.447051,0,1,1,150,41
2,3,2,1,0,0,2,0,1,3.448766,0,0.937857,336.708906,0,0,0,150,41
3,4,5,1,0,0,2,1,1,3.797086,0,0.802913,317.400498,0,0,0,150,41
4,5,8,1,0,0,2,1,1,3.024098,0,0.984053,280.088862,0,0,0,150,41


In [84]:
from sklearn import svm
from pylightgbm.models import GBMClassifier
import os
from sklearn.cross_validation import train_test_split
execpath= "/Users/admin/code/LightGBM/lightgbm"


data = pd.read_csv('trainingtwo.csv')
from sklearn.ensemble import RandomForestClassifier 
data=data[:50000]

prices = data['Price']
features = data.drop(['Accept','ID','Price','Date'], axis = 1)
scaler = MinMaxScaler()
numerical = ['PicQuality','Review','Beds']
features[numerical] = scaler.fit_transform(features[numerical])



Accept = data['Accept']
features = data.drop(['Accept','ID'], axis = 1)
scaler = MinMaxScaler()
numerical = ['Price', 'PicQuality','Review','Beds']
features[numerical] = scaler.fit_transform(features[numerical])

features['Region']= features['Region'].astype('str')
#features['Date']= features['Date'].astype('str')
features['Weekday']= features['Weekday'].astype('str')
features = pd.get_dummies(features)
features = features.drop(['HighPrice','LowPrice'], axis = 1)
features = features.drop(['HighDate'], axis = 1)

display(features.head(n=5))
x_train, x_test, y_train, y_test = train_test_split(features, Accept, test_size=0.2, random_state = 0)

#建立Logistics
logit = linear_model.LogisticRegression(C=2000,solver='newton-cg',penalty='l2')
logit.fit(x_train,y_train)
predictions = logit.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

logit = linear_model.LogisticRegression(C=1000,solver='newton-cg',penalty='l2')
scores = cross_val_score(logit,features,Accept,cv=5,scoring='accuracy')
print scores.mean()

rf = RandomForestClassifier() 
rf.fit(x_train,y_train)
predictions = rf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

clf = GBMClassifier(exec_path=execpath, application='binary' ,boosting_type='gbdt',is_unbalance=True, verbose=False)
clf.fit(x_train,y_train)
predictions = clf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores



Unnamed: 0,Date,LowDate,ApartmentRoom,Beds,Review,LowReview,PicQuality,Price,Region_1,Region_10,...,Region_7,Region_8,Region_9,Weekday_1,Weekday_2,Weekday_3,Weekday_4,Weekday_5,Weekday_6,Weekday_7
0,1,0,1,0,0.953111,0,0.77507,0.523818,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,0,1,1,0.963626,0,0.563758,0.665992,0,0,...,1,0,0,0,1,0,0,0,0,0
2,1,0,0,0,0.224384,0,0.937175,0.396262,0,0,...,0,0,0,0,1,0,0,0,0,0
3,1,0,1,0,0.398545,0,0.800733,0.366457,0,0,...,0,0,0,0,1,0,0,0,0,0
4,1,0,1,0,0.012049,0,0.983884,0.308862,0,0,...,0,1,0,0,1,0,0,0,0,0


0.7516
0.7494
0.7214
0.7008


In [90]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV

neigh = KNeighborsClassifier()
params = {'n_neighbors':[3,5,10,20]}
grid = GridSearchCV(neigh, params,'accuracy',n_jobs=1,verbose=1)
grid = grid.fit(x_train, y_train)
print(grid.)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.4s


{'n_neighbors': 20}


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    4.7s finished


In [105]:
neigh =KNeighborsClassifier(n_neighbors=500)
neigh.fit(x_train, y_train) 
predictions_test=neigh.predict(x_test)
accuracy_score(y_test, predictions_test)

0.70899999999999996

In [None]:
clf = GBMClassifier(exec_path=execpath, application='binary' ,boosting_type='gbdt',is_unbalance=True, verbose=False)
clf.fit(x_train,y_train)
predictions = clf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

svc = svm.LinearSVC()
svc.fit(x_train,y_train)
svpredictions = svc.predict(x_test)
svscores = accuracy_score(y_test, svpredictions)

sv = svm.SVC(kernel='linear')
sv.fit(x_train,y_train)
svpredictions = sv.predict(x_test)
svscores = accuracy_score(y_test, svpredictions)
print svscore

In [21]:
sv = svm.SVC(C=2,kernel='linear')
sv.fit(x_train,y_train)
svpredictions = sv.predict(x_test)
svscores = accuracy_score(y_test, svpredictions)
print svscores

0.7493


In [19]:
clf = GBMClassifier(exec_path=execpath, application='binary' ,boosting_type='gbdt',is_unbalance=True, verbose=False,)
clf.fit(x_train,y_train)
predictions = clf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

0.7002


In [29]:
rf = RandomForestClassifier(min_samples_split=100) 
rf.fit(x_train,y_train)
predictions = rf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

0.7407


In [33]:
from sklearn.grid_search import GridSearchCV
param_test = {'max_depth':range(3,16,2), 'min_samples_split':range(50,201,20)}

gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, min_samples_leaf=20, random_state=10),param_grid = param_test, scoring='accuracy',iid=False, cv=5)

gsearch2.fit(features, Accept)
gsearch2.grid_scores_

AttributeError: 'GridSearchCV' object has no attribute 'best_score'

In [34]:
gsearch2.grid_scores_

[mean: 0.70162, std: 0.01176, params: {'min_samples_split': 50, 'max_depth': 3},
 mean: 0.70176, std: 0.01148, params: {'min_samples_split': 70, 'max_depth': 3},
 mean: 0.70122, std: 0.01256, params: {'min_samples_split': 90, 'max_depth': 3},
 mean: 0.70122, std: 0.01256, params: {'min_samples_split': 110, 'max_depth': 3},
 mean: 0.70122, std: 0.01256, params: {'min_samples_split': 130, 'max_depth': 3},
 mean: 0.70122, std: 0.01256, params: {'min_samples_split': 150, 'max_depth': 3},
 mean: 0.70122, std: 0.01256, params: {'min_samples_split': 170, 'max_depth': 3},
 mean: 0.70122, std: 0.01256, params: {'min_samples_split': 190, 'max_depth': 3},
 mean: 0.70282, std: 0.01982, params: {'min_samples_split': 50, 'max_depth': 5},
 mean: 0.70224, std: 0.02044, params: {'min_samples_split': 70, 'max_depth': 5},
 mean: 0.70342, std: 0.01882, params: {'min_samples_split': 90, 'max_depth': 5},
 mean: 0.70336, std: 0.01976, params: {'min_samples_split': 110, 'max_depth': 5},
 mean: 0.70246, std: 0

In [None]:
sv = svm.SVC(kernel='linear'，)
sv.fit(x_train,y_train)
svpredictions = sv.predict(x_test)
svscores = accuracy_score(y_test, svpredictions)
print svscores

In [None]:
param_test1 = {'n_estimators':range(10,71,10)}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,
                                  min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(X,y)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [154]:
import xgboost as xgb
data = pd.read_csv('trainingtwo2.csv')
data=data[:49990]
prices = data['Price']
#display(data.head(n=5))
features = data.drop(['Accept','ID','Price','Date','Region','Weekday','Twobedroom','LowReview','ApartmentRoom','L0','L1','L2','L3','L4','L5','L6','L7','L8','L9','L10'], axis = 1)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
#display(features.head(n=5))

acceptdata = pd.read_csv('accept3.csv')
acceptprices = acceptdata['Price']
acceptfeature = acceptdata.drop(['Price','ApartmentRoom','ID'], axis = 1)
#display(acceptfeature.head(n=5))
#display(features.head(n=5))
reg = linear_model.LinearRegression()
reg.fit(acceptfeature, acceptprices)
predictedprice=reg.predict(features)
data['predictedPrice']= predictedprice
data['difference']= (prices - predictedprice)/predictedprice


accepts  = data['Accept']
features = data.drop(['Accept','ID','Date'], axis = 1)
features['Region']= features['Region'].astype('str')
features['Weekday']= features['Weekday'].astype('str')
features = pd.get_dummies(features)

scaler = MinMaxScaler()
numerical =['Beds','PicQuality']
features[numerical] = scaler.fit_transform(features[numerical])
display(features.head(n=5))

#logit = linear_model.LogisticRegression()
#scores = cross_val_score(logit,features,accepts,cv=5,scoring='accuracy')#print scores.mean()

#rf = RandomForestClassifier() 
#rfscores=cross_val_score(rf,features,accepts,cv=5,scoring='accuracy')
#print rfscores.mean()

execpath= "/Users/admin/code/LightGBM/lightgbm"

x_train, x_test, y_train, y_test = cross_validation.train_test_split(features, accepts, test_size=0.2, random_state = 0)
clf = GBMClassifier(exec_path=execpath, boosting_type='dart',is_unbalance=True, verbose=False)
clf.fit(x_train,y_train)
predictions = clf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

clf = GBMClassifier(exec_path=execpath, num_iterations=1000, num_leaves=1024,max_bin=1000,application='binary' ,boosting_type='gbdt',is_unbalance=True, verbose=False)
clf.fit(x_train,y_train)
predictions = clf.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores


Unnamed: 0,ApartmentRoom,Beds,Twobedroom,Review,LowReview,PicQuality,Price,L0,L1,L2,...,Region_7,Region_8,Region_9,Weekday_1,Weekday_2,Weekday_3,Weekday_4,Weekday_5,Weekday_6,Weekday_7
0,1,0,0,4.906215,0,0.77507,419.342677,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,1,0,4.927244,0,0.563758,511.447051,0,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,3.448766,0,0.937175,336.708906,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,1,0,0,3.797086,0,0.800733,317.400498,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,1,0,0,3.024098,0,0.983884,280.088862,0,0,0,...,0,1,0,0,1,0,0,0,0,0


0.711142228446
0.722644528906


In [141]:
import xgboost as xgb
xg = xgb.XGBClassifier(n_estimators= 1000, max_depth= 13)
xg.fit(x_train,  y_train)
predictions = xg.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

0.727445489098


In [143]:
import xgboost as xgb
xgb = xgb.XGBClassifier(n_estimators= 2000, max_depth= 8)
xgb.fit(x_train,  y_train)
predictions = xgb.predict(x_test)
scores = accuracy_score(y_test, predictions)
print scores

0.728045609122


In [112]:
display(acceptdata.head(n=5))

Unnamed: 0,ID,ApartmentRoom,Beds,Review,PicQuality,Price
0,9235,1,4,4.154231,0.897629,480.064853
1,18373,1,2,4.592062,0.508191,411.78732
2,22859,1,4,4.923909,0.89204,533.212973
3,44539,1,3,4.731298,0.695569,265.842812
4,14643,0,1,4.755759,0.105415,247.631903


## 模型输出

这一部分我们在所有训练集（50000条数据）上建立模型3， 并输出预测（0，1）和预测概率（Accept=1）

In [206]:
#将训练和测试集同时读入
data = pd.read_csv('totaltwo.csv')
data=data[:70000]

#变量扩展
data['DayNum']=0
data['Rate'] =0
for i in range(365):
    data['DayNum'][data['Date']==i+1]=len(data['ID'][data['Date']==i+1])
    data['Rate'][data['Date']==i+1]=len(data['ID'][data['Date']==i+1][data['Accept']==1])

#使用回归预测价格
prices = data['Price']
features = data.drop(['Accept','ID','Price','Date'], axis = 1)
scaler = MinMaxScaler()
numerical = ['PicQuality','Review','Beds','DayNum','Rate']
features[numerical] = scaler.fit_transform(features[numerical])

reg = linear_model.LinearRegression()
reg.fit(features, prices) 
predictedprice=reg.predict(features)

#变量扩展：预测价格与价格差
data['predictedPrice']= predictedprice
data['difference']= (prices - predictedprice)/prices

#删除预测价格变量，保留价格差，避免 multilinearity
accept = data['Accept'][:50000]
features = data.drop(['Accept','ID','predictedPrice'], axis = 1)

scaler = MinMaxScaler()
numerical = ['Price', 'PicQuality','Review','Beds','difference','DayNum','Rate']
features[numerical] = scaler.fit_transform(features[numerical])

#独热码编码
features['Region']= features['Region'].astype('str')
features['Weekday']= features['Weekday'].astype('str')
features = pd.get_dummies(features)
#除去冗余变量
features = features.drop(['Date','Twobedroom'], axis = 1)

testfeatures =features[50000:]
features =features[:50000]

#建立Logistics
logit = linear_model.LogisticRegression(C=2000,solver='newton-cg',penalty='l2')
logit.fit(features,accept)
predictions = logit.predict(testfeatures)
preprob = logit.predict_proba(testfeatures)[:,1]

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [209]:
#将预测结果写入到csv文件中
import csv
csvfile = file('case_two_pred.csv', 'wb')
writer = csv.writer(csvfile)
for i in predictions:
    writer.writerow([i])
csvfile.close()

In [210]:
#输出概率
csvfile = file('case_two_prob.csv', 'wb')
writer = csv.writer(csvfile)
for i in preprob:
    writer.writerow([i])
csvfile.close()