Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

重複房屋偵測 #1

Open
ddio opened this issue Jun 7, 2018 · 2 comments
Open

重複房屋偵測 #1

ddio opened this issue Jun 7, 2018 · 2 comments
Labels
data-correctness Anything that can improve quality of data

Comments

@ddio
Copy link
Contributor

ddio commented Jun 7, 2018

目前已知的重複情境

  1. 同張貼人,物件資料相同,但價錢不同,拿來避免過去租金被瀏覽者發現
  2. 屋主會將物件交給多名仲介出租,導致網站上有多筆重複物件,但資料不一定完全相同

解決方案

  1. 從統計與實務經驗出發,定義「重複」物件,把他們找出來
  2. 設計適合表達重複物件的資料格式,另外釋出,以保留原始資料,避免過多詮釋
@ddio ddio added the data-correctness Anything that can improve quality of data label Jun 7, 2018
@ddio
Copy link
Contributor Author

ddio commented Jul 27, 2018

目前先比了絕大多數的欄位,只要完全相同就算重複,先以保守的方式為主。
之後應該需要懂統計或 ML 的人來看看可以作什麼事情 XD

不會用來篩選的欄位

  1. 交易狀態(因為會一直變化)
  2. 作者代碼(目前資料不足,2018/07 以前的資料有缺)
  3. 591 的身份限制(還在研究怎麼抓 nested json XD)
  4. 標題、說明(看起來不看也有好處,可以抓到神秘的重複物件

目前重跑五六月資料,大概 25% 是重複物件 XD 200,000 -> 150,000

@ddio
Copy link
Contributor Author

ddio commented Jul 29, 2018

實做的程式碼可見 tools.export_uniq_house,不過需要先修好 Django pgsql 的 annotate XD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-correctness Anything that can improve quality of data
Projects
None yet
Development

No branches or pull requests

1 participant