# [San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime)

### 사용되는 파일
* **_input/test.csv_** - Test Set
* **_input/train.csv_** - Train Set
* **_input/sampleSubmission.csv_** - Kaggle 제출용

이 데이터셋은 SFPD 범죄 사건 리포팅 시스템에서 유래된 사건이 포함되어 있다. 이 데이터 범위는 2003/1/1부터 2015/5/13까지이다. Test Set과 Training Set은 매주 회전합니다.(데이터가 격주로 존재한다는 말인듯) 1,3,5,7주는 테스트셋이고 2,4,6,8은 Training Set입니다.

### Data fields
* **_Dates_** - 범죄사건의 timestamp
* **_Category_** - 범죄사건의 카테고리(only train.csv). 이것은 예측해야하는 목표 변수(종속 변수)
* **_Descript_** - 범죄사건의 자세한 설명 (only train.csv)
* **_DayOfWeek_** - 요일
* **_PdDistrict_** - 관할 경찰서
* **_Resolution_** - 범죄사건이 어떻게 해결됐는지 (only train.csv
* **_Address_** - 범죄 사건이 발생한 주소
* **_X_** - 경도
* **_Y_** - 위도

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv("input/train.csv", parse_dates=["Dates"])
print(train.shape)
train.head()

(878049, 9)


Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [3]:
# print(len(train["Category"].unique()))
# print(len(test["Category"].unique()))

In [4]:
test = pd.read_csv("input/test.csv", parse_dates=["Dates"])
print(test.shape)
test.head()

(884262, 7)


Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [5]:
test["Dates-hour"] = test["Dates"].dt.hour
train["Dates-hour"] = train["Dates"].dt.hour

## Train

In [6]:
feature_names = ["PdDistrict", "Dates-hour"]
feature_names

['PdDistrict', 'Dates-hour']

In [7]:
X_train = pd.get_dummies(train[feature_names])

print(X_train.shape)
X_train.head()

(878049, 11)


Unnamed: 0,Dates-hour,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
0,23,0,0,0,0,1,0,0,0,0,0
1,23,0,0,0,0,1,0,0,0,0,0
2,23,0,0,0,0,1,0,0,0,0,0
3,23,0,0,0,0,1,0,0,0,0,0
4,23,0,0,0,0,0,1,0,0,0,0


In [8]:
X_test = pd.get_dummies(test[feature_names])

print(X_test.shape)
X_test.head()

(884262, 11)


Unnamed: 0,Dates-hour,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
0,23,1,0,0,0,0,0,0,0,0,0
1,23,1,0,0,0,0,0,0,0,0,0
2,23,0,0,0,0,1,0,0,0,0,0
3,23,0,0,1,0,0,0,0,0,0,0
4,23,0,0,1,0,0,0,0,0,0,0


In [9]:
label_name = "Category"

y_train = train[label_name]

print(y_train.shape)
y_train.head()

(878049,)


0          WARRANTS
1    OTHER OFFENSES
2    OTHER OFFENSES
3     LARCENY/THEFT
4     LARCENY/THEFT
Name: Category, dtype: object

In [None]:
import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=15, nthread=4)
model



XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=15, nthread=4,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [None]:
%time score = cross_val_score(model, X_train, y_train, cv=5, scoring="neg_log_loss").mean()

score = -1.0 * score

print("Score = {0:.5f}".format(score))

In [None]:
model.fit(X_train, y_train)

In [None]:
predictions = model.predict_proba(X_test)

print(predictions.shape)
predictions

## Submit 

In [None]:
submission = pd.read_csv("input/sampleSubmission.csv", index_col="Id")

print(submission.shape)
submission.head()

In [None]:
submission = pd.DataFrame(predictions, index=submission.index, columns=submission.columns)
print(submission.shape)
submission.head()

In [None]:
submission.to_csv("output/baseline-script.csv")