# [San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime)

### 사용되는 파일
* **_input/test.csv_** - Test Set
* **_input/train.csv_** - Train Set
* **_input/sampleSubmission.csv_** - Kaggle 제출용

이 데이터셋은 SFPD 범죄 사건 리포팅 시스템에서 유래된 사건이 포함되어 있다. 이 데이터 범위는 2003/1/1부터 2015/5/13까지이다. Test Set과 Training Set은 매주 회전합니다.(데이터가 격주로 존재한다는 말인듯) 1,3,5,7주는 테스트셋이고 2,4,6,8은 Training Set입니다.

### Data fields
* **_Dates_** - 범죄사건의 timestamp
* **_Category_** - 범죄사건의 카테고리(only train.csv). 이것은 예측해야하는 목표 변수(종속 변수)
* **_Descript_** - 범죄사건의 자세한 설명 (only train.csv)
* **_DayOfWeek_** - 요일
* **_PdDistrict_** - 관할 경찰서
* **_Resolution_** - 범죄사건이 어떻게 해결됐는지 (only train.csv
* **_Address_** - 범죄 사건이 발생한 주소
* **_X_** - 경도
* **_Y_** - 위도

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv("input/train.csv")
print(train.shape)
train.head()

(878049, 9)


Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015.5.13 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015.5.13 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015.5.13 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015.5.13 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015.5.13 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [None]:
test = pd.read_csv("input/test.csv")
print(test.shape)
test.head()

def roundXY(x):
    x = round(x, 2)
    return x

test["X"] = test["X"].apply(roundXY)
test["Y"] = test["Y"].apply(roundXY)

train["X"] = train["X"].apply(roundXY)
train["Y"] = train["Y"].apply(roundXY)

## Train

In [None]:
feature_names = ["X", "Y"]
feature_names

In [None]:
X_train = train[feature_names]

print(X_train.shape)
X_train.head()

In [None]:
X_test = test[feature_names]

print(X_test.shape)
X_test.head()

In [None]:
label_name = "Category"

y_train = train[label_name]

print(y_train.shape)
y_train.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_jobs=-1)
model

In [None]:
from sklearn.model_selection import cross_val_score

%time score = cross_val_score(model, X_train, y_train, cv=5, scoring="neg_log_loss").mean()

score = -1.0 * score

print("Score = {0:.5f}".format(score))

In [None]:
import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=15, nthread=4)
model

In [None]:
%time score = cross_val_score(model, X_train, y_train, cv=5, scoring="neg_log_loss").mean()

score = -1.0 * score

print("Score = {0:.5f}".format(score))

In [None]:
model.fit(X_train, y_train)

In [None]:
predictions = model.predict_proba(X_test)

print(predictions.shape)
predictions

## Submit 

In [None]:
submission = pd.read_csv("input/sampleSubmission.csv", index_col="Id")

print(submission.shape)
submission.head()

In [None]:
submission = pd.DataFrame(predictions, index=submission.index, columns=submission.columns)
print(submission.shape)
submission.head()

In [None]:
submission.to_csv("output/baseline-script.csv")