## San Francisco Crime Classification Baseline Script

- Kaggle 주소: [San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime)
- 경진대회의 목표: 특정 위치 데이터를 활용해서 특정 시간대, 요일, 관할서에서 일어난 범죄를 분석해서 범죄의 구체적인 종류를 예측하는 것

In [1]:
import pandas as pd

### Reload Dataset

**Train Dataset**

In [2]:
train = pd.read_csv('data/train.csv')
print(train.shape)
train.head()

(878049, 9)


Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB


- Lable: Category
- Feature: Dates, DayOfWeek, PdDistrict, Address, X, Y

**Test Dataset**

In [4]:
test = pd.read_csv('data/test.csv', index_col='Id')
print(test.shape)
test.head()

(884262, 6)


Unnamed: 0_level_0,Dates,DayOfWeek,PdDistrict,Address,X,Y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


### Tarin
#### Feature로 X, Y를 사용

In [5]:
feature_names = ["X", "Y"]
feature_names

['X', 'Y']

In [6]:
X_train = train[feature_names]
print(X_train.shape)
X_train.head()

(878049, 2)


Unnamed: 0,X,Y
0,-122.425892,37.774599
1,-122.425892,37.774599
2,-122.424363,37.800414
3,-122.426995,37.800873
4,-122.438738,37.771541


In [7]:
X_test = test[feature_names]
print(X_test.shape)
X_test.head()

(884262, 2)


Unnamed: 0_level_0,X,Y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,-122.399588,37.735051
1,-122.391523,37.732432
2,-122.426002,37.792212
3,-122.437394,37.721412
4,-122.437394,37.721412


#### Label로 Category를 사용

In [8]:
label_name = "Category"
y_train = train[label_name]
print(y_train.shape)
y_train.head()

(878049,)


0          WARRANTS
1    OTHER OFFENSES
2    OTHER OFFENSES
3     LARCENY/THEFT
4     LARCENY/THEFT
Name: Category, dtype: object

### Use Random Forest
- Decision Tree의 응용, 앙상블(여러 머신러닝 모델을 연결하는 것)

In [9]:
from sklearn.ensemble import RandomForestClassifier

# n_estimators는 tree의 개수
# n_jobs는 병렬처리, -1은 모든 코어를 사용(빠르게 처리)
# random_state는 랜덤발생을 고정적으로 수행
model = RandomForestClassifier(n_estimators=10,
                              n_jobs=-1,
                              random_state=39) 
model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=39, verbose=0,
                       warm_start=False)

### Model Validation
- Model Validation: 모델의 결과가 잘 나왔는지 정량적으로 검증하는 방식
- 탐험적 데이터분석: 가설->**(가설)검증**->예측
    - 검증방식: 1) Hold-out Validation, 2) Cross Validation    
- ```Hold-out Validation```: train 데이터를 전체로 fit하지 않고, 임의의 비율로 나눈후 fit, 남은 부분을 predict한 후 정답(actual)과 비교, 속도가 빠름(fit을 한번만 함)  
- ```Cross Validation```: train 데이터를 1/n으로 나눈 뒤 한 조각을 제외한 나머지로 fit하고 한조각을 predict함, 이 방식을 n번 반복하면 train 데이터 전체 개수와 일치하는 정답과 예측값이 나옴, 속도가 느림(fit을 여러번 함), 정확함

### Evaluate
- Hold-out Validation, Logloss를 사용

In [10]:
from sklearn.model_selection import train_test_split

# test dataset split비율 조정: test_size로 지정
X_train_kf, X_test_kf, y_train_kf, y_test_kf = train_test_split(X_train, y_train, 
                                                                test_size=0.3,
                                                               random_state=39)
print(X_train_kf.shape, y_train_kf.shape)
print(X_test_kf.shape, y_test_kf.shape)

(614634, 2) (614634,)
(263415, 2) (263415,)


In [11]:
%time model.fit(X_train_kf, y_train_kf)

CPU times: user 12.1 s, sys: 288 ms, total: 12.4 s
Wall time: 3.22 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=39, verbose=0,
                       warm_start=False)

In [12]:
#model.predict: 예측값
#model.predict_proba: 예측확률(0~1사이)
y_test_predict_kf = model.predict_proba(X_test_kf)
print(y_test_predict_kf.shape)
y_test_predict_kf

(263415, 39)


array([[0.        , 0.02489514, 0.        , ..., 0.17206992, 0.        ,
        0.        ],
       [0.        , 0.0111371 , 0.        , ..., 0.09733456, 0.00819005,
        0.02350758],
       [0.        , 0.22891464, 0.        , ..., 0.10706815, 0.        ,
        0.        ],
       ...,
       [0.        , 0.07881159, 0.        , ..., 0.1799732 , 0.        ,
        0.        ],
       [0.        , 0.02094606, 0.        , ..., 0.10880375, 0.05389141,
        0.00643738],
       [0.00323462, 0.21289159, 0.        , ..., 0.        , 0.08175139,
        0.01639368]])

**Kaggle에서 제시하는 공식사용**  
: 낮을 수록 좋고, 높을 수록 안좋음

\begin{align}
log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij})
\end{align}

In [13]:
from sklearn.metrics import log_loss

score = log_loss(y_test_kf, y_test_predict_kf)
print("Log Loss = {:.5f}".format(score))

Log Loss = 6.28793


### Predict

In [14]:
%time model.fit(X_train, y_train)

CPU times: user 17.9 s, sys: 430 ms, total: 18.4 s
Wall time: 4.75 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=39, verbose=0,
                       warm_start=False)

In [None]:
prediction_list = model.predict_proba(X_test)
print(X_test)
prediction_list

### Submit
- kaggle에서 원하는 제출형식으로 만들어 주기

In [None]:
sample_submission = pd.read_csv('data/sampleSubmission.csv', index_col='Id')
print(sample_submission.shape)
sample_submission.head()

In [None]:
model.classes_

In [None]:
submissions = pd.DataFrame(prediction_list,
                              index=sample_submission.index,
                              columns=model.classes_)
print(submissions.shape)
submissions.head()

In [None]:
submissions.to_csv('data/baseline-script.csv')