# [San Francisco Crime Classification | Kaggle](https://www.kaggle.com/c/sf-crime)

### [SF Crime Prediction with scikit-learn 을 따라해 본다. | Kaggle](https://www.kaggle.com/rhoslug/sf-crime-prediction-with-scikit-learn)

### Data fields
* 날짜  - 범죄 사건의 타임 스탬프
* 범주  - 범죄 사건 카테고리 (train.csv에만 해당) 이 변수를 예측하는 게 이 경진대회 과제임
* 설명  - 범죄 사건에 대한 자세한 설명 (train.csv에만 있음)
* DayOfWeek - 요일
* PdDistrict - 경찰서 구의 이름
* 해결 방법 - 범죄 사건이 어떻게 해결 되었는지 (train.csv에서만)
* 주소 - 범죄 사건의 대략적인 주소 
* X - 경도
* Y - 위도


* Dates - timestamp of the crime incident
* Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
* Descript - detailed description of the crime incident (only in train.csv)
* DayOfWeek - the day of the week
* PdDistrict - name of the Police Department District
* Resolution - how the crime incident was resolved (only in train.csv)
* Address - the approximate street address of the crime incident 
* X - Longitude 
* Y - Latitude 

In [2]:
import pandas as pd
import numpy as np

In [3]:
df_train = pd.read_csv('data/train.csv', parse_dates=['Dates'])
df_train.shape

(878049, 9)

In [4]:
df_train.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [5]:
# 'Descript', 'Dates', 'Resolution' 는 제거
df_train.drop(['Descript', 'Dates', 'Resolution'], axis=1, inplace=True)
df_train.shape

(878049, 6)

In [6]:
df_test = pd.read_csv('data/test.csv', parse_dates=['Dates'])
df_test.shape

(884262, 7)

In [7]:
df_test.head()

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [8]:
df_submit = pd.read_csv("data/sampleSubmission.csv")
df_submit.head()

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [11]:
df_train["Category"].value_counts()

LARCENY/THEFT                  174900
OTHER OFFENSES                 126182
NON-CRIMINAL                    92304
ASSAULT                         76876
DRUG/NARCOTIC                   53971
VEHICLE THEFT                   53781
VANDALISM                       44725
WARRANTS                        42214
BURGLARY                        36755
SUSPICIOUS OCC                  31414
MISSING PERSON                  25989
ROBBERY                         23000
FRAUD                           16679
FORGERY/COUNTERFEITING          10609
SECONDARY CODES                  9985
WEAPON LAWS                      8555
PROSTITUTION                     7484
TRESPASS                         7326
STOLEN PROPERTY                  4540
SEX OFFENSES FORCIBLE            4388
DISORDERLY CONDUCT               4320
DRUNKENNESS                      4280
RECOVERED VEHICLE                3138
KIDNAPPING                       2341
DRIVING UNDER THE INFLUENCE      2268
RUNAWAY                          1946
LIQUOR LAWS 

In [7]:
df_test.drop(['Dates'], axis=1, inplace=True)

In [8]:
df_test.head()

Unnamed: 0,Id,DayOfWeek,PdDistrict,Address,X,Y
0,0,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [9]:
# 트레이닝과 검증셋을 선택한다.
inds = np.arange(df_train.shape[0])
inds

array([     0,      1,      2, ..., 878046, 878047, 878048])

In [10]:
np.random.shuffle(inds)
df_train.shape[0]

878049

In [11]:
df_train.shape[0] * 0.8

702439.2000000001

In [12]:
# 트레인 셋
train_inds = inds[:int(0.8 * df_train.shape[0])]
print(train_inds.shape)
# 검증 셋
val_inds = inds[int(0.8) * df_train.shape[0]:]
print(val_inds.shape)

(702439,)
(878049,)


In [13]:
# 컬럼명을 추출한다.
col_names = np.sort(df_train['Category'].unique())
col_names

array(['ARSON', 'ASSAULT', 'BAD CHECKS', 'BRIBERY', 'BURGLARY',
       'DISORDERLY CONDUCT', 'DRIVING UNDER THE INFLUENCE',
       'DRUG/NARCOTIC', 'DRUNKENNESS', 'EMBEZZLEMENT', 'EXTORTION',
       'FAMILY OFFENSES', 'FORGERY/COUNTERFEITING', 'FRAUD', 'GAMBLING',
       'KIDNAPPING', 'LARCENY/THEFT', 'LIQUOR LAWS', 'LOITERING',
       'MISSING PERSON', 'NON-CRIMINAL', 'OTHER OFFENSES',
       'PORNOGRAPHY/OBSCENE MAT', 'PROSTITUTION', 'RECOVERED VEHICLE',
       'ROBBERY', 'RUNAWAY', 'SECONDARY CODES', 'SEX OFFENSES FORCIBLE',
       'SEX OFFENSES NON FORCIBLE', 'STOLEN PROPERTY', 'SUICIDE',
       'SUSPICIOUS OCC', 'TREA', 'TRESPASS', 'VANDALISM', 'VEHICLE THEFT',
       'WARRANTS', 'WEAPON LAWS'], dtype=object)

In [14]:
# 카테고리를 숫자로 변환해 준다.
df_train['Category'] = pd.Categorical(df_train['Category']).codes
df_train['DayOfWeek'] = pd.Categorical(df_train['DayOfWeek']).codes
df_train['PdDistrict'] = pd.Categorical(df_train['PdDistrict']).codes
df_test['DayOfWeek'] = pd.Categorical(df_test['DayOfWeek']).codes
df_test['PdDistrict'] = pd.Categorical(df_test['PdDistrict']).codes

In [15]:
df_train.head()

Unnamed: 0,Category,DayOfWeek,PdDistrict,Address,X,Y
0,37,6,4,OAK ST / LAGUNA ST,-122.425892,37.774599
1,21,6,4,OAK ST / LAGUNA ST,-122.425892,37.774599
2,21,6,4,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,16,6,4,1500 Block of LOMBARD ST,-122.426995,37.800873
4,16,6,5,100 Block of BRODERICK ST,-122.438738,37.771541


In [16]:
df_test.head()

Unnamed: 0,Id,DayOfWeek,PdDistrict,Address,X,Y
0,0,3,0,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,3,0,3RD ST / REVERE AV,-122.391523,37.732432
2,2,3,4,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,3,2,4700 Block of MISSION ST,-122.437394,37.721412
4,4,3,2,4700 Block of MISSION ST,-122.437394,37.721412


In [17]:
from sklearn.feature_extraction.text import CountVectorizer
# text 빈도를 추출한다.
cvec = CountVectorizer()
cvec

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [18]:
bows_train = cvec.fit_transform(df_train['Address'].values)

In [19]:
bows_test = cvec.transform(df_test['Address'].values)

In [20]:
# 트레이닝과 검증셋을 나눈다.
df_val = df_train.iloc[val_inds]
df_val.head()

Unnamed: 0,Category,DayOfWeek,PdDistrict,Address,X,Y
844294,21,4,8,200 Block of RANDOLPH ST,-122.46486,37.714345
376559,21,4,7,MISSION ST / 11TH ST,-122.417105,37.774324
3683,20,1,7,100 Block of 9TH ST,-122.413981,37.775569
847531,13,4,1,500 Block of WASHINGTON ST,-122.402219,37.795713
22596,21,0,7,1000 Block of HARRISON ST,-122.405955,37.775861


In [21]:
df_val.shape

(878049, 6)

In [22]:
df_train = df_train.iloc[train_inds]
df_train.shape

(702439, 6)

In [23]:
df_train.head()

Unnamed: 0,Category,DayOfWeek,PdDistrict,Address,X,Y
844294,21,4,8,200 Block of RANDOLPH ST,-122.46486,37.714345
376559,21,4,7,MISSION ST / 11TH ST,-122.417105,37.774324
3683,20,1,7,100 Block of 9TH ST,-122.413981,37.775569
847531,13,4,1,500 Block of WASHINGTON ST,-122.402219,37.795713
22596,21,0,7,1000 Block of HARRISON ST,-122.405955,37.775861


In [24]:
from patsy import dmatrices, dmatrix
y_train, X_train = dmatrices('Category ~ X + Y + DayOfWeek + PdDistrict', df_train)

In [25]:
y_train.shape

(702439, 1)

In [26]:
# 벡터화 된 주소
X_train = np.hstack((X_train, bows_train[train_inds, :].toarray()))

In [27]:
X_train.shape

(702439, 2146)

In [28]:
y_val, X_val = dmatrices('Category ~ X + Y + DayOfWeek + PdDistrict', df_val)

In [29]:
X_val = np.hstack((X_val, bows_train[val_inds, :].toarray()))
X_test = dmatrix('X + Y + DayOfWeek + PdDistrict', df_test)

In [30]:
X_test = np.hstack((X_test, bows_test.toarray()))

In [31]:
# IncrementalPCA
from sklearn.decomposition import IncrementalPCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [32]:
ipca = IncrementalPCA(n_components=4, batch_size=5)
ipca

IncrementalPCA(batch_size=5, copy=True, n_components=4, whiten=False)

In [33]:
X_train = ipca.fit_transform(X_train)

In [34]:
X_val = ipca.transform(X_val)

In [35]:
X_test = ipca.transform(X_test)

In [None]:
# 로지스틱 회귀를 생성하고 fit 시킨다.
logistic = LogisticRegression()
logistic.fit(X_train, y_train.ravel())

In [42]:
# 정확도를 본다.
print('Mean accuracy (Logistic):', logistic.score(X_val, y_val.ravel()))

Mean accuracy (Logistic): 0.20984250309492977


In [43]:
# 랜덤 포레스트로 fit 시키고 정확도를 본다.
randforest = RandomForestClassifier()
randforest.fit(X_train, y_train.ravel())

# 정확도를 본다.
print('Mean accuracy (Logistic):', logistic.score(X_val, y_val.ravel()))

Mean accuracy (Logistic): 0.20984250309492977


In [38]:
# Make predictions

predict_probs = logistic.predict_proba(X_test)

In [40]:
df_pred = pd.DataFrame(data=predict_probs, columns=col_names)
df_pred['Id'] = df_test['Id'].astype(int)
df_pred

Unnamed: 0,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS,Id
0,0.004529,0.111885,0.000697,0.000650,0.060622,0.003698,0.001908,0.018210,0.003156,0.001992,...,0.004727,0.001456,0.046903,0.000112,0.007827,0.073395,0.089612,0.027645,0.015829,0
1,0.001690,0.069454,0.000226,0.000340,0.003436,0.004071,0.006859,0.029056,0.004057,0.000247,...,0.004209,0.000254,0.025817,0.000097,0.001138,0.050124,0.119799,0.036432,0.014453,1
2,0.001925,0.099437,0.000520,0.000306,0.056984,0.005298,0.001311,0.049739,0.005113,0.001971,...,0.006031,0.000648,0.040351,0.000108,0.011008,0.055089,0.053575,0.046717,0.009409,2
3,0.002338,0.103010,0.000561,0.000389,0.059863,0.005012,0.001252,0.036751,0.004797,0.001951,...,0.006200,0.000652,0.040480,0.000112,0.010054,0.058667,0.061577,0.042184,0.010763,3
4,0.002338,0.103010,0.000561,0.000389,0.059863,0.005012,0.001252,0.036751,0.004797,0.001951,...,0.006200,0.000652,0.040480,0.000112,0.010054,0.058667,0.061577,0.042184,0.010763,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
884257,0.002129,0.099521,0.000620,0.000296,0.060803,0.005592,0.001029,0.048363,0.004390,0.001820,...,0.006701,0.000650,0.041442,0.000106,0.011491,0.053332,0.054004,0.049094,0.010010,884257
884258,0.001908,0.097043,0.000583,0.000261,0.054615,0.005763,0.001113,0.056758,0.004553,0.001727,...,0.006577,0.000629,0.040999,0.000104,0.011351,0.051588,0.051441,0.051973,0.009450,884258
884259,0.002322,0.100829,0.000631,0.000333,0.058137,0.005467,0.001056,0.042030,0.004284,0.001726,...,0.006788,0.000635,0.041244,0.000108,0.010474,0.055037,0.059041,0.047047,0.010810,884259
884260,0.004545,0.110369,0.000790,0.000561,0.058906,0.004060,0.001635,0.020955,0.002835,0.001768,...,0.005204,0.001435,0.048189,0.000109,0.008170,0.069483,0.086921,0.031026,0.016070,884260


In [41]:
df_pred.to_csv('output.csv', index=False)