# 미래에셋 자산운용 
### AI 혁신본부 AI 혁신팀 사전과제

### 개요 
- 일정기간동안 측정된 주식 데이터셋이 주어집니다.
- Train set과 Test set이 주어집니다.
- Train set에 대해서 학습을 진행하고, Test Set을 통한 prediction을 만드는 것이 목표입니다.

### 데이터 설명

######  Input Data
- 데이터셋의 각 column은 날짜정보와 종목정보, 그리고 Feature set으로 이루어져 있습니다. Feature set은 blur 처리되어 있습니다.
- Feature는 각 종목들의 유의미하다고 판단되는 데이터 값으로 이루어져 있습니다.

######  Target Data
- Train set에 대해서는 정답 데이터 2개가 주어지고, Test set에 대해서는 정답이 주어지지 않습니다.
- 정답 데이터들은 각 샘플 시간 기준으로 다음 단위시간(T) 수익률로 만들어집니다.
    - 정답 데이터1: 단위시간(T) 수익률 (train_target.csv) -> Regression
    - 정답 데이터2: 특정 한 시점에서 종목들의 단위 시간 수익률(정답데이터 1)을 5분위로 나누어 분류한 Target (train_target2.csv) -> Classification
  

### 세부사항 
- 어떤 정답을 학습시키느냐에 따라 regression 접근, 혹은 5분위 중 어떤 위치에 있을지 예측하는 Classification 접근법이 있습니다.
- 이를 포함해서 수험자 재량에 따라 데이터에 변형을 가하는 등의 접근 방식을 써도 좋습니다.
    - 예를 들어 classification 문제로 예시를 들면 수익률을 5분위 대신 2분위로 나누어서 binary classification으로 변형해도 좋습니다. 다만 변형할 경우 해당 부분에 대한 설명의 기재를 부탁드립니다.
    - <font color=RED> 어떠한 접근 방법이든 하나만 선택해도 됩니다. </font>
- 모든 데이터를 학습에 사용할 필요는 없습니다. 
    - 예를 들어 특정 Feature는 필요없다고 판단하면 버리셔도 무방하고, 새로운 Feature를 만들어서 사용하셔도 됩니다.
- <font color=RED> 실제 모델 결과보다 모델을 만들기까지의 과정이 중요합니다.</font>
- <font color=RED>어떠한 논리로 분석을 진행하였는지 설명을 세부적으로 적어주시길 바랍니다. (코멘트/주석 방식 or 보고서) </font>
    - <font color=RED> SCORE보다는 적어주신 주석/코멘트나 보고서 내용이 평가 비중이 높습니다. </font>
- <font color=RED>딥러닝 프레임워크(Tensorflow, Pytorch, Keras, Theano 등)를 사용한 모델링 과정을 하나 이상 넣어주시길 바랍니다.</font>
- Deep learning 모델 외에도 추가적으로 모델을 사용하는 것은 제한이 없습니다. 

### 제출 양식
- Test set에 대한 모델의 예측값를 csv 형태로 저장하여 소스코드와 같이 제출해야합니다.
- 소스코드 형식은 .ipynb(jupyter notebook) 형식이 선호되고 .py 파일도 제출가능합니다.
- 결과물들이 모두 포함된 압축파일(.zip)로 제출해주십시요.

##### 추가 문의 사항:
- AI 혁신팀 고정욱 매니저
- 02-3774-2262(OH : 9:00 ~ 18:00)
- jungwook.ko@miraeasset.com


- 아래 소스코드는 제출 양식과 진행 방식의 전달을 위해 간략하게 진행한 예시입니다.

# Data load

In [1]:
import pandas as pd
import numpy as np

In [2]:
X_train = pd.read_csv('./data/train_data.csv') #훈련 데이터
Y_train = pd.read_csv('./data/train_target.csv') # 훈련 데이터에 대한 정답데이터 for regression
Y2_train = pd.read_csv('./data/train_target2.csv') # 훈련 데이터에 대한 정답데이터 for classification
X_test = pd.read_csv('./data/test_data.csv') # 테스트 데이터


In [3]:
X_train.tail()

Unnamed: 0,td,code,F001,F002,F003,F004,F005,F006,F007,F008,...,F037,F038,F039,F040,F041,F042,F043,F044,F045,F046
83559,T274,A791,5.411604,0.0,0.560828,-0.101983,-8.063605,0.029134,0.026586,-6.890195,...,-18.27876,-0.000252,-0.009729,4.0,-0.007442,-22.693267,14.001306,1.637208,1.333598,-3.507891
83560,T274,A793,82.562142,0.0,8.971392,-0.008897,22.222222,0.206389,0.033859,19.18437,...,64.794007,0.021994,-0.004664,2.0,0.11194,0.0,275.426621,1.005348,158.823529,8.893188
83561,T274,A794,0.186385,0.000893,2.374996,-0.052632,25.110132,0.029708,0.030444,9.984827,...,30.275229,0.093428,0.000383,5.0,-0.020436,13.72549,0.530973,0.872873,43.434343,6.393844
83562,T274,A795,-5.515355,-0.005224,4.898941,-0.037037,6.10119,0.097354,0.023531,-0.481948,...,-1.10957,0.031893,-0.009076,2.0,-0.00905,-4.255319,-12.083847,0.880508,-1.246537,3.634898
83563,T274,A796,3.056981,-0.005908,2.25273,-0.035714,-17.596567,0.140226,0.035148,-6.005738,...,-20.661157,-0.003084,-0.005149,2.0,0.054743,0.0,9.090909,1.28015,11.953353,-5.925237


In [4]:
Y_train.tail()

Unnamed: 0,td,code,target
83559,T274,A791,0.026318
83560,T274,A793,-0.034091
83561,T274,A794,0.001761
83562,T274,A795,0.040673
83563,T274,A796,0.005208


In [5]:
Y2_train.tail()

Unnamed: 0,td,code,target
83559,T274,A791,3
83560,T274,A793,0
83561,T274,A794,2
83562,T274,A795,3
83563,T274,A796,2


In [6]:
X_test.tail()

Unnamed: 0,td,code,F001,F002,F003,F004,F005,F006,F007,F008,...,F037,F038,F039,F040,F041,F042,F043,F044,F045,F046
11613,T310,A790,0.407656,,9.867299,0.0,-6.334372,,0.036413,-7.040488,...,-25.454545,-0.024005,,7.0,,,1.920904,1.007055,-24.833333,-2.578208
11614,T310,A793,43.023917,-0.003035,9.985884,-0.008333,-3.834356,0.055146,0.040303,-2.480406,...,-9.913793,0.06561,-0.009397,2.0,-0.018386,-1.123596,144.921875,0.979869,59.137056,-1.383484
11615,T310,A794,-4.534972,0.0,1.751161,0.0,8.661417,-0.016977,0.02723,-0.698376,...,-1.895735,0.006033,-0.020032,5.0,0.014985,-3.921569,-14.46281,0.824189,-15.853659,3.262274
11616,T310,A795,-17.974223,0.0,4.192958,0.003632,11.420983,0.100086,0.025969,-8.893335,...,-23.308958,0.000907,0.001045,2.0,-0.014319,-6.666667,-42.217631,0.833315,-38.399413,5.205679
11617,T310,A796,-1.651783,-0.02964,2.452572,-0.433829,-4.824561,0.018919,0.040217,-5.178204,...,-21.090909,-0.048585,-0.012903,2.0,-0.046665,-10.769231,-6.060606,1.246159,-2.908277,-1.546951


In [7]:
X_train = X_train.set_index(['td', 'code'])
Y_train = Y_train.set_index(['td', 'code'])
Y2_train = Y2_train.set_index(['td', 'code'])
X_test = X_test.set_index(['td', 'code'])

# EDA

In [8]:
X_train.shape, Y_train.shape, Y2_train.shape, X_test.shape

((83564, 46), (83564, 1), (83564, 1), (11618, 46))

In [9]:
X_train.describe()

Unnamed: 0,F001,F002,F003,F004,F005,F006,F007,F008,F009,F010,...,F037,F038,F039,F040,F041,F042,F043,F044,F045,F046
count,80147.0,72494.0,82138.0,83448.0,83448.0,69456.0,82518.0,82504.0,71432.0,70884.0,...,83448.0,80338.0,71058.0,83383.0,70049.0,67884.0,83448.0,83559.0,83448.0,83102.0
mean,7.89437,0.000915,3.212757,0.202291,2.075079,0.052969,0.024497,1.753955,0.909854,-2.7e-05,...,6.097276,0.003239,0.001637,3.776357,0.003051,0.678024,23.730528,1.000124,17.703937,0.49589
std,22.1013,0.03923,17.267118,2.020238,14.543002,0.069505,0.011561,8.342232,0.601707,0.016406,...,29.899738,0.328621,0.025129,1.994943,0.034708,8.502843,76.855077,0.269099,61.836535,4.649171
min,-52.092279,-3.832419,0.169382,-76.84,-59.385189,-0.866186,0.00569,-26.092628,-0.165837,-0.367137,...,-70.21978,-43.672986,-0.727923,1.0,-0.727923,-92.5,-92.017722,0.013118,-89.84329,-27.129617
25%,-6.103297,-0.001195,0.919447,-0.02,-5.553225,0.020651,0.017816,-4.073285,0.440529,-0.001739,...,-8.556635,-0.006791,-0.002727,2.0,-0.0054,-1.198204,-12.293946,0.864735,-11.29578,-2.745036
50%,3.184114,0.0,1.551248,0.0,0.440529,0.042782,0.022341,0.917061,0.8,0.0,...,1.724138,0.002463,0.0,4.0,0.0,0.0,5.0,0.985998,4.21326,0.234027
75%,16.197738,0.0004,3.090994,0.059686,7.606289,0.073355,0.028253,6.597034,1.234568,0.001397,...,14.653135,0.014523,0.002716,5.0,0.00732,2.083333,36.567964,1.102857,29.559748,3.478132
max,470.723992,4.133958,983.606913,345.8,691.925065,1.670955,0.383658,109.968661,10.0,2.307626,...,1563.043478,5.31708,1.160028,7.0,1.160612,207.692308,3125.524476,21.820324,2900.0,56.84502


In [10]:
X_test.describe()

Unnamed: 0,F001,F002,F003,F004,F005,F006,F007,F008,F009,F010,...,F037,F038,F039,F040,F041,F042,F043,F044,F045,F046
count,11086.0,9532.0,11562.0,11617.0,11617.0,9177.0,11390.0,11507.0,9470.0,9356.0,...,11617.0,11053.0,9360.0,11593.0,9234.0,8888.0,11617.0,11618.0,11617.0,11575.0
mean,5.547566,-0.001151,3.907227,0.902523,-0.122685,0.050943,0.027775,-1.136299,0.976132,-0.00094,...,-0.063615,0.001916,-0.001705,3.877081,-0.002398,-1.207978,28.062455,1.008501,18.076916,-0.338304
std,20.564398,0.020017,8.417691,14.744368,15.264813,0.068831,0.013069,7.346431,0.702352,0.011895,...,31.048006,0.036362,0.024202,2.048031,0.030882,8.916721,91.104617,0.280727,77.078929,4.557944
min,-29.721986,-0.97634,0.161449,-38.6,-49.50495,-0.825617,0.007313,-24.110365,0.01295,-0.50264,...,-68.425926,-0.927502,-1.025736,1.0,-1.025736,-97.878788,-83.402256,0.052986,-78.882818,-15.909403
25%,-7.672988,-0.001494,0.852407,-0.023139,-8.49359,0.018199,0.01974,-6.20657,0.409836,-0.002002,...,-14.608696,-0.007728,-0.003912,2.0,-0.008019,-3.225806,-15.384615,0.864336,-17.345873,-3.557173
50%,0.693708,0.0,1.727881,0.0,-1.408451,0.04022,0.025025,-2.023613,0.819672,-0.000167,...,-4.66563,0.001277,-0.000294,4.0,-0.001443,0.0,0.0,0.993835,-1.382488,-0.639357
75%,12.59485,0.000157,3.96713,0.014085,6.226415,0.070894,0.031825,2.922284,1.408451,0.000461,...,7.053942,0.010803,0.000951,6.0,0.002659,0.835084,33.781764,1.117372,22.991071,2.610749
max,196.038447,0.392012,189.233466,396.80783,273.619632,1.537437,0.124297,70.851211,11.111111,0.15701,...,791.625616,0.254544,0.341028,7.0,0.624205,80.28169,1058.244681,7.400282,900.0,26.269902


In [11]:
# imputation with -1 
X_train.fillna(-1, inplace = True)
X_test.fillna(-1, inplace = True)

# Modeling
- classification or regression

In [12]:
#use classifier in example
from sklearn.ensemble import ExtraTreesClassifier

  from numpy.core.umath_tests import inner1d


In [13]:
model = ExtraTreesClassifier(n_estimators=100, max_depth = 20)

In [14]:
# 만약 regression이라면 Y_train 사용
model.fit(X_train.values, Y2_train.values)


  


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=20, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [15]:
print(model.score(X_train.values, Y2_train.values))

0.8002728447656886


# Make prediction

In [16]:
pred = model.predict(X_test)

# Make submission

In [17]:
submission = pd.DataFrame(pred, columns = ['target'], index = X_test.index)

In [18]:
submission.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,target
td,code,Unnamed: 2_level_1
T275,A005,0
T275,A006,1
T275,A007,2
T275,A008,4
T275,A012,0
