# EDA with Time

* 이번 커널에서는 시간 관련 특성을 알아볼 것입니다.
* 시간이 중요한 이유는 Data Description에 아래와 같은 설명이 있기 때문입니다.
> Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores!
* 요약하자면 대회 주최측에서 러프하게 시간으로 데이터셋을 트레인, 테스트로 구분했고, private, public 역시 러프하게 시간으로 구분했으니 주의해라 이런 말입니다.

## 라이브러리

In [1]:
import numpy as np
import pandas as pd
import warnings
import gc
warnings.filterwarnings("ignore")

In [2]:
pd.set_option('max_rows', 150)
pd.set_option('max_colwidth', 500)
pd.set_option('max_columns', 500)

## 데이터 로드

In [3]:
dtypes = {
        'MachineIdentifier':                                    'object',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float16',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int8',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float32',
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float16',
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float16',
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float32',
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float32',
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float16',
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float16',
        'Census_InternalPrimaryDisplayResolutionVertical':      'float16',
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float32',
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }

In [4]:
train = pd.read_csv('./data/train.csv', dtype=dtypes)
test = pd.read_csv('./data/test.csv', dtype=dtypes)

## 시간 관련 컬럼

### ProductName
* 윈도우 디펄트 EPP 시스템 계보, 역사를 통한 시계열성 간접적으로 파악
* 딱히 의미 없음

In [9]:
train.ProductName.value_counts(dropna=False)

win8defender     8826520
mse                94873
mseprerelease         53
scep                  22
windowsintune          8
fep                    7
Name: ProductName, dtype: int64

In [10]:
test.ProductName.value_counts(dropna=False)

win8defender     7797245
mse                55946
mseprerelease         34
scep                  16
fep                    7
windowsintune          5
Name: ProductName, dtype: int64

### EngineVersion
* 윈도우 디펜더와 MSE과 엔진과 sig를 같이 사용함
* 엔진 버전 네이밍에 시간 관련 힌트를 얻을 수 있지 않을까?
* test가 상대적으로 엔진 버전이 높다 !

In [32]:
train.EngineVersion.value_counts(dropna=False)[:10]

1.1.15200.1    3845067
1.1.15100.1    3675915
1.1.15000.2     265218
1.1.14901.4     212408
1.1.14600.4     160585
1.1.14800.3     136476
1.1.15300.6     120295
1.1.14104.0      93926
1.1.13504.0      70645
1.1.15300.5      68716
Name: EngineVersion, dtype: int64

In [33]:
test.EngineVersion.value_counts(dropna=False)[:10]

1.1.15300.6    3101305
1.1.15400.4    2106236
1.1.15400.5    1491273
1.1.15200.1     366085
1.1.15100.1     158036
1.1.14600.4     138514
1.1.14901.4      75808
1.1.14104.0      72795
1.1.15000.2      66904
1.1.14800.3      55992
Name: EngineVersion, dtype: int64

### AppVersion
* 앱 버전도 시계열성 간접적으로 추측이 가능할 것
* 네이밍 규칙을 정확히는 모르겠지만, 어느정도 시간성과 관련 있음

In [34]:
train.AppVersion.value_counts(dropna=False)[:10]

4.18.1807.18075     5139224
4.18.1806.18062      850929
4.12.16299.15        359871
4.10.209.0           272455
4.13.17134.1         257270
4.16.17656.18052     235032
4.13.17134.228       226501
4.8.10240.17443      205480
4.9.10586.1106       203525
4.14.17639.18041     194699
Name: AppVersion, dtype: int64

In [35]:
test.AppVersion.value_counts(dropna=False)[:10]

4.18.1809.2        2738721
4.18.1810.5        2129928
4.18.1807.18075     685600
4.12.16299.15       267102
4.13.17134.1        231117
4.8.10240.17443     193085
4.9.10586.1106      176442
4.13.17134.320      169995
4.10.209.0          155962
4.13.17134.228      118351
Name: AppVersion, dtype: int64

### AvSigVersion
* 어느정도 시간과 관련은 있지만, 완벽하게 이 컬럼으로 구분짓지는 않음

In [36]:
train.AvSigVersion.value_counts(dropna=False)[:10]

1.273.1420.0    102317
1.263.48.0       98024
1.275.1140.0     97232
1.275.727.0      92448
1.273.371.0      86967
1.273.1826.0     86013
1.275.1244.0     78902
1.251.42.0       76837
1.275.1209.0     66393
1.273.810.0      65895
Name: AvSigVersion, dtype: int64

In [37]:
test.AvSigVersion.value_counts(dropna=False)[:10]

1.263.48.0      132624
1.277.515.0      80393
1.251.42.0       73723
1.279.102.0      73108
1.279.32.0       67893
1.277.96.0       57377
1.277.1044.0     54981
1.277.1102.0     52147
1.237.0.0        47158
1.281.261.0      46284
Name: AvSigVersion, dtype: int64

### Platform
* 10일수록 최신 플랫폼
* 2016은 윈도우 서버 제품
* 특이사항 없음

In [6]:
train.Platform.value_counts(dropna=False)

windows10      8618715
windows8        194508
windows7         93889
windows2016      14371
Name: Platform, dtype: int64

In [8]:
test.Platform.value_counts(dropna=False)

windows10      7675480
windows8        111547
windows7         55240
windows2016      10986
Name: Platform, dtype: int64

### OsVer
* 운영체제 버전 정보
* 큰의미 없음

In [10]:
train.OsVer.value_counts(dropna=False)[:5]

10.0.0.0    8632545
6.3.0.0      194447
6.1.1.0       93268
6.1.0.0         582
10.0.3.0        225
Name: OsVer, dtype: int64

In [12]:
test.OsVer.value_counts(dropna=False)[:5]

10.0.0.0    7686083
6.3.0.0      111520
6.1.1.0       54653
6.1.0.0         572
10.0.3.0        167
Name: OsVer, dtype: int64

### OsBuild
* 거의 비슷

In [14]:
train.OsBuild.value_counts(dropna=False)[:10]

17134    3915521
16299    2503681
15063     780270
14393     730819
10586     411606
10240     270192
9600      194508
7601       93306
17692       3184
17738       2478
Name: OsBuild, dtype: int64

In [16]:
test.OsBuild.value_counts(dropna=False)[:10]

17134    3893188
16299    1690582
15063     616089
14393     575569
10586     367350
17763     280705
10240     248346
9600      111547
7601       54667
18252       3246
Name: OsBuild, dtype: int64

### OsPlatformSubRelease
* 서브릴리즈 이름
* rs5정도 의미

In [17]:
train.OsPlatformSubRelease.value_counts(dropna=False)

rs4           3915526
rs3           2503681
rs2            780270
rs1            730819
th2            411606
th1            270192
windows8.1     194508
windows7        93889
prers5          20992
Name: OsPlatformSubRelease, dtype: int64

In [19]:
test.OsPlatformSubRelease.value_counts(dropna=False)

rs4           3893189
rs3           1690582
rs2            616089
rs1            575569
th2            367350
prers5         295341
th1            248346
windows8.1     111547
windows7        55240
Name: OsPlatformSubRelease, dtype: int64

### OsBuildLab

In [22]:
train.OsBuildLab.value_counts(dropna=False)[:20]

17134.1.amd64fre.rs4_release.180410-1804                 3658199
16299.431.amd64fre.rs3_release_svc_escrow.180502-1908    1252674
16299.15.amd64fre.rs3_release.170928-1534                 961060
15063.0.amd64fre.rs2_release.170317-1834                  718033
17134.1.x86fre.rs4_release.180410-1804                    257074
16299.15.x86fre.rs3_release.170928-1534                   233449
14393.2189.amd64fre.rs1_release.180329-1711               193636
10240.17443.amd64fre.th1.170602-2340                      171990
10586.1176.amd64fre.th2_release_sec.170913-1848           148259
15063.0.x86fre.rs2_release.170317-1834                     62237
14393.0.amd64fre.rs1_release.160715-1616                   58292
9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800        56036
9600.19067.amd64fre.winblue_ltsb_escrow.180619-2033        55853
16299.637.amd64fre.rs3_release_svc.180808-1748             44817
14393.2189.x86fre.rs1_release.180329-1711                  39392
10586.1176.x86fre.th2_rel

In [23]:
test.OsBuildLab.value_counts(dropna=False)[:20]

17134.1.amd64fre.rs4_release.180410-1804                 3628554
16299.15.amd64fre.rs3_release.170928-1534                 731698
15063.0.amd64fre.rs2_release.170317-1834                  570217
16299.431.amd64fre.rs3_release_svc_escrow.180502-1908     522601
16299.637.amd64fre.rs3_release_svc.180808-1748            277431
17763.1.amd64fre.rs5_release.180914-1434                  266323
17134.1.x86fre.rs4_release.180410-1804                    264521
14393.2189.amd64fre.rs1_release.180329-1711               163371
10240.17443.amd64fre.th1.170602-2340                      162781
16299.15.x86fre.rs3_release.170928-1534                   132176
10586.1176.amd64fre.th2_release_sec.170913-1848           131801
14393.0.amd64fre.rs1_release.160715-1616                   57739
9600.19153.amd64fre.winblue_ltsb.180908-0600               49398
15063.0.x86fre.rs2_release.170317-1834                     45868
10586.0.amd64fre.th2_release.151029-1700                   32668
14393.2189.x86fre.rs1_rel

### Census_OSVersion	
* 의미있음

In [28]:
train.Census_OSVersion.value_counts(dropna=False)[:10]

10.0.17134.228     1413627
10.0.17134.165      899711
10.0.16299.431      546546
10.0.17134.285      470280
10.0.16299.547      346853
10.0.17134.112      346410
10.0.16299.371      325267
10.0.17134.191      228254
10.0.14393.2189     223775
10.0.16299.611      216776
Name: Census_OSVersion, dtype: int64

In [29]:
test.Census_OSVersion.value_counts(dropna=False)[:10]

10.0.17134.345      1377565
10.0.17134.285       669674
10.0.17134.407       520122
10.0.17134.286       365793
10.0.16299.431       283978
10.0.17134.112       227059
10.0.10240.17443     195916
10.0.16299.371       195793
10.0.14393.2189      191389
10.0.10586.1176      164920
Name: Census_OSVersion, dtype: int64

### Census_OSBranch

In [31]:
train.Census_OSBranch.value_counts(dropna=False)[:10]

rs4_release               4009158
rs3_release               1237321
rs3_release_svc_escrow    1199767
rs2_release                797066
rs1_release                785534
th2_release                326655
th2_release_sec            266882
th1_st1                    195840
th1                         75764
rs5_release                 15324
Name: Census_OSBranch, dtype: int64

In [32]:
test.Census_OSBranch.value_counts(dropna=False)[:10]

rs4_release               3979144
rs3_release               1127648
rs2_release                626698
rs1_release                603919
rs3_release_svc_escrow     496021
rs5_release                290608
th2_release                240049
th2_release_sec            224555
th1_st1                    183721
th1                         67214
Name: Census_OSBranch, dtype: int64

### Census_OSBuildNumber

In [35]:
train.Census_OSBuildNumber.value_counts(dropna=False)[:10]

17134    4008881
16299    2443249
15063     797049
14393     785450
10586     593527
10240     271604
17692       3096
17738       3062
17744       2372
17758       1703
Name: Census_OSBuildNumber, dtype: int64

In [36]:
test.Census_OSBuildNumber.value_counts(dropna=False)[:10]

17134    3978878
16299    1627934
15063     626684
14393     603871
10586     464599
17763     285400
10240     250935
18252       3299
17758       2984
18282       1214
Name: Census_OSBuildNumber, dtype: int64

### Census_OSBuildRevision
* 중요!

In [39]:
train.Census_OSBuildRevision.value_counts(dropna=False)[:10]

228     1413633
165      899712
431      546548
285      470280
547      346853
112      346488
371      325267
191      228255
2189     223775
611      216776
Name: Census_OSBuildRevision, dtype: int64

In [40]:
test.Census_OSBuildRevision.value_counts(dropna=False)[:10]

345      1377568
285       669675
407       520122
286       365795
431       283978
112       227122
17443     195916
371       195793
2189      191389
1         179005
Name: Census_OSBuildRevision, dtype: int64

### Census_OSInstallTypeName

In [41]:
train.Census_OSInstallTypeName.value_counts(dropna=False)

UUPUpgrade        2608037
IBSClean          1650733
Update            1593308
Upgrade           1251559
Other              840121
Reset              649201
Refresh            205842
Clean               69073
CleanPCRefresh      53609
Name: Census_OSInstallTypeName, dtype: int64

In [42]:
test.Census_OSInstallTypeName.value_counts(dropna=False)

IBSClean          2110259
UUPUpgrade        1847191
Update            1109763
Upgrade            942760
Other              809406
Reset              630962
Refresh            256050
Clean               74920
CleanPCRefresh      71942
Name: Census_OSInstallTypeName, dtype: int64