Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [X] Choose your target. Which column in your tabular dataset will you predict?
- [X] Is your problem regression or classification?
- [X] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [X] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [X] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [X] Begin to clean and explore your data.
- [X] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

In [1]:
import numpy as np 
import pandas as pd 
import dask.dataframe as dd
import random

pd.set_option('max_colwidth', 500)
pd.set_option('max_columns', 500)
pd.set_option('max_rows', 100)

import sys
if not sys.warnoptions:
  import warnings
  warnings.simplefilter('ignore')

In [2]:
DATA_PATH = '../data/'

In [3]:
def reduce_mem_usage(df, verbose=True):
  """
  Iterate through all the columns of a dataframe and modify the data types
  to reduce memory usage.        
  """
  df = df.copy()
  
  start_mem = df.memory_usage().sum() / 1024**2
  start_mem_gb = start_mem / 1024
  
  numeric_dtype = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
  
  for col in df:
    col_type = str(df[col].dtypes)
        
    if col_type in numeric_dtype:
      c_min = df[col].min()
      c_max = df[col].max()
      if col_type[:3] == 'int':
        if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
          df[col] = df[col].astype(np.int8)
        elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
          df[col] = df[col].astype(np.int16)
        elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
          df[col] = df[col].astype(np.int32)
        elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
          df[col] = df[col].astype(np.int64)  
      else:  # column is not int
        if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
          df[col] = df[col].astype(np.float16)
        elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
          df[col] = df[col].astype(np.float32)
        else:
          df[col] = df[col].astype(np.float64)
    else:  # column is an object
      df[col] = df[col].astype('category')

  end_mem = df.memory_usage().sum() / 1024**2
  end_mem_gb = end_mem / 1024
  
  if verbose:
    print(f'Memory usage of dataframe is {start_mem:.2f} MB',
        f'/ {start_mem_gb:.2f} GB')
    print(f'Memory usage after optimization is: {end_mem:.2f} MB',
        f'/ {end_mem_gb:.2f} GB')
    mem_dec = 100 * (start_mem - end_mem) / start_mem
    print(f'Decreased by {mem_dec:.1f}%')
    
  return df


def import_data(file, dtypes=None):
  """
  Create a dataframe using dask for faster speed
  """

  ddf = dd.read_csv(file, dtype=dtypes)
  df = ddf.compute()
  return df

In [4]:
dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int16',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float64',
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float32',
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float32',
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float64',
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float64',
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32',
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32',
        'Census_InternalPrimaryDisplayResolutionVertical':      'float32',
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float64',
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
}

In [5]:
target = 'HasDetections'
data_id = 'MachineIdentifier'

In [6]:
# train = import_data(DATA_PATH + 'project/train.csv', dtypes=dtypes)
train = import_data(DATA_PATH + 'project/reduced_train.csv', dtypes=dtypes)
train = reduce_mem_usage(train)

Memory usage of dataframe is 580.71 MB / 0.57 GB
Memory usage after optimization is: 509.25 MB / 0.50 GB
Decreased by 12.3%


In [7]:
# test = import_data(DATA_PATH + 'project/test.csv', dtypes=dtypes)
test = import_data(DATA_PATH + 'project/reduced_test.csv', dtypes=dtypes)
test = reduce_mem_usage(test)

Memory usage of dataframe is 518.57 MB / 0.51 GB
Memory usage after optimization is: 451.17 MB / 0.44 GB
Decreased by 13.0%


In [8]:
"""
def select_subset(df, data_id='MachineIdentifier', ratio=0.3):
    import random
    
    random.seed(42)
    unique_id = df[data_id].unique().tolist()
    sel_id_list = random.sample(unique_id, round(len(df)*ratio))
    reduced_df = df[df[data_id].isin(sel_id_list)]
    
    return reduced_df
"""

"\ndef select_subset(df, data_id='MachineIdentifier', ratio=0.3):\n    import random\n    \n    random.seed(42)\n    unique_id = df[data_id].unique().tolist()\n    sel_id_list = random.sample(unique_id, round(len(df)*ratio))\n    reduced_df = df[df[data_id].isin(sel_id_list)]\n    \n    return reduced_df\n"

In [9]:
# reduced_train = select_subset(train)
# reduced_test = select_subset(test)

In [10]:
# reduced_train.to_csv(DATA_PATH + 'project/reduced_train.csv', index=False)
# reduced_test.to_csv(DATA_PATH + 'project/reduced_test.csv', index=False)

In [11]:
# train = import_data(DATA_PATH + 'project/reduced_train.csv', dtypes=dtypes)
# train = reduce_mem_usage(train)
print(train.shape)

(2676445, 83)


In [12]:
col_stats = []
for col in train.columns:
  # Number of unique values in the column
  unique_vals = train[col].nunique()
  
  # Percent of NaNs in the column
  nan_pct = train[col].isnull().sum() * 100 / train.shape[0]

  # Percent of data that the most common column covers
  col_counts = train[col].value_counts(normalize=True, dropna=False)
  largest_pct = col_counts.values[0] * 100

  # Datatype of the column
  col_type = train[col].dtype

  col_stats.append((col, unique_vals, nan_pct, largest_pct, col_type))

df_header = ['Feature', 'Unique_values', '% of Missing Values',
             'Largest Category %', 'Type']
stats_df = pd.DataFrame(col_stats, columns=df_header)
stats_df.sort_values(['% of Missing Values', 'Largest Category %'],
                     ascending=False)

Unnamed: 0,Feature,Unique_values,% of Missing Values,Largest Category %,Type
28,PuaMode,2,99.975714,99.975714,category
41,Census_ProcessorClass,3,99.585644,99.585644,category
8,DefaultBrowsersIdentifier,1075,95.134068,95.134068,float16
68,Census_IsFlightingInternal,2,83.014857,83.014857,float16
52,Census_InternalBatteryType,44,70.997013,70.997013,category
71,Census_ThresholdOptIn,2,63.458879,63.458879,float16
75,Census_IsWIMBootEnabled,1,63.372683,63.372683,float16
31,SmartScreen,18,35.589336,48.378801,category
15,OrganizationIdentifier,46,30.874985,47.041393,float16
29,SMode,2,6.033339,93.924217,float16


In [13]:
# Keep only features with less than 50% NaNs and less than 95% dominant category
condition = ((stats_df['% of Missing Values'] <= 50) &
             (stats_df['Largest Category %'] <= 95))
stats_df[condition]

Unnamed: 0,Feature,Unique_values,% of Missing Values,Largest Category %,Type
0,MachineIdentifier,2676445,0.0,3.7e-05,category
2,EngineVersion,64,0.0,43.094926,category
3,AppVersion,102,0.0,57.646654,category
4,AvSigVersion,8145,0.0,1.15149,category
9,AVProductStatesIdentifier,14463,0.405613,65.26934,float32
10,AVProductsInstalled,7,0.405613,69.573221,float16
13,CountryIdentifier,222,0.0,4.449335,int16
14,CityIdentifier,74311,3.650776,3.650776,float32
15,OrganizationIdentifier,46,30.874985,47.041393,float16
16,GeoNameIdentifier,283,0.002279,17.152828,float16


In [14]:
stats_df_filtered = stats_df[condition]
# Take the relevant features from the 'Feature' column
relevant_features = stats_df_filtered['Feature'].tolist()

In [15]:
relevant_features

['MachineIdentifier',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'CountryIdentifier',
 'CityIdentifier',
 'OrganizationIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'Processor',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'IsProtected',
 'SMode',
 'IeVerIdentifier',
 'SmartScreen',
 'Census_MDC2FormFactor',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_PrimaryDiskTotalCapacity',
 'Census_PrimaryDiskTypeName',
 'Census_SystemVolumeTotalCapacity',
 'Census_HasOpticalDiskDrive',
 'Census_TotalPhysicalRAM',
 'Census_ChassisTypeName',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_PowerPlatformRoleName',
 'Census_InternalBat

In [16]:
train = train[relevant_features]
print(train.shape)

(2676445, 60)


In [17]:
# test = import_data(DATA_PATH + 'project/reduced_test.csv', dtypes=dtypes)
test[target] = np.nan
print(test.shape)

(2355976, 83)


In [18]:
# test = reduce_mem_usage(test)

In [19]:
test = test[relevant_features]
print(test.shape)

(2355976, 60)


In [20]:
num_features = ['Census_SystemVolumeTotalCapacity']

bin_features = ['Census_HasOpticalDiskDrive',
                'Census_IsAlwaysOnAlwaysConnectedCapable',
                'Census_IsSecureBootEnabled',
                'Census_IsTouchEnabled',
                'IsProtected',
                'SMode',
                'Wdft_IsGamer']

noncat_features = num_features + bin_features
# All the other features are categorical
cat_features = [i for i in relevant_features if i not in noncat_features]
cat_features.remove(target)

In [21]:
train.head()

Unnamed: 0,MachineIdentifier,EngineVersion,AppVersion,AvSigVersion,AVProductStatesIdentifier,AVProductsInstalled,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Processor,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,SMode,IeVerIdentifier,SmartScreen,Census_MDC2FormFactor,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_GenuineStateName,Census_ActivationChannel,Census_FlightRing,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsTouchEnabled,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,000014a5f00daa18e76b81417eeb99fc,1.1.15100.1,4.18.1807.18075,1.273.1379.0,53447.0,1.0,18,37376.0,,277.0,75,x64,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0.0,137.0,RequireAdmin,Notebook,1443.0,331929.0,4.0,5.0,2500.0,476940.0,HDD,101900.0,0,6144.0,Portable,14.0,1366.0,768.0,Mobile,0.0,10.0.17134.191,amd64,rs4_release,17134,191,Core,CORE,Update,8.0,31,FullAuto,IS_GENUINE,Retail,Retail,355.0,19844.0,0,0,0.0,0.0,1.0,1
1,00001f26e9e5775277d6231fc6ac9e70,1.1.15100.1,4.18.1807.18075,1.273.1372.0,36429.0,2.0,80,7198.0,27.0,101.0,107,x64,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0.0,137.0,,Desktop,2102.0,250496.0,4.0,5.0,2408.0,476940.0,HDD,456309.0,0,4096.0,Desktop,18.5,1366.0,768.0,Desktop,4294967000.0,10.0.17134.191,amd64,rs4_release,17134,191,Professional,PROFESSIONAL,UUPUpgrade,6.0,28,FullAuto,IS_GENUINE,OEM:DM,Retail,486.0,51601.0,1,0,0.0,1.0,3.0,0
2,00002b7454f06444e8d9f6083d8a9ebd,1.1.15300.6,4.18.1809.2,1.277.48.0,53447.0,1.0,178,136271.0,27.0,230.0,71,x64,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,,137.0,,Notebook,666.0,264568.0,4.0,5.0,2664.0,477102.0,SSD,48602.0,0,8192.0,Laptop,13.296875,2560.0,1600.0,Mobile,0.0,10.0.17134.285,amd64,rs4_release,17134,285,Core,CORE,IBSClean,7.0,30,FullAuto,INVALID_LICENSE,Retail,Retail,152.0,45542.0,0,0,0.0,1.0,1.0,0
3,000037f84e21c83328ba6963cdac497b,1.1.15200.1,4.18.1807.18075,1.275.173.0,7945.0,2.0,12,110781.0,27.0,15.0,58,x64,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0.0,137.0,RequireAdmin,Notebook,2652.0,304618.0,4.0,5.0,2696.0,228936.0,SSD,228106.0,0,8192.0,Laptop,13.203125,1920.0,1080.0,Mobile,0.0,10.0.17134.228,amd64,rs4_release,17134,228,Core,CORE,UUPUpgrade,8.0,31,FullAuto,IS_GENUINE,OEM:DM,Retail,142.0,12463.0,0,0,0.0,1.0,2.0,0
4,000046e59c37136173428e560acbe3a2,1.1.15200.1,4.18.1807.18075,1.275.511.0,46669.0,2.0,141,60626.0,27.0,240.0,233,x64,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0.0,108.0,,Notebook,2668.0,171476.0,2.0,5.0,2096.0,476940.0,HDD,450063.0,0,2048.0,Notebook,13.898438,1366.0,768.0,Mobile,0.0,10.0.15063.1266,amd64,rs2_release,15063,1266,CoreSingleLanguage,CORE_SINGLELANGUAGE,Reset,9.0,34,Notify,IS_GENUINE,OEM:DM,Retail,628.0,13190.0,1,0,0.0,1.0,10.0,1


In [22]:
test.head()

Unnamed: 0,MachineIdentifier,EngineVersion,AppVersion,AvSigVersion,AVProductStatesIdentifier,AVProductsInstalled,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Processor,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,SMode,IeVerIdentifier,SmartScreen,Census_MDC2FormFactor,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_GenuineStateName,Census_ActivationChannel,Census_FlightRing,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsTouchEnabled,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,00000574cefffeca83ec8adf9285b2bf,1.1.15400.4,4.18.1809.2,1.279.236.0,53447.0,1.0,171,124736.0,18.0,211.0,182,x64,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1.0,,117.0,RequireAdmin,Notebook,585.0,189538.0,4.0,5.0,3394.0,476940.0,HDD,461506.0,1,2048.0,Notebook,15.5,1366.0,768.0,Mobile,0.0,10.0.16299.371,amd64,rs3_release,16299,371,CoreSingleLanguage,CORE_SINGLELANGUAGE,Update,29.0,125,UNKNOWN,IS_GENUINE,Retail,Retail,556.0,63269.0,1,0,0.0,1.0,3.0,
1,000008f31610018d898e5f315cdf1bd1,1.1.15400.4,4.18.1810.5,1.279.1373.0,7945.0,2.0,29,112216.0,,35.0,171,x64,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,,137.0,,Notebook,4144.0,120223.0,4.0,5.0,2382.0,953869.0,HDD,936631.0,0,4096.0,Notebook,15.5,1920.0,1080.0,Mobile,121.0,10.0.17134.345,amd64,rs4_release,17134,345,CoreSingleLanguage,CORE_SINGLELANGUAGE,Refresh,26.0,119,FullAuto,IS_GENUINE,OEM:DM,Retail,142.0,52312.0,1,0,0.0,1.0,10.0,
2,00000e658ce75c1e2a3bb47bcc3b08f3,1.1.15400.4,4.12.17007.18011,1.279.922.0,2558.0,2.0,203,143782.0,27.0,255.0,46,x64,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0.0,117.0,ExistsNotSet,Notebook,4488.0,308579.0,4.0,5.0,2536.0,715404.0,HDD,681374.0,0,4096.0,Notebook,15.5,1920.0,1080.0,Mobile,0.0,10.0.16299.431,amd64,rs3_release_svc_escrow,16299,431,Core,CORE,Upgrade,39.0,160,FullAuto,IS_GENUINE,Retail,Retail,556.0,57180.0,1,1,0.0,0.0,7.0,
3,0000102ff65968bbdc04b69073434b05,1.1.15400.5,4.18.1810.5,1.281.638.0,53447.0,1.0,80,97985.0,27.0,101.0,107,x64,16299,256,rs3,16299.637.amd64fre.rs3_release_svc.180808-1748,Pro,1.0,,117.0,,Desktop,1443.0,256605.0,4.0,5.0,2426.0,122104.0,SSD,107289.0,0,4096.0,Desktop,18.5,1360.0,768.0,Desktop,4294967000.0,10.0.16299.785,amd64,rs3_release,16299,785,Professional,PROFESSIONAL,Other,6.0,28,UNKNOWN,IS_GENUINE,OEM:DM,Retail,355.0,7313.0,1,0,0.0,0.0,3.0,
4,00001dcfc3f82d68d6eae9cad4a3e07c,1.1.15400.4,4.18.1810.5,1.279.684.0,53447.0,1.0,108,75425.0,27.0,277.0,75,x64,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,,137.0,,Desktop,2206.0,252854.0,8.0,5.0,2776.0,1907729.0,HDD,147446.0,0,16384.0,Desktop,19.0,1280.0,1024.0,Desktop,4294967000.0,10.0.17134.345,amd64,rs4_release,17134,345,Professional,PROFESSIONAL,UUPUpgrade,8.0,31,FullAuto,IS_GENUINE,Retail,Retail,142.0,13887.0,0,0,0.0,0.0,11.0,


In [23]:
train['HasDetections'].value_counts(normalize=True)

0    0.500041
1    0.499959
Name: HasDetections, dtype: float64

In [24]:
def impute_features(X):
  """
  Custom imputation for numerical, binary, and categorical features.
  """
  X = X.copy()
  
  # Numerical Features
  # Replace NaN values with -1
  for feature in num_features:
    X[feature] = X[feature].fillna(-1)
  
  # Binary Features
  # Replace NaN values with the column's mode
  for feature in bin_features:
    X[feature] = X[feature].fillna(X[feature].mode()[0])
  
  # Categorical Features
  # Replace all "numerical" categorical feature NaNs with -1
  # We need to handle all the other features by hand
  features_by_hand = []
  for feature in cat_features:
    categorical = pd.api.types.is_categorical_dtype(X[feature])
    if not categorical:
      if feature != target:
        # Replace the NaNs with -1
        X[feature] = X[feature].fillna(-1)
        X[feature] = X[feature].astype('category')
    else:
      # Add to the list of features to correct by hand
      features_by_hand.append(feature)

  return X, features_by_hand

In [25]:
train, features_by_hand_train = impute_features(train)
test, features_by_hand_test = impute_features(test)

In [26]:
features_by_hand_train

['MachineIdentifier',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'Processor',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'SmartScreen',
 'Census_MDC2FormFactor',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_PowerPlatformRoleName',
 'Census_OSVersion',
 'Census_OSArchitecture',
 'Census_OSBranch',
 'Census_OSEdition',
 'Census_OSSkuName',
 'Census_OSInstallTypeName',
 'Census_OSWUAutoUpdateOptionsName',
 'Census_GenuineStateName',
 'Census_ActivationChannel',
 'Census_FlightRing']

In [31]:
def wrangle(X, features_by_hand):
  """
  Wrangle the train and test dataframes in the same way.
  In order the handle the other features above, we should:
    1. Convert all values to lowercase.
    2. Merge different spellings of the features together.
    3. Set NaNs to 'unknown'.
    4. Merge or set rare values to 'unknown'.
    5. Label all values with numerics.
    6. Add feature to its corresponding encoding list based on feature amount.
  """
  freq_encoding_list = []

  X = X.copy()

  # Convert all values to lowercase.
  for feature in features_by_hand:
    X[feature] = X[feature].str.lower()

  # Feature: MachineIdentifier
  # Nothing needed

  # Feature: EngineVersion
  freq_encoding_list.append('EngineVersion')

  # Feature: AppVersion
  freq_encoding_list.append('AppVersion')

  # Feature: AvSigVersion
  X['AvSigVersion'] = X['AvSigVersion'].replace('1.2&#x17;3.1144.0', '1.2173.1144.0')
  freq_encoding_list.append('AvSigVersion')

  # Feature: Processor
  # Nothing needed

  # Feature: OsPlatformSubRelease
  # Nothing needed

  # Feature: OsBuildLab
  X['OsBuildLab'] = X['OsBuildLab'].astype('category')
  X['OsBuildLab'] = X['OsBuildLab'].cat.add_categories(['unknown'])
  X['OsBuildLab'] = X['OsBuildLab'].fillna('unknown')
  freq_encoding_list.append('OsBuildLab')

  # Feature: SkuEdition
  X['SkuEdition'] = X['SkuEdition'].astype(str)
  X.loc[X['SkuEdition'] != 'Home', 'SkuEdition'] = 'Pro'
  X['SkuEdition'] = X['SkuEdition'].astype('category')
  X['SkuEdition'] = X['SkuEdition'].cat.remove_unused_categories()

  # Feature: SmartScreen
  X['SmartScreen'] = X['SmartScreen'].astype('category')
  X['SmartScreen'] = X['SmartScreen'].fillna('existsnotset')
  X['SmartScreen'] = X['SmartScreen'].astype(str)
  condition = X['SmartScreen'].isin(['requireadmin',
                                     'existsnotset',
                                     'off',
                                     'warn']) == False
  X.loc[condition, 'SmartScreen'] = 'prompt'
  X['SmartScreen'] = X['SmartScreen'].astype('category')
  X['SmartScreen'] = X['SmartScreen'].cat.remove_unused_categories()

  # Feature: Census_MDC2FormFactor
  X['Census_MDC2FormFactor_new'] = X['Census_MDC2FormFactor']
  features_by_hand.append('Census_MDC2FormFactor_new')
  X['Census_MDC2FormFactor_new'] = X['Census_MDC2FormFactor_new'].astype(str)
  X['Census_MDC2FormFactor_new'] = X['Census_MDC2FormFactor_new'].apply(rename_Census_MDC2FormFactor_new)
  X['Census_MDC2FormFactor_new'] = X['Census_MDC2FormFactor_new'].astype('category')
  X['Census_MDC2FormFactor_new'] = X['Census_MDC2FormFactor_new'].cat.remove_unused_categories()

  # Feature: Census_PrimaryDiskTypeName
  X['Census_PrimaryDiskTypeName'] = X['Census_PrimaryDiskTypeName'].astype('category')
  X['Census_PrimaryDiskTypeName'] = X['Census_PrimaryDiskTypeName'].replace('unspecified', 'unknown')
  X['Census_PrimaryDiskTypeName'] = X['Census_PrimaryDiskTypeName'].fillna('unknown')
  X['Census_PrimaryDiskTypeName'] = X['Census_PrimaryDiskTypeName'].cat.remove_unused_categories()

  # Feature: Census_ChassisTypeName
  X['Census_ChassisTypeName'] = X['Census_ChassisTypeName'].fillna('unknown')
  X['Census_ChassisTypeName'] = X['Census_ChassisTypeName'].astype(str)
  X['Census_ChassisTypeName'] = X['Census_ChassisTypeName'].apply(rename_Census_ChassisTypeName)
  X['Census_ChassisTypeName'] = X['Census_ChassisTypeName'].astype('category')
  X['Census_ChassisTypeName'] = X['Census_ChassisTypeName'].cat.remove_unused_categories()

  # Feature: Census_PowerPlatformRoleName
  X['Census_PowerPlatformRoleName'] = X['Census_PowerPlatformRoleName'].astype('category')
  X['Census_PowerPlatformRoleName'] = X['Census_PowerPlatformRoleName'].replace('unspecified', 'unknown')
  X['Census_PowerPlatformRoleName'] = X['Census_PowerPlatformRoleName'].fillna('unknown')
  X['Census_PowerPlatformRoleName'] = X['Census_PowerPlatformRoleName'].cat.remove_unused_categories()

  # Feature: Census_OSVersion
  freq_encoding_list.append('Census_OSVersion')

  # Feature: Census_OSArchitecture
  # Nothing needed

  # Feature: Census_OSBranch
  # Nothing needed

  # Feature: Census_OSEdition
  X['Census_OSEdition'] = X['Census_OSEdition'].astype('category')
  X['Census_OSEdition'] = X['Census_OSEdition'].cat.add_categories(['unknown'])
  X['Census_OSEdition'] = X['Census_OSEdition'].fillna('unknown')
  X['Census_OSEdition'] = X['Census_OSEdition'].astype(str)
  X['Census_OSEdition'] = X['Census_OSEdition'].apply(rename_Census_OSEdition)
  X['Census_OSEdition'] = X['Census_OSEdition'].astype('category')
  X['Census_OSEdition'] = X['Census_OSEdition'].cat.remove_unused_categories()

  # Feature: Census_OSSkuName
  X['Census_OSSkuName'] = X['Census_OSSkuName'].astype(str)
  X['Census_OSSkuName'] = X['Census_OSSkuName'].apply(rename_Census_OSEdition)
  X['Census_OSSkuName'] = X['Census_OSSkuName'].astype('category')
  X['Census_OSSkuName'] = X['Census_OSSkuName'].cat.remove_unused_categories()

  # Feature: Census_OSInstallTypeName
  # Nothing needed

  # Feature: Census_OSWUAutoUpdateOptionsName
  # Nothing needed

  # Feature: Census_GenuineStateName
  X['Census_GenuineStateName'] = X['Census_GenuineStateName'].astype('category')
  X['Census_GenuineStateName'] = X['Census_GenuineStateName'].replace('tampered', 'unknown')
  X['Census_GenuineStateName'] = X['Census_GenuineStateName'].fillna('unknown')
  X['Census_GenuineStateName'] = X['Census_GenuineStateName'].cat.remove_unused_categories()

  # Feature: Census_ActivationChannel
  X['Census_ActivationChannel'] = X['Census_ActivationChannel'].astype(str)
  X['Census_ActivationChannel'] = X['Census_ActivationChannel'].apply(rename_Census_ActivationChannel)
  X['Census_ActivationChannel'] = X['Census_ActivationChannel'].astype('category')

  # Feature: Census_FlightRing
  X['Census_FlightRing'] = X['Census_FlightRing'].astype('category')
  X['Census_FlightRing'] = X['Census_FlightRing'].replace('disabled', 'not_set')
  X['Census_FlightRing'] = X['Census_FlightRing'].replace(['osg', 'canary', 'invalid'], 'unknown')
  X['Census_FlightRing'] = X['Census_FlightRing'].fillna('unknown')
  X['Census_FlightRing'] = X['Census_FlightRing'].cat.remove_unused_categories()

  return X, freq_encoding_list

def rename_Census_MDC2FormFactor_new(x):
  x = x.lower()
  if 'server' in x:
    return 'server'
  elif 'tablet' in x:
    return 'tablet'                  
  else:
    return x

def rename_Census_ChassisTypeName(x):
  x = x.lower()
  if 'laptop' in x:
    return 'notebook'
  elif 'other' in x:
    return 'unknown'                  
  else:
    return x

def rename_Census_OSEdition(x):
  x = x.lower()
  if 'core' in x:
    return 'core'
  elif 'pro' in x:
    return 'pro'
  elif 'enterprise' in x:
    return 'enterprise'
  elif 'server' in x:
    return 'server'
  elif 'home' in x:
    return 'home'
  elif 'education' in x:
    return 'education'
  elif 'cloud' in x:
    return 'cloud'
  else:
    return x

def rename_Census_ActivationChannel(x):
  x = x.lower()
  if 'oem' in x:
    return 'oem'
  elif 'volume' in x:
    return 'volume'
  elif 'retail' in x:
    return 'retail'
  else:
    return x

In [28]:
%time train, freq_encoding_list = wrangle(train, features_by_hand_train)
%time test, freq_encoding_list = wrangle(test, features_by_hand_test)

MachineIdentifier
EngineVersion
AppVersion
AvSigVersion
Processor
OsPlatformSubRelease
OsBuildLab
SkuEdition
SmartScreen
Census_MDC2FormFactor
Census_PrimaryDiskTypeName
Census_ChassisTypeName
Census_PowerPlatformRoleName
Census_OSVersion
Census_OSArchitecture
Census_OSBranch
Census_OSEdition
Census_OSSkuName
Census_OSInstallTypeName
Census_OSWUAutoUpdateOptionsName
Census_GenuineStateName
Census_ActivationChannel
Census_FlightRing
Wall time: 15.9 s
MachineIdentifier
EngineVersion
AppVersion
AvSigVersion
Processor
OsPlatformSubRelease
OsBuildLab
SkuEdition
SmartScreen
Census_MDC2FormFactor
Census_PrimaryDiskTypeName
Census_ChassisTypeName
Census_PowerPlatformRoleName
Census_OSVersion
Census_OSArchitecture
Census_OSBranch
Census_OSEdition
Census_OSSkuName
Census_OSInstallTypeName
Census_OSWUAutoUpdateOptionsName
Census_GenuineStateName
Census_ActivationChannel
Census_FlightRing
Wall time: 14.1 s


In [29]:
freq_encoding_list

['EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'OsBuildLab',
 'Census_OSVersion']

In [30]:
%time train.to_csv(DATA_PATH + 'project/reduced_train_clean.csv', index=False)
%time test.to_csv(DATA_PATH + 'project/reduced_test_clean.csv', index=False)

Wall time: 56.5 s
Wall time: 50.5 s
