Group members:
- Aya Abdelgawad
- Eric Ge
- Jay Acosta
- Vandana George

# The Problem
We will attempt to use classification algorithms to predict the presence of malicious code in a program. Having a relatively accurate classifier would be a great asset for antiviral protection, allowing users to protect their computers. This analysis in general would be fruitful for finding trends in compromised data and finding the most effective features to search for in order to classify code as suspicious.

The dataset we will be using was supplied by Max Secure Partner for a malware detection competition. It includes metadata on different programs that were labeled as either legitimate or not (contains malicious code). This metadata includes metrics such as the min/max/average entropy, the length of the code, flags it contains, and its hash.

In [2]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt

%matplotlib inline

Data Cleaning

In [18]:
data = pd.read_csv('Kaggle-data.csv', low_memory=False)
# note: see Dr. Beasley office hours about this^^
pd.set_option('display.max_columns', None)
data.describe(include='all')
# also ask why the last col is "Unnamed: 57"

Unnamed: 0,ID,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,AddressOfEntryPoint,BaseOfCode,BaseOfData,ImageBase,SectionAlignment,FileAlignment,MajorOperatingSystemVersion,MinorOperatingSystemVersion,MajorImageVersion,MinorImageVersion,MajorSubsystemVersion,MinorSubsystemVersion,SizeOfImage,SizeOfHeaders,CheckSum,Subsystem,DllCharacteristics,SizeOfStackReserve,SizeOfStackCommit,SizeOfHeapReserve,SizeOfHeapCommit,LoaderFlags,NumberOfRvaAndSizes,SectionsNb,SectionsMeanEntropy,SectionsMinEntropy,SectionsMaxEntropy,SectionsMeanRawsize,SectionsMinRawsize,SectionMaxRawsize,SectionsMeanVirtualsize,SectionsMinVirtualsize,SectionMaxVirtualsize,ImportsNbDLL,ImportsNb,ImportsNbOrdinal,ExportNb,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate,Unnamed: 57
count,216352.0,216352,216352.0,216352.0,216352.0,216351.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,216352.0,1.0
unique,,209329,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,9873531e6ba01adecf7e1e0f68c2df1a,332.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,25,197618.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,108176.5,,,225.390197,4658.018849,9.052688,4.297964,395385.7,582797.8,1332425.0,281219.8,92636.16,259750.6,682562200000000.0,8973.149,876.249815,5.891214,1.908686,65.019649,62.19417,4.915735,1.062089,830637.5,1444.321,119838700.0,2.211697,19125.945395,82265970.0,81352960.0,1375880.0,45967.53,92372.07,61608.88,4.907304,4.50014,2.046629,6.780452,162379.1,19088.53,558863.1,182494.3,20666.24,627925.8,7.375374,114.004788,4.702134,24.093205,21.311201,3.758643,2.391066,5.149387,99970.22,71396.24,252453.5,1023401.0,7.888492,0.348982,0.0
std,62455.587057,,,4.559983,7843.855241,71.522478,11.965284,19627750.0,28411060.0,73378090.0,12543270.0,9922827.0,6712844.0,1.121701e+17,719444.1,1362.854293,183.580174,227.045651,1163.764427,1153.224766,1.041145,144.727471,6859762.0,5878.291,496510300.0,0.500514,16258.49351,37821550000.0,37821560000.0,147876600.0,7974684.0,13707140.0,12420310.0,2.554187,1.121715,1.82534,1.049097,6451482.0,318612.2,25738650.0,3393885.0,318960.0,10730930.0,727.966935,137.024438,36.612958,267.169003,130.67709,1.305126,1.042133,1.864471,17182010.0,16815310.0,24336130.0,47725220.0,8.049384,0.476649,
min,1.0,,,176.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65536.0,0.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,352.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,54088.75,,,224.0,258.0,7.0,0.0,25600.0,15360.0,0.0,12538.0,4096.0,24576.0,4194304.0,4096.0,512.0,4.0,0.0,0.0,0.0,4.0,0.0,122880.0,1024.0,33886.75,2.0,320.0,1048576.0,4096.0,1048576.0,4096.0,0.0,16.0,4.0,3.746777,0.020393,6.305549,13619.2,512.0,41472.0,25903.0,348.0,75957.0,3.0,41.0,0.0,0.0,2.0,3.362966,2.010185,3.594423,775.5,38.0,968.0,0.0,0.0,0.0,0.0
50%,108176.5,,,224.0,271.0,9.0,0.0,101888.0,119808.0,0.0,46612.5,4096.0,86016.0,4194304.0,4096.0,512.0,5.0,1.0,0.0,0.0,5.0,0.0,372736.0,1024.0,278334.5,2.0,32768.0,1048576.0,4096.0,1048576.0,4096.0,0.0,16.0,5.0,4.617155,2.133852,6.617805,57036.8,1536.0,178176.0,75448.2,2776.0,231642.5,4.0,90.0,0.0,0.0,6.0,3.671986,2.458492,5.217124,1601.917,48.0,7336.0,0.0,0.0,0.0,0.0
75%,162264.25,,,224.0,8450.0,10.0,0.0,122880.0,385024.0,0.0,76180.0,4096.0,126976.0,268435500.0,4096.0,512.0,5.0,1.0,6.0,0.0,5.0,1.0,532480.0,1024.0,560597.0,2.0,33088.0,1048576.0,4096.0,1048576.0,4096.0,0.0,16.0,5.0,5.61181,4.034605,7.96112,101068.8,9216.0,330752.0,109930.9,9384.0,352708.0,8.0,144.0,1.0,0.0,13.0,4.194799,3.003205,6.122045,3146.4,232.0,17005.0,72.0,15.0,1.0,0.0


In [15]:
# list all the columns
data.columns.values.tolist()
# will get rid of the following for preliminary cleaning:
clean_out = ['ID', 'Machine', 'Characteristics', 'MinorLinkerVersion', 'BaseOfCode', 'BaseOfData', 'ImageBase', 'Unnamed: 57']

['ID',
 'md5',
 'Machine',
 'SizeOfOptionalHeader',
 'Characteristics',
 'MajorLinkerVersion',
 'MinorLinkerVersion',
 'SizeOfCode',
 'SizeOfInitializedData',
 'SizeOfUninitializedData',
 'AddressOfEntryPoint',
 'BaseOfCode',
 'BaseOfData',
 'ImageBase',
 'SectionAlignment',
 'FileAlignment',
 'MajorOperatingSystemVersion',
 'MinorOperatingSystemVersion',
 'MajorImageVersion',
 'MinorImageVersion',
 'MajorSubsystemVersion',
 'MinorSubsystemVersion',
 'SizeOfImage',
 'SizeOfHeaders',
 'CheckSum',
 'Subsystem',
 'DllCharacteristics',
 'SizeOfStackReserve',
 'SizeOfStackCommit',
 'SizeOfHeapReserve',
 'SizeOfHeapCommit',
 'LoaderFlags',
 'NumberOfRvaAndSizes',
 'SectionsNb',
 'SectionsMeanEntropy',
 'SectionsMinEntropy',
 'SectionsMaxEntropy',
 'SectionsMeanRawsize',
 'SectionsMinRawsize',
 'SectionMaxRawsize',
 'SectionsMeanVirtualsize',
 'SectionsMinVirtualsize',
 'SectionMaxVirtualsize',
 'ImportsNbDLL',
 'ImportsNb',
 'ImportsNbOrdinal',
 'ExportNb',
 'ResourcesNb',
 'ResourcesMeanEnt

In [16]:
data.head()

Unnamed: 0,ID,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,...,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate,Unnamed: 57
0,1,b69acb3bb133974e48229627663f96d4,332,224,8450,8.0,0,16896,8192,0,...,3.492126,3.492126,3.492126,864.0,864.0,864,72,0,1,
1,2,1cbee4b3725629bd0aa6ac2ff500925f,332,224,258,9.0,0,84480,25600,0,...,3.486827,3.486827,3.486827,892.0,892.0,892,72,0,1,
2,3,b7027cf0cd31c820928950cbfe7e91ef,332,224,8450,8.0,0,4608,3584,0,...,3.51727,3.51727,3.51727,952.0,952.0,952,72,0,1,
3,4,156a0bb069f94d1e7c2508318805f2a4,332,224,8450,10.0,0,108544,15872,0,...,3.270559,3.034188,3.506931,1032.0,972.0,1092,72,0,1,
4,5,c72bf851fed5542abba904b1f3944cd5,332,224,8226,48.0,0,513024,2048,0,...,3.420977,3.420977,3.420977,954.0,954.0,954,0,0,1,
