# Detecting Fraud in Ad Impression Data

I have a file with one day of impression logs for selected IP addresses. The nine fields
are, in order:

* Timestamp
* IP address
* Detected browser type
* User agent string
* Host (URL)
* Whether the impression was in view (1.0 = yes, 0.0 = no)
* Number of plugins installed
* Browser window position and size (x, y, width, height)
* Network latency

My task is to identify hosts which are receiving a substantial amount of fraudulent
traffic. As part of this, I may also wish to identify IP addresses home to machines that
are part of botnets, but this is not required. The definition of "substantial" is up to me --
this may be a ranked list of all hosts, or a list of hosts reaching a certain threshold, or I
may choose not to quantify the amount of fraud and simply classify hosts as likely to be
experiencing high fraud or not. To get started, I have a list of hosts which are known
to receive substantial amounts of fraudulent traffic:

* featureball.com
* uvido.com
* sprungliving.com
* sweetboxgames.com
* mammabay.co.uk
* workingfathertv.com
* worsthorrorgame.com
* hourlyparent.com
* ulterior-movies.com
* myhomesdesign.com
* indoorlife.tv
* bumclub.info
* psychoworld.tv
* hunp.us
* rlinevideos.com


In [1]:
import re
import math
import timeit
import useragent
import numpy as np
print 'numpy', np.__version__
import pandas as pd
print 'pandas', pd.__version__
import seaborn as sns
print 'seaborn', sns.__version__
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

pd.options.display.max_columns = 25

tic = timeit.default_timer()
imp_df = pd.io.parsers.read_csv("ad_impression_data_set.tsv",sep='\t',
                                names=['Timestamp','IPadd','Browser','UserAgent','Host','Inview','Nplugins','Wpossize','Latency'],
                               header=None)
toc = timeit.default_timer()
print 'Seconds to load csv: ', toc - tic

imp_df.head(5)

numpy 1.10.1
pandas 0.17.0
seaborn 0.6.0
Seconds to load csv:  1.41146397591




Unnamed: 0,Timestamp,IPadd,Browser,UserAgent,Host,Inview,Nplugins,Wpossize,Latency
0,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0
1,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0
2,2014-08-25 00:00:00,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,
3,2014-08-25 00:00:00,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,
4,2014-08-25 00:00:00,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,


In [2]:
print imp_df.describe()

              Inview       Nplugins         Latency
count  211811.000000  162304.000000   200283.000000
mean        0.431998      10.139442      345.200182
std         0.495355       7.637172    13292.816593
min         0.000000       1.000000        0.000000
25%         0.000000       2.000000        0.000000
50%         0.000000      10.000000       54.000000
75%         1.000000      15.000000      179.000000
max         1.000000      61.000000  5866142.000000


In [3]:
def clean_hosts(data=None):
    ''' A bunch of regular expression magic to filter host names to the top level'''
    if data is None:
		raise ValueError("Input 'data' to clean_data is None")
    # deal with co.uk and com.au, convert to .couk and .comau, respectively (and other variants)
    data["Host"]=data["Host"].apply(lambda x: re.sub('(\.co)\.([a-z][a-z])$',r'\1\2',x))
    data["Host"]=data["Host"].apply(lambda x: re.sub('(\.com)\.([a-z][a-z])$',r'\1\2',x))
    # select just the top level and suffix
    data["Host"]=data["Host"].apply(lambda x: '.'.join(str(x).split('.')[-2:]))
    # account for cases like http://mlb.com (where there is only one '.')
    data["Host"]=data["Host"].apply(lambda x: str(x).split('//')[-1])
    
    #Remove ERROR Hosts
    print '\nNumber of ERROR hosts to be removed:', sum(data["Host"]=="ERROR"), '\n'
    data=data[data["Host"]!="ERROR"]
    return data

print imp_df.info()
print imp_df['Host'].head(5)

tic = timeit.default_timer()
imp_df = clean_hosts(imp_df)
toc = timeit.default_timer()
print 'Seconds to clean hosts: ', toc - tic

print imp_df.info()
print imp_df['Host'].head(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 235083 entries, 0 to 235082
Data columns (total 9 columns):
Timestamp    235083 non-null object
IPadd        235083 non-null object
Browser      235083 non-null object
UserAgent    235083 non-null object
Host         235083 non-null object
Inview       211811 non-null float64
Nplugins     162304 non-null float64
Wpossize     118451 non-null object
Latency      200283 non-null float64
dtypes: float64(3), object(6)
memory usage: 17.9+ MB
None
0      http://www.domain.com.au
1      http://www.domain.com.au
2    http://www.mangareader.net
3    http://www.mangareader.net
4    http://www.mangareader.net
Name: Host, dtype: object

Number of ERROR hosts to be removed: 138 

Seconds to clean hosts:  4.70598196983
<class 'pandas.core.frame.DataFrame'>
Int64Index: 234945 entries, 0 to 235082
Data columns (total 9 columns):
Timestamp    234945 non-null object
IPadd        234945 non-null object
Browser      234945 non-null object
UserAgent    23494

In [4]:
fraudtraffic=['featureball.com',
'uvido.com',
'sprungliving.com',
'sweetboxgames.com',
'mammabay.couk', #had to modify because of how I cleaned above
'workingfathertv.com',
'worsthorrorgame.com',
'hourlyparent.com',
'ulterior-movies.com',
'myhomesdesign.com',
'indoorlife.tv',
'bumclub.info',
'psychoworld.tv',
'hunp.us',
'rlinevideos.com']

print "Host counts:\n", imp_df[imp_df["Host"].isin(fraudtraffic)]["Host"].value_counts()#.describe()

Host counts:
uvido.com              692
featureball.com        560
sprungliving.com       228
mammabay.couk          123
workingfathertv.com    120
worsthorrorgame.com     94
hourlyparent.com        85
ulterior-movies.com     83
myhomesdesign.com       69
indoorlife.tv           60
bumclub.info            59
hunp.us                 58
psychoworld.tv          57
rlinevideos.com         55
Name: Host, dtype: int64


Yikes, we're missing one host, 'sweetboxgames.com'. By searching through the file, it appears there was a typo and the host name should be 'sweetxboxgames.com' (the 'x' was missing after 'sweet'). I will modify my `fraudtraffic` list to remedy this.

In [5]:
fraudtraffic=['featureball.com',
'uvido.com',
'sprungliving.com',
'sweetxboxgames.com', #fixed typo
'mammabay.couk', #had to modify because of how I cleaned above
'workingfathertv.com',
'worsthorrorgame.com',
'hourlyparent.com',
'ulterior-movies.com',
'myhomesdesign.com',
'indoorlife.tv',
'bumclub.info',
'psychoworld.tv',
'hunp.us',
'rlinevideos.com']

print "Host counts:\n", imp_df[imp_df["Host"].isin(fraudtraffic)]["Host"].value_counts()#.describe()

#add the fraud column to the data frame
imp_df['Fraud']=0
imp_df.loc[imp_df["Host"].isin(fraudtraffic),'Fraud']=1

Host counts:
uvido.com              692
featureball.com        560
sprungliving.com       228
sweetxboxgames.com     150
mammabay.couk          123
workingfathertv.com    120
worsthorrorgame.com     94
hourlyparent.com        85
ulterior-movies.com     83
myhomesdesign.com       69
indoorlife.tv           60
bumclub.info            59
hunp.us                 58
psychoworld.tv          57
rlinevideos.com         55
Name: Host, dtype: int64


In [6]:
#print imp_df.Browser.value_counts()
#print imp_df.Browser.head(5)
#print 'Chrome', imp_df.UserAgent.iloc[5]
#print 'Safari', imp_df.UserAgent.iloc[0]
#print 'IE', imp_df.UserAgent.iloc[20]
#print 'Firefox', imp_df.UserAgent.iloc[38]

In [7]:
#import useragent
def lookup_user_agent(ua_string):
    ua=useragent.detect(ua_string)
    if ua.browser.family != 'Other':
        return ua.browser.family+'_'+ua.browser.version
    else:
        return 'Other'
    
tic = timeit.default_timer()
imp_df['UA_Browser_Ver']=imp_df["UserAgent"].map(lambda x: lookup_user_agent(x))
toc = timeit.default_timer()
print 'Seconds to creat UA_Browser_Ver: ', toc - tic
imp_df.head(5)

Seconds to creat UA_Browser_Ver:  80.0083010197


Unnamed: 0,Timestamp,IPadd,Browser,UserAgent,Host,Inview,Nplugins,Wpossize,Latency,Fraud,UA_Browser_Ver
0,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9
1,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9
2,2014-08-25 00:00:00,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985
3,2014-08-25 00:00:00,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985
4,2014-08-25 00:00:00,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985


In [8]:
def get_ua_browser(browver):
    if 'Chrome' in browver:
        return 'Chrome'
    elif 'Other' in browver:
        return 'Other'
    elif 'Firefox' in browver:
        return 'Firefox'
    elif 'IE' in browver:
        return 'Internet Explorer'
    elif 'Safari' in browver:
        return 'Safari/Webkit'
    elif 'Opera' in browver:
        return 'Opera'
    else:
        return 'Other'
    
print 'UA_Browser from UA_Browser_Ver:\n', imp_df["UA_Browser_Ver"].map(lambda x: x.split('_')[0]).fillna('X').value_counts()
print 'Browser:\n', imp_df["Browser"].value_counts()
tic = timeit.default_timer()
#Add column for browser from user agent string
imp_df['UA_Browser']=imp_df["UA_Browser_Ver"].map(lambda x: get_ua_browser(x))
toc = timeit.default_timer()
print 'Seconds to creat UA_Browser: ', toc - tic
print 'UA_Browser:\n', imp_df["UA_Browser"].value_counts()
imp_df.loc[imp_df['Browser']=='Unknown','Browser'] = imp_df.loc[imp_df['Browser']=='Unknown','UA_Browser']
print 'Browser after filling Unknowns\n', imp_df["Browser"].value_counts()
imp_df.head(5)

UA_Browser from UA_Browser_Ver:
Chrome                          84221
Other                           33927
Firefox                         32919
IE                              32796
Safari                          20655
Mobile Safari                   17846
Chrome Mobile                    6217
Android                          4439
Chrome Mobile iOS                 805
IE Mobile                         231
Chrome Frame                      214
Firefox Mobile                    126
Opera                              89
Silk                               73
Maxthon                            68
Chromium                           63
Iceweasel                          54
SeaMonkey                          52
WebKit Nightly                     39
Blackberry WebKit                  21
Nintendo 3DS                       19
PlayStation                        17
Nokia Services (WAP) Browser       17
Opera Mobile                       10
Nokia OSS Browser                   3
RockMelt          

Unnamed: 0,Timestamp,IPadd,Browser,UserAgent,Host,Inview,Nplugins,Wpossize,Latency,Fraud,UA_Browser_Ver,UA_Browser
0,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit
1,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit
2,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome
3,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome
4,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome


In [9]:
def timestamp_to_secs(x):
    return int(x[11:13])*3600  + int(x[-5:-3])*60 + int(x[-2:])

tic = timeit.default_timer()
#Add column that converts timestamp to time of day, in seconds
imp_df["Tsecs"]=imp_df["Timestamp"].map(lambda x: timestamp_to_secs(x))
toc = timeit.default_timer()
print 'Seconds to creat Tsecs: ', toc - tic
imp_df.head(5)

Seconds to creat Tsecs:  0.544656991959


Unnamed: 0,Timestamp,IPadd,Browser,UserAgent,Host,Inview,Nplugins,Wpossize,Latency,Fraud,UA_Browser_Ver,UA_Browser,Tsecs
0,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit,0
1,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit,0
2,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0
3,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0
4,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0


In [10]:
def window_area(x):
    #Multiplies the width and height of the window
    val=np.NAN
    if isinstance(x,str):
        [xp,yp,w,h]=x[1:-1].split(',')
        val=int(w)*int(h)
    return val

import timeit
tic = timeit.default_timer()
#Add column that contains browser window area
imp_df["Warea"] = imp_df["Wpossize"].map(lambda x: window_area(x))     
toc = timeit.default_timer()
print 'Seconds to creat Warea: ', toc - tic


def parse_wpos(w,n):
    #Multiplies the width and height of the window
    val=np.NAN
    if isinstance(w,str):
        val=[int(x) for x in w[1:-1].split(',')][n]
    return val

import timeit
tic = timeit.default_timer()
#Add column that contains browser window area
#imp_df["Warea"] = imp_df["Wpossize"].map(lambda x: window_area(x))     
#print parse_wpos(imp_df.Wpossize.iloc[0],0)
imp_df['WXpos']=imp_df["Wpossize"].map(lambda x: parse_wpos(x,0))
imp_df['WYpos']=imp_df["Wpossize"].map(lambda x: parse_wpos(x,1))
imp_df['Wwidth']=imp_df["Wpossize"].map(lambda x: parse_wpos(x,2))
imp_df['Wheight']=imp_df["Wpossize"].map(lambda x: parse_wpos(x,3))
toc = timeit.default_timer()
print 'Seconds to creat WXpos WYPos Wwidth Wheight: ', toc - tic
imp_df.head(5)

Seconds to creat Warea:  0.43431019783
Seconds to creat WXpos WYPos Wwidth Wheight:  2.16133499146


Unnamed: 0,Timestamp,IPadd,Browser,UserAgent,Host,Inview,Nplugins,Wpossize,Latency,Fraud,UA_Browser_Ver,UA_Browser,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight
0,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit,0,805120.0,0.0,0.0,1280.0,629.0
1,2014-08-25 00:00:00,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit,0,805120.0,0.0,0.0,1280.0,629.0
2,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0,,,,,
3,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0,,,,,
4,2014-08-25 00:00:00,325.441.386.395,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0,,,,,


In [54]:
host_counts = imp_df["Host"].value_counts()
print host_counts
print host_counts.describe()
sighost_counts = host_counts[host_counts>8]
keepHosts = sighost_counts.index.map(lambda x: str(x))
#print keepHosts
print len(keepHosts)
redhost_imp_df = imp_df[imp_df["Host"].isin(keepHosts)]

uvido.com              692
featureball.com        560
sprungliving.com       228
sweetxboxgames.com     150
mammabay.couk          123
workingfathertv.com    120
worsthorrorgame.com     94
hourlyparent.com        85
ulterior-movies.com     83
myhomesdesign.com       69
indoorlife.tv           60
bumclub.info            59
hunp.us                 58
psychoworld.tv          57
rlinevideos.com         55
Name: Host, dtype: int64
count     15.000000
mean     166.200000
std      193.938576
min       55.000000
25%       59.500000
50%       85.000000
75%      136.500000
max      692.000000
Name: Host, dtype: float64
['uvido.com' 'featureball.com' 'sprungliving.com' 'sweetxboxgames.com'
 'mammabay.couk' 'workingfathertv.com' 'worsthorrorgame.com'
 'hourlyparent.com' 'ulterior-movies.com' 'myhomesdesign.com'
 'indoorlife.tv' 'bumclub.info' 'hunp.us' 'psychoworld.tv'
 'rlinevideos.com']
15


In [36]:
dum_imp_df = pd.get_dummies(redhost_imp_df,columns=['Browser'])
dum_imp_df.head(5)

Unnamed: 0,Timestamp,IPadd,UserAgent,Host,Inview,Nplugins,Wpossize,Latency,Fraud,UA_Browser_Ver,UA_Browser,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
0,2014-08-25 00:00:00,393.414.443.469,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit,0,805120.0,0.0,0.0,1280.0,629.0,0,0,0,0,0,1
1,2014-08-25 00:00:00,393.414.443.469,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,domain.comau,0.0,,"(0,0,1280,629)",0.0,0,Safari_5.1.9,Safari/Webkit,0,805120.0,0.0,0.0,1280.0,629.0,0,0,0,0,0,1
2,2014-08-25 00:00:00,325.441.386.395,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0,,,,,,1,0,0,0,0,0
3,2014-08-25 00:00:00,325.441.386.395,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0,,,,,,1,0,0,0,0,0
4,2014-08-25 00:00:00,325.441.386.395,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,mangareader.net,,,,,0,Chrome_36.0.1985,Chrome,0,,,,,,1,0,0,0,0,0


In [293]:
#print imp_df[imp_df['Host'].isin(fraudtraffic)]["IPadd"].value_counts()[:10]
#plt.hist(imp_df['Tsecs'].iloc[imp_df["IPadd"]=='411.517.507.552'])
#plt.subplot(411)
#histdat = plt.hist(imp_df[imp_df["IPadd"]=='411.517.507.552']['Tsecs'].values,bins=24)
#times = imp_df[imp_df["Host"].isin(fraudtraffic)]['Tsecs'].values
times = imp_df[imp_df["Host"]=='foxsports.com']['Tsecs'].values
diffs = np.array([times[i]-times[i-1] for i in range(1,len(times))])
xcoords = diffs[:-1]
ycoords = diffs[1:]


plt.subplot(131)
plt.plot(xcoords+2,ycoords+2,'b.',alpha=.2)
#plt.xlim(0,10**3)
#plt.ylim(0,10**3)
plt.xscale('log')
plt.yscale('log')
plt.subplot(132)
#plt.figure()
myxlim = 30
plt.hist(xcoords[xcoords<=myxlim],range(0,myxlim),normed=True,cumulative=True)
import matplotlib.mlab as mlab
#y=mlab.normpdf(range(0,myxlim),np.mean(xcoords[xcoords<=myxlim]),np.std(xcoords[xcoords<=myxlim])).cumsum()
y=mlab.normpdf(range(0,np.max(xcoords)),np.mean(xcoords),np.std(xcoords)).cumsum()
y/=y[-1]
l=plt.plot(range(0,myxlim),y,'k--')
plt.subplot(133)
plt.hist(times,bins=96)
plt.yscale('log')
f = plt.gcf()
f.set_size_inches(14,6)
#import spectrum
#p = spectrum.Periodogram()

KeyboardInterrupt: 

<matplotlib.figure.Figure at 0x7f8a64409b90>

In [185]:
#Split the data by fraud condition
fdata = dum_imp_df[dum_imp_df["Host"].isin(fraudtraffic)]
nfdata = dum_imp_df[~dum_imp_df["Host"].isin(fraudtraffic)]
print fdata.head(3)
fdata['IPadd'].value_counts()

                Timestamp            IPadd  \
279   2014-08-25 00:02:08  574.491.567.341   
311   2014-08-25 00:02:23  476.494.399.426   
1186  2014-08-25 00:11:03  324.338.423.496   

                                              UserAgent                Host  \
279   Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....     rlinevideos.com   
311   Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3...    sprungliving.com   
1186  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...  sweetxboxgames.com   

      Inview  Nplugins        Wpossize  Latency  Fraud    UA_Browser_Ver  \
279        1       NaN  (0,0,1024,673)     3496      1             Other   
311        0         3  (0,0,1024,706)        0      1  Chrome_35.0.1916   
1186       0         1             NaN      NaN      1  Chrome_34.0.1847   

     UA_Browser  Tsecs   Warea  WXpos  WYpos  Wwidth  Wheight  Browser_Chrome  \
279       Other    128  689152      0      0    1024      673               0   
311      Chrome    143  722944 

324.338.423.496    344
496.529.325.519    219
574.491.567.341    208
476.494.399.426    186
529.366.487.475    156
496.437.522.387    139
574.452.484.501    136
489.462.542.447     83
496.325.356.347     76
438.562.383.508     66
476.391.537.495     58
527.358.505.356     57
324.525.562.389     54
476.516.393.518     49
529.324.449.376     47
544.552.576.426     42
446.488.377.562     42
445.435.337.514     38
445.438.339.329     33
376.394.517.438     33
476.356.528.363     33
529.330.395.356     32
412.498.354.369     31
441.378.464.557     31
518.368.391.462     30
476.565.329.537     28
525.537.550.349     27
438.378.504.483     24
525.470.434.544     19
445.503.427.560     17
                  ... 
496.361.565.416      9
378.564.404.345      8
445.555.342.420      7
441.361.494.512      7
452.434.392.353      6
376.346.326.421      5
476.516.393.572      5
476.356.392.517      5
324.453.396.518      4
527.500.570.335      4
496.438.406.528      3
508.513.388.432      3
445.566.570

In [37]:
dum_imp_df.groupby('Fraud').median()
#imp_df.Host.value_counts()

Unnamed: 0_level_0,Inview,Nplugins,Latency,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0,10,53,50461,925056,0,0,1296,730,0,0,0,0,0,0
1,1,3,127,49092,723968,0,0,1024,706,0,0,0,0,0,0


In [39]:
dum_imp_df.groupby('Fraud').mean()

Unnamed: 0_level_0,Inview,Nplugins,Latency,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0.428204,10.247782,283.620772,48853.426236,41199390000000.0,30754.749088,30747.681905,27810.803089,27278.966099,0.364143,0.135789,0.279755,0.000353,0.014416,0.205543
1,0.749683,5.516657,2381.751429,47430.848777,863500.1,-0.079012,-0.079012,1158.57037,701.706173,0.469314,0.067389,0.456077,0.0,0.002006,0.005215


In [65]:
fdata.groupby('Host').mean()

Unnamed: 0_level_0,Inview,Nplugins,Latency,Fraud,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Host,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
bumclub.info,0.058824,,22.823529,1,52148.508475,950340.0,0.0,0.0,1348.0,705.0,0.0,0.0,1.0,0,0.0,0.0
featureball.com,0.957066,6.254973,2985.175141,1,41581.533929,761686.504604,0.0,0.0,1084.28361,678.399632,0.701786,0.126786,0.158929,0,0.0,0.0125
hourlyparent.com,0.455696,6.305556,2973.594937,1,43491.588235,807671.3125,0.0,0.0,1148.8125,688.4375,0.364706,0.0,0.635294,0,0.0,0.0
hunp.us,0.0,,0.0,1,41193.137931,0.0,0.0,0.0,0.0,0.0,0.87931,0.12069,0.0,0,0.0,0.0
indoorlife.tv,0.929825,2.804348,346.107143,1,39963.216667,1280151.018868,0.0,0.0,1423.075472,879.415094,0.0,0.0,0.95,0,0.05,0.0
mammabay.couk,0.452991,3.625,560.8,1,49841.934959,1007426.715517,-0.068966,-0.068966,1361.862069,738.112069,0.065041,0.02439,0.910569,0,0.0,0.0
myhomesdesign.com,0.650794,1.05,856.419355,1,52933.173913,1129792.390244,-1.170732,-1.170732,1497.268293,748.780488,0.608696,0.014493,0.362319,0,0.014493,0.0
psychoworld.tv,0.473684,3.0,321.487179,1,48530.894737,1947836.717949,0.0,0.0,1780.871795,1017.025641,0.0,0.0,1.0,0,0.0,0.0
rlinevideos.com,0.545455,3.45,1082.181818,1,66072.018182,455706.036364,0.0,0.0,631.127273,412.145455,0.145455,0.0,0.854545,0,0.0,0.0
sprungliving.com,0.634361,6.030769,682.125561,1,59687.820175,974671.718182,-0.472727,-0.472727,1311.695455,721.022727,0.179825,0.0,0.820175,0,0.0,0.0


In [67]:
nplug_fillna_val = np.nanmean(fdata.groupby('Host')['Nplugins'].mean().values)
fhost_df = fdata.groupby('Host').mean()
fhost_df['Nplugins'].fillna(nplug_fillna_val,inplace=True)
fhost_df

Unnamed: 0_level_0,Inview,Nplugins,Latency,Fraud,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Host,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
bumclub.info,0.058824,3.62342,22.823529,1,52148.508475,950340.0,0.0,0.0,1348.0,705.0,0.0,0.0,1.0,0,0.0,0.0
featureball.com,0.957066,6.254973,2985.175141,1,41581.533929,761686.504604,0.0,0.0,1084.28361,678.399632,0.701786,0.126786,0.158929,0,0.0,0.0125
hourlyparent.com,0.455696,6.305556,2973.594937,1,43491.588235,807671.3125,0.0,0.0,1148.8125,688.4375,0.364706,0.0,0.635294,0,0.0,0.0
hunp.us,0.0,3.62342,0.0,1,41193.137931,0.0,0.0,0.0,0.0,0.0,0.87931,0.12069,0.0,0,0.0,0.0
indoorlife.tv,0.929825,2.804348,346.107143,1,39963.216667,1280151.018868,0.0,0.0,1423.075472,879.415094,0.0,0.0,0.95,0,0.05,0.0
mammabay.couk,0.452991,3.625,560.8,1,49841.934959,1007426.715517,-0.068966,-0.068966,1361.862069,738.112069,0.065041,0.02439,0.910569,0,0.0,0.0
myhomesdesign.com,0.650794,1.05,856.419355,1,52933.173913,1129792.390244,-1.170732,-1.170732,1497.268293,748.780488,0.608696,0.014493,0.362319,0,0.014493,0.0
psychoworld.tv,0.473684,3.0,321.487179,1,48530.894737,1947836.717949,0.0,0.0,1780.871795,1017.025641,0.0,0.0,1.0,0,0.0,0.0
rlinevideos.com,0.545455,3.45,1082.181818,1,66072.018182,455706.036364,0.0,0.0,631.127273,412.145455,0.145455,0.0,0.854545,0,0.0,0.0
sprungliving.com,0.634361,6.030769,682.125561,1,59687.820175,974671.718182,-0.472727,-0.472727,1311.695455,721.022727,0.179825,0.0,0.820175,0,0.0,0.0


In [88]:
nfdata.groupby('Host').mean().head(3)

Unnamed: 0_level_0,Inview,Nplugins,Latency,Fraud,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Host,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1001jogos.pt,0.153846,16.0,230.75,0,38640.769231,,,,,,1.0,0.0,0.0,0,0,0.0
10beauty.com,0.181818,,199.636364,0,35338.272727,948000.0,0.0,0.0,1174.0,798.8,0.0,0.0,1.0,0,0,0.0
123greetings.com,0.261905,9.939394,467.352941,0,43512.680851,1128893.058824,-1.411765,1.176471,1436.294118,763.352941,0.255319,0.255319,0.340426,0,0,0.148936


In [89]:
nfdata.groupby('Host').mean().dropna().head(3)

Unnamed: 0_level_0,Inview,Nplugins,Latency,Fraud,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Host,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
123greetings.com,0.261905,9.939394,467.352941,0,43512.680851,1128893.058824,-1.411765,1.176471,1436.294118,763.352941,0.255319,0.255319,0.340426,0,0,0.148936
14news.com,0.428571,11.0,350.166667,0,23959.5,814549.142857,0.0,0.0,1126.0,715.571429,0.8,0.0,0.0,0,0,0.2
192.com,0.428571,2.0,69.571429,0,60819.571429,870142.0,0.0,0.0,1366.0,637.0,0.0,0.0,1.0,0,0,0.0


Unnamed: 0_level_0,Inview,Nplugins,Latency,Fraud,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Host,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
bumclub.info,0.117647,7.24684,45.647059,2,104297.016949,1900680.0,0.0,0.0,2696.0,1410.0,0.0,0.0,2.0,0,0.0,0.0
featureball.com,1.914132,12.509946,5970.350282,2,83163.067857,1523373.009208,0.0,0.0,2168.567219,1356.799263,1.403571,0.253571,0.317857,0,0.0,0.025
hourlyparent.com,0.911392,12.611111,5947.189873,2,86983.176471,1615342.625,0.0,0.0,2297.625,1376.875,0.729412,0.0,1.270588,0,0.0,0.0
hunp.us,0.0,7.24684,0.0,2,82386.275862,0.0,0.0,0.0,0.0,0.0,1.758621,0.241379,0.0,0,0.0,0.0
indoorlife.tv,1.859649,5.608696,692.214286,2,79926.433333,2560302.037736,0.0,0.0,2846.150943,1758.830189,0.0,0.0,1.9,0,0.1,0.0
mammabay.couk,0.905983,7.25,1121.6,2,99683.869919,2014853.431034,-0.137931,-0.137931,2723.724138,1476.224138,0.130081,0.04878,1.821138,0,0.0,0.0
myhomesdesign.com,1.301587,2.1,1712.83871,2,105866.347826,2259584.780488,-2.341463,-2.341463,2994.536585,1497.560976,1.217391,0.028986,0.724638,0,0.028986,0.0
psychoworld.tv,0.947368,6.0,642.974359,2,97061.789474,3895673.435897,0.0,0.0,3561.74359,2034.051282,0.0,0.0,2.0,0,0.0,0.0
rlinevideos.com,1.090909,6.9,2164.363636,2,132144.036364,911412.072727,0.0,0.0,1262.254545,824.290909,0.290909,0.0,1.709091,0,0.0,0.0
sprungliving.com,1.268722,12.061538,1364.251121,2,119375.640351,1949343.436364,-0.945455,-0.945455,2623.390909,1442.045455,0.359649,0.0,1.640351,0,0.0,0.0


In [111]:
print 'Fraud data shape: ',fhost_df.shape
nfhost_df = nfdata.groupby('Host').mean().dropna()
print 'Nonfraud data shape: ',nfhost_df.shape
allhost_df = pd.concat([nfhost_df, fhost_df])
print 'Combined data shape: ',allhost_df.shape

Fraud data shape:  (15, 16)
Nonfraud data shape:  (1939, 16)
Combined data shape:  (1954, 16)


In [133]:
from sklearn.ensemble import RandomForestClassifier
#from sklearn import preprocessing
#allhost_df = pd.concat([nfhost, fhost])
predictors = ['Inview','Nplugins','Latency','Tsecs','Warea',
              'WXpos','WYpos',
              'Wwidth','Wheight','Browser_Chrome','Browser_Firefox',
              'Browser_Internet Explorer','Browser_Opera','Browser_Other','Browser_Safari/Webkit']
this_split_df = pd.concat([nfhost_df[:13], fhost_df[:13]])
clf = RandomForestClassifier(warm_start=True)

In [153]:
rows=np.random.choice(nfhost_df.index.values,15,replace=False)

model=[]

for x in range(10):
    print 'Training', x, 'model'
    rows=np.random.choice(nfhost_df.index.values,15,replace=False)
    this_split_df = pd.concat([nfhost_df.ix[rows], fhost_df[:13]])
    clf_fit = clf.fit(this_split_df[predictors],this_split_df['Fraud'])
    rows=np.random.choice(nfhost_df.index.values,15,replace=False)
    print clf_fit.score(allhost_df[predictors].ix[rows], allhost_df['Fraud'].ix[rows])
    print clf_fit.predict(fhost_df[predictors])#.values[14])
    model.append(clf_fit)

    

Training 0 model
0.866666666667
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 1 model
1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 2 model
0.933333333333
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 3 model
0.933333333333
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 4 model
0.866666666667
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 5 model
0.8
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 6 model
1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 7 model
0.866666666667
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 8 model
0.933333333333
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]
Training 9 model
1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.]


In [163]:
from sklearn.metrics import confusion_matrix

preds = allhost_df[predictors].apply(lambda x : np.mean([m.predict(x) for m in model]))

#confusion_matrix(allhost_df,)

ValueError: ('Number of features of the model must  match the input. Model n_features is 15 and  input n_features is 1954 ', u'occurred at index Inview')

In [160]:
nfhost_df.loc[model[0].predict(nfhost_df[predictors]).astype(bool)]

Unnamed: 0_level_0,Inview,Nplugins,Latency,Fraud,Tsecs,Warea,WXpos,WYpos,Wwidth,Wheight,Browser_Chrome,Browser_Firefox,Browser_Internet Explorer,Browser_Opera,Browser_Other,Browser_Safari/Webkit
Host,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
50connect.couk,0.730769,2.342105,301.205128,0,50263.666667,896810.933333,0.000000,0.000000,1343.666667,667.866667,0.012346,0.012346,0.938272,0,0.037037,0.000000
5mostpopular.com,0.214286,2.000000,1053.230769,0,52054.642857,922741.333333,0.000000,0.000000,1125.333333,807.833333,0.000000,0.000000,1.000000,0,0.000000,0.000000
advanceautoparts.com,0.230769,9.142857,874.875000,0,61069.206897,535928.250000,0.000000,0.000000,982.083333,469.583333,0.206897,0.000000,0.793103,0,0.000000,0.000000
advcm.com,0.000000,11.000000,1159.142857,0,43810.555556,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0,0.000000,0.000000
affhealth.com,0.777778,1.000000,1079.000000,0,41026.444444,1013165.714286,-1.142857,-1.142857,1436.571429,698.428571,0.777778,0.000000,0.222222,0,0.000000,0.000000
ahaber.comtr,0.333333,2.000000,697.571429,0,58776.222222,826119.000000,0.000000,0.000000,1269.000000,651.000000,0.000000,0.000000,1.000000,0,0.000000,0.000000
alldayfashions.com,0.181818,3.250000,1852.000000,0,41642.000000,812886.571429,-2.000000,-2.000000,1246.285714,626.000000,0.272727,0.090909,0.636364,0,0.000000,0.000000
americanprofile.com,0.518248,7.153846,1866.311111,0,54460.000000,971511.945946,-0.504505,-0.504505,1324.198198,711.117117,0.133803,0.014085,0.852113,0,0.000000,0.000000
artinstitutes.edu,0.166667,2.000000,233.916667,0,45196.083333,1028208.000000,-8.000000,-8.000000,1382.000000,744.000000,0.000000,0.000000,1.000000,0,0.000000,0.000000
avimoya.com,0.583333,3.000000,1349.166667,0,58844.750000,1245209.600000,-4.800000,-4.800000,1494.400000,821.600000,0.083333,0.000000,0.916667,0,0.000000,0.000000


In [101]:
from sklearn import cross_validation
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
scores = cross_validation.cross_val_score(alg, allhost_df[predictors], allhost_df["Fraud"], cv=8)
print scores
print scores.mean()

[ 0.99183673  0.99183673  0.99183673  0.99180328  0.99180328  0.99180328
  0.99180328  0.99588477]
0.992326011562


In [95]:
clf_fit.score(allhost_df[predictors], allhost_df['Fraud'])

0.99795291709314227

In [118]:
kf_nf = cross_validation.KFold(allhost_df.shape[0], n_folds=3, random_state=1)
for train, test in kf_nf:
    print allhost_df['Fraud'].iloc[train].sum()
    print test[:5]

15.0
[0 1 2 3 4]
15.0
[652 653 654 655 656]
0.0
[1303 1304 1305 1306 1307]


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import GradientBoostingClassifier

rows=np.random.choice(df.index.values,15,replace=False)


cross_validation.StratifiedShuffleSplit
# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), predictors]
]

# Initialize the cross validation folds
kf_nf = KFold(titanic.shape[0], n_folds=3, random_state=1)
kf_f = KFold()
predictions = []
for train, test in kf:
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)


In [50]:
final_df = dum_imp_df.groupby('Host').mean().dropna()
print final_df.shape
print final_df.describe()
print 'Number of orig fraud host: ', len(fraudtraffic)
print 'Number of fraud hosts now: ', sum(final_df['Fraud']==1)

(1952, 16)
            Inview     Nplugins       Latency        Fraud         Tsecs  \
count  1952.000000  1952.000000   1952.000000  1952.000000   1952.000000   
mean      0.457692     9.597534    379.542030     0.006660  48746.020906   
std       0.239451     5.585981    744.015352     0.081356  13205.936813   
min       0.000000     1.000000      0.000000     0.000000   2108.250000   
25%       0.285714     5.500000     97.904227     0.000000  41007.492898   
50%       0.444444     9.803813    181.245833     0.000000  49650.151030   
75%       0.624403    13.000000    364.135417     0.000000  57279.096154   
max       1.000000    37.000000  12660.235294     1.000000  84369.000000   

              Warea         WXpos         WYpos        Wwidth       Wheight  \
count  1.952000e+03  1.952000e+03  1.952000e+03  1.952000e+03  1.952000e+03   
mean   1.041620e+14  7.655938e+04  7.655125e+04  7.040713e+04  6.990065e+04   
std    4.586037e+15  3.359440e+06  3.359440e+06  3.071944e+06  3.07

In [10]:
cols_OI = ['Browser','Host','Inview','Nplugins','Latency','Warea','Tsecs','Fraud','WXpos','WYpos','Wwidth','Wheight']
int_imp_df = imp_df.loc[:,cols_OI]
print int_imp_df.describe()
int_imp_df.dropna(inplace=True)
print int_imp_df.describe()

              Inview       Nplugins         Latency         Warea  \
count  211810.000000  162303.000000   200279.000000  1.184500e+05   
mean        0.431996      10.139424      345.189685  3.765405e+13   
std         0.495355       7.637193    13292.948056  9.158285e+15   
min         0.000000       1.000000        0.000000  0.000000e+00   
25%         0.000000       2.000000        0.000000  7.144180e+05   
50%         0.000000      10.000000       54.000000  9.247820e+05   
75%         1.000000      15.000000      179.000000  1.262400e+06   
max         1.000000      61.000000  5866142.000000  2.228787e+18   

               Tsecs          Fraud         WXpos         WYpos        Wwidth  \
count  234945.000000  234945.000000  1.184500e+05  1.184500e+05  1.184500e+05   
mean    48780.596480       0.010611  2.810657e+04  2.809878e+04  2.552057e+04   
std     21935.404297       0.102462  6.709700e+06  6.709700e+06  6.136262e+06   
min         0.000000       0.000000 -4.000000e+04 -4.0

In [11]:
int_imp_df.Fraud.value_counts()

0    85571
1     1279
Name: Fraud, dtype: int64

In [None]:
axes1 = pd.tools.plotting.scatter_matrix(int_imp_df.loc[int_imp_df['Fraud']==1,:], color="brown")
f1 = plt.gcf()
f1.set_size_inches(10,8)
axes2 = pd.tools.plotting.scatter_matrix(int_imp_df.loc[int_imp_df['Fraud']==0,:], color="blue")
f2 = plt.gcf()
f2.set_size_inches(10,8)

In [None]:
imp_df['Fraud']=0
imp_df.loc[imp_df["Host"].isin(fraudtraffic),'Fraud']=1

In [None]:
fdata = imp_df[imp_df["Host"].isin(fraudtraffic)]
nfdata = imp_df[~imp_df["Host"].isin(fraudtraffic)]
#ax = fdata.plot(kind='scatter', x="UA_Browser_Ver", y="Warea", style="o")
#ax2 = nfdata.plot(kind='scatter', x='UA_Browser_Ver', y='Warea')
#f, axs = plt.subplots(1, 2)
#fdata["Nplugins"].value_counts(normalize=True).plot(kind='bar',ax=axs[0])
axes = fdata.boxplot(column="Nplugins",by="Browser")
axes.set_title('Nplugins (fraud)')
plt.ylim(0,35)
#nfdata["Nplugins"].value_counts(normalize=True).plot(kind='bar',ax=axs[1])
axes2 = nfdata.boxplot(column="Nplugins",by="Browser")
axes2.set_title('Nplugins (nonfraud)')
plt.ylim(0,35)

plt.figure()
fdata["Browser"].value_counts(normalize=True).sort_index().plot(kind='bar',color="brown")
plt.axis.set_title('Browser Counts (Fraud)')
plt.figure()
nfdata["Browser"].value_counts(normalize=True).sort_index().plot(kind='bar',color="blue")

#plt.figure()
#imp_df['Browser'].value_counts().plot()

In [None]:
pd.tools.plotting.scatter_matrix(fdata[['Nplugins','Warea','Browser','Inview']], color="brown")
pd.tools.plotting.scatter_matrix(nfdata[['Nplugins','Warea','Browser','Inview']], color="blue")

In [None]:
print imp_df.describe()
print fdata['Warea'].describe()
print nfdata['Warea'].describe()


### Assumptions
* logs are correct. I assume that when there are observations of one IP address clicking on the same host 3 times in a second, that this is what occurred. It may be the case that some entries are simply repeated due to errors in the recording.
* ERROR hosts are to be disregarded. As there is no valid host, the only information is IP address and Browser/User Agent info
* That greatboxgames.com should be greatxboxgames.com

### Cleaning and Feature Engineering
* Remove ERROR hosts
* Standardize hosts by removing preceeding 'http://www.|go.|video.'
* Convert timestamp to seconds in the day
* Create browser window area feature
* Create inter-impression interval (III) feature

In [None]:
fraudtraffic=["featureplay.com","uvidi.com","spryliving.com","greatxboxgames.com",
              "mmabay.co.uk","workingmothertv.com","besthorrorgame.com","dailyparent.com","superior-movies.com",
              "yourhousedesign.com","outdoorlife.tv","drumclub.info","cycleworld.tv","hmnp.us","nlinevideos.com"]

print "IP counts:\n", data[data["Host"].isin(fraudtraffic)]["IPadd"].value_counts().describe()
print "Host counts:\n", data[data["Host"].isin(fraudtraffic)]["Host"].value_counts().describe()

fdata = data[data["Host"].isin(fraudtraffic)]
nfdata = data[~data["Host"].isin(fraudtraffic)]

f, axs = plt.subplots(1, 2)
fdata["Tsecs"].value_counts(normalize=True).plot(kind='bar',ax=axs[0])
axs[0].set_title('Seconds (fraud)')
nfdata["Tsecs"].value_counts(normalize=True).plot(kind='bar',ax=axs[1])
axs[1].set_title('Seconds (nonfraud)')


In [None]:

#Reduce the data - Remove Hosts that are present less than 8 times
host_counts = data["Host"].value_counts()
host_counts
host_counts.describe()
sighost_counts = host_counts[host_counts>8]
keepHosts = sighost_counts.index.map(lambda x: str(x))
data = data[data["Host"].isin(keepHosts)]


#Reduce the data - Remove IPs that are present less than 17 times
IP_counts = data["IPadd"].value_counts()
IP_counts
IP_counts.describe()
sigIP_counts = IP_counts[IP_counts>17]
keepIPs = sigIP_counts.index.map(lambda x: str(x))
data = data[data["IPadd"].isin(keepIPs)]



In [None]:
def get_host_counts(df,hostname,feat_name):
    '''
        Selects feature column out given hostname
    '''
    count = df[df["Host"]==hostname][feat_name].value_counts()
    return count 

def get_hosts_counts(df,hostnames,feat_name):
    '''
        Selects feature column out for list of hostnames
    '''
    count=data[data["Host"].isin(hostnames)][feat_name].value_counts()
    return count 

def naive_bayes_fraud(Fcounts,NFcounts,Tcounts):
    '''
        Runs naive bayes, or a modified version that returns a score, not a probability
    '''
    vocab = NFcounts.add(Fcounts,fill_value=0)
    Fprior = .625
    NFprior = .375
    prob_fraud = 0.0
    log_prob_fraud = 0.0
    prob_notfraud = 0.0
    log_prob_notfraud = 0.0    
    for ind, val in Tcounts.iteritems():
        if ind in vocab.index:
            p_value = (vocab[ind]+0.0)/vocab.sum()
            #print "Prob. of value: ", p_value
            #p_v_given_fraud = (Fcounts.get(ind,0.0)+0.0)/Fcounts.sum()
            p_v_given_fraud = (Fcounts.get(ind,0.0)+1.0)/(Fcounts.sum()+len(vocab))
            #print "Prob. of val | fraud: ", p_v_given_fraud
            #p_v_given_notfraud = (NFcounts.get(ind,0.0)+0.0)/NFcounts.sum()
            p_v_given_notfraud = (NFcounts.get(ind,0.0)+1.0)/(NFcounts.sum()+len(vocab))
            #print "Prob. of val | not fraud: ", p_v_given_notfraud
        else:
            p_value = 1.0/(vocab.sum()+1)
            p_v_given_fraud = 1.0/(Fcounts.sum()+len(vocab))
            p_v_given_notfraud = 1.0/(NFcounts.sum()+len(vocab))
        if p_v_given_fraud > 0:
            prob_fraud += (val * p_v_given_fraud) / p_value
            log_prob_fraud += math.log(val * p_v_given_fraud / p_value)
        if p_v_given_notfraud >0:
            prob_notfraud += (val * p_v_given_notfraud) / p_value
            log_prob_notfraud += math.log(val * p_v_given_notfraud / p_value)
    '''
    print "\nFraud Score:  ", (prob_fraud*Fprior)/(prob_fraud*Fprior+prob_notfraud*NFprior)
    print "SumProb. (fraud):  ", prob_fraud + Fprior
    print "SumProb. (not fraud):  ", prob_notfraud + NFprior
    print "LogScore (fraud):  ", log_prob_fraud + math.log(Fprior)
    print "LogScore (not fraud):  ", log_prob_notfraud + math.log(NFprior)
    print "Fscore : ", np.exp(log_prob_fraud+math.log(Fprior)-(log_prob_fraud+math.log(Fprior)+log_prob_notfraud+math.log(NFprior)))   
    exp_prob_fraud = np.exp(log_prob_fraud + math.log(Fprior))
    exp_prob_notfraud = np.exp(log_prob_notfraud + math.log(NFprior))
    print "Likelihood of Fraud(exp):  ", exp_prob_fraud/(exp_prob_fraud+exp_prob_notfraud)#(log_prob_fraud + math.log(Fprior))/(log_prob_fraud + math.log(Fprior)+log_prob_notfraud + math.log(NFprior))
    print "Likelihood of Fraud(log):  ", (log_prob_fraud + math.log(Fprior))/(log_prob_fraud + math.log(Fprior)+log_prob_notfraud + math.log(NFprior))
    '''
    return prob_fraud/(prob_fraud+prob_notfraud)

def run_fraudscore(df,host,feat_name):
    '''
        Runs naive_bayes_fraud for a host. Checks to see if it is in training set and removes it if True
    '''
    fraudtraffic=["featureplay.com","uvidi.com","spryliving.com","greatxboxgames.com",
              "mmabay.co.uk","workingmothertv.com","besthorrorgame.com","dailyparent.com","superior-movies.com",
              "yourhousedesign.com","outdoorlife.tv","drumclub.info","cycleworld.tv","hmnp.us","nlinevideos.com"]
    nfraudtraffic=["google.com","foxsports.com","washingtonpost.com","amazon.com",
               "nytimes.com","tvguide.com","pandora.com","youtube.com","cnn.com"]
    if host in fraudtraffic:
        fraudtraffic.remove(host)
    if host in nfraudtraffic:
        nfraudtraffic.remove(host)
    F_count = get_hosts_counts(data,fraudtraffic,feat_name)
    NF_count = get_hosts_counts(data,nfraudtraffic,feat_name)
    T_count = get_host_counts(data,host,feat_name)
    return naive_bayes_fraud(F_count,NF_count,T_count)

def clean_plot(color='w',ax=plt.gca(), leftAxisOn=True):
	# Make a cleaner, prettier plot
	ax.set_axis_bgcolor('w')
	ax.spines['bottom'].set_color(color)
	ax.spines['top'].set_color(color) 
	ax.spines['right'].set_color(color)
	ax.spines['left'].set_color(color)
	if leftAxisOn is True:
		ax.spines['left'].set_color((0.5,0.5,0.5))
		ax.yaxis.set_ticks_position('left')
		ax.get_yaxis().set_tick_params(direction='out',color=(0.5,0.5,0.5),length=3.5)


def runallhost_fraudscore(data,feat_name):
    # Run fraudscores for each host   
    allhosts = pd.Series(data["Host"].ravel()).unique()
    fraudscores = np.zeros_like(allhosts)
    for i,h in enumerate(allhosts):
        #print i, "out of", len(allhosts)
        fraudscores[i]=run_fraudscore(data,h,feat_name)

    host_score_prob = pd.DataFrame({'host' : allhosts,'fscore' : fraudscores})
    host_score_prob.sort('fscore',ascending=False)
    
    #sort the scores and reindex    
    sort_host_score_prob = host_score_prob.sort_index(by='fscore',ascending=False)
    sort_host_score_prob.index=range(1,len(sort_host_score_prob)+1)

    fraudtraffic=["featureplay.com","uvidi.com","spryliving.com","greatxboxgames.com",
              "mmabay.co.uk","workingmothertv.com","besthorrorgame.com","dailyparent.com","superior-movies.com",
              "yourhousedesign.com","outdoorlife.tv","drumclub.info","cycleworld.tv","hmnp.us","nlinevideos.com"]
       
    #plot results
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes([.125, .2, .775, .7])
    plt.title('Fraud Scores from '+feat_name+' for known Fraudulent')
    plt.ylabel('Fraud Score Percentile', size=12)
    textFont  = {'family' : u'sans-serif',
         'size'   : 12,
         'style'  : u'italic' }		
    rects = plt.bar(np.arange(15),100-(sort_host_score_prob[sort_host_score_prob['host'].isin(fraudtraffic)].index/(len(sort_host_score_prob)+0.0))*100,color='r')
    plt.xticks(np.arange(15)+0.5, sort_host_score_prob[sort_host_score_prob['host'].isin(fraudtraffic)].host.values, rotation=60,
    	horizontalalignment='right', **textFont)
    
    for rect in rects:
        height = int(rect.get_height())
        rankStr = str(height)
        xloc = rect.get_x()+rect.get_width()/2.0
        yloc = 0.95*height
        ax.text(xloc, yloc, rankStr, horizontalalignment='center',
                verticalalignment='center', color='white', weight='bold')
    clean_plot(ax=ax)
    
    f = open('Integral_data_host_fraud_ranks'+feat_name+'_noreduction.txt', 'w')
    for ind, fsc, hst in sort_host_score_prob.itertuples():
        f.write(str(ind) + ' ' + hst + '\n')
    f.close()
    

if __name__ == "__main__":
    print 'Loading data'
    log_file = "D:\Downloads\Integral_data_set.tsv"
    data = pd.io.parsers.read_csv(log_file,sep='\t',names=['Timestamp','IPadd','Browser','UserA','Host','Iinview','Nplugins','Bwinpossize','NetLat'],header=None)
    
    print 'Cleaning data'
    data = clean_data(data)
    print 'Adding features'
    data = add_features(data)
    print 'Reducing data'
    data = reduce_data(data)
    
    for f in ['Browser','UserA','Iinview','Nplugins','Bwinarea']:
        print 'Computing fraud scores for '+f
        runallhost_fraudscore(data,f)
        plt.savefig('KnownFraudulent'+f+'FraudScore.pdf')
    
