In [1]:
%reload_ext ishbook
import pandas as pd
import plus
import datetime as dt
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Pulled in a sample output from the existing CSOS EMEA Ishbook

- This is taken from a sample run on 11/2/2016
- This is the normal output from the query/audit - I am adding a few additional steps to flag some of these accounts for removal


In [22]:
EMEA_ishbook_output = pd.read_csv('BIREQ_2032_test.csv')

In [76]:
print 'Here are the first 5 rows of the sample data set:'
print
print EMEA_ishbook_output.head()
print
print 'This is the length of the sample dataframe: {}'.format(len(EMEA_ishbook_output))

Here are the first 5 rows of the sample data set:

   Account Id Relationship  New User Id Date Assignment Starts  advertiser_id
0    67845114  SERVICE_REP         3271             11/02/2016         140420
1    86573442  SERVICE_REP         3271             11/02/2016         207496
2    92927998  SERVICE_REP         3271             11/02/2016         236360
3    96851770  SERVICE_REP         3271             11/02/2016         309638
4   117254481  SERVICE_REP         3271             11/02/2016         549041

This is the length of the sample dataframe: 34


## I had to pull in the advertiser IDs for the sample that I pulled from the ishbook output, because this field is not populated in the output

- This step won't be necessary, if we run the check for the claimed sources before finalizing the output
- I merge these advertiser ids so that I can use them in the PLUS call for "claimed sources"

In [35]:
get_advertiser_id = plus.get_advertiser_info(EMEA_ishbook_output['Account Id'], intype = 'acctid')['advertiser_id'].reset_index()

In [39]:
EMEA_ishbook_output = EMEA_ishbook_output.merge(get_advertiser_id, how = 'left', on = 'Account Id')

In [45]:
EMEA_ishbook_output.head()

Unnamed: 0,Account Id,Relationship,New User Id,Date Assignment Starts,advertiser_id
0,67845114,SERVICE_REP,3271,11/02/2016,140420
1,86573442,SERVICE_REP,3271,11/02/2016,207496
2,92927998,SERVICE_REP,3271,11/02/2016,236360
3,96851770,SERVICE_REP,3271,11/02/2016,309638
4,117254481,SERVICE_REP,3271,11/02/2016,549041


# **THE KEY THING TO REMEMBER FOR UPDATING THE ISHBOOK IS THAT THE STEPS BELOW CAN AND SHOULD BE DONE BEFORE FINALIZING THE OUTPUT - IT WOULD SIMPLY BE ADDING THESE STEPS BEFORE DOING THE FINAL CLEAN-UP**

## For the EMEA SCS B Audit sample, I pull in all of the claimed sources using a PLUS call

- There's an endpoint called *get_claimed_sources* that allows you to view any claimed sources for a particular advertiser
- I pull in the *sitename* & *source_id* fields, this will contain the "Claimed Sources" information that is required to exclude these accounts
- Reset the index to make the merge easier

In [46]:
claimed_sources = plus.get_claimed_sources(EMEA_ishbook_output['advertiser_id'])[['sitename', 'source_id']].reset_index()

## As you can see in the sample df below, there's sitenames that contain "dradis_" + advertiser_id

- These are the advertisers that need to get removed before the output is finalized

In [47]:
claimed_sources

Unnamed: 0,advertiser_id,sitename,source_id
0,140420,http://www.vacature-beveiliger.nl,0
1,207496,Sapienza Consulting,87997
2,2432765,Quby,1631861
3,3053246,dradis_3053246,2192947
4,6068338,dradis_6068338,3653656
5,6790556,dradis_6790556,3988811
6,8008487,dradis_8008487,4571598
7,8138891,dradis_8138891,4631845
8,9078401,dradis_9078401,5031976


## Merge the Claimed Sources information to the existing sample dataset - filling in null values since not all of the advertisers will have claimed sources

- You need this step in order to do the string filtering on any [sitename] containing 'dradis' in it
- In my sample run, I found these accounts containing "dradis"

In [59]:
emea_new_step = EMEA_ishbook_output.merge(claimed_sources, how = 'left', on = 'advertiser_id').fillna('NONE')

In [63]:
emea_new_step[emea_new_step['sitename'].str.contains('dradis')]

Unnamed: 0,Account Id,Relationship,New User Id,Date Assignment Starts,advertiser_id,sitename,source_id
9,183712105,SERVICE_REP,3271,11/02/2016,3053246,dradis_3053246,2192950.0
16,225078169,SERVICE_REP,3271,11/02/2016,6068338,dradis_6068338,3653660.0
17,233378848,SERVICE_REP,3271,11/02/2016,6790556,dradis_6790556,3988810.0
20,247190244,SERVICE_REP,3271,11/02/2016,8008487,dradis_8008487,4571600.0
21,248614903,SERVICE_REP,3271,11/02/2016,8138891,dradis_8138891,4631840.0
25,258413454,SERVICE_REP,3271,11/02/2016,9078401,dradis_9078401,5031980.0


## Here's the critical step - remove any and all accounts with that in their sitename

- This will yield what the new final output of the ishbook should be - so after this step is completed, remove those fields from the final output as they are not needed to do the assignment
- The very last dataframe is what the output SHOULD look like - with the excluded accounts that were flagged containing "dradis" in them

In [64]:
final_output = emea_new_step[~emea_new_step['sitename'].str.contains('dradis')]

In [70]:
final_output.ix[:, 0:4]

Unnamed: 0,Account Id,Relationship,New User Id,Date Assignment Starts
0,67845114,SERVICE_REP,3271,11/02/2016
1,86573442,SERVICE_REP,3271,11/02/2016
2,92927998,SERVICE_REP,3271,11/02/2016
3,96851770,SERVICE_REP,3271,11/02/2016
4,117254481,SERVICE_REP,3271,11/02/2016
5,123533321,SERVICE_REP,3271,11/02/2016
6,148135895,SERVICE_REP,3271,11/02/2016
7,173687669,SERVICE_REP,3271,11/02/2016
8,180269461,SERVICE_REP,3271,11/02/2016
10,193632292,SERVICE_REP,3271,11/02/2016
