Input file considerations

Here are some considerations to keep in mind:

1. Because we are only interested in individual contributions, we only want records that have the field, OTHER_ID, set to empty. If the OTHER_ID field contains any other value, ignore the entire record and don't include it in any calculation

2. If TRANSACTION_DT is an invalid date (e.g., empty, malformed), you should still take the record into consideration when outputting the results of medianvals_by_zip.txt but completely ignore the record when calculating values for medianvals_by_date.txt

3. While the data dictionary has the ZIP_CODE occupying nine characters, for the purposes of the challenge, we only consider the first five characters of the field as the zip code

4. If ZIP_CODE is an invalid zipcode (i.e., empty, fewer than five digits), you should still take the record into consideration when outputting the results of medianvals_by_date.txt but completely ignore the record when calculating values for medianvals_by_zip.txt

5. If any lines in the input file contains empty cells in the CMTE_ID or TRANSACTION_AMT fields, you should ignore and skip the record and not take it into consideration when making any calculations for the output files

6. Except for the considerations noted above with respect to CMTE_ID, ZIP_CODE, TRANSACTION_DT, TRANSACTION_AMT, OTHER_ID, data in any of the other fields (whether the data is valid, malformed, or empty) should not affect your processing. That is, as long as the four previously noted considerations apply, you should process the record as if it was a valid, newly arriving transaction. (For instance, campaigns sometimes retransmit transactions as amendments, however, for the purposes of this challenge, you can ignore that distinction and treat all of the lines as if they were new)

7. For the purposes of this challenge, you can assume the input file follows the data dictionary noted by the FEC for the 2015-current election years

8. The transactions noted in the input file are not in any particular order, and in fact, can be out of order chronologically

**medianvals_by_zip.txt**

Each line of this file should contain these fields:

- recipient of the contribution (or CMTE_ID from the input file)

- 5-digit zip code of the contributor (or the first five characters of the ZIP_CODE field from the input file)

- running median of contributions received by recipient from the contributor's zip code streamed in so far. Median calculations should be rounded to the whole dollar (drop anything below \$.50 and round anything from \$.50 and up to the next dollar)

- total number of transactions received by recipient from the contributor's zip code streamed in so far

- total amount of contributions received by recipient from the contributor's zip code streamed in so far

In [6]:
import statistics

In [178]:
# file_input = 'example1/itcont.txt'
# file_zip_output = "example1/medianvals_by_zip.txt"
# file_date_output = "example1/medianvals_by_date.txt"

file_input = 'example1/itcont.txt'
file_zip_output = "example1/medianvals_by_zip.txt"
file_date_output = "example1/medianvals_by_date.txt"


# Checks if value is a date; False if:  empty, malformed
def is_date(s):
    try:
        int(s)
        return (len(str(s))==8)
    except ValueError:
        pass

    return False

# Checks if value is a zip code; False if: empty, fewer than five digits
def is_zip(s):
    try:
        int(s)
        return (len(str(s))>=5)
    except ValueError:
        pass

    return False

def search(lst, target):
    min = 0
    max = len(lst)
    avg = int((min+max)/2)
    print (lst, target, avg , (min,max)) 
    while (min < max):
        if (lst[avg] == target):
            return avg
        elif (lst[avg] < target):
            print(lst[avg], ' < ', target, ' avg = ', avg, 'min, max : ', min, max)
            return avg + 1 + search(lst[avg+1:], target)
        else:
            print(lst[avg], ' > ', target, ' avg = ', avg, 'min, max : ', min, max)
            return search(lst[:avg], target)

    return avg



class Insight:
    def __init__(self):
        medianvals_by_zip = open(file_zip_output,"w+")
        medianvals_by_date = open(file_date_output,"w+")

        # perhaps use only one master_d in the future, use these ones to check
        master_d_zip = {}
        master_l_date = []
        d_date = {}
        
        with open(file_input) as f:
            for line in f:
                mylist = line.split('|')  # this is always? of length 21
                if (len(mylist)==21):
                    d = {'CMTE_ID' : mylist[0], 'ZIP_CODE' : mylist[10][:5] # zip: first 5 digits only
                        ,'TRANSACTION_DT' : mylist[13], 'TRANSACTION_AMT' : mylist[14], 'OTHER_ID' : mylist[15]} 

                    if (d['CMTE_ID']!='' and d['TRANSACTION_AMT']!='' and d['OTHER_ID']==''):
                        # process this one (applies to both files)



                        # zip - write as you read in
                        if (is_zip(d['ZIP_CODE'])==True):
                            master_d_zip.setdefault((d['CMTE_ID'],d['ZIP_CODE']), []).append(int(d['TRANSACTION_AMT']))
                            medianvals_by_zip.write(''+d['CMTE_ID']+'|'+d['ZIP_CODE']+'|'+ str(round(statistics.median(master_d_zip[(d['CMTE_ID'],d['ZIP_CODE'])])))+ '|'
                                                    +str(len(master_d_zip[(d['CMTE_ID'],d['ZIP_CODE'])]))+'|'
                                                    +str(sum(master_d_zip[(d['CMTE_ID'],d['ZIP_CODE'])]))+'\n')

                        # date - wait to write
                        if (is_date(d['TRANSACTION_DT'])==True):
                            # check which position to insert in the list, then insert.
#                             master_l_date = [('A', '2')]
#                             d_date = {('A', '2'): [21]}
                            spot = search(master_l_date,(d['CMTE_ID'],d['TRANSACTION_DT']))
                            print('search result: ', spot)
                            
                            if spot==len(master_l_date):
                                master_l_date.append((d['CMTE_ID'],d['TRANSACTION_DT']))
                                d_date[(d['CMTE_ID'],d['TRANSACTION_DT'])]=[int(d['TRANSACTION_AMT'])]
                                print('add to end of list', master_l_date, d_date)
                                
                            elif (master_l_date[spot] == (d['CMTE_ID'],d['TRANSACTION_DT'])):
                                d_date[(d['CMTE_ID'],d['TRANSACTION_DT'])].append(int(d['TRANSACTION_AMT']))
                                print ('Already in the list',(d['CMTE_ID'],d['TRANSACTION_DT']), '\n', d_date)
                            else:
                                master_l_date.insert(spot, (d['CMTE_ID'],d['TRANSACTION_DT'])) 
                                d_date[(d['CMTE_ID'],d['TRANSACTION_DT'])]=[int(d['TRANSACTION_AMT'])]
                                print('insert into list at position: ', spot)
#                                     min = max
#                                 elif (master_l_date[avg][0] < (d['CMTE_ID'],d['TRANSACTION_DT'])):
#                                     master_l_date.insert(avg,[(d['CMTE_ID'],d['TRANSACTION_DT']),[int(d['TRANSACTION_AMT'])]])avg + 1 + search(lst[avg+1:], target)
# #                             else:
# #                               return search(lst[:avg], target)

                            # avg may be a partial offset so no need to print it here
                            # print "The location of the number in the array is", avg 
#                             return avg
#                             search(master_l_date,(d['CMTE_ID'],d['TRANSACTION_DT']))
#                               
#                             master_l_date.setdefault((d['CMTE_ID'],d['TRANSACTION_DT']), []).append(int(d['TRANSACTION_AMT']))
     
        # write date files
        # sorted alphabetical by recipient and then chronologically by date.


        for i in range(0,len(master_l_date)):
            medianvals_by_date.write(''+str(master_l_date[i][0])+'|'+str(master_l_date[i][1])+'|'
                                     +str(round(statistics.median(d_date[master_l_date[i]])))+'|'
                                     +str(len(d_date[master_l_date[i]])))+'|'
                                     #+ str(len(master_d_date[d]))+'|'
                                     #+ str(sum(master_d_date[d]))
                                     +'\n')

#             medianvals_by_date.write(''+d[0]+'|'+d[1]+'|'+str(round(statistics.median(master_d_date[d]))) + '|' 
#                                      + str(len(master_d_date[d]))+'|'
#                                      + str(sum(master_d_date[d]))+'\n')





In [179]:
my_Insight = Insight()

[] ('C00177436', '01312017') 0 (0, 0)
search result:  0
add to end of list [('C00177436', '01312017')] {('C00177436', '01312017'): [384]}
[('C00177436', '01312017')] ('C00384818', '01122017') 0 (0, 1)
('C00177436', '01312017')  <  ('C00384818', '01122017')  avg =  0 min, max :  0 1
[] ('C00384818', '01122017') 0 (0, 0)
search result:  1
add to end of list [('C00177436', '01312017'), ('C00384818', '01122017')] {('C00177436', '01312017'): [384], ('C00384818', '01122017'): [250]}
[('C00177436', '01312017'), ('C00384818', '01122017')] ('C00177436', '01312017') 1 (0, 2)
('C00384818', '01122017')  >  ('C00177436', '01312017')  avg =  1 min, max :  0 2
[('C00177436', '01312017')] ('C00177436', '01312017') 0 (0, 1)
search result:  0
Already in the list ('C00177436', '01312017') 
 {('C00177436', '01312017'): [384, 230], ('C00384818', '01122017'): [250]}
[('C00177436', '01312017'), ('C00384818', '01122017')] ('C00177436', '01312017') 1 (0, 2)
('C00384818', '01122017')  >  ('C00177436', '01312017

In [155]:
l = [('A', '2')]
t = ('C','3')
search(l,t)

[('A', '2')] ('C', '3') 0 (0, 1)
('A', '2')  <  ('C', '3')  avg =  0 min, max :  0 1
[] ('C', '3') 0 (0, 0)


1

In [85]:
int(2.5)

2

In [101]:
# def search(lst, target):
#     min = 0
#     max = len(lst)-1
#     avg = int(round((min+max)/2))
#     # uncomment next line for traces
#     print (lst, target, avg ) 
#     while (min < max-1):
#         if (lst[avg] == target):
#             print('equal')
#             return (avg,True)
#         elif (lst[avg] < target):
#             print(lst[avg], ' < ', target, ' avg = ', avg, 'min, max : ', min, max)
#             min = avg
#             avg = int((min+max)/2)
#             print(' avg = ', avg, 'min, max : ', min, max)
#         else: 
#             print(lst[avg], ' > ', target, ' avg = ', avg, 'min, max : ', min, max)
#             max = avg
#             avg = int((min+max)/2)
#             print(' avg = ', avg, 'min, max : ', min, max)            
#     return (avg,False)


In [111]:
def search(lst, target):
    min = 0
    max = len(lst)
    avg = int((min+max)/2)
    print (lst, target, avg , (min,max)) 
    while (min < max):
        if (lst[avg] == target):
            return avg
        elif (lst[avg] < target):
            print(lst[avg], ' < ', target, ' avg = ', avg, 'min, max : ', min, max)
            return avg + 1 + search(lst[avg+1:], target)
        else:
            print(lst[avg], ' > ', target, ' avg = ', avg, 'min, max : ', min, max)
            return search(lst[:avg], target)

    return avg


In [154]:
l = [1,2,4,8,12,13]
# print(len(l))
# print(l[6])
t = 2
print(search(l, t))
print('Want: ', 2)

[1, 2, 4, 8, 12, 13] 2 3 (0, 6)
8  >  2  avg =  3 min, max :  0 6
[1, 2, 4] 2 1 (0, 3)
1
Want:  2


In [147]:
l = []
t = 4
print(search(l,t))
print('Want: ', 0)

[] 4 0 (0, 0)
0
Want:  0


In [114]:
search ([('C00177435', '01312017'),('C00177435', '01312017')], ('C00177436', '01312017'))

[('C00177435', '01312017'), ('C00177435', '01312017')] ('C00177436', '01312017') 1 (0, 2)
('C00177435', '01312017')  <  ('C00177436', '01312017')  avg =  1 min, max :  0 2
[] ('C00177436', '01312017') 0 (0, 0)


2

In [115]:
search([('C00177435','1'),('C00177439','1')],('C00177436','1'))

[('C00177435', '1'), ('C00177439', '1')] ('C00177436', '1') 1 (0, 2)
('C00177439', '1')  >  ('C00177436', '1')  avg =  1 min, max :  0 2
[('C00177435', '1')] ('C00177436', '1') 0 (0, 1)
('C00177435', '1')  <  ('C00177436', '1')  avg =  0 min, max :  0 1
[] ('C00177436', '1') 0 (0, 0)


1

In [116]:
l = [('1','1',[2]),('1','4',[1,3]), ('3','1',[]), ('5','5',[])]
t = ('2','3')
print(search(l, t))
print('Want: ', 2)


[('1', '1', [2]), ('1', '4', [1, 3]), ('3', '1', []), ('5', '5', [])] ('2', '3') 2 (0, 4)
('3', '1', [])  >  ('2', '3')  avg =  2 min, max :  0 4
[('1', '1', [2]), ('1', '4', [1, 3])] ('2', '3') 1 (0, 2)
('1', '4', [1, 3])  <  ('2', '3')  avg =  1 min, max :  0 2
[] ('2', '3') 0 (0, 0)
2
Want:  2
