In [1]:
'''PURPOSE

The purpose of this code is to attempt to merge the Starr account data located in Saleforce with that of the 
company information located in Capital IQ.

Data Sources  = Two Excel spreadsheets, ene from Capital IQ and the other from Salesforce. 
Unique ID's   = Capital IQ will be the CIQ ID
                Salesforce will be the Ultimate Parent D&B number. 

Approach      = TBD

Questions   
1.) Does every company in our dataset have a CIQ and D&B number?
2.) Does every company in our dataset have a state, city and zip code?

Date:    02.10.2018
author:  Chris Cirelli
'''

"PURPOSE\n\nThe purpose of this code is to attempt to merge the Starr account data located in Saleforce with that of the \ncompany information located in Capital IQ.\n\nData Sources  = Two Excel spreadsheets, ene from Capital IQ and the other from Salesforce. \nUnique ID's   = Capital IQ will be the CIQ ID\n                Salesforce will be the Ultimate Parent D&B number. \n\nApproach      = TBD\n\nQuestions   \n1.) Does every company in our dataset have a CIQ and D&B number?\n2.) Does every company in our dataset have a state, city and zip code?\n\nDate:    02.07.2018\nauthor:  Chris Cirelli\n"

In [2]:
# LOAD LIBRARIES

In [3]:
import os
import pandas as pd
import sys

os.chdir(r'C:\Users\Chris.Cirelli\Desktop\Python Programming Docs\GitHub\Starr-Project')
import Module_Starr_DataMerger as msd

In [4]:
# DEFINE LOCATION OF FILES

In [5]:
os.chdir(r'C:\Users\Chris.Cirelli\Desktop\Capital IQ Match w Salesforce')

In [6]:
# IMPORT FILES

In [7]:
# Salesforce Data
df_CIQ = pd.read_excel('Private Company Target List 2062018.xls')

# Capital IQ Data
df_SF = pd.read_excel('Salesforce Data Dump - Capital IQ Merger.xlsx')
df_SF = df_SF[:-7]

In [8]:
# DATA ANALYTICS TABLE (DAT) CIQ

In [9]:
'''
Purpose:  Limit the CIQ Dataframe to only those values needed to facilitate the matching
'''

DAT_CIQ = df_CIQ[['Excel Company ID', 'Company Name', 'Primary State', 'Primary City', 'Primary Zip Code/Postal Code']]

DAT_SF = df_SF[['Client Ultimate Parent DUNS Number', 'Company Name', 'Billing State/Province', 'Billing City', 
                'Billing Zip/Postal Code']]

In [12]:
# CALCULATE NONE VALUES

In [13]:
'''
Purpose:  See if we are missing any values in our dataframe that need to be relplaced or removed. 
Import:   Create & import the get_nanValues function from the module 'msd'.
'''

print('None Values in dataframe:  DAT_CIQ', '\n',  msd.get_nanValues(DAT_CIQ))
print('')
print('None Values in dataframe:  DAT_SF', '\n', msd.get_nanValues(DAT_SF))


None Values in dataframe:  DAT_CIQ 
 {'Excel Company ID': 0, 'Company Name': 0, 'Primary State': 0, 'Primary City': 0, 'Primary Zip Code/Postal Code': 0}

None Values in dataframe:  DAT_SF 
 {'Client Ultimate Parent DUNS Number': 0, 'Company Name': 0, 'Billing State/Province': 0, 'Billing City': 0, 'Billing Zip/Postal Code': 0}


In [14]:
# GET FIRST AND SECOND COMPANY NAMES

In [37]:
'''The purpose of this code is to extract from the Company Name column in each dataset the first and second name of
    each company.  In addition, punctuation like a ',' and '.' will need to be removed. 
   
    Modules =  Create and import the get_company_name() function from the msd module. 
    Input   =  To generate the first and second name, the code needs to be run twice on the same dataframe.  Each time, 
              the user needs to identify the dataframe and then the name (First / Second) that they want to obtain. 
    Output  =  A list of every company name for either the first or second name. 
   
    date:   02.10.2018
    author: Chris Cirelli
'''

# CIQ Dataframe
DAT_CIQ_list_first_Name = msd.get_company_name(DAT_CIQ, 'First').copy()
DAT_CIQ_list_second_Name = msd.get_company_name(DAT_CIQ, 'Second').copy()

# SF Dataframe
DAT_SF_list_first_Name = msd.get_company_name(DAT_SF, 'First').copy()
DAT_SF_list_Second_Name = msd.get_company_name(DAT_SF, 'Second').copy()

# Error Check = Verify Lenghts of Lists
'''List lengths need to equal the length of the columns in the dataframe to properly append'''

print(len(DAT_SF['Company Name']))
print(len(DAT_SF_list_first_Name))
print(len(DAT_SF_list_Second_Name))

38813
38813
38813


In [20]:
# RECREATE DATAFRAMES WITH FIRST AND SECOND NAMES APPENDED. 

In [21]:
'''The purpose of this code is to append the first and second name lists that we created to the CIQ and SF Dataframes'''

DAT_SF['Company First Name'] = DAT_SF_list_first_Name
DAT_SF['Company Second Name'] = DAT_SF_list_Second_Name
DAT_CIQ['Company First Name'] = DAT_CIQ_list_first_Name
DAT_CIQ['Company Second Name'] = DAT_CIQ_list_second_Name

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [22]:
# HARMONIZE ZIP CODE

In [23]:
'''The purpose of this code is to harmonize the format of the zip codes between the two datasets. 

    Modules = create and import the clean_zip_code() module from msd. 
    Input   = a.) a string value of the dataframe (ex 'DAT_CIQ') to tell the module which dataframe to work with. 
             b.) the target dataframe. 
    Output  = A list with each zip code harmonized 
   
    date:   02.10.2018
    author: Chris Cirelli
'''

# Create list of harmonized zipCodes. 
DAT_CIQ_ZIP = msd.clean_zip_code('DAT_CIQ', DAT_CIQ)
DAT_SF_ZIP = msd.clean_zip_code('DAT_SF', DAT_SF)

# Append lists to the CIQ and SF dataframes. 
DAT_CIQ['Zip Code Clean'] = DAT_CIQ_ZIP
DAT_SF['Zip Code Clean'] = DAT_SF_ZIP

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [25]:
CIQ_head = DAT_CIQ.head(1)

In [35]:
def get_match_v2():
    '''The purpose of this code is to match records from the SF and CIQ dataframes. 
    Input  = The CIQ and SF Dataframes.  Requires that these dataframes were pre-cleaned by the codes included in the
             Module_Starr_Datamerge file. 
    Output = The DAT_CIQ Dataframe with the matching values appended to each row. 
    
    Date:    02.10.2018
    author:  Chris Cirelli
    '''
    
    # Create a tuple for each row in the dataframe. 
    CIQ = [x for x in CIQ_head.itertuples()]         # Rever back to DAT CIQ when finished testing. 
    SF = [x for x in DAT_SF.itertuples()]
    
    # Loop over each row of the CIQ Dataframe. 
    for row_CIQ in CIQ:
        
        # Get the index value for the target CIQ row.  Use this at end of code. 
        row_CIQ_index_value = row_CIQ.index
        
        # Loop over each row of the SF Dataframe. 
        for row_SF in SF:
            if row_CIQ[8] in row_SF[8]:
            
                # Limit SF dataframe to only those records that have the CIQ zip code
                SF_limit = DAT_SF['Zip Code Clean'] == row_CIQ[8]
                # Define new SF Dataframe
                SF_limited_zip = DAT_SF[SF_limit]
                # Create a new SF tupple object from the SF limited dataframe. 
                SF_2 = [x for x in SF_limited_zip.itertuples()]
                
                # Iterate over new SF dataframe
                for row_SF2 in SF_2:
                    # See if the first name of the same company in question is in the SF dataframe
                    if row_CIQ[6] in row_SF2[6]:
                        
                        # Limit the SF Dataframe to only those records that have the CIQ first company name
                        SF_limit = SF_limited_zip['Company First Name'] == row_SF2[6]
                        # Define new SF Dataframe
                        SF_limited_firstName = SF_limited_zip[SF_limit]
                        # Create a new SF tupple object from the SF limited dataframe. 
                        SF_3 = [x for x in SF_limited_firstName.itertuples()]
                        
                        
                        # Iterate over new SF dataframe
                        for row_SF3 in SF_3:
                            
                            # Check to see if there is a match with the second name from our original CIQ dataframe
                            if row_CIQ[7] in row_SF3[7]:
                            
                                # Limit the SF Dataframe to only those records that have the CIQ second company name
                                SF_limit = SF_limited_firstName['Company Second Name'] == row_SF3[7]
                                # Define Final SF Dataframe
                                SF_Final = SF_limited_firstName[SF_limit]
                        
                                return SF_Final
                        
                    
                

In [36]:
Matching_record = get_match_v2()

<built-in method index of Pandas object at 0x00000239F172EF48>


In [28]:
Match_index_value = Matching_record.index

In [29]:
Matching_record.index = [0]

In [None]:
'''Requirements to merge

DAT_CIQ dataframe
Matching record from our get_match() function
Matching record - set index to the same value as that of the DAT_CIQ record. 

'''

In [30]:
pd.merge(left = DAT_CIQ, 
         right = Matching_record, 
         left_index = True, 
         right_index = True, 
         how = 'outer')

Unnamed: 0,Excel Company ID,Company Name_x,Primary State,Primary City,Primary Zip Code/Postal Code,Company First Name_x,Company Second Name_x,Zip Code Clean_x,Client Ultimate Parent DUNS Number,Company Name_y,Billing State/Province,Billing City,Billing Zip/Postal Code,Company First Name_y,Company Second Name_y,Zip Code Clean_y
0,IQ184468,"Mars, Incorporated",Virginia,McLean,22101,Mars,Incorporated,22101,3250685.0,"Mars, Incorporated",VA,McLean,22101-3881,Mars,Incorporated,22101
1,IQ201170,"Publix Super Markets, Inc.",Florida,Lakeland,33811,Publix,Super,33811,,,,,,,,
2,IQ117946,"Cox Enterprises, Inc.",Georgia,Atlanta,30328,Cox,Enterprises,30328,,,,,,,,
3,IQ160810,"CHS, Inc.",Minnesota,Inver Grove Heights,55077,CHS,Inc.,55077,,,,,,,,
4,IQ160716,"C&S Wholesale Grocers, Inc.",New Hampshire,Keene,3431,C&S,Wholesale,3431,,,,,,,,
5,IQ897901,H.E. Butt Grocery Company,Texas,San Antonio,78204,H.E.,Butt,78204,,,,,,,,
6,IQ800735,"Reyes Holdings, LLC",Illinois,Rosemont,60018,Reyes,Holdings,60018,,,,,,,,
7,IQ4224685,Trinity Health Corporation,Michigan,Livonia,48152,Trinity,Health,48152,,,,,,,,
8,IQ162883,"Menard, Inc.",Wisconsin,Eau Claire,54703,Menard,Inc.,54703,,,,,,,,
9,IQ162388,"Dairy Farmers of America, Inc.",Missouri,Kansas City,64153,Dairy,Farmers,64153,,,,,,,,
