# This notebook's process

1. Load in Crunchbase dataframes(4 merged CSVs created in `1_SS_EDA.ipynb`)
    - Organizations: `files/output/organizations_merged.csv`
    - Jobs: `files/output/p1_jobs.csv`
    - Investments: `files/output/p1_investments.csv`
    - Partner investments: `files/output/p1_investments_partner.csv`
2. Select date
3. Filter the dataframes by date
4. Save filtered dataframes as separate CSVs, and then load in as SFrames
    - Crunchbase network: `files/output/graph_temp/cb/{}_df.csv`
    - Pledge 1% network: `files/output/graph_temp/cb/{}_df.csv`
    - Not Pledge 1% network: `files/output/graph_temp/cb/{}_df.csv`
5. Load SFrames into graph
6. Reduce size of dataset by limiting degrees of freedom from Pledge 1%
7. Create random sample of non-P1 organizations, equal to number of P1 organizations
8. Load in updated dataframes with sample uuids
9. Save filter  dataframes as separate CSVs, and then load in as SFrames
    - Crunchbase network: `files/output/graph_model/cb/{}_df.csv`
    - Pledge 1% network: `files/output/graph_model/cb/{}_df.csv`
    - Not Pledge 1% network: `files/output/graph_model/cb/{}_df.csv`
    - Model network: `files/output/graph_model/model/{}_df.csv`
10. Load SFrames into graph
11. Graph feature calculations, save to CSV
    - Pagerank: `files/output/graph_model/model/pagerank.csv`

### To be explored further
- Applying weights to network based on edge `status`
- Calculate another useful graph feature to include in the model
- EDA on model graph

### Model
**p1_tag ~ `rank` + `total_funding_usd` + `employee_count` (ordinal) + `country` (nominal, 210 indicator columns) + `category_groups` (nominal, 46 indicator columns) + `days_since_founding` + ((GRAPH FEATURES))**

In [1]:
'''Importing basic data analysis packages'''
import numpy as np
import pandas as pd
import csv
import warnings
import os
import time
import math
from datetime import datetime
#datetime.today().strftime('%Y%m%d')
warnings.filterwarnings('ignore')

'''Graph'''
import networkx as nx
from pyvis.network import Network
from turicreate import SFrame, SGraph, pagerank, degree_counting, aggregate, visualization

'''Plotting packages'''
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
sns.set(style='white', font_scale=1.3)

def reduce_mem_usage(df, verbose=True):   
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

def network_by_date(date, df_input, jobs_input, invest_input, invest_prtnr_input, model_uuids=[], skip_not_p1=True):
    '''
    This function filters down Crunchbase dataframes by date 
    to ensure that the companies/people/investments being used in modeling exist at a given time.

    INPUT:
        - `date`: string w/ format 'YEAR-MO-DY' (e.g. '2020-09-08')
        - `df`: pandas dataframe of Crunchbase organizationss with necessary column fields:
            * `p1_date`, `founded_on`, `closed_on`
        - `jobs`: pandas dataframe of Crunchbase jobss with necessary column fields:
            * `p1_date`, `started_on`, `ended_on`
        - `invest`: pandas dataframe of Crunchbase investmentss with necessary column fields:
            * `p1_date`, `announced_on`
        - `invest_prtnr`: pandas dataframe of Crunchbase investments with necessary column fields:
            * `p1_date`, `announced_on`
        - `model_uuids`: list that contains the uuids of organizations that are used to construct the model graph
    
    OUTPUT:
        - List of dataframe lists, 3 (or 4) lists of length 10: 
            * [Crunchbase neighborhood dataframes], [Pledge 1% neighborhood dataframes], 
              [~Pledge 1% neighborhood dataframes], { [Model neighborhood dataframes] }
        - Each dataframe list contains dataframes that will be used in the next processing step:
            0. Companies
            1. Investors
            2. Investments
            3. Partner investments
            4. Current Jobs
            5. Former jobs
            6. Former affiliated's new jobs
            7. Partner investor's affiliation (if not in jobs dataframes)
            8. Partner investor's coworkers at the investing firm
            9. Partner investor's coworkers' partner investments
            10. Current affiliated's old jobs
            11. Organization nodes from edges in 2,3,6,7,9,10 if not already in 0 or 1
    '''
    
    
    # Soft copy of dataframes
    df = df_input.copy()
    jobs = jobs_input.copy()
    invest = invest_input.copy()
    invest_prtnr = invest_prtnr_input.copy()
    
    #*******************************************************************************************************
    # DATE PROCESSING
    
    # Convert date columns to datetime
    df['p1_date'] = pd.to_datetime(df['p1_date'], errors='coerce')
    df['founded_on'] = pd.to_datetime(df['founded_on'], errors='coerce')
    df['closed_on'] = pd.to_datetime(df['closed_on'], errors='coerce')
    jobs['p1_date'] = pd.to_datetime(jobs['p1_date'], errors='coerce')
    jobs['started_on'] = pd.to_datetime(jobs['started_on'], errors='coerce')
    jobs['ended_on'] = pd.to_datetime(jobs['ended_on'], errors='coerce')
    invest['p1_date'] = pd.to_datetime(invest['p1_date'], errors='coerce')
    invest['announced_on'] = pd.to_datetime(invest['announced_on'], errors='coerce')
    invest_prtnr['p1_date'] = pd.to_datetime(invest_prtnr['p1_date'], errors='coerce')
    invest_prtnr['announced_on'] = pd.to_datetime(invest_prtnr['announced_on'], errors='coerce')
    
    # Convert input date to datetime object
    date = pd.Timestamp(date)
    print('\nAS OF {}:\n'.format(date.strftime('%B %d, %Y').upper()))
    
    #*******************************************************************************************************
    # Create new row for tagging model companies
    df['add_to_model'] = 0
    df['add_to_model'][df['uuid'].isin(model_uuids)] = 1
    jobs['add_to_model'] = 0
    jobs['add_to_model'][jobs['org_uuid'].isin(model_uuids)] = 1
    invest['add_to_model'] = 0
    invest['add_to_model'][invest['org_uuid'].isin(model_uuids)] = 1
    invest_prtnr['add_to_model'] = 0
    invest_prtnr['add_to_model'][invest_prtnr['org_uuid'].isin(model_uuids)] = 1
    
    #*******************************************************************************************************
    # COMPANY FILTER
    # Crunchbase company must be founded after DATE and closed before DATE (or DATE == NaT)
    CB_companies = df[(df['founded_on']<=date) & 
                      ((df['closed_on']>date) | (pd.isnull(df['closed_on']))) & 
                      (df['primary_role']=='company')].reset_index(drop=True)
    
    #*******************************************************************************************************
    # INVESTOR FILTER:
    # Crunchbase investor must be founded AFTER date and closed BEFORE date (or date == NaT)
    CB_investors = df[(df['founded_on']<=date) & 
                      ((df['closed_on']>date) | (pd.isnull(df['closed_on']))) & 
                      (df['primary_role']=='investor')].reset_index(drop=True)
    
    #*******************************************************************************************************
    # INVESTMENT FILTER
    # Crunchbase investment must have taken place BEFORE date
    CB_investments = invest[(invest['announced_on']<=date) & 
                            (invest['investor_type']=='organization')].reset_index(drop=True)
    
    #*******************************************************************************************************
    # PARTNER INVESTMENT FILTER
    # Crunchbase partner investment must have taken place BEFORE date
    CB_investment_partners = invest_prtnr[invest_prtnr['announced_on']<=date].reset_index(drop=True)
    
    #*******************************************************************************************************
    # CURRENT JOB FILTER
    # Crunchbase job must have started BEFORE date and ended AFTER date (or date == NaT)
    CB_jobs = jobs[(jobs['job_type'].isin(['executive','board_member','advisor','board_observer'])) & 
                      (jobs['started_on']<=date) & 
                      ((jobs['ended_on']>date) | (pd.isnull(jobs['ended_on'])))].reset_index(drop=True)
    
    #*******************************************************************************************************
    # FORMER JOB FILTER
    # Crunchbase job must have ended BEFORE date or started AFTER date
    CB_jobs_former = jobs[(jobs['job_type'].isin(['executive','board_member','advisor','board_observer'])) & 
                          ((jobs['ended_on']<=date) | (jobs['started_on']>date))].reset_index(drop=True)
    
    #*******************************************************************************************************
    # COMBINE THESE 6 (or 7) INTO LIST OF FRAMES
    lst_of_frames = []
    
    # Crunchbase frames
    CB_frames = [CB_companies,CB_investors,CB_investments,CB_investment_partners,CB_jobs,CB_jobs_former]
    
    # Add to list of frames
    lst_of_frames.append(CB_frames)
    
    # If model_uuids are not supplied, calculate Pledge 1% neighborhood
    if model_uuids == []:
        P1_frames = []
        for frame in CB_frames:
            
            # Pledge 1% frames must have Crunchbase assumptions in addition to an earlier pledge date
            new_frame = frame[frame['p1_date']<=date].reset_index(drop=True).drop('add_to_model',axis=1)
            P1_frames.append(new_frame)
        # Add to list of frames
        lst_of_frames.append(P1_frames)
    
    # If model_uuids are supplied, calculate model neighborhood
    if model_uuids != []:
        model_frames = []
        for frame in CB_frames:
            
            # Include model dataframe if condition satisfied: either are a Pledge 1% company or tagged by model_uuids
            new_frame=frame[(frame['p1_date']<=date) | (frame['add_to_model']==1)].reset_index(drop=True).drop('add_to_model',axis=1)
            model_frames.append(new_frame)
        
        # Add to list of frames
        lst_of_frames.append(model_frames)
    
    # If this boolean value is False, calculate ~Pledge 1% neighborhood
    if skip_not_p1 is False:
        not_P1_frames = []
        for frame in CB_frames:
            
            # Non-Pledge 1% frames must have Crunchbase assumptions in addition to NaT pledge date or later pledge date
            new_frame = frame[(pd.isnull(frame['p1_date']) | (frame['p1_date']>date))].reset_index(drop=True).drop('add_to_model',axis=1)
            not_P1_frames.append(new_frame)
        
        # Add to list of frames
        lst_of_frames.append(not_P1_frames)
        
    # Remove extra column 'add_to_model'
    for idx,frame in enumerate(CB_frames):
        CB_frames[idx] = frame.drop('add_to_model',axis=1)

    #*******************************************************************************************************
    # FORMER NEW JOB FILTER
    print('CaLcUlAtInG... FORMER NEW JOB FILTER')
    
    for frame in lst_of_frames:
        
        # Where do the former affiliated work now?
        former_people = frame[5].person_uuid.unique() # Pull their IDs
        jobs_former_new = CB_frames[4][CB_frames[4].person_uuid.isin(former_people)] # Pull their current jobs from Crunchbase

        # Check they're not already in the current jobs dataframe
        # Combine into one temp data frame
        combined_jobs = pd.concat([frame[4], jobs_former_new]).reset_index(drop=True) 
        df_gpby = combined_jobs.groupby(list(combined_jobs.columns))
        
        # Only count non-duplicated columns
        idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
        
        # Reindex dataframe
        jobs_former_new = combined_jobs.reindex(idx)
        
        # Add to list of frames
        frame.append(jobs_former_new)
    
    #*******************************************************************************************************
    # PARTNER INVESTMENT JOB FILTER
    print('CaLcUlAtInG... PARTNER INVESTMENT JOB FILTER')
    
    for frame in lst_of_frames:
        
        # Are the partner investment jobs already in one of the jobs dataframes? If not, we should add them.
        
        # Create temporary dataframe and column to make checking the intersection between dataframes easier 
        # frame[4]: current jobs | frame[5]: former jobs | frame[6]: former new jobs
        jobs_combined = pd.concat([frame[4],frame[5],frame[6]])
        jobs_combined['person,company'] = jobs_combined['person_uuid'] + ',' + jobs_combined['org_uuid']
        
        # frame[3]: partner investments
        frame[3]['person,company'] = frame[3]['partner_uuid']+ ',' + frame[3]['investor_uuid']

        # Number of unique partner investments
        unique_PI = frame[3]['person,company'].unique()

        # Overlap between PI and combined J frames, create temporary jobs view
        # These PI are already found in J frames, so we do not need to include them
        jobs_already_in_J = jobs_combined[jobs_combined['person,company'].isin(unique_PI)] 

        # This will return non intersecting value of PI with temp J
        # These PI are not found in J, so we would like to include them
        PI_not_in_J = np.setdiff1d(unique_PI,jobs_already_in_J['person,company'].unique())

        # Need to create separate jobs dataframe for non intersecting PI/J person/company pairs
        grouped = frame[3][frame[3]['person,company'].isin(PI_not_in_J)].groupby(['partner_uuid','partner_name','investor_uuid','investor_name']).count()
        grouped_df = grouped.reset_index()[['partner_uuid','partner_name','investor_uuid','investor_name']]
        grouped_df['job_type'] = 'executive'
        
        # Add to list of frames
        frame.append(grouped_df)
    
    #*******************************************************************************************************
    # OTHER FIRM PARNTERS
    print('CaLcUlAtInG... OTHER FIRM PARTNER JOBS & INVESTMENTS FILTER')
    
    for frame in lst_of_frames:
        
        # OTHER FIRM PARNTERS - JOBS
        # Who are the other partners that work at the investment firms present in the neighborhood?
        
        # Get the unique investor uuids associated with the dataframes
        # frame[2]: from investments dataframe
        unique_investor_firm_A = list(frame[2]['investor_uuid'].unique())
        
        # frame[3]: from partner investments dataframe
        unique_investor_firm_B = list(frame[3]['investor_uuid'].unique())
        partners = list(frame[3]['partner_uuid'].unique())
        
        # Combine to get list of unique uuids of VC firms
        unique_firms = list(set(unique_investor_firm_A+unique_investor_firm_B))
        
        # Grab current jobs from Crunchbase for these investing firms
        # Exclude duplicate partner job (already represented by partners list calculated above)
        partner_jobs = CB_frames[4][(CB_frames[4]['org_uuid'].isin(unique_firms)) &  
                                    ~(CB_frames[4]['person_uuid'].isin(partners))].reset_index(drop=True)
        
        # Check they're not already in the current/former jobs dataframe
        # Combine into one temp data frame
        combined_jobs = pd.concat([frame[4], partner_jobs]).reset_index(drop=True) 
        df_gpby = combined_jobs.groupby(list(combined_jobs.columns))
        
        # Only count non-duplicated rows
        idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
        
        # Reindex dataframe
        partner_jobs = combined_jobs.reindex(idx)
        
        # Add to list of frames
        frame.append(partner_jobs)
        
        # OTHER FIRM PARNTERS - PARTNER INVESTMENTS
        # For these new partners, what companies are they invested in?
        # Get the unique parnter uuids associated with the dataframes
        other_partners = partner_jobs['person_uuid'].unique()
        other_partner_investments = CB_frames[3][CB_frames[3]['partner_uuid'].isin(other_partners)]
        
        # Check they're not already in the partner investments dataframe
        # Combine into one temp data frame
        combined_jobs = pd.concat([frame[3], other_partner_investments]).reset_index(drop=True) 
        df_gpby = combined_jobs.groupby(list(combined_jobs.columns))
        
        # Only count non-duplicated rows
        idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
        
        # Reindex dataframe
        other_partner_investments = combined_jobs.reindex(idx)
        
        # Add to list of frames
        frame.append(other_partner_investments)
    
    #*******************************************************************************************************
    # CURRENT OLD JOB FILTER
    print('CaLcUlAtInG... CURRENT OLD JOB FILTER')
    
    for frame in lst_of_frames:
        
        # Where did the current affiliated work previously?
        current_people = frame[4].person_uuid.unique() # Pull their IDs
        jobs_current_old = CB_frames[5][CB_frames[5].person_uuid.isin(current_people)] # Pull their current jobs from Crunchbase

        # Check they're not already in the current jobs dataframe
        # Combine into one temp data frame
        combined_jobs = pd.concat([frame[5], jobs_current_old]).reset_index(drop=True) 
        df_gpby = combined_jobs.groupby(list(combined_jobs.columns))
        
        # Only count non-duplicated columns
        idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
        
        # Reindex dataframe
        jobs_current_old = combined_jobs.reindex(idx)
        
        # Add to list of frames
        frame.append(jobs_current_old)
        
    #*******************************************************************************************************
    # GET EXTRA ORG UUID ATTRIBUTES FROM INVESTMENTS & JOBS
    print('CaLcUlAtInG... EXTRA ORGANIZATION NODES')
    
    CB_orgs = pd.concat([CB_companies, CB_investors])
    
    for frame in lst_of_frames:
        
        unique_orgs = []
        
        # Investments
        unique_orgs.extend(list(frame[2]['investor_uuid'].unique()))
        
        # Partner investments
        unique_orgs.extend(list(frame[3]['investor_uuid'].unique()))
        
        # Former new jobs organizations
        unique_orgs.extend(list(frame[6]['org_uuid'].unique()))
        
        # Parter jobs organizations
        unique_orgs.extend(list(frame[7]['investor_uuid'].unique()))
        
        # Other parter investments organizations
        unique_orgs.extend(list(frame[9]['org_uuid'].unique()))
        
        # Current old jobs organizations
        unique_orgs.extend(list(frame[10]['org_uuid'].unique()))
        
        # Pull their organization information from Crunchbase
        new_org_nodes = CB_orgs[CB_orgs['uuid'].isin(list(set(unique_orgs)))]
        
        # Add to list of frames
        frame.append(new_org_nodes)
    
    #*******************************************************************************************************
    del df['add_to_model'], invest['add_to_model'], invest_prtnr['add_to_model'], jobs['add_to_model']
    
    # Output print statements
    print('\nCrunchbase Neighborhood')
    print('NODES | OUTPUT FRAME 0/CB_companies {}'.format(CB_frames[0].shape))
    print('NODES | OUTPUT FRAME 1/CB_investors {}'.format(CB_frames[1].shape))
    print('NODES&EDGES | OUTPUT FRAME 2/CB_investments {}'.format(CB_frames[2].shape))
    print('NODES&EDGES | OUTPUT FRAME 3/CB_investment_partners {}'.format(CB_frames[3].shape))
    print('NODES&EDGES | OUTPUT FRAME 4/CB_jobs {}'.format(CB_frames[4].shape))
    print('NODES&EDGES | OUTPUT FRAME 5/CB_jobs_former {}'.format(CB_frames[5].shape))
    print('NODES&EDGES | OUTPUT FRAME 6/CB_jobs_former_new {}'.format(CB_frames[6].shape))
    print('NODES&EDGES | OUTPUT FRAME 7/CB_jobs_partner {}'.format(CB_frames[7].shape))
    print('NODES&EDGES | OUTPUT FRAME 8/CB_jobs_other_partners {}'.format(CB_frames[8].shape))
    print('NODES&EDGES | OUTPUT FRAME 9/CB_invest_other_partners {}'.format(CB_frames[9].shape))
    print('NODES&EDGES | OUTPUT FRAME 10/CB_jobs_current_old {}'.format(CB_frames[10].shape))
    print('NODES | OUTPUT FRAME 11/CB_extra_org_nodes {}'.format(CB_frames[11].shape))
    
    if model_uuids != []:

        print('\nModel Neighborhood')
        print('NODES | OUTPUT FRAME 0/model_companies {}'.format(model_frames[0].shape))
        print('NODES | OUTPUT FRAME 1/model_investors {}'.format(model_frames[1].shape))
        print('NODES&EDGES | OUTPUT FRAME 2/model_investments {}'.format(model_frames[2].shape))
        print('NODES&EDGES | OUTPUT FRAME 3/model_investment_partners {}'.format(model_frames[3].shape))
        print('NODES&EDGES | OUTPUT FRAME 4/model_jobs {}'.format(model_frames[4].shape))
        print('NODES&EDGES | OUTPUT FRAME 5/model_jobs_former {}'.format(model_frames[5].shape))
        print('NODES&EDGES | OUTPUT FRAME 6/model_jobs_former_new {}'.format(model_frames[6].shape))
        print('NODES&EDGES | OUTPUT FRAME 7/model_jobs_partner {}'.format(model_frames[7].shape))
        print('NODES&EDGES | OUTPUT FRAME 8/model_jobs_other_partners {}'.format(model_frames[8].shape))
        print('NODES&EDGES | OUTPUT FRAME 9/model_invest_other_partners {}'.format(model_frames[9].shape))
        print('NODES&EDGES | OUTPUT FRAME 10/model_jobs_current_old {}'.format(model_frames[10].shape))
        print('NODES | OUTPUT FRAME 11/model_extra_org_nodes {}'.format(model_frames[11].shape))
        
        return CB_frames, model_frames
    
    print('\nPledge 1% Neighborhood')
    print('NODES | OUTPUT FRAME 0/P1_companies {}'.format(P1_frames[0].shape))
    print('NODES | OUTPUT FRAME 1/P1_investors {}'.format(P1_frames[1].shape))
    print('NODES&EDGES | OUTPUT FRAME 2/P1_investments {}'.format(P1_frames[2].shape))
    print('NODES&EDGES | OUTPUT FRAME 3/P1_investment_partners {}'.format(P1_frames[3].shape))
    print('NODES&EDGES | OUTPUT FRAME 4/P1_jobs {}'.format(P1_frames[4].shape))
    print('NODES&EDGES | OUTPUT FRAME 5/P1_jobs_former {}'.format(P1_frames[5].shape))
    print('NODES&EDGES | OUTPUT FRAME 6/P1_jobs_former_new {}'.format(P1_frames[6].shape))
    print('NODES&EDGES | OUTPUT FRAME 7/P1_jobs_partner {}'.format(P1_frames[7].shape))
    print('NODES&EDGES | OUTPUT FRAME 8/P1_jobs_other_partners {}'.format(P1_frames[8].shape))
    print('NODES&EDGES | OUTPUT FRAME 9/P1_invest_other_partners {}'.format(P1_frames[9].shape))
    print('NODES&EDGES | OUTPUT FRAME 10/P1_jobs_current_old {}'.format(P1_frames[10].shape))
    print('NODES | OUTPUT FRAME 11/P1_extra_org_nodes {}'.format(P1_frames[11].shape))
    
    # Skip Not P1 Calculations
    if skip_not_p1 is False:
        
        print('\n~Pledge 1% Neighborhood')
        print('NODES | OUTPUT FRAME 0/not_P1_companies {}'.format(not_P1_frames[0].shape))
        print('NODES | OUTPUT FRAME 1/not_P1_investors {}'.format(not_P1_frames[1].shape))
        print('NODES&EDGES | OUTPUT FRAME 2/not_P1_investments {}'.format(not_P1_frames[2].shape))
        print('NODES&EDGES | OUTPUT FRAME 3/not_P1_investment_partners {}'.format(not_P1_frames[3].shape))
        print('NODES&EDGES | OUTPUT FRAME 4/not_P1_jobs {}'.format(not_P1_frames[4].shape))
        print('NODES&EDGES | OUTPUT FRAME 5/not_P1_jobs_former {}'.format(not_P1_frames[5].shape))
        print('NODES&EDGES | OUTPUT FRAME 6/not_P1_jobs_former_new {}'.format(not_P1_frames[6].shape))
        print('NODES&EDGES | OUTPUT FRAME 7/not_P1_jobs_partner {}'.format(not_P1_frames[7].shape))
        print('NODES&EDGES | OUTPUT FRAME 8/not_P1_jobs_other_partners {}'.format(not_P1_frames[8].shape))
        print('NODES&EDGES | OUTPUT FRAME 9/not_P1_invest_other_partners {}'.format(not_P1_frames[9].shape))
        print('NODES&EDGES | OUTPUT FRAME 10/not_P1_jobs_current_old {}'.format(not_P1_frames[10].shape))
        print('NODES | OUTPUT FRAME 11/not_P1_extra_org_nodes {}'.format(not_P1_frames[11].shape))
    
    return CB_frames, P1_frames

# 1. Load in Crunchbase dataframes (merged CSVs created in `1_SS_EDA.ipynb`)

In [12]:
# Import CSVs as Pandas DataFrames
path = 'files/output/organizations_merged.csv'
df = pd.read_csv(path).drop(['Unnamed: 0'],axis=1)
print('INPUT df=p1+org FROM CSV: {}'.format(path))
print('ORGANIZATION/df cols: {}\nSHAPE: {}'.format(df.columns.to_list(), df.shape))
df = reduce_mem_usage(df, verbose=True)

path = 'files/output/p1_jobs.csv'
jobs = pd.read_csv(path)
print('\nINPUT jobs FROM CSV: {}'.format(path))
print('JOBS/jobs cols: {}\nSHAPE: {}'.format(jobs.columns.to_list(), jobs.shape))
jobs = reduce_mem_usage(jobs, verbose=True)

path = 'files/output/p1_investments.csv'
invest = pd.read_csv(path)
print('\nINPUT invest FROM CSV: {}'.format(path))
print('INVESTMENTS/invest cols: {}\nSHAPE: {}'.format(invest.columns.to_list(), invest.shape))
invest = reduce_mem_usage(invest, verbose=True)

path = 'files/output/p1_investments_partner.csv'
invest_prtnr = pd.read_csv(path)
print('\nINPUT invest_prtnr FROM CSV: {}'.format(path))
print('PARTNER INVESTMENTS/invest_prtnr cols: {}\nSHAPE: {}'.format(invest_prtnr.columns.to_list(), invest_prtnr.shape))
invest_prtnr = reduce_mem_usage(invest_prtnr, verbose=True)

print('\n\nPledge 1% UUID: {}'.format(df[df['name']=='Pledge 1%'].uuid.values[0]))

INPUT df=p1+org FROM CSV: files/output/organizations_merged.csv
ORGANIZATION/df cols: ['uuid', 'name', 'type', 'rank', 'roles', 'country_code', 'region', 'status', 'category_groups_list', 'total_funding_usd', 'founded_on', 'closed_on', 'employee_count', 'primary_role', 'p1_tag', 'p1_date']
SHAPE: (1131315, 16)
Mem. usage decreased to 121.00 Mb (12.0% reduction)

INPUT jobs FROM CSV: files/output/p1_jobs.csv
JOBS/jobs cols: ['job_uuid', 'person_uuid', 'person_name', 'org_uuid', 'org_name', 'started_on', 'ended_on', 'is_current', 'title', 'job_type', 'p1_tag', 'p1_date']
SHAPE: (1536376, 12)
Mem. usage decreased to 121.00 Mb (6.0% reduction)

INPUT invest FROM CSV: files/output/p1_investments.csv
INVESTMENTS/invest cols: ['investment_uuid', 'funding_round_uuid', 'investor_uuid', 'investor_name', 'investor_type', 'is_lead_investor', 'investment_type', 'announced_on', 'raised_amount_usd', 'post_money_valuation_usd', 'investor_count', 'lead_investor_uuids', 'lead_investor_count', 'org_uuid'

# 2. Select date

In [10]:
date = '2020-09-08'

# 3. Filter the dataframes by date

In [9]:
cb_frames,p1_frames = network_by_date(date, df, jobs, invest, invest_prtnr)

# List of Pledge 1% uuids
global p1_companies_uuid
p1_companies_uuid = []
p1_companies_uuid.extend(list(p1_frames[0]['uuid'].unique()))
p1_companies_uuid.extend(list(p1_frames[1]['uuid'].unique()))


AS OF SEPTEMBER 08, 2020:

CaLcUlAtInG... FORMER NEW JOB FILTER
CaLcUlAtInG... PARTNER INVESTMENT JOB FILTER
CaLcUlAtInG... OTHER FIRM PARTNER JOBS & INVESTMENTS FILTER
CaLcUlAtInG... CURRENT OLD JOB FILTER
CaLcUlAtInG... EXTRA ORGANIZATION NODES

Crunchbase Neighborhood
NODES | OUTPUT FRAME 0/CB_companies (825393, 16)
NODES | OUTPUT FRAME 1/CB_investors (31499, 16)
NODES&EDGES | OUTPUT FRAME 2/CB_investments (453058, 17)
NODES&EDGES | OUTPUT FRAME 3/CB_investment_partners (89926, 18)
NODES&EDGES | OUTPUT FRAME 4/CB_jobs (395270, 12)
NODES&EDGES | OUTPUT FRAME 5/CB_jobs_former (182483, 12)
NODES&EDGES | OUTPUT FRAME 6/CB_jobs_former_new (483629, 12)
NODES&EDGES | OUTPUT FRAME 7/CB_jobs_partner (11771, 5)
NODES&EDGES | OUTPUT FRAME 8/CB_jobs_other_partners (434370, 12)
NODES&EDGES | OUTPUT FRAME 9/CB_invest_other_partners (161382, 18)
NODES&EDGES | OUTPUT FRAME 10/CB_jobs_current_old (289847, 12)
NODES | OUTPUT FRAME 11/CB_extra_org_nodes (225184, 17)

Pledge 1% Neighborhood
NODES | OU

# 4. Save filtered dataframes as separate CSVs, and then load in as SFrames

### Save filtered dataframes as separate CSVs & load in nodes, edges as SFrames

<a href='https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.html'>turicreate.SFrame</a>

In [12]:
for idx, frame in enumerate(cb_frames):
    path = 'files/output/graph_temp/cb/{}_df.csv'.format(idx)
    print('SAVED TO CSV', path)
    frame.to_csv(path, index=False)
for idx, frame in enumerate(p1_frames):
    path = 'files/output/graph_temp/p1/{}_df.csv'.format(idx)
    print('SAVED TO CSV', path)
    frame.to_csv(path, index=False)

('SAVED TO CSV', 'files/output/graph_temp/cb/0_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/1_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/2_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/3_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/4_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/5_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/6_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/7_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/8_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/9_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/10_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/cb/11_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/p1/0_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/p1/1_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/p1/2_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/p1/3_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/p1/4_df.csv')
('SAVED TO CSV', 'files/output/graph_temp/p1/5

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [2]:
lst_of_frames = []
for val in ['cb','p1']:
    lst = []
    for idx in range(12):
        path = 'files/output/graph_temp/{}/{}_df.csv'.format(val, idx)
        lst.append(SFrame(data=path))
    lst_of_frames.append(lst)
cb_sframes,p1_sframes = lst_of_frames

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


# 5. Load SFrames into graph

### Create function to format SFrames for loading into SGraph

#### Vertices: Person, Company, or Investor

Node attributes: `__id`, `__node_type`, `name`

#### Edges: Investment, Job

Edge attributes: `__src_id`, `__dst_id`, `__edge_type`, `status`, {`__id`}, {`investment_type`,`raised_amount_usd`, `investor_count`, `is_lead_investor`, `lead_investor_count`}, {`job_type`, `title`}


In [3]:
import copy

def load_vertices(sframes, g):
    
    # For jobs dataframes
    for idx in [4,5,6,8,10]:
        frame_temp = sframes[idx][['person_uuid', 'person_name']].rename({'person_uuid':'__id', 'person_name':'name'})
        frame_temp['__node_type'] = 'person'
        frame_temp['p1_tag'] = 0
        g = g.add_vertices(vertices=frame_temp, vid_field='__id')
    
    # For jobs and partner investments dataframes
    for idx in [2,3,4,5,6,8,9,10]:
        frame_temp = sframes[idx][['org_uuid', 'org_name', 'p1_tag']].rename({'org_uuid':'__id', 'org_name':'name'})
        frame_temp['p1_tag'] = frame_temp['p1_tag'].apply(lambda x: 0 if (x=="" or x==0) else 1)
        frame_temp['p1_tag'] = frame_temp['p1_tag'].astype(int)
        frame_temp['__node_type'] = 'company'
        g = g.add_vertices(vertices=frame_temp, vid_field='__id')
    
    # For investments dataframes
    for idx in [2,3,7,9]:
        frame_temp = sframes[idx][['investor_uuid', 'investor_name']].rename({'investor_uuid':'__id', 'investor_name':'name'})
        frame_temp['__node_type'] = 'investor'
        frame_temp['p1_tag'] = 0
        g = g.add_vertices(vertices=frame_temp, vid_field='__id')
    
    # For partner investments dataframes
    for idx in [3,7,9]:
        frame_temp = sframes[idx][['partner_uuid', 'partner_name']].rename({'partner_uuid':'__id', 'partner_name':'name'})
        frame_temp['__node_type'] = 'person'
        frame_temp['p1_tag'] = 0
        g = g.add_vertices(vertices=frame_temp, vid_field='__id')
    
    # Organizations
    for idx in [0,1,11]:
        # Create id field in SFrame
        frame_temp = sframes[idx][['uuid', 'name', 'primary_role', 'p1_tag']].rename({'uuid':'__id', 'primary_role':'__node_type'})
        frame_temp['p1_tag'] = frame_temp['p1_tag'].apply(lambda x: 0 if (x=="" or x==0) else 1)
        frame_temp['p1_tag'] = frame_temp['p1_tag'].astype(int)
        g = g.add_vertices(vertices=frame_temp, vid_field='__id')
    
    return g

def find_p1_affiliations(p1_sframes):
    frames = copy.deepcopy(p1_sframes)
    
    # Combine company and investor Pledge 1% dataframes
    p1_affiliations = frames[0][['uuid']].append(frames[1][['uuid']])
    
    # Add edge connecting to Pledge 1% uuid
    p1_affiliations['p1_uuid'] = 'fd9e2d10-a882-c6f4-737e-fd388d4ffd7c'
    
    # Create id, source, destination fields in SFrame
    p1_affiliations = p1_affiliations.rename({'uuid':'src','p1_uuid':'dst'})
    p1_affiliations['p1_tag'] = 1
    
    return p1_affiliations

p1_aff = find_p1_affiliations(p1_sframes)

def load_edges(sframes, g, p1_affiliations=[], include_edges=[2,3]):
    
    if type(p1_affiliations) == SFrame:
        # P1 Companies: Company/Investor --> Pledge 1%
        g = g.add_edges(edges=p1_affiliations, src_field='src', dst_field='dst')
    
    # Investments: Investor --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[2][['investment_uuid','investor_uuid','org_uuid','investment_type','raised_amount_usd','investor_count','is_lead_investor','lead_investor_count']].rename({'investment_uuid':'__id','investor_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'primary'
    g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
    
    # Partner Investments, Investments: Person --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[3][['investment_uuid','partner_uuid','org_uuid','investment_type','raised_amount_usd','investor_count']].rename({'investment_uuid':'__id','partner_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'primary'
    g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
    
    # Partner Investments, Investments: Investor --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[3][['investor_uuid','org_uuid','investment_type','investor_count']].rename({'investor_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'secondary'
    # Secondary relationships
    if 2 in include_edges:
        g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
    
    # Partner Investments, Jobs: Person --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[7][['partner_uuid','investor_uuid']].rename({'partner_uuid':'src','investor_uuid':'dst'})
    frame_temp['__edge_type'] = 'job'
    frame_temp['status'] = 'secondary'
    # Secondary relationships
    if 2 in include_edges:
        g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')    
    
    # Other Partner Investments, Investments: Person --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[9][['investment_uuid','partner_uuid','org_uuid','investment_type','raised_amount_usd','investor_count']].rename({'investment_uuid':'__id','partner_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'tertiary'
    # Tertiary relationships
    if 3 in include_edges:
        g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
    
    # Jobs: Person --> Company
    for idx in [4,5,6,8,10]:
        # Create id, source, destination fields in SFrame
        frame_temp = sframes[idx][['job_uuid','person_uuid','org_uuid','job_type','title']].rename({'job_uuid':'__id','person_uuid':'src','org_uuid':'dst'})
        frame_temp['__edge_type'] = 'job'
        
        # Current jobs
        if idx == 4:
            frame_temp['status'] = 'primary'
            g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
            continue
        
        # Secondary relationships
        if 2 in include_edges:
            
            # Former jobs | Former new jobs | Current old jobs 
            if idx in [5,6,10]:
                frame_temp['status'] = 'secondary'
                g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
                continue
        
        # Tertiary relationships
        if 3 in include_edges:
            
            # Other partners at firm
            if idx == 8:
                frame_temp['status'] = 'tertiary'
                g = g.add_edges(edges=frame_temp, src_field='src', dst_field='dst')
                continue

    return g

cb = load_edges(cb_sframes, load_vertices(cb_sframes,SGraph()), p1_aff)

In [4]:
def load_edges2(sframes, g, p1_affiliations=[], include_edges=[2,3], reverse=False):
    # Since it is a directed graph, need to include option for reverse direction
    # Forward
    source = 'src'
    destination = 'dst'
    # Reverse
    if reverse:
        source = 'dst'
        destination = 'src'
    if type(p1_affiliations) == SFrame:
        # P1 Companies: Company/Investor --> Pledge 1%
        g = g.add_edges(edges=p1_affiliations, src_field=source, dst_field=destination)
    # Investments: Investor --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[2][['investment_uuid','investor_uuid','org_uuid','investment_type','raised_amount_usd','investor_count','is_lead_investor','lead_investor_count']].rename({'investment_uuid':'__id','investor_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'primary'
    g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
    # Partner Investments, Investments: Person --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[3][['investment_uuid','partner_uuid','org_uuid','investment_type','raised_amount_usd','investor_count']].rename({'investment_uuid':'__id','partner_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'primary'
    g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
    # Partner Investments, Investments: Investor --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[3][['investor_uuid','org_uuid','investment_type','investor_count']].rename({'investor_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'secondary'
    # Secondary relationships, skip if not specified at input
    if 2 in include_edges:
        g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
    # Partner Investments, Jobs: Person --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[7][['partner_uuid','investor_uuid']].rename({'partner_uuid':'src','investor_uuid':'dst'})
    frame_temp['__edge_type'] = 'job'
    frame_temp['status'] = 'secondary'
    # Secondary relationships, skip if not specified at input
    if 2 in include_edges:
        g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)    
    # Other Partner Investments, Investments: Person --> Company
    # Create id, source, destination fields in SFrame
    frame_temp = sframes[9][['investment_uuid','partner_uuid','org_uuid','investment_type','raised_amount_usd','investor_count']].rename({'investment_uuid':'__id','partner_uuid':'src','org_uuid':'dst'})
    frame_temp['__edge_type'] = 'investment'
    frame_temp['status'] = 'tertiary'
    # Tertiary relationships, skip if not specified at input
    if 3 in include_edges:
        g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
    # Jobs: Person --> Company
    for idx in [4,5,6,8,10]:
        # Create id, source, destination fields in SFrame
        frame_temp = sframes[idx][['job_uuid','person_uuid','org_uuid','job_type','title']].rename({'job_uuid':'__id','person_uuid':'src','org_uuid':'dst'})
        frame_temp['__edge_type'] = 'job'
        # Current jobs
        if idx == 4:
            frame_temp['status'] = 'primary'
            g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
            continue
        # Secondary relationships, skip if not specified at input
        if 2 in include_edges:
            # Former jobs | Former new jobs | Current old jobs 
            if idx in [5,6,10]:
                frame_temp['status'] = 'secondary'
                g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
                continue
        # Tertiary relationships, skip if not specified at input
        if 3 in include_edges:
            # Other partners at firm
            if idx == 8:
                frame_temp['status'] = 'tertiary'
                g = g.add_edges(edges=frame_temp, src_field=source, dst_field=destination)
                continue
    return g
# Load in crunchbase with relationships defined above (primary, secondary, tertiary)
cb = load_edges2(cb_sframes, load_vertices(cb_sframes,SGraph()), p1_affiliations=[], include_edges=[2,3], reverse=False)
cb = load_edges2(cb_sframes, cb, p1_affiliations=[], include_edges=[2,3], reverse=True)

### Remove duplicate edges

From: <a href='https://github.com/turi-code/how-to/blob/master/remove_duplicate_edges.py'>Remove duplicate edges from SGraph</a>

In [7]:
# Get list of edge fields
graph_edge_fields = cb.get_edge_fields()

# Create temporary edge attribute that you'll use in aggregate function
cb.edges['combined'] = cb.edges['__id']+','+cb.edges['status']+','+cb.edges['__src_id']+','+cb.edges['__dst_id']

# Before comparison
before = cb.summary()
before_pri = cb.get_edges(fields={'status':'primary'}).shape[0]
before_sec = cb.get_edges(fields={'status':'secondary'}).shape[0]
before_ter = cb.get_edges(fields={'status':'tertiary'}).shape[0]

# Select one value of duplicated rows th
cb = SGraph(cb.vertices, cb.edges.groupby(graph_edge_fields, {'combined': aggregate.SELECT_ONE('combined')}))

# After comparison
after = cb.summary()
after_pri = cb.get_edges(fields={'status':'primary'}).shape[0]
after_sec = cb.get_edges(fields={'status':'secondary'}).shape[0]
after_ter = cb.get_edges(fields={'status':'tertiary'}).shape[0]

# Output
print('Remove duplicates from Crunchbase graph')
print('\nNode change: {:,} --> {:,}'.format(before['num_vertices'], after['num_vertices']))
print('Edge change: {:,} --> {:,}'.format(before['num_edges'], after['num_edges']))
print('\nPRIMARY Edge change: {:,} --> {:,}'.format(before_pri,after_pri))
print('SECONDARY Edge change: {:,} --> {:,}'.format(before_sec,after_sec))
print('TERTIARY Edge change: {:,} --> {:,}'.format(before_ter,after_ter))

del cb.edges['combined']

Remove duplicates from Crunchbase graph

Node change: 1,290,346 --> 1,290,346
Edge change: 4,170,144 --> 4,170,144

PRIMARY Edge change: 1,876,400 --> 1,876,400
SECONDARY Edge change: 1,328,800 --> 1,328,800
TERTIARY Edge change: 964,944 --> 964,944


# 6. Reduce size of dataset by limiting degrees of freedom from Pledge 1%

### Reduce the CB dataset

Retrieve the graph neighborhood around a set of vertices, ignoring edge directions.

<a href='https://apple.github.io/turicreate/docs/api/generated/turicreate.SGraph.get_neighborhood.html'>turicreate.SGraph.get_neighborhood</a>

In [6]:
# Define radius for calculating degrees of separation away from Pledge 1%
rad = 3

# Create subgraph
cb_smol = cb.get_neighborhood(ids='fd9e2d10-a882-c6f4-737e-fd388d4ffd7c', radius=rad, full_subgraph=True)

# Save dictionaries which store info about graph
before = cb.summary() # Full graph
after = cb_smol.summary() # Subgraph

# Output
print('Radius of the neighborhood: {} degrees of separation from Pledge 1% uuid'.format(rad))
print('Reduction in nodes: {:.2f}%'.format((1-(after['num_vertices']/before['num_vertices']))*100))
print('Reduction in edges: {:.2f}%'.format((1-(after['num_edges']/before['num_edges']))*100))
print('\nNode change: {:,} --> {:,}'.format(before['num_vertices'], after['num_vertices']))
print('Edge change: {:,} --> {:,}'.format(before['num_edges'], after['num_edges']))

Radius of the neighborhood: 3 degrees of separation from Pledge 1% uuid
Reduction in nodes: 100.00%
Reduction in edges: 100.00%

Node change: 1,290,346 --> 44
Edge change: 4,170,144 --> 212


# 7. Create random sample of non-P1 organizations, equal to number of P1 organizations

### Retrieve list of company vertices that can be sampled from in model

In [19]:
# Get subgraph vertices to sample from
cb_smol_ver = cb_smol.get_vertices()

# Append investors + companies together
cb_smol_ver_NEW = cb_smol_ver[cb_smol_ver['__node_type']=='investor']
cb_smol_ver_NEW = cb_smol_ver_NEW.append(cb_smol_ver[cb_smol_ver['__node_type']=='company'])

# Grab actual P1 companies, using output of find_p1_affiliations function
pos_labels = pd.DataFrame(p1_aff)['src'].to_list()

# Sample equal size of P1 companies from subgraph
neg_labels = pd.DataFrame(cb_smol_ver_NEW).sample(len(pos_labels), replace=False).__id.to_list()

# List of IDs to put into subgraph calculation
model_labels = pos_labels + neg_labels

# Don't forget Pledge 1%!
#model_labels.append('fd9e2d10-a882-c6f4-737e-fd388d4ffd7c')

print('Number of model vertices: {:,}'.format(len(model_labels)))

# NOTE: There are vertices with "None" listed as their node_type in the 4deg subgraph, someone investigate?

Number of model vertices: 13,512


In [8]:
# Get subgraph vertices to sample from
cb_smol_ver = cb.get_vertices()

# Append investors + companies together
cb_smol_ver_NEW = cb_smol_ver[cb_smol_ver['__node_type']=='investor']
cb_smol_ver_NEW = cb_smol_ver_NEW.append(cb_smol_ver[cb_smol_ver['__node_type']=='company'])

# Grab actual P1 companies, using output of find_p1_affiliations function
pos_labels = pd.DataFrame(p1_aff)['src'].to_list()

# Sample equal size of P1 companies from subgraph
neg_labels = pd.DataFrame(cb_smol_ver_NEW).sample(len(pos_labels), replace=False).__id.to_list()

# List of IDs to put into subgraph calculation
model_labels = pos_labels + neg_labels

# Don't forget Pledge 1%!
#model_labels.append('fd9e2d10-a882-c6f4-737e-fd388d4ffd7c')

print('Number of model vertices: {:,}'.format(len(model_labels)))

# NOTE: There are vertices with "None" listed as their node_type in the 4deg subgraph, someone investigate?

Number of model vertices: 13,512


# 8. Load in updated dataframes with sample uuids

###  Create subgraph within subgraph (No?)

In [None]:
# # Define radius for calculating degrees of separation away from model vertices
# rad = 2

# # Create subgraph
# model = cb_d4_p1.get_neighborhood(ids=model_labels, radius=rad, full_subgraph=True)

# # Save dictionaries which store info about graph
# model_summary = model.summary() # Full graph

# print('Radius of the neighborhood: {} degrees of separation from model uuids'.format(rad))
# print('Reduction in nodes: {:.2f}%'.format((1-(model_summary['num_vertices']/cb_d4_p1_summary['num_vertices']))*100))
# print('Reduction in edges: {:.2f}%'.format((1-(model_summary['num_edges']/cb_d4_p1_summary['num_edges']))*100))
# print('\nNode change: {:,} --> {:,}'.format(cb_d4_p1_summary['num_vertices'], model_summary['num_vertices']))
# print('Edge change: {:,} --> {:,}'.format(cb_d4_p1_summary['num_edges'], model_summary['num_edges']))

### Re-do earlier steps to construct new graph network

In [13]:
cb_frames,model_frames = network_by_date(date, df, jobs, invest, invest_prtnr, model_uuids=model_labels)


AS OF SEPTEMBER 08, 2020:

CaLcUlAtInG... FORMER NEW JOB FILTER
CaLcUlAtInG... PARTNER INVESTMENT JOB FILTER
CaLcUlAtInG... OTHER FIRM PARTNER JOBS & INVESTMENTS FILTER
CaLcUlAtInG... CURRENT OLD JOB FILTER
CaLcUlAtInG... EXTRA ORGANIZATION NODES

Crunchbase Neighborhood
NODES | OUTPUT FRAME 0/CB_companies (825393, 16)
NODES | OUTPUT FRAME 1/CB_investors (31499, 16)
NODES&EDGES | OUTPUT FRAME 2/CB_investments (453058, 17)
NODES&EDGES | OUTPUT FRAME 3/CB_investment_partners (89926, 18)
NODES&EDGES | OUTPUT FRAME 4/CB_jobs (395270, 12)
NODES&EDGES | OUTPUT FRAME 5/CB_jobs_former (182483, 12)
NODES&EDGES | OUTPUT FRAME 6/CB_jobs_former_new (483629, 12)
NODES&EDGES | OUTPUT FRAME 7/CB_jobs_partner (11771, 5)
NODES&EDGES | OUTPUT FRAME 8/CB_jobs_other_partners (434370, 12)
NODES&EDGES | OUTPUT FRAME 9/CB_invest_other_partners (161382, 18)
NODES&EDGES | OUTPUT FRAME 10/CB_jobs_current_old (289847, 12)
NODES | OUTPUT FRAME 11/CB_extra_org_nodes (225184, 17)

Model Neighborhood
NODES | OUTPUT

# 9. Save filter dataframes as separate CSVs, and then load in as SFrames

### Save filtered dataframes as separate CSVs & load in nodes, edges as SFrames

In [14]:
for idx, frame in enumerate(model_frames):
    path = 'files/output/graph_model/model/{}_df.csv'.format(idx)
    print('SAVED TO CSV', path)
    frame.to_csv(path, index=False)

model_sframes = []
for idx in range(12):
    path = 'files/output/graph_model/model/{}_df.csv'.format(idx)
    model_sframes.append(SFrame(data=path))


('SAVED TO CSV', 'files/output/graph_model/model/0_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/1_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/2_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/3_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/4_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/5_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/6_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/7_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/8_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/9_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/10_df.csv')
('SAVED TO CSV', 'files/output/graph_model/model/11_df.csv')


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,float,float,float,str,float,str,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,float,str,str,str,str,str,float,str,str,str,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


# 10. Load SFrames into graph

In [36]:
# Don't include Pledge 1% 
model_g = load_edges(model_sframes, load_vertices(model_sframes,SGraph()))

graph_edge_fields = model_g.get_edge_fields()
model_g.edges['combined'] = model_g.edges['__id']+','+model_g.edges['status']+','+model_g.edges['__src_id']+','+model_g.edges['__dst_id']

# Before comparison
before = model_g.summary()
before_pri = model_g.get_edges(fields={'status':'primary'}).shape[0]
before_sec = model_g.get_edges(fields={'status':'secondary'}).shape[0]
before_ter = model_g.get_edges(fields={'status':'tertiary'}).shape[0]

# Remove it by the 'status' field, similar start and ending node
model_g = SGraph(model_g.vertices, model_g.edges.groupby(graph_edge_fields, {'combined': aggregate.SELECT_ONE('combined')}))

# After comparison
after = model_g.summary()
after_pri = model_g.get_edges(fields={'status':'primary'}).shape[0]
after_sec = model_g.get_edges(fields={'status':'secondary'}).shape[0]
after_ter = model_g.get_edges(fields={'status':'tertiary'}).shape[0]

# Output
print('Remove duplicates from model graph')
print('\nNode change: {:,} --> {:,}'.format(before['num_vertices'], after['num_vertices']))
print('Edge change: {:,} --> {:,}'.format(before['num_edges'], after['num_edges']))
print('\nPRIMARY Edge change: {:,} --> {:,}'.format(before_pri,after_pri))
print('SECONDARY Edge change: {:,} --> {:,}'.format(before_sec,after_sec))
print('TERTIARY Edge change: {:,} --> {:,}'.format(before_ter,after_ter))

Remove duplicates from model graph

Node change: 77,199 --> 77,199
Edge change: 152,349 --> 143,177

PRIMARY Edge change: 33,717 --> 33,707
SECONDARY Edge change: 50,862 --> 43,135
TERTIARY Edge change: 67,770 --> 66,335


# 11. Graph feature calculations, save to CSV

### Pagerank

The pagerank.create() method computes the pagerank for each vertex and returns a PagerankModel. The pagerank value indicates the centrality of each node in the graph.

Compute the PageRank for each vertex in the graph. Return a model object with total PageRank as well as the PageRank value for each vertex in the graph.

<a href='https://apple.github.io/turicreate/docs/api/generated/turicreate.pagerank.create.html#turicreate.pagerank.create'>turicreate.pagerank.create</a> | <a href='https://apple.github.io/turicreate/docs/api/generated/turicreate.SGraph.get_vertices.html#turicreate.SGraph.get_vertices'>turicreate.SGraph.get_vertices</a>

Follow steps here? <a href='http://snap.stanford.edu/mlg2013/submissions/mlg2013_submission_7.pdf'>Article</a>

In [32]:
pr = pagerank.create(model_g, verbose=False)
pr_out = pr['pagerank']
pr_out=pr_out.sort('pagerank', ascending=False)
for idx, uuid in enumerate(pr_out['__id']):
    if idx+1 < 51:
        print('{}. {}'.format(idx+1, model_g.get_vertices(ids=uuid)['name'][0]))
    else:
        break

pagerank_out = pd.DataFrame(pr_out)
pagerank_out = pagerank_out[pagerank_out['__id'].isin(model_labels)].reset_index(drop=True)

path = 'files/output/graph_model/model/pagerank_df_deg3.csv'
pagerank_out.to_csv(path, index=False)

1. Google
2. Microsoft
3. IBM
4. Yahoo
5. Requisite Technology
6. Puppet
7. O3b Networks
8. SAP
9. VMware
10. GeoCities
11. Deloitte
12. SendFriend
13. CVC Capital Partners
14. Apttus
15. Salesforce
16. Facebook
17. KPMG
18. Cisco
19. Warburg Pincus
20. E-contenta
21. Kohlberg Kravis Roberts
22. R3
23. Amazon
24. Intel
25. Flexport
26. Accenture
27. Investcorp
28. The Carlyle Group
29. Box
30. HarbourVest Partners
31. The Boston Consulting Group
32. Techstars
33. Oracle
34. Dropbox
35. KaiOS Technologies
36. Xevo
37. General Atlantic
38. SolarCity
39. DocuSign
40. Unacademy
41. Proteus Digital Health
42. Adams Street Partners
43. Samsung Electronics
44. Twilio
45. Mirantis
46. SalesLoft
47. AppNexus
48. Demandbase
49. Hello Heart
50. XANT.ai
<class 'turicreate.data_structures.sframe.SFrame'>
('\nSAVED TO CSV', 'files/output/graph_model/model/pagerank_df_deg3.csv')


<H3> Degrees </H3>

This model computes the inbound, outbound, and total degree for each vector.
https://apple.github.io/turicreate/docs/api/turicreate.toolkits.graph_analytics.html#shortest-path

In [43]:
from turicreate import degree_counting
deg = degree_counting.create(model_g)
deg_graph = deg['graph'] # a new SGraph with degree data attached to each vertex
in_degree = pd.DataFrame(deg_graph.vertices[['__id', 'in_degree']])
out_degree = pd.DataFrame(deg_graph.vertices[['__id', 'out_degree']])
total_degree = pd.DataFrame(deg_graph.vertices[['__id','total_degree']])

path = 'files/output/graph_model/model/in_degree.csv'
print ('SAVED TO CSV', path)
in_degree.to_csv(path, index=False)

path = 'files/output/graph_model/model/out_degree.csv'
print ('SAVED TO CSV', path)
out_degree.to_csv(path, index=False)

path = 'files/output/graph_model/model/total_degree.csv'
print ('SAVED TO CSV', path)
total_degree.to_csv(path, index=False)

('SAVED TO CSV', 'files/output/graph_model/model/in_degree.csv')
('SAVED TO CSV', 'files/output/graph_model/model/out_degree.csv')
('SAVED TO CSV', 'files/output/graph_model/model/total_degree.csv')


<h3>Triangle Counting </h3>

Computes the number of triangles each vertex belongs to. 
https://apple.github.io/turicreate/docs/api/generated/turicreate.triangle_counting.create.html#turicreate.triangle_counting.create

In [51]:
from turicreate import triangle_counting
tc = triangle_counting.create(model_g)
triangle_count = pd.DataFrame(tc['triangle_count'])

path = 'files/output/graph_model/model/triangle_count.csv'
print ('SAVED TO CSV', path)
triangle_count.to_csv(path, index=False)

('SAVED TO CSV', 'files/output/graph_model/model/triangle_count.csv')


<h3>K-Core Decomposition</h3>

This method recursively removes vertices from the graph with degree less than k. The value of K at which a vertex is removed is its core ID.

In [54]:
from turicreate import kcore
kc = kcore.create(model_g)
kcore = pd.DataFrame(kc['core_id'])

path = 'files/output/graph_model/model/kcore.csv'
print ('SAVED TO CSV', path)
kcore.to_csv(path, index=False)

('SAVED TO CSV', 'files/output/graph_model/model/kcore.csv')


<h3>Distance from Pledge 1%</h3>

This feature measures the distance from Pledge 1% itself.

In [15]:
from turicreate import load_sgraph
from turicreate import shortest_path

In [122]:
#Taking only the values which have a P1 tag
p1_tag = model_g.vertices[model_g.vertices['p1_tag']==1]

In [124]:
from turicreate import SArray

initial_check = 0
df1 = pd.DataFrame(p1_tag['__id'])

for i in p1_tag['__id']:
    sp = shortest_path.create(model_g, source_vid=i, verbose = False)
    a = sp['distance']
    
    if initial_check == 0:
        distances = a
        initial_check = 1
        
    else:
        distances['distance'] = np.where(a['distance'] < distances['distance'], a['distance'], distances['distance'])  #create new column in df1 to check if prices match

    if (df1[df1[0]==i].index.values % 500 == 0):
        print (str(int(df1[df1[0]==i].index.values)) + " P1 companies have been checked.")

0 P1 companies have been checked.
500 P1 companies have been checked.
1000 P1 companies have been checked.
1500 P1 companies have been checked.
2000 P1 companies have been checked.
2500 P1 companies have been checked.
3000 P1 companies have been checked.
3500 P1 companies have been checked.
4000 P1 companies have been checked.
4500 P1 companies have been checked.
5000 P1 companies have been checked.
5500 P1 companies have been checked.
6000 P1 companies have been checked.
6500 P1 companies have been checked.


In [125]:
distances2 = pd.DataFrame(distances)
path = 'files/output/graph_model/model/shortest_distance_to_p1_company.csv'
print ('SAVED TO CSV', path)
distances2.to_csv(path, index=False)

('SAVED TO CSV', 'files/output/graph_model/model/shortest_distance_to_p1_company.csv')


<h2> Running the same analysis but for the entire CB Network </h2>

In [143]:
p1_tag2 = model_g.vertices[model_g.vertices['p1_tag']==1]

In [141]:
from turicreate import SArray

initial_check = 0
df1 = pd.DataFrame(p1_tag2['__id'])

for i in p1_tag2['__id']:
    sp = shortest_path.create(cb, source_vid=i, verbose = False)
    a = sp['distance']
    
    if initial_check == 0:
        distances2 = a
        initial_check = 1
        
    else:
        distances2['distance'] = np.where(a['distance'] < distances2['distance'], a['distance'], distances2['distance'])  #create new column in df1 to check if prices match

    if (df1[df1[0]==i].index.values % 500 == 0):
        print (str(int(df1[df1[0]==i].index.values)) + " P1 companies have been checked.")

## Converting to NetworkX

In [24]:
edge_fields_list = cb.get_edge_fields()
edges = [(row['__src_id'], row['__dst_id'], dict(list(row.items())[2:])) for row in cb.edges[edge_fields_list]]
vertices_fields_list = cb.get_vertex_fields()
nodes = [(row['__id'], dict(list(row.items())[1:])) for row in cb.vertices[vertices_fields_list]]
g = nx.Graph()
g.add_nodes_from(nodes)
g.add_edges_from(edges)

<h2>Recommender Model</h2>

In [131]:
from turicreate import recommender
pruned_frame = cb_sframes[1].dropna()
m = recommender.item_content_recommender.create(pruned_frame, "uuid")

Applying transform:
Class             : AutoVectorizer

Model Fields
------------
Features          : ['name', 'type', 'rank', 'roles', 'country_code', 'region', 'status', 'category_groups_list', 'total_funding_usd', 'founded_on', 'closed_on', 'employee_count', 'primary_role', 'p1_tag', 'p1_date']
Excluded Features : ['uuid']

Column                Type   Interpretation  Transforms                         Output Type
--------------------  -----  --------------  ---------------------------------  -----------
name                  str    categorical     None                               str        
type                  str    categorical     None                               str        
rank                  float  numerical       None                               float      
roles                 str    categorical     None                               str        
country_code          str    categorical     None                               str        
region                str  

<h2> Turicreate Model </h2>

In [132]:
from turicreate import classifier
pruned_frame = cb_sframes[1].dropna()

model = classifier.create(pruned_frame, target='p1_tag',features=['category_groups_list', 'founded_on', 'type'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.


PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.983606557377
PROGRESS: SVMClassifier                   : 0.983606557377
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.


In [135]:
results = model.evaluate(pruned_frame)
results

{'accuracy': 0.995049504950495,
 'auc': 0.984692179700499,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 3
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |   6   |
 |      1       |        1        |   4   |
 |      0       |        0        |  1202 |
 +--------------+-----------------+-------+
 [3 rows x 3 columns],
 'f1_score': 0.5714285714285715,
 'log_loss': 0.017233967475003508,
 'precision': 1.0,
 'recall': 0.4,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +-----------+-----------------+-----+----+------+
 | threshold |       fpr       | tpr | p  |  n   |
 +-----------+-----------------+-----+----+------+
 |    0.0    |       1.0       | 1.0 | 10 | 1202 |
 |   0.001   |  0.158901830283 | 1.0 | 10 | 1202 |
 |   0.002   |  0.114808652246 | 1.0 | 10 | 1

In [146]:
len(pruned_frame[pruned_frame['p1_tag'] == 1])

10