# Investment crowdfunding has little faith in sustainability! At least for the moment.

## Journal:

*Venture Capital*, 25:1, 91-115, 2023, [DOI: 10.1080/13691066.2022.2129510](https://doi.org/10.1080/13691066.2022.2129510)

## Authors:

Carmen Mendoza

Isabel María Parra Oller

Álvaro Rezola (@alvarorezola)

Nuria Suárez

In [1]:
# Import libraries
import pandas as pd
import locale
import numpy as np
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from sklearn.linear_model import LogisticRegression
import math

In [2]:
# Function definition
def summary_with_stars(model):
    # get the summary table as a DataFrame
    summary_df = model.summary2().tables[1]
    
    # create a new column for stars
    summary_df['stars'] = '-'

    # add stars based on p-values
    summary_df.loc[summary_df['P>|z|'] < 0.001, 'stars'] = '***'
    summary_df.loc[(summary_df['P>|z|'] >= 0.001) & (summary_df['P>|z|'] < 0.01), 'stars'] = '**'
    summary_df.loc[(summary_df['P>|z|'] >= 0.01) & (summary_df['P>|z|'] < 0.05), 'stars'] = '*'

    # return the modified summary table
    return summary_df

def confidence_interval(X1, X2):
    #calculate degrees of freedom
    df = len(X1) + len(X2) - 2

    #calculate standard error
    s1 = np.var(X1, ddof=1)
    s2 = np.var(X2, ddof=1)
    n1 = len(X1)
    n2 = len(X2)
    standard_error = np.sqrt((s1/n1) + (s2/n2))

    #calculate margin of error
    margin_of_error = stats.t.ppf(0.95, df) * standard_error

    #calculate confidence interval
    lower_limit = (np.mean(X1) - np.mean(X2)) - margin_of_error
    upper_limit = (np.mean(X1) - np.mean(X2)) + margin_of_error

    return lower_limit, upper_limit

# calculamos el logit del puntuaje de propensión para emparejar
def logit(p):
    logit_value = math.log(p/(1 - p))
    return logit_value

# calcular standard errors ATE
def ate_se(df):
    # obtener el numero de observaciones de cada grupo
    n_t = df[df.sustainable == 1].shape[0]
    n_c = df[df.sustainable == 0].shape[0]
    
    # obtener la varianza de los resultados en cada grupo
    s_t = df[df.sustainable == 1].exito.var()
    s_c = df[df.sustainable == 0].exito.var()
    
    # calcular error estandar del ATE usando la fórmula
    se = np.sqrt((s_t / n_t) + (s_c / n_c))
    
    return se

In [3]:
# Import data
stata_dataset = "/home/alvaro/Desktop/MendozaEtAl2023-VC/Data/CROWD_SUSTAINABILITY_FINAL.dta"
df = pd.read_stata(stata_dataset)
df.drop(index=range(3679,len(df)), inplace=True)
df = df[(df["form_c"] == 1) & (df["deadline"] <= "2019-10-01")]
df[["deadline", "datestart", "dateincorporation"]] = df[["deadline", "datestart", "dateincorporation"]].apply(pd.to_datetime)
df["totalassetsmostrecent1"] += 1e-6
df["logtotalassetsmostrecent1"] = np.log(df["totalassetsmostrecent1"])
df_sust = df[df["sustainable"] == 1]
df_non_sust = df[df["sustainable"] == 0]

One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  df = pd.read_stata(stata_dataset)


We found 1,853 investment crowdfunding campaings issued under the Form C exemption from May 1st, 2016 until September 10th, 2019.

### Table 1 (a): Descriptive statistics
This table shows the descriptive statistics – mean, standard deviation, 25th percentile, median, 75th percentile – of the main variables of interest.

In [4]:
df[["exito",
    "quick75relative",
    "sustainable",
    "totalassetsmostrecent1",
    "employees",
    "loglife",                     # Age as in np.log(date_diff(start, incorporation))              
    "equity",                     
    "asked",
    "lagbranches",                 # Bank branches
    "lagvcfundraising",            # VC fundraising (not sure if then I need to do the np.log())
    "loglagnum_oper_por_platf_y",  # Number of offerings per platform
    ]].describe()

Unnamed: 0,exito,quick75relative,sustainable,totalassetsmostrecent1,employees,loglife,equity,asked,lagbranches,lagvcfundraising,loglagnum_oper_por_platf_y
count,792.0,792.0,792.0,792.0,792.0,792.0,792.0,792.0,789.0,714.0,791.0
mean,0.343434,0.026515,0.156566,519350.8,5.295455,6.023776,0.30303,68751.25,4052.95057,8677.518555,59.369152
std,0.475156,0.160763,0.36362,7195351.0,8.932594,1.511196,0.459858,109203.0,2333.467647,11150.884766,56.845554
min,0.0,0.0,0.0,1e-06,0.0,0.69,0.0,1000.0,117.0,0.81,2.0
25%,0.0,0.0,0.0,1e-06,1.0,5.095,0.0,10000.0,1633.0,90.800003,8.0
50%,0.0,0.0,0.0,22610.0,3.0,6.245,0.0,25000.0,4270.0,777.429993,30.0
75%,1.0,0.0,0.0,186310.8,6.0,7.13,1.0,87192.75,6728.0,20763.099609,110.0
max,1.0,1.0,1.0,201464600.0,111.0,9.56,1.0,1070000.0,6868.0,33174.179688,262.0


 ### Table 1 (b): Mean differences across subsamples
 This table shows the mean values of the main variables across the two subsamples of offerings and the T-statistic for the mean differences. The T-statistics reported are obtained for the differences between the means across groups of offerings. All the variables are defined in Annex 1. ***, **, and * indicate statistical significance at 1, 5, and 10 percent, respectively.

In [5]:
data = {
    "success": (df_non_sust["exito"],df_sust["exito"]),
    "quick75relative": (df_non_sust["quick75relative"],df_sust["quick75relative"]),
    "totalassetsmostrecent1": (df_non_sust["totalassetsmostrecent1"],df_sust["totalassetsmostrecent1"]),
    "employees": (df_non_sust["employees"],df_sust["employees"]),
    "loglife": (df_non_sust["loglife"],df_sust["loglife"]),
    "equity": (df_non_sust["equity"],df_sust["equity"]),
    "asked": (df_non_sust["asked"],df_sust["asked"]),
    "loglagnum_oper_por_platf_y": (df_non_sust["loglagnum_oper_por_platf_y"],df_sust["loglagnum_oper_por_platf_y"]),
    "lagbranches": (df_non_sust["lagbranches"],df_sust["lagbranches"]),
    "lagvcfundraising": (df_non_sust["lagvcfundraising"],df_sust["lagvcfundraising"]),
}

results = {}

for key in data:
    group1 = data[key][0]
    group2 = data[key][1]

    # calculate t-statistic    
    t_statistic, p_value = stats.ttest_ind(group1, group2, nan_policy="omit")
    
    # Indicate statistical significance at different levels
    if p_value < 0.01:
        significance = "***"
    elif p_value < 0.05:
        significance = "**"
    elif p_value < 0.10:
        significance = "*"
    else:
        significance = "-"
        
    # calculate mean of each variable
    mean_group1 = np.mean(group1)
    mean_group2 = np.mean(group2)

    results[key]={'non sustainable':mean_group1,
                  'sustainable':mean_group2,
                  't-statistic':t_statistic,
                  'p-value':p_value,
                  'significance':significance}
# Display results
pd.DataFrame(results).T

# print(df_results)

Unnamed: 0,non sustainable,sustainable,t-statistic,p-value,significance
success,0.338323,0.370968,-0.702378,0.482651,-
quick75relative,0.02994,0.008065,1.392413,0.164189,-
totalassetsmostrecent1,545714.3125,377327.875,0.239184,0.811025,-
employees,5.348802,5.008065,0.389894,0.69672,-
loglife,5.986901,6.222419,-1.595374,0.111029,-
equity,0.300898,0.314516,-0.302673,0.762219,-
asked,70287.421875,60475.71875,0.918764,0.3585,-
loglagnum_oper_por_platf_y,59.30135,59.733871,-0.077754,0.938043,-
lagbranches,4151.485714,3524.516129,2.758299,0.005945,***
lagvcfundraising,8781.643555,8080.272949,0.597308,0.550492,-


### Table 2: Propensity Score Matching 
This table shows the mean values of the main variables across the two subsamples of offerings and the t-statistics obtained for the differences between the means across groups of offerings, before matching and after matching using caliper, nearest 1-to-1 and nn-VBC methods. ***, ** and * indicate statistical significance at 1, 5, and 10%, respectively.

In [6]:
# Two probit functions to validate covariates
formula_traetment = f"sustainable ~ logtotalassetsmostrecent1 + logemployees1 + logasked1"
summary_with_stars(smf.probit(formula=formula_traetment, data=df).fit())

Optimization terminated successfully.
         Current function value: 0.432126
         Iterations 5


Unnamed: 0,Coef.,Std.Err.,z,P>|z|,[0.025,0.975],stars
Intercept,-0.145039,0.530024,-0.273646,0.784356,-1.183866,0.893788,-
logtotalassetsmostrecent1,0.001421,0.004994,0.284511,0.776019,-0.008368,0.01121,-
logemployees1,-0.062212,0.069686,-0.892752,0.37199,-0.198795,0.07437,-
logasked1,-0.075253,0.04888,-1.539545,0.123671,-0.171056,0.02055,-


In [7]:
formula_outcome = f"exito ~ logtotalassetsmostrecent1 + logemployees1 + logasked1"
summary_with_stars(smf.probit(formula=formula_outcome, data=df).fit())

Optimization terminated successfully.
         Current function value: 0.626499
         Iterations 5


Unnamed: 0,Coef.,Std.Err.,z,P>|z|,[0.025,0.975],stars
Intercept,1.381336,0.459077,3.008945,0.002622,0.481563,2.28111,**
logtotalassetsmostrecent1,0.007188,0.004352,1.651797,0.098576,-0.001341,0.015717,-
logemployees1,0.086588,0.058852,1.471301,0.14121,-0.028759,0.201935,-
logasked1,-0.187081,0.042941,-4.356674,1.3e-05,-0.271244,-0.102918,***


### Table 2 (A): Before Matching

In [8]:
df[["exito", "quick75relative", "logtotalassetsmostrecent1", "logemployees1", "logasked1"]].describe()
df_sust = df[df["sustainable"] == 1]
df_non_sust = df[df["sustainable"] == 0]

data = {
    "success": (df_sust["exito"], df_non_sust["exito"]),
    "quick75relative": (df_sust["quick75relative"], df_non_sust["quick75relative"]),
    "logtotalassetsmostrecent1": (df_sust["logtotalassetsmostrecent1"], df_non_sust["logtotalassetsmostrecent1"]),
    "logemployees1": (df_sust["logemployees1"], df_non_sust["logemployees1"]),
    "logasked1": (df_sust["logasked1"], df_non_sust["logasked1"])
}

results = {}

for key in data:
    group1 = data[key][0]
    group2 = data[key][1]

    # calculate t-statistic    
    t_statistic, p_value = stats.ttest_ind(group1, group2, nan_policy="omit")
    
    # Indicate statistical significance at different levels
    if p_value < 0.01:
        significance = "***"
    elif p_value < 0.05:
        significance = "**"
    elif p_value < 0.10:
        significance = "*"
    else:
        significance = "-"
        
    # calculate mean of each variable
    mean_group1 = np.mean(group1)
    mean_group2 = np.mean(group2)

    results[key]={'sustainable':mean_group1,
                  'non_sustainable':mean_group2,
                  "difference" :(mean_group1 - mean_group2),
                  't-statistic':t_statistic,
                  'p-value':p_value,
                  'significance':significance}
# Display results
pd.DataFrame(results).T

Unnamed: 0,sustainable,non_sustainable,difference,t-statistic,p-value,significance
success,0.370968,0.338323,0.032644,0.702378,0.482651,-
quick75relative,0.008065,0.02994,-0.021876,-1.392413,0.164189,-
logtotalassetsmostrecent1,3.809409,3.914037,-0.104628,-0.093975,0.925153,-
logemployees1,1.390724,1.444712,-0.053989,-0.667047,0.504937,-
logasked1,10.288928,10.44759,-0.158662,-1.422105,0.15539,-


### Table 2 (B): After Matching

In [9]:
# Logit model para estimar el puntaje de propensión (ps)
model = LogisticRegression()
df = df.dropna(axis = 1) # drop all variables that have empty values

# Independent variables
X = df[["logtotalassetsmostrecent1",
        "logemployees1",
        "logasked1"]]

# Dependent variable (treatment group)
y = df["sustainable"]

# Model adjustment & predicted probabilities
model.fit(X, y) 
pred_prob = model.predict_proba(X)
df["ps"] = pred_prob[:, 1]

df["ps_logit"] = df.ps.apply(logit)

# Implementing the caliper match
def caliper_match(df, threshold):
    # ordenar los datos por ps_logit y crear una columna con el índice original
    df_sorted = df.sort_values("ps_logit").reset_index()
    df_sorted["orig_index"] = df_sorted.index
    
    # crear listas vacias para almacenar los indices emparejados y no emparejados
    matched_index = []
    unmatched_index = []
    
    # iterar sobre las filas del dataframe ordenado
    for i in range(len(df_sorted)):
        row = df_sorted.iloc[i]
        if i not in matched_index: # si la fila no está emparejada todavía
            potential_matches = df_sorted[(df_sorted.sustainable != row.sustainable) & (abs(df_sorted.ps_logit - row.ps_logit) <= threshold)]
            
            # encontrar las filas potenciales que tienen un tratamiento diferente y una diferencia de ps_logit menor o igual al umbral
            
            if len(potential_matches) > 0: # si hay al menos una fila potencialmente emparejable
                closest_match_index = potential_matches.iloc[0].orig_index
                # Tomar la priemra fila potencial como la más cercana
                
                matched_index.append(i)
                matched_index.append(closest_match_index)
                # añadir ambos índices a la lista de emparejados
                
            else:
                unmatched_index.append(i)
                # si no hay ninguna fila potencialmente emparejable, añadir el índice a la lista d eno emparejados
    return matched_index, unmatched_index

caliper_matched, caliper_unmatched = caliper_match(df[["sustainable", "logtotalassetsmostrecent1", "logemployees1", "logasked1", "ps_logit"]], 0.1) # umbral 0.2
caliper_df_matched = df.iloc[caliper_matched]

# mean of each variable in the treatment & control group
treatment_means_caliper = caliper_df_matched[caliper_df_matched["sustainable"] == 1]
control_means_caliper = caliper_df_matched[caliper_df_matched["sustainable"] == 0]

data = {
    "success": (treatment_means_caliper["exito"], control_means_caliper["exito"]),
    "quick75relative": (treatment_means_caliper["quick75relative"], control_means_caliper["quick75relative"]),
    "logtotalassetsmostrecent1": (treatment_means_caliper["logtotalassetsmostrecent1"], control_means_caliper["logtotalassetsmostrecent1"]),
    "logemployees1": (treatment_means_caliper["logemployees1"], control_means_caliper["logemployees1"]),
    "logasked1": (treatment_means_caliper["logasked1"], control_means_caliper["logasked1"])
}

results = {}

for key in data:
    group1 = data[key][0]
    group2 = data[key][1]

    # calculate t-statistic    
    t_statistic, p_value = stats.ttest_ind(group1, group2, nan_policy="omit")
    
    # Indicate statistical significance at different levels
    if p_value < 0.01:
        significance = "***"
    elif p_value < 0.05:
        significance = "**"
    elif p_value < 0.10:
        significance = "*"
    else:
        significance = "-"
        
    # calculate mean of each variable
    mean_group1 = np.mean(group1)
    mean_group2 = np.mean(group2)

    results[key]={'sustainable':mean_group1,
                  'non_sustainable':mean_group2,
                  "difference" :(mean_group1 - mean_group2),
                  't-statistic':t_statistic,
                  'p-value':p_value,
                  'significance':significance}
# Display results
pd.DataFrame(results).T

Unnamed: 0,sustainable,non_sustainable,difference,t-statistic,p-value,significance
success,0.529661,0.377976,0.151685,4.409956,1.1e-05,***
quick75relative,0.004237,0.072173,-0.067935,-4.008367,6.4e-05,***
logtotalassetsmostrecent1,5.664698,4.947288,0.717411,0.942729,0.345964,-
logemployees1,1.291606,1.433447,-0.141841,-2.570124,0.010257,**
logasked1,10.297965,10.524361,-0.226396,-2.803164,0.005122,***


### Table 3: Average Treatment Effect on the Treated (ATET) 
This table shows the average treatment effect on the treated individuals (ATET) for each method: caliper, nearest 1-to-1 and nn-VBC methods. ***, ** and * indicate statistical significance at 1, 5, and 10%, respectively.

In [11]:
caliper_ate_success = caliper_df_matched.groupby("sustainable")["exito"].mean().diff().iloc[-1]
caliper_ate_quick = caliper_df_matched.groupby("sustainable")["quick75relative"].mean().diff().iloc[-1]
caliper_se = ate_se(caliper_df_matched)

# realizar una preuba de t de dos muestras independientes y obtener los valores p y los intervalso de confianza al 95%
caliper_success_tstat, caliper_success_pvalue, caliper_success_desc = sms.ttest_ind(
    caliper_df_matched[caliper_df_matched.sustainable == 1].exito.values,
    caliper_df_matched[caliper_df_matched.sustainable == 0].exito.values,
    usevar="unequal",
    alternative="larger",
    value=0
)
lower_limit_success, upper_limit_success = confidence_interval(caliper_df_matched[caliper_df_matched.sustainable == 1].exito.values,
                                                               caliper_df_matched[caliper_df_matched.sustainable == 0].exito.values)

# realizar una preuba de t de dos muestras independientes y obtener los valores p y los intervalso de confianza al 95%
caliper_quick_tstat, caliper_quick_pvalue, caliper_quick_desc = sms.ttest_ind(
    caliper_df_matched[caliper_df_matched.sustainable == 1].quick75relative.values,
    caliper_df_matched[caliper_df_matched.sustainable == 0].quick75relative.values,
    usevar="unequal",
    alternative="larger",
    value=0
)

lower_limit_quick, upper_limit_quick = confidence_interval(caliper_df_matched[caliper_df_matched.sustainable == 1].quick75relative.values,
                                                           caliper_df_matched[caliper_df_matched.sustainable == 0].quick75relative.values)

data = {"Modelo caliper 1-to-1": ["success", "quickrelative75"],
        "ATE": [caliper_ate_success, caliper_ate_quick],
        "Error estándar": [caliper_se, caliper_se],
        "Valor p": [caliper_success_pvalue, caliper_quick_pvalue],
        "[Intervalo de": [lower_limit_success, lower_limit_quick],
        "confianza (95%)]": [upper_limit_success, upper_limit_quick]}

pd.DataFrame(data)

Unnamed: 0,Modelo caliper 1-to-1,ATE,Error estándar,Valor p,[Intervalo de,confianza (95%)]
0,success,0.151685,0.035145,1.1e-05,0.093843,0.209527
1,quickrelative75,-0.067935,0.035145,1.0,-0.081489,-0.054382


### Table 4: Sustainability and success
This table presents IV results examining the effect of the sustainable orientation of investment crowdfunding offerings on the probability of success. The dependent variable in columns (1) and (2) is the dummy that identifies sustainable offerings (Sustainable). The dependent variable in columns (3) and (5) is SUCCESS. QUICK75 is the dependent variables in columns (4) and (6). Variables definitions are reported in Annex 1. Year, industry-year and state-year fixed effects are included but not reported. T-statistics are in parentheses. ***, ** and * indicate statistical significance at 1, 5, and 10%, respectively.

### Table 5: Sustainability and success: the role of firm- and offering-level characteristics
This table presents results examining the effect of firm- and offering-level characteristics on the relationship between the sustainable orientation of investment crowdfunding offerings and the probability of success. The dependent variable is SUCCESS. Variables definitions are reported in Annex 1. Year, industry-year and state-year fixed effects are included but not reported. T-statistics are in parentheses. *** and ** indicate statistical significance at 1 and 5 percent, respectively.

### Table 6: Sustainability and success: the role of the financing environment
This table presents results examining the effect of the characteristics of the financing environment on the relationship between the sustainable orientation of investment crowdfunding offerings and the probability of success. The dependent variable is SUCCESS. Variables definitions are reported in Annex 1. Firm and offering control variables, year, industry-year and state-year fixed effects are included but not reported. T-statistics are in parentheses. ***; ** and * indicate statistical significance at 1, 5, and 10 percent, respectively.

### Table 7: Sustainability and success: robustness tests
This table presents a set of robustness tests for the relationship between the sustainable orientation of investment crowdfunding offerings and the probability of success. In column (1), we report the results for the second-stage regression for the Heckman (1979) method. In column (2), we find that the results do not vary when controlling for the characteristics of the team in terms of gender and size. In columns (3) to (5), we control for the funding history of the company. In column (6), we control for the cost structure defined by the funding portal. The dependent variable is SUCCESS. Variables definitions are reported in Annex 1. Firm and offering control variables, year, industry-year and state-year fixed effects are included but not reported. T-statistics are in parentheses. ***; ** and * indicate statistical significance at 1, 5, and 10 percent, respectively.