Y=β0​+β1​Debt+β2​Risk_Profile+β3​(Debt×Risk_Profile)+ϵ


1. Dependent Variable: Financial Planner’s Product Recommendation (High-risk vs Low-risk)
    • Variable Name: Product Recommendation
    • Description: This is a binary choice variable indicating whether the financial planner recommends a high-risk product (e.g., stocks, ETFs) or a low-risk product (e.g., mutual funds, bonds) to the client.
        ○ Possible Values: 1 for recommending high-risk products, 0 for recommending low-risk products.
2. Independent Variable: Financial Planner’s Personal Debt Level
    • Variable Name: Planner Debt Level
    • Description: This variable describes the financial planner's personal debt level. It can be either a continuous variable (e.g., total debt) or a categorical variable (e.g., high debt vs low debt).
        ○ Possible Values: If it's categorical, 1 could represent high debt, 0 could represent low debt.
3. Interaction Variable: Client’s Risk Tolerance
    • Variable Name: Client Risk Profile
    • Description: This variable describes the client’s risk tolerance, indicating whether the client prefers high-risk or low-risk investment products. It’s typically based on the client’s financial goals, investment experience, etc.
        ○ Possible Values: 1 for clients with high-risk tolerance, 0 for clients with low-risk tolerance.
    
4. Control Variables (To account for other potential influencing factors in your analysis)
    • Financial Planner Characteristics:
        1. Age: Financial planner's age (continuous variable).
        2. Gender: Gender (1 for male, 0 for female).
        3. Education Level: Planner’s educational background (e.g., bachelor's, master’s, etc. – can be categorical).
        4. Years of Experience: Number of years of work experience (continuous variable).
        5. Compensation Structure: Whether the planner’s compensation is commission-based or fixed salary (categorical variable).
    • Client Characteristics:
        1. Client Age: Client’s age (continuous variable).
        2. Client Income Level: Client’s income level (can be a categorical variable, e.g., high, medium, or low income).
        3. Client Investment Experience: Client’s investment experience (categorical variable, e.g., novice, intermediate, advanced).
        4. Client Investment Objective: Client’s investment goals (categorical variable, e.g., long-term growth, wealth preservation, etc.).
    
5. Potential Moderating or Interaction Terms
    • Financial Planner's Personal Debt × Client's Risk Tolerance Interaction Term
        ○ You can create an interaction variable combining the financial planner’s debt level and the client’s risk tolerance to capture how the debt affects the planner’s recommendation for different types of clients.
        ○ Variable Formula: Planner Debt × Client Risk Profile
1.For question 3, our group chose this question, how does a financial planner's personal debt level affect their ability to recommend high-risk versus low-risk investment products for clients with different risk profiles?Is this research topic feasible?

2.In terms of data selection, is the Debt-to-Income Ratio (DIR) more reasonable or the Credit Score (CS) more reasonable?
3.For the binary choice of recommendation behavior, we plan to use Logit or Probit models, but are there other models (e.g., multilevel or fixed-effects models) that can better capture the effect of debt level on recommendations for different clients?


1.DIR

In [None]:
import pandas as pd

# Load dataset
file_path = "clean_all.csv"
data = pd.read_csv(file_path)

# Calculate Debt-to-Income Ratio (DIR) using imputed income and debt values
data['DIR'] = data['debt_impute'] / data['income_impute']
data['DIR'] = data['DIR'].replace([float('inf'), float('nan')], 0)  # Replace inf and NaN with 0


data[['debt_impute', 'income_impute', 'DIR']].head()


Unnamed: 0,debt_impute,income_impute,DIR
0,580000.0,110000.0,5.272727
1,0.0,200000.0,0.0
2,295000.0,65000.0,4.538462
3,30000.0,150000.0,0.2
4,0.0,150000.0,0.0


risk level 

In [58]:
# Define scenario answer columns
scenario_columns = ['scn1a_answer', 'scn2a_answer', 'scn3a_answer', 'scn4a_answer']

# Define adjusted risk score dictionary with updated values
adjusted_risk_score_dict = {
    'high_risk': ['etf', 'high return mutual fund', 'stocks', 'segregated fund'],
    'medium_risk': ['rrsp', 'tfsa', 'mutual fund', 'ul policy', 'mortgage investment'],
    'low_risk': ['annuity', 'long term care insurance', 'gic', 'repay debt', 'bond', 'index linked gic']
}


risk score

In [59]:
# Function to score risk based on adjusted values
def adjusted_score_risk(text):
    if pd.isna(text):
        return 0  # No answer given
    if any(keyword in text.lower() for keyword in adjusted_risk_score_dict['high_risk']):
        return 5  # High Risk Score (adjusted to 5)
    elif any(keyword in text.lower() for keyword in adjusted_risk_score_dict['medium_risk']):
        return 1  # Medium Risk Score (kept as 1)
    elif any(keyword in text.lower() for keyword in adjusted_risk_score_dict['low_risk']):
        return -1  # Low Risk Score (adjusted to -1)
    else:
        return 0  # Undefined or other, set to neutral 0

# Apply the adjusted scoring function to each scenario column
for col in scenario_columns:
    data[f'{col}_adjusted_risk_score'] = data[col].apply(adjusted_score_risk)

# Check adjusted risk scores for each scenario
data[[f'{col}_adjusted_risk_score' for col in scenario_columns]].head()


Unnamed: 0,scn1a_answer_adjusted_risk_score,scn2a_answer_adjusted_risk_score,scn3a_answer_adjusted_risk_score,scn4a_answer_adjusted_risk_score
0,-1,0,0,-1
1,1,0,0,5
2,1,0,0,1
3,1,-1,0,5
4,1,-1,0,-1


In [60]:
# Calculate adjusted total risk score by summing up the individual adjusted scores
data['adjusted_total_risk_score'] = data[[f'{col}_adjusted_risk_score' for col in scenario_columns]].sum(axis=1)

# Preview the adjusted total risk scores
data[['adjusted_total_risk_score']].head()


Unnamed: 0,adjusted_total_risk_score
0,-2
1,6
2,2
3,5
4,-1


log-dir

In [61]:
import numpy as np

# Classify DIR into low, medium, high categories based on quantiles
data['DIR_category'] = pd.qcut(data['DIR'], q=3, labels=['Low', 'Medium', 'High'])

# Apply log transformation to DIR, handling zeros by adding a small constant
data['log_DIR'] = np.log(data['DIR'] + 1e-5)

# Check the DIR categories and log-transformed DIR
data[['DIR', 'DIR_category', 'log_DIR']].head()


Unnamed: 0,DIR,DIR_category,log_DIR
0,5.272727,High,1.66255
1,0.0,Low,-11.512925
2,4.538462,High,1.51259
3,0.2,Low,-1.609388
4,0.0,Low,-11.512925


control variables

In [62]:
# Check if control variables are available
control_variables = ['age', 'educ', 'work_experience']
print("Available columns:", data.columns)

# Display control variables
data[control_variables].head()


Available columns: Index(['respid', 'status', 'language', 'gender', 'age', 'province',
       'license_mutualfunds', 'license_insurance', 'license_securities',
       'educ',
       ...
       'time', 'reminder', 'DIR', 'scn1a_answer_adjusted_risk_score',
       'scn2a_answer_adjusted_risk_score', 'scn3a_answer_adjusted_risk_score',
       'scn4a_answer_adjusted_risk_score', 'adjusted_total_risk_score',
       'DIR_category', 'log_DIR'],
      dtype='object', length=211)


Unnamed: 0,age,educ,work_experience
0,26,"University certificate, diploma, degree above ...",2.0
1,47,"Bachelor's degree (e.g. B.A., B.Sc., LL.B.)",20.0
2,40,"Bachelor's degree (e.g. B.A., B.Sc., LL.B.)",1.0
3,41,"Bachelor's degree (e.g. B.A., B.Sc., LL.B.)",16.0
4,65,University certificate or diploma below the ba...,15.0


In [66]:
# Define a mapping for the education levels
education_mapping = {
    'Less than high school diploma or its equivalent': 1,
    'High school diploma or a high school equivalency certificate': 2,
    'Trade certificate or diploma': 3,
    'College, CEGEP or other non-university certificate or diploma (other than trades certificates or diplomas)': 4,
    'University certificate or diploma below the bachelor\'s level': 5,
    'Bachelor\'s degree (e.g. B.A., B.Sc., LL.B.)': 6,
    'University certificate, diploma, degree above the bachelor\'s level': 7
}

# Map the education levels in the data to the numeric values
data['educ_level'] = data['educ'].map(education_mapping)

# Check the transformed education levels
data[['educ', 'educ_level']].head()


Unnamed: 0,educ,educ_level
0,"University certificate, diploma, degree above ...",7
1,"Bachelor's degree (e.g. B.A., B.Sc., LL.B.)",6
2,"Bachelor's degree (e.g. B.A., B.Sc., LL.B.)",6
3,"Bachelor's degree (e.g. B.A., B.Sc., LL.B.)",6
4,University certificate or diploma below the ba...,5


In [69]:
# Convert DIR_category to dummy variables
X = pd.get_dummies(data['DIR_category'], drop_first=True)  # Drop "Low" as baseline
X['log_DIR'] = data['log_DIR']
X['age'] = data['age']
X['educ'] = data['educ_level']
X['work_experience'] = data['work_experience']

# Define dependent variable
y = data['adjusted_total_risk_score']

# Check the prepared variables for regression
X.head()


Unnamed: 0,Medium,High,log_DIR,age,educ,work_experience
0,False,True,1.66255,26,7,2.0
1,False,False,-11.512925,47,6,20.0
2,False,True,1.51259,40,6,1.0
3,False,False,-1.609388,41,6,16.0
4,False,False,-11.512925,65,5,15.0


regression

In [72]:
# Convert all columns in X to numeric, coercing errors to NaN
X = X.apply(pd.to_numeric, errors='coerce')

# Convert y to numeric, coercing errors to NaN
y = pd.to_numeric(y, errors='coerce')

# Drop rows with NaN values in either X or y
X = X.dropna()
y = y.loc[X.index]  # Align y with X after dropping rows with NaN in X

# Ensure X and y are of correct numeric types for regression
X = X.astype(float)
y = y.astype(float)
import statsmodels.api as sm

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Run the regression model
model = sm.OLS(y, X).fit()

# Display the regression results
print(model.summary())


                                OLS Regression Results                               
Dep. Variable:     adjusted_total_risk_score   R-squared:                       0.013
Model:                                   OLS   Adj. R-squared:                  0.007
Method:                        Least Squares   F-statistic:                     2.216
Date:                        周一, 04 11月 2024   Prob (F-statistic):             0.0395
Time:                               03:45:36   Log-Likelihood:                -2377.0
No. Observations:                        979   AIC:                             4768.
Df Residuals:                            972   BIC:                             4802.
Df Model:                                  6                                         
Covariance Type:                   nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------