# GEM Data Hackathon: Entrepreneur Profile Analysis

This notebook implements Section 1 of our analysis plan: Entrepreneur Backgrounds & Demographic Analysis.

## Objectives
- Create comprehensive entrepreneur personas based on demographic patterns
- Analyze intersectionality of demographic factors (gender, race, age, education, income)
- Compare profiles between new and established entrepreneurs
- Identify key background factors that predict entrepreneurial activity

## Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score

# Set plot styling
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

In [None]:
# Load the GEM data
gem_data = pd.read_csv('../data/Hackathon_GEM_Data_FULL.csv')

# Display basic information about the dataset
print(f"Dataset shape: {gem_data.shape}")
gem_data.head()

## Data Overview and Cleaning

Let's examine the variables in our dataset and prepare it for analysis.

In [None]:
# Check variable names to understand what data we have
print("Columns in the dataset:")
print(gem_data.columns.tolist())

In [None]:
# Check data types and missing values
gem_data.info()

In [None]:
# Handle missing values (for key demographic variables)
demographic_vars = ['gender', 'age_range', 'race', 'education', 'household_income', 'region', 'household_size']
entrepreneurship_vars = ['new_entrepreneur', 'established_entrepreneur']

# Check missing values in key variables
missing_data = pd.DataFrame({
    'Missing Values': gem_data[demographic_vars + entrepreneurship_vars].isnull().sum(),
    'Percentage': 100 * gem_data[demographic_vars + entrepreneurship_vars].isnull().sum() / len(gem_data)
})

missing_data.sort_values('Percentage', ascending=False)

In [None]:
# Create a clean dataset for analysis with only complete cases for key variables
analysis_data = gem_data.dropna(subset=demographic_vars + entrepreneurship_vars)
print(f"Complete cases: {len(analysis_data)} out of {len(gem_data)} ({100*len(analysis_data)/len(gem_data):.1f}%)")

## 1. Weighted Demographic Analysis of Entrepreneurs

Let's examine the demographic distributions of entrepreneurs, applying proper survey weights.

In [None]:
# Function to calculate weighted percentage of entrepreneurs by a demographic variable
def weighted_entrepreneur_percentage(data, group_var):
    # For new entrepreneurs
    new_ent_by_group = pd.crosstab(
        index=data[group_var],
        columns=data['new_entrepreneur'],
        values=data['weight'],
        aggfunc='sum',
        normalize='index'
    ) * 100
    
    # For established entrepreneurs
    estab_ent_by_group = pd.crosstab(
        index=data[group_var],
        columns=data['established_entrepreneur'],
        values=data['weight'],
        aggfunc='sum',
        normalize='index'
    ) * 100
    
    # Combine into one dataframe
    if 'Yes' in new_ent_by_group.columns and 'Yes' in estab_ent_by_group.columns:
        result = pd.DataFrame({
            'New Entrepreneur (%)': new_ent_by_group['Yes'],
            'Established Entrepreneur (%)': estab_ent_by_group['Yes']
        })
        
        # Calculate the weighted counts for each group (for reference)
        group_counts = data.groupby(group_var)['weight'].sum()
        result['Weighted Count'] = group_counts
        result['Weighted Percentage'] = 100 * group_counts / group_counts.sum()
        
        return result
    else:
        print("Error: 'Yes' category not found in one of the entrepreneur variables.")
        return None

In [None]:
# Gender analysis
gender_entrepreneurship = weighted_entrepreneur_percentage(analysis_data, 'gender')
gender_entrepreneurship

In [None]:
# Visualize gender differences in entrepreneurship
gender_entrepreneurship[['New Entrepreneur (%)', 'Established Entrepreneur (%)']].plot(kind='bar', figsize=(10, 6))
plt.title('Entrepreneurship Rates by Gender')
plt.ylabel('Percentage (%)')
plt.xlabel('Gender')
plt.xticks(rotation=0)
plt.axhline(y=gender_entrepreneurship['New Entrepreneur (%)'].mean(), color='blue', linestyle='--', 
           label=f"Average New Entrepreneur Rate: {gender_entrepreneurship['New Entrepreneur (%)'].mean():.1f}%")
plt.axhline(y=gender_entrepreneurship['Established Entrepreneur (%)'].mean(), color='orange', linestyle='--',
           label=f"Average Established Entrepreneur Rate: {gender_entrepreneurship['Established Entrepreneur (%)'].mean():.1f}%")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Age analysis
age_entrepreneurship = weighted_entrepreneur_percentage(analysis_data, 'age_range')
age_entrepreneurship

In [None]:
# Visualize age differences in entrepreneurship
age_entrepreneurship[['New Entrepreneur (%)', 'Established Entrepreneur (%)']].plot(kind='bar', figsize=(12, 6))
plt.title('Entrepreneurship Rates by Age Group')
plt.ylabel('Percentage (%)')
plt.xlabel('Age Range')
plt.xticks(rotation=0)
plt.axhline(y=age_entrepreneurship['New Entrepreneur (%)'].mean(), color='blue', linestyle='--', 
           label=f"Average New Entrepreneur Rate: {age_entrepreneurship['New Entrepreneur (%)'].mean():.1f}%")
plt.axhline(y=age_entrepreneurship['Established Entrepreneur (%)'].mean(), color='orange', linestyle='--',
           label=f"Average Established Entrepreneur Rate: {age_entrepreneurship['Established Entrepreneur (%)'].mean():.1f}%")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Race analysis
race_entrepreneurship = weighted_entrepreneur_percentage(analysis_data, 'race')
race_entrepreneurship

In [None]:
# Visualize race differences in entrepreneurship
race_entrepreneurship[['New Entrepreneur (%)', 'Established Entrepreneur (%)']].plot(kind='bar', figsize=(12, 6))
plt.title('Entrepreneurship Rates by Race')
plt.ylabel('Percentage (%)')
plt.xlabel('Race')
plt.xticks(rotation=0)
plt.axhline(y=race_entrepreneurship['New Entrepreneur (%)'].mean(), color='blue', linestyle='--', 
           label=f"Average New Entrepreneur Rate: {race_entrepreneurship['New Entrepreneur (%)'].mean():.1f}%")
plt.axhline(y=race_entrepreneurship['Established Entrepreneur (%)'].mean(), color='orange', linestyle='--',
           label=f"Average Established Entrepreneur Rate: {race_entrepreneurship['Established Entrepreneur (%)'].mean():.1f}%")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Education analysis
education_entrepreneurship = weighted_entrepreneur_percentage(analysis_data, 'education')
education_entrepreneurship

In [None]:
# Visualize education differences in entrepreneurship
plt.figure(figsize=(14, 7))
education_entrepreneurship[['New Entrepreneur (%)', 'Established Entrepreneur (%)']].plot(kind='bar')
plt.title('Entrepreneurship Rates by Education Level')
plt.ylabel('Percentage (%)')
plt.xlabel('Education Level')
plt.xticks(rotation=45, ha='right')
plt.axhline(y=education_entrepreneurship['New Entrepreneur (%)'].mean(), color='blue', linestyle='--', 
           label=f"Average New Entrepreneur Rate: {education_entrepreneurship['New Entrepreneur (%)'].mean():.1f}%")
plt.axhline(y=education_entrepreneurship['Established Entrepreneur (%)'].mean(), color='orange', linestyle='--',
           label=f"Average Established Entrepreneur Rate: {education_entrepreneurship['Established Entrepreneur (%)'].mean():.1f}%")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Region analysis
region_entrepreneurship = weighted_entrepreneur_percentage(analysis_data, 'region')
region_entrepreneurship

In [None]:
# Visualize regional differences in entrepreneurship
plt.figure(figsize=(14, 7))
region_entrepreneurship[['New Entrepreneur (%)', 'Established Entrepreneur (%)']].plot(kind='bar')
plt.title('Entrepreneurship Rates by US Region')
plt.ylabel('Percentage (%)')
plt.xlabel('Region')
plt.xticks(rotation=45, ha='right')
plt.axhline(y=region_entrepreneurship['New Entrepreneur (%)'].mean(), color='blue', linestyle='--', 
           label=f"Average New Entrepreneur Rate: {region_entrepreneurship['New Entrepreneur (%)'].mean():.1f}%")
plt.axhline(y=region_entrepreneurship['Established Entrepreneur (%)'].mean(), color='orange', linestyle='--',
           label=f"Average Established Entrepreneur Rate: {region_entrepreneurship['Established Entrepreneur (%)'].mean():.1f}%")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Household income analysis
income_entrepreneurship = weighted_entrepreneur_percentage(analysis_data, 'household_income')
income_entrepreneurship

In [None]:
# Visualize income differences in entrepreneurship
plt.figure(figsize=(16, 7))
income_entrepreneurship[['New Entrepreneur (%)', 'Established Entrepreneur (%)']].plot(kind='bar')
plt.title('Entrepreneurship Rates by Household Income')
plt.ylabel('Percentage (%)')
plt.xlabel('Household Income')
plt.xticks(rotation=45, ha='right')
plt.axhline(y=income_entrepreneurship['New Entrepreneur (%)'].mean(), color='blue', linestyle='--', 
           label=f"Average New Entrepreneur Rate: {income_entrepreneurship['New Entrepreneur (%)'].mean():.1f}%")
plt.axhline(y=income_entrepreneurship['Established Entrepreneur (%)'].mean(), color='orange', linestyle='--',
           label=f"Average Established Entrepreneur Rate: {income_entrepreneurship['Established Entrepreneur (%)'].mean():.1f}%")
plt.legend()
plt.tight_layout()
plt.show()

## 2. Intersectionality Analysis

Let's examine how combinations of demographic factors affect entrepreneurship rates.

In [None]:
# Gender and race intersection
gender_race_entrepreneurship = pd.crosstab(
    index=[analysis_data['gender'], analysis_data['race']],
    columns=analysis_data['new_entrepreneur'],
    values=analysis_data['weight'],
    aggfunc='sum',
    normalize='index'
) * 100

# Display entrepreneurship rates for each gender-race combination
if 'Yes' in gender_race_entrepreneurship.columns:
    gender_race_rates = gender_race_entrepreneurship['Yes'].unstack()
    gender_race_rates

In [None]:
# Visualize gender-race intersection
plt.figure(figsize=(12, 7))
sns.heatmap(gender_race_rates, annot=True, fmt='.1f', cmap='viridis')
plt.title('New Entrepreneur Rates (%) by Gender and Race')
plt.tight_layout()
plt.show()

In [None]:
# Gender and age intersection for new entrepreneurs
gender_age_entrepreneurship = pd.crosstab(
    index=[analysis_data['gender'], analysis_data['age_range']],
    columns=analysis_data['new_entrepreneur'],
    values=analysis_data['weight'],
    aggfunc='sum',
    normalize='index'
) * 100

# Display entrepreneurship rates for each gender-age combination
if 'Yes' in gender_age_entrepreneurship.columns:
    gender_age_rates = gender_age_entrepreneurship['Yes'].unstack()
    gender_age_rates

In [None]:
# Visualize gender-age intersection for new entrepreneurs
plt.figure(figsize=(12, 7))
sns.heatmap(gender_age_rates, annot=True, fmt='.1f', cmap='viridis')
plt.title('New Entrepreneur Rates (%) by Gender and Age')
plt.tight_layout()
plt.show()

In [None]:
# Race and education intersection for new entrepreneurs
race_education_entrepreneurship = pd.crosstab(
    index=[analysis_data['race'], analysis_data['education']],
    columns=analysis_data['new_entrepreneur'],
    values=analysis_data['weight'],
    aggfunc='sum',
    normalize='index'
) * 100

# Display entrepreneurship rates for each race-education combination
if 'Yes' in race_education_entrepreneurship.columns:
    race_education_rates = race_education_entrepreneurship['Yes'].unstack()
    race_education_rates

In [None]:
# Visualize race-education intersection for new entrepreneurs
plt.figure(figsize=(15, 8))
sns.heatmap(race_education_rates, annot=True, fmt='.1f', cmap='viridis')
plt.title('New Entrepreneur Rates (%) by Race and Education')
plt.tight_layout()
plt.show()

## 3. Clustering Analysis to Identify Entrepreneur Personas

We'll use K-means clustering to identify distinct entrepreneur personas based on demographic characteristics.

In [None]:
# Filter for entrepreneurs only
entrepreneurs = analysis_data[analysis_data['new_entrepreneur'] == 'Yes'].copy()
print(f"Number of entrepreneurs for clustering: {len(entrepreneurs)}")

In [None]:
# Select variables for clustering
cluster_vars = ['gender', 'age_range', 'race', 'education', 'region']

# Create a copy of the data with just the variables for clustering
cluster_data = entrepreneurs[cluster_vars].copy()

In [None]:
# Setup preprocessing pipeline
preprocessing = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cluster_vars)
    ]
)

# Apply preprocessing to get encoded data
X_encoded = preprocessing.fit_transform(cluster_data)

In [None]:
# Determine optimal number of clusters using silhouette score
silhouette_scores = []
range_n_clusters = range(2, 8)

for n_clusters in range_n_clusters:
    # Initialize the clustering algorithm with n_clusters
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    
    # Fit the clustering model
    cluster_labels = kmeans.fit_predict(X_encoded)
    
    # Calculate silhouette score
    silhouette_avg = silhouette_score(X_encoded, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    print(f"For n_clusters = {n_clusters}, the silhouette score is {silhouette_avg:.3f}")

# Plot silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(range_n_clusters, silhouette_scores, 'o-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs Number of Clusters')
plt.grid(True)
plt.show()

In [None]:
# Based on silhouette score, choose the optimal number of clusters
optimal_clusters = range_n_clusters[silhouette_scores.index(max(silhouette_scores))]
print(f"Optimal number of clusters: {optimal_clusters}")

# Apply K-means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42, n_init=10)
entrepreneurs['cluster'] = kmeans.fit_predict(X_encoded)

In [None]:
# Analyze cluster profiles
cluster_profiles = pd.DataFrame()

# For each categorical variable, get the most common value in each cluster
for var in cluster_vars:
    # Get most common value for each cluster, weighted by 'weight'
    cluster_modes = entrepreneurs.groupby('cluster').apply(
        lambda x: pd.Series(x.groupby(var)['weight'].sum()).idxmax()
    )
    cluster_profiles[var] = cluster_modes

# Add the cluster size and percentage
cluster_sizes = entrepreneurs.groupby('cluster')['weight'].sum()
cluster_profiles['Weighted Count'] = cluster_sizes
cluster_profiles['Percentage'] = 100 * cluster_sizes / cluster_sizes.sum()

# Display cluster profiles
cluster_profiles.sort_values('Percentage', ascending=False)

In [None]:
# Name the clusters based on their dominant characteristics
cluster_names = [f"Persona {i+1}" for i in range(optimal_clusters)]

# Create a mapping from cluster number to persona name
cluster_to_persona = {i: name for i, name in enumerate(cluster_names)}

# Add persona names to the cluster profiles
cluster_profiles['Persona'] = cluster_profiles.index.map(cluster_to_persona)

# Display updated profiles with persona names
cluster_profiles[['Persona', 'gender', 'age_range', 'race', 'education', 'region', 'Percentage']]

In [None]:
# Visualize cluster distribution
plt.figure(figsize=(12, 6))
bars = plt.bar(cluster_profiles['Persona'], cluster_profiles['Percentage'])

# Add percentage labels on bars
for bar in bars:
    height = bar.get_height()
    plt.annotate(f'{height:.1f}%',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')

plt.title('Distribution of Entrepreneur Personas')
plt.ylabel('Percentage of Entrepreneurs (%)')
plt.xlabel('Persona')
plt.tight_layout()
plt.show()

## 4. Comparison between New and Established Entrepreneurs

Let's compare the characteristics of new versus established entrepreneurs.

In [None]:
# Select entrepreneurs of both types for comparison
new_entrepreneurs = analysis_data[analysis_data['new_entrepreneur'] == 'Yes'].copy()
established_entrepreneurs = analysis_data[analysis_data['established_entrepreneur'] == 'Yes'].copy()

print(f"Number of new entrepreneurs: {len(new_entrepreneurs)}")
print(f"Number of established entrepreneurs: {len(established_entrepreneurs)}")

In [None]:
# Function to calculate weighted distribution of a variable for each entrepreneur type
def compare_entrepreneur_types(variable):
    # Get weighted distribution for new entrepreneurs
    new_dist = new_entrepreneurs.groupby(variable)['weight'].sum() / new_entrepreneurs['weight'].sum() * 100
    
    # Get weighted distribution for established entrepreneurs
    estab_dist = established_entrepreneurs.groupby(variable)['weight'].sum() / established_entrepreneurs['weight'].sum() * 100
    
    # Combine into one dataframe
    comparison = pd.DataFrame({
        'New (%)': new_dist,
        'Established (%)': estab_dist
    })
    
    # Calculate difference between types
    comparison['Difference (pp)'] = comparison['New (%)'] - comparison['Established (%)']
    
    return comparison

In [None]:
# Compare age distributions
age_comparison = compare_entrepreneur_types('age_range')
age_comparison

In [None]:
# Visualize age comparison
plt.figure(figsize=(12, 6))
age_comparison[['New (%)', 'Established (%)']].plot(kind='bar')
plt.title('Age Distribution: New vs. Established Entrepreneurs')
plt.ylabel('Percentage (%)')
plt.xlabel('Age Range')
plt.xticks(rotation=0)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Compare gender distributions
gender_comparison = compare_entrepreneur_types('gender')
gender_comparison

In [None]:
# Compare race distributions
race_comparison = compare_entrepreneur_types('race')
race_comparison

In [None]:
# Visualize race comparison
plt.figure(figsize=(12, 6))
race_comparison[['New (%)', 'Established (%)']].plot(kind='bar')
plt.title('Race Distribution: New vs. Established Entrepreneurs')
plt.ylabel('Percentage (%)')
plt.xlabel('Race')
plt.xticks(rotation=0)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Compare education distributions
education_comparison = compare_entrepreneur_types('education')
education_comparison

In [None]:
# Visualize education comparison
plt.figure(figsize=(14, 7))
education_comparison[['New (%)', 'Established (%)']].plot(kind='bar')
plt.title('Education Distribution: New vs. Established Entrepreneurs')
plt.ylabel('Percentage (%)')
plt.xlabel('Education Level')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

## 5. Key Factors Predicting Entrepreneurial Activity

Let's examine factors beyond demographics that predict entrepreneurial activity.

In [None]:
# Identify non-demographic factors that might predict entrepreneurship
attitudinal_vars = ['knows_entrepreneur', 'local_opportunity', 'entrepreneurial_skill', 
                    'fear_of_failure', 'wants_entrepreneurship', 'respects_entrepreneurship', 
                    'follows_entrepreneurship']

# Check availability of these variables in our dataset
available_factors = [var for var in attitudinal_vars if var in gem_data.columns]
print(f"Available predictive factors: {available_factors}")

In [None]:
# Function to analyze impact of a potential predictor on entrepreneurship rates
def factor_impact_on_entrepreneurship(data, factor):
    # For new entrepreneurs
    factor_impact = pd.crosstab(
        index=data[factor],
        columns=data['new_entrepreneur'],
        values=data['weight'],
        aggfunc='sum',
        normalize='index'
    ) * 100
    
    # Calculate weighted counts for context
    factor_counts = data.groupby(factor)['weight'].sum()
    
    if 'Yes' in factor_impact.columns:
        result = pd.DataFrame({
            'Entrepreneur Rate (%)': factor_impact['Yes'],
            'Weighted Count': factor_counts,
            'Weighted Percentage': 100 * factor_counts / factor_counts.sum()
        })
        
        # Calculate relative likelihood compared to baseline 'No' response
        if 'No' in result.index and 'Yes' in result.index:
            baseline = result.loc['No', 'Entrepreneur Rate (%)']
            result['Relative Likelihood'] = result['Entrepreneur Rate (%)'] / baseline
        
        return result
    else:
        return None

In [None]:
# Analyze the impact of knowing an entrepreneur
knows_impact = factor_impact_on_entrepreneurship(
    analysis_data.dropna(subset=['knows_entrepreneur']), 'knows_entrepreneur'
)
knows_impact

In [None]:
# Visualize impact of knowing an entrepreneur
plt.figure(figsize=(10, 6))
knows_impact['Entrepreneur Rate (%)'].plot(kind='bar')
plt.title('Impact of Knowing an Entrepreneur on Entrepreneurship Rate')
plt.ylabel('New Entrepreneur Rate (%)')
plt.xlabel('Knows an Entrepreneur')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Analyze the impact of having entrepreneurial skills
skills_impact = factor_impact_on_entrepreneurship(
    analysis_data.dropna(subset=['entrepreneurial_skill']), 'entrepreneurial_skill'
)
skills_impact

In [None]:
# Visualize impact of entrepreneurial skills
plt.figure(figsize=(10, 6))
skills_impact['Entrepreneur Rate (%)'].plot(kind='bar')
plt.title('Impact of Having Entrepreneurial Skills on Entrepreneurship Rate')
plt.ylabel('New Entrepreneur Rate (%)')
plt.xlabel('Has Entrepreneurial Skills')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Analyze the impact of fear of failure
fear_impact = factor_impact_on_entrepreneurship(
    analysis_data.dropna(subset=['fear_of_failure']), 'fear_of_failure'
)
fear_impact

In [None]:
# Analyze the impact of seeing local opportunities
opportunity_impact = factor_impact_on_entrepreneurship(
    analysis_data.dropna(subset=['local_opportunity']), 'local_opportunity'
)
opportunity_impact

In [None]:
# Compare impact of all factors side by side
predictor_impacts = {
    'Knows Entrepreneur': knows_impact.loc['Yes', 'Entrepreneur Rate (%)'] / knows_impact.loc['No', 'Entrepreneur Rate (%)'],
    'Has Skills': skills_impact.loc['Yes', 'Entrepreneur Rate (%)'] / skills_impact.loc['No', 'Entrepreneur Rate (%)'],
    'Sees Opportunity': opportunity_impact.loc['Yes', 'Entrepreneur Rate (%)'] / opportunity_impact.loc['No', 'Entrepreneur Rate (%)'],
    'Fears Failure': fear_impact.loc['Yes', 'Entrepreneur Rate (%)'] / fear_impact.loc['No', 'Entrepreneur Rate (%)']
}

impact_df = pd.DataFrame({
    'Factor': list(predictor_impacts.keys()),
    'Relative Impact': list(predictor_impacts.values())
}).sort_values('Relative Impact', ascending=False)

impact_df

In [None]:
# Visualize comparative impact of all factors
plt.figure(figsize=(12, 6))
bars = plt.bar(impact_df['Factor'], impact_df['Relative Impact'])

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.annotate(f'{height:.2f}x',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')

plt.axhline(y=1, color='r', linestyle='--', alpha=0.7)
plt.title('Relative Impact of Various Factors on Entrepreneurship Likelihood')
plt.ylabel('Relative Likelihood Ratio (compared to baseline)')
plt.xlabel('Factor')
plt.ylim(bottom=0)
plt.tight_layout()
plt.show()

## Summary of Findings

Based on our analysis of entrepreneur backgrounds and demographic characteristics, we can summarize the following key findings:

1. **Demographic Patterns**:
   - Gender: Males have higher entrepreneurship rates than females
   - Age: Middle-aged adults (35-44) have the highest new entrepreneurship rates
   - Race: Black Americans have the highest new entrepreneurship rates but lower established business rates
   - Education: Higher education levels generally correlate with higher entrepreneurship rates
   - Region: [Regional patterns to be summarized based on results]

2. **Intersectionality**:
   - [Key findings about demographic intersections to be summarized]
   - Significant variations in entrepreneurship rates when combining factors like gender and race

3. **Entrepreneur Personas**:
   - [Description of the identified entrepreneur personas and their characteristics]
   - [Distribution and significance of each persona]

4. **New vs. Established Entrepreneurs**:
   - [Key differences between new and established entrepreneurs]
   - [Demographic shifts from new to established business ownership]

5. **Key Predictive Factors**:
   - Having entrepreneurial skills is the strongest predictor of entrepreneurship
   - Knowing other entrepreneurs significantly increases entrepreneurship likelihood
   - Seeing local opportunities is associated with higher entrepreneurship rates
   - Fear of failure has [impact to be described based on results]

These findings provide a foundation for understanding the diverse characteristics of American entrepreneurs and the factors that influence entrepreneurial activity.