## Strategy

- **Understand business relationships**: For example, `total_assets` should generally be greater than `current_assets`. `gross_profit` should be derived from `net_sales` and `cost_of_goods_sold`. These relationships should guide the generation process.
- **Introduce variability**: Randomly modify values but keep them within reasonable business ranges and relationships.
  
### Steps:
1. **Define Logical Relationships**:
   - `gross_profit = net_sales - cost_of_goods_sold`
   - `ebit = gross_profit - total_operating_expenses - depreciation_and_amortization`
   - `ebitda = ebit + depreciation_and_amortization`
   - `net_income = ebit - total_long_term_debt`
   - `total_liabilities` should be less than or equal to `total_assets`
   - `market_value` can depend on `net_income` or `ebitda` with some variability

2. **Generate Data**: Use `numpy` for generating values but apply business rules to maintain realistic relationships.

### Key Points:
1. **Relationships between variables**: 
   - `gross_profit = net_sales - cost_of_goods_sold`
   - `ebit = gross_profit - total_operating_expenses - depreciation_and_amortization`
   - `ebitda = ebit + depreciation_and_amortization`
   - `net_income = ebit - total_long_term_debt`
2. **Ranges**: Values are randomly generated within ranges but in a way that preserves logical relationships.
3. **Realistic Variability**: The code generates realistic variability while ensuring business sense.

### Outcome:
This approach generates synthetic data that makes sense from a business perspective, where relationships between financial variables like `net_sales`, `gross_profit`, `ebit`, and `total_assets` are maintained. This method creates new datasets based on meaningful variations and ensures that your financial data behaves realistically.

## Data for Financial Performance Analysis

This section of the notebook focuses on analyzing the financial performance and attributes of healthy companies, including their assets, liabilities, sales, profits, and company-specific features like industry, size, and age. Below is a breakdown of the process:



### Libraries and Setup
- **Pandas and Numpy**: Essential for handling data structures and numerical computations.
- **Custom Functions**: Two custom functions, `generate_company_size()` and `generate_company_age()`, are defined to categorize company sizes based on net sales and company age based on size.

### Dataset Overview
- **Number of Records**: The dataset contains `15,000` records of healthy companies, spanning various states, industries, and statuses (public or private).
- **States and Industries**: The dataset covers 21 U.S. states and 12 industries, ensuring diversity across geographical locations and business sectors.
- **Random Seed**: A seed (`33`) is set for reproducibility, ensuring consistent results across runs.

### Financial Data Analysis
1. **Assets and Liabilities**: 
   - `current_assets` and `total_assets` provide insights into the financial standing of companies.
   - `total_liabilities` are calculated to ensure financial health, remaining lower than `total_assets`.
   
2. **Net Sales and Profitability**:
   - `net_sales` reflect the company’s overall revenue performance.
   - `gross_profit` is calculated as net sales minus the cost of goods sold (COGS).
   - `EBIT` and `EBITDA` provide measures of operational profitability, adjusted for depreciation and amortization.

3. **Debt and Liabilities**:
   - `total_current_liabilities` are expressed as a percentage of `current_assets`, around 60%.
   - `total_long_term_debt` is tied to a fraction of `total_liabilities`, reflecting manageable debt levels.

4. **Other Metrics**:
   - Metrics like `market_value`, `retained_earnings`, `total_revenue`, and `inventory` offer additional insights into the company's financial health, based on net income and sales data.

### Company Attributes
- **Company Size**: Based on `net_sales`, the company size is categorized as small, medium, or large.
- **Company Age**: Age is determined using size as a proxy, with older companies being less frequent.
- **Industry**: Industries are represented with varying probabilities, reflecting a realistic distribution of common industries like Technology and Healthcare.
- **State**: Companies are distributed across states in a balanced manner.
- **Company Status**: The status (`Public` or `Private`) is determined using weighted probabilities, with most companies being private.

### Final Dataset
The dataset is structured into a Pandas DataFrame, `df_healthy`, which contains the following fields:
- **Date Features**: `time`
- **Company Characteristics**: `industry`, `state`, `company_status`, `size`, and `age`
- **Financial Metrics**: `current_assets`, `total_assets`, `total_liabilities`, `net_sales`, `gross_profit`, `total_operating_expenses`, `depreciation_and_amortization`, `ebit`, `ebitda`, `net_income`, `market_value`, `retained_earnings`, `total_revenue`, `total_current_liabilities`, `cost_of_goods_sold`, `total_long_term_debt`, `inventory`, and `total_receivables`.

In [36]:
import pandas as pd
import numpy as np
from scipy import stats
from datetime import timedelta
from faker import Faker

from scipy.stats import beta
import matplotlib.pyplot as plt
import seaborn as sns


In [37]:
import pandas as pd
import numpy as np
from data_generation_functions import generate_company_size, generate_company_age

# Set the number of records to generate
num_records = 15000
states = [
    "Arizona", "Connecticut", "Colorado", "California",
    "Florida", "Delaware", "Georgia",
    "Hawaii","Louisiana",
    "Massachusetts", "Michigan", "Nebraska", "Nevada",
    "North Carolina", "New York", "New Jersey", "Ohio",
    "Oklahoma","Washington",  "South Carolina",
    "Texas","Utah"
]
industries = ['Technology', 'Healthcare', 'Finance', 'Retail', 'Manufacturing', 'Energy', 'Transportation', 
              'Real Estate', 'Utilities', 'Telecommunications', 'Education', 'Hospitality']

company_statuses = ['Public', 'Private']

# Define a random seed for reproducibility
np.random.seed(33)

# Generate random data with adjusted business logic
def generate_company_size(net_sales):
    if net_sales < 15e5:
        return np.random.randint(1, 51)  # Small
    elif net_sales < 3e6:
        return np.random.randint(51, 251)  # Medium
    else:
        return np.random.randint(251, 1001)  # Large

def generate_company_age(size):
    if size <= 50:  # Small
        return np.random.randint(0, 6)  # Startup
    elif size <= 250:  # Medium
        return np.random.randint(6, 11)  # Growth
    else:  # Large
        return np.random.randint(11, 31)  # Established

# Generate random dates for healthy companies
def generate_random_dates(start_year, end_year, num_records):
    start_date = pd.Timestamp(f'{start_year}-01-01')
    end_date = pd.Timestamp(f'{end_year}-12-31')
    
    # Generate random days within the date range
    random_days = np.random.randint(0, (end_date - start_date).days, num_records)
    
    # Create the random dates
    random_dates = start_date + pd.to_timedelta(random_days, unit='D')
    
    return random_dates

# Generate data that represents healthy companies

random_dates = generate_random_dates(start_year=2000, end_year=2019, num_records=num_records)

# Assets and liabilities
current_assets = np.random.normal(50000, 200000, num_records)
total_assets = current_assets + np.random.normal(600000, 300000, num_records)

# Liabilities and financial health
total_liabilities = total_assets - np.random.normal(300000, 100000, num_records)
total_liabilities = np.clip(total_liabilities, 50000, total_assets)  # Ensure financial health

# Net sales with a floor to avoid zeros
net_sales = np.random.normal(2500000, 1000000, num_records)
net_sales = np.clip(net_sales, 1000, None)  # Minimum sales to avoid zeros

# Other financial metrics
gross_profit = net_sales - np.random.normal(400000, 150000, num_records)  # Gross profit
total_operating_expenses = np.random.normal(500000, 200000, num_records)
depreciation_and_amortization = np.random.exponential(15000, num_records)

# EBITDA and EBIT calculations
ebit = gross_profit - total_operating_expenses - depreciation_and_amortization
ebitda = ebit + depreciation_and_amortization

# Net income with a floor to avoid negative values
net_income = np.clip(ebit - np.random.normal(300000, 100000, num_records), 0, None)

# Current liabilities should be lower than current assets
total_current_liabilities = np.random.normal(loc=current_assets * 0.6, scale=abs(current_assets), size=num_records)  # Around 60% of current assets

total_long_term_debt = np.random.normal(loc=total_liabilities * 0.5, scale= abs(total_liabilities * 0.1), size=num_records)  # Manageable debt

inventory = np.random.lognormal(mean=8, sigma=0.4, size=num_records) 
total_receivables = np.random.normal(loc=150000, scale=50000, size=num_records)

cost_of_goods_sold = np.random.normal(loc=400000, scale=150000, size=num_records)  # Lower COGS relative to sales

# Market value and revenue calculations
market_value = np.abs(net_income) * np.random.uniform(5, 15, num_records)
total_revenue = net_sales * np.random.uniform(1.1, 1.5, num_records)
retained_earnings = np.random.uniform(0, net_income * 0.3, num_records)

# Generate company size using log-normal distribution
size = np.array([generate_company_size(s) for s in net_sales])

# Company age: exponentially distributed, older companies less frequent
age = np.array([generate_company_age(size) for size in size])

#industries with a Zipf distribution (skewed distribution)
industry_probs = [0.25, 0.2, 0.15, 0.12, 0.1, 0.08, 0.05, 0.03, 0.02, 0.01, 0.02, 0.01]

# Normalize probabilities
industry_probs = [p / sum(industry_probs) for p in industry_probs]
industry = np.random.choice(industries, num_records, p=industry_probs)

# Generate states with a Dirichlet distribution (ensures probabilities sum to 1)
state_probs = np.random.dirichlet(alpha=[1] * len(states))
state = np.random.choice(states, num_records, p=state_probs)

status_probs = [0.63, 0.37]
company_status = np.random.choice(company_statuses, num_records, p=status_probs)


# Create DataFrame
data = {
    'time': random_dates,
    'industry': industry,
    'state': state,
    'company_status': company_status,
    'current_assets': np.clip(current_assets, 0, None),
    'total_assets': np.clip(total_assets, 0, None),
    'total_liabilities': np.clip(total_liabilities, 0, None),
    'net_sales': net_sales,
    'gross_profit': np.clip(gross_profit, 0, None),
    'total_operating_expenses': np.clip(total_operating_expenses, 0, None),
    'depreciation_and_amortization': depreciation_and_amortization,
    'ebit': np.clip(ebit, 0, None),
    'ebitda': np.clip(ebitda, 0, None),
    'net_income': net_income,
    'market_value': np.clip(market_value, 0, None),
    'retained_earnings': np.clip(retained_earnings, 0, None),
    'total_revenue': np.clip(total_revenue, 0, None),
    'size': size,
    'age': age,
    'total_current_liabilities': np.clip(total_current_liabilities, 0, None),
    'cost_of_goods_sold': np.clip(cost_of_goods_sold, 0, None),
    'total_long_term_debt': np.clip(total_long_term_debt, 0, None),
    'inventory': np.clip(inventory, 0, None),
    'total_receivables': np.clip(total_receivables, 0, None)
}

df_healthy = pd.DataFrame(data)

df_healthy.head()


## Company Data for Financial Analysis

This section outlines the process of preparing a dataset with company attributes, financial periods, and key details such as industry, state, company size, and status.

### Key Steps:
- **Number of Companies**: The dataset includes 3,000 unique companies, each with a varying number of financial periods (between 20 and 35).
- **Industries and States**: Companies span 5 industries with typical gross margin ranges and 10 U.S. states.
- **Company Attributes**: Companies are categorized by size (small, medium, large), age (1 to 20 years), and status (public or private).
- **Company Size and Status**: 
  - Size is linked to company age, with younger companies more likely to be smaller.
  - Company status is based on size, with larger companies tending to be public.
- **Company Names**: Each company is assigned a unique name, reflecting realistic business scenarios.
- **Financial Periods**: Each company is associated with multiple financial periods, starting between 2000 and 2018, with quarterly reporting.
- **Final Dataset**: The data for all companies is combined into a comprehensive DataFrame, including fields such as `company_id`, `date`, `industry`, `state`, `company_status`, `size_category`, `age`, and `company_name`.


In [38]:
# Set the number of unique companies
num_companies = 3000

# Random number of periods per company
min_periods = 20
max_periods = 35

# Define industries with typical gross margin ranges
industries = {
    'Technology': {'gross_margin': (0.40, 0.80)},
    'Finance': {'gross_margin': (0.50, 0.90)},
    'Healthcare': {'gross_margin': (0.30, 0.70)},
    'Retail': {'gross_margin': (0.10, 0.50)},
    'Manufacturing': {'gross_margin': (0.10, 0.60)},
    # Add more industries as needed
}

industry_list = list(industries.keys())

states = [
    'California', 'New York', 'Texas', 'Florida', 'Illinois',
    'Washington', 'Georgia', 'Pennsylvania', 'Ohio', 'North Carolina'
]
company_statuses = ['Public', 'Private']

# Generate company IDs
company_ids = np.arange(1, num_companies + 1)

# Assign attributes to companies
np.random.seed(69)  # For reproducibility of attributes

company_attributes = pd.DataFrame({
    'company_id': company_ids,
    # 'industry': np.random.choice(industry_list, size=num_companies),
    'state': np.random.choice(states, size=num_companies),
    # 'company_status': np.random.choice(company_statuses, size=num_companies),
    # 'size_category': np.random.choice(['Small', 'Medium', 'Large'], size=num_companies),
    'age': np.random.randint(1, 21, size=num_companies)  # Age between 1 and 50 years
})

def assign_size(age):
    if age < 5:
        return 'Small'
    elif age < 12:
        return np.random.choice(['Small', 'Medium'], p=[0.3, 0.7])
    else:
        return np.random.choice(['Medium', 'Large'], p=[0.4, 0.6])
company_attributes['size_category'] = company_attributes['age'].apply(assign_size)

def assign_status(size_category):
    if size_category == 'Small':
        return 'Private'
    elif size_category == 'Medium':
        return np.random.choice(['Private', 'Public'], p=[0.7, 0.3])
    else:
        return 'Public'

company_attributes['company_status'] = company_attributes['size_category'].apply(assign_status)

# Generate company names using Faker
fake = Faker()
company_names = set()
while len(company_names) < num_companies:
    company_names.add(fake.company())

company_attributes['company_name'] = list(company_names)

industry_probs = {
    'Technology': 0.2,
    'Finance': 0.15,
    'Healthcare': 0.15,
    'Retail': 0.27,
    'Manufacturing': 0.23
}
company_attributes['industry'] = np.random.choice(
    industry_list,
    size=num_companies,
    p=list(industry_probs.values())
)

# Create a list to store date data for each company
data_list = []
overall_start_date = pd.to_datetime('2000-01-01')
overall_end_date = pd.to_datetime('2018-12-31')

def generate_dates(num_periods, start_date, frequency='Q'):
    return pd.date_range(start=start_date, periods=num_periods, freq=frequency)

for idx, row in company_attributes.iterrows():
    company_id = row['company_id']
    num_periods = np.random.randint(min_periods, max_periods + 1)
    start_date = overall_start_date + timedelta(days=np.random.randint(0, (overall_end_date - overall_start_date).days - 365))
    dates = generate_dates(num_periods, start_date)


    # Create a DataFrame for this company
    df_company = pd.DataFrame({
        'company_id': [company_id] * num_periods,
        'date': dates,
        'industry': [row['industry']] * num_periods,
        'state': [row['state']] * num_periods,
        'company_status': [row['company_status']] * num_periods,
        'size_category': [row['size_category']] * num_periods,
        'age': [row['age']] * num_periods,
        'company_name': [row['company_name']] * num_periods
    })

    data_list.append(df_company)

# Combine all company DataFrames
df = pd.concat(data_list, ignore_index=True)


  return pd.date_range(start=start_date, periods=num_periods, freq=frequency)


## Financial Metrics Over Time for Companies

This section details the process of calculating financial metrics for companies over time, considering both healthy and distressed companies. Key attributes such as revenue, expenses, profit, and liabilities are computed for each financial period based on company-specific factors like industry and size.

### Key Steps:
- **Company Attributes**: The year of founding is assigned for each company, along with size categories (`Small`, `Medium`, `Large`), and industries, each with typical gross margin and operating expense parameters.
- **Financial Metrics**: 
  - Metrics include revenue, gross profit, operating expenses, EBIT, EBITDA, net income, assets, liabilities, equity, and market value.
  - These metrics are adjusted annually, accounting for factors such as economic crises (2008-2009, 2020-2021) and whether a company is distressed.
- **Beta Distributions**: Industry-specific gross margins and operating expenses are simulated using Beta distributions to ensure realistic ranges.
- **Distressed Companies**: Some companies are marked as distressed, leading to negative growth, higher operating expenses, and increased financial volatility.
- **Crisis Adjustments**: Financial crises like 2008 and 2020 are factored in, with lower revenues and margins during these periods.
- **Final Dataset**: The resulting DataFrame includes columns for revenue, net income, assets, liabilities, and various other financial metrics, all clipped to prevent negative values where appropriate.
- **Handling Missing Values**: Infinite values are replaced with `NaN` and subsequently filled with 0.

In [39]:
from scipy.stats import beta

# Function to generate financial metrics for a company over time
def generate_financials(company, distressed=False):
    # Add distress label column
    company['bankrupt'] = distressed
    min_date = company['date'].min().year
    year_founded = np.random.randint(min_date - 30, min_date)  # Founded up to 30 years before the first record
    company['year_founded'] = year_founded

    num_periods = len(company)
    industry = company['industry'].iloc[0]
    size_category = company['size_category'].iloc[0]

    # Set base revenue ranges based on company size
    if size_category == 'Small':
        revenue_lower, revenue_upper = 500000, 5000000
    elif size_category == 'Medium':
        revenue_lower, revenue_upper = 5000000, 50000000
    else:  # Large companies
        revenue_lower, revenue_upper = 50000000, 500000000

    initial_revenue = np.random.uniform(revenue_lower, revenue_upper)

    # Initialize arrays for financial metrics
    revenues = []
    gross_profits = []
    cogs_list = []
    total_operating_expenses = []
    ebit_list = []
    ebitda_list = []
    net_income_list = []
    total_assets_list = []
    total_liabilities_list = []
    equity_list = []
    current_assets_list = []
    total_current_liabilities_list = []
    inventory_list = []
    total_receivables_list = []
    retained_earnings_list = []
    market_value_list = []
    depreciation_and_amortization_list = []
    total_long_term_debt_list = []

    # Industry parameters for Beta distributions
    industry_gross_margin_params = {
        'Technology': {'a': 2, 'b': 1},
        'Healthcare': {'a': 2, 'b': 1.5},
        'Finance': {'a': 2, 'b': 1},
        'Retail': {'a': 2, 'b': 1},  # Mean ~66%
        'Manufacturing': {'a': 2, 'b': 1.5}
    }

    industry_opex_params = {
        'Technology': {'a': 1, 'b': 3},
        'Healthcare': {'a': 1, 'b': 3},
        'Finance': {'a': 1, 'b': 2.5},
        'Retail': {'a': 1, 'b': 3},  # Mean ~25%
        'Manufacturing': {'a': 1.5, 'b': 3}
    }

    # Simulate financial metrics over the periods
    current_revenue = initial_revenue
    retained_earnings = 0
    for i, date in enumerate(company['date']):
        current_year = date.year

        # Set default mu and sigma based on company size
        if distressed:
            mu = -0.1  # Negative growth for distressed companies
            sigma = 0.2  # Higher volatility
        else:
            if size_category == 'Small':
                mu = 0.10
                sigma = 0.15
            elif size_category == 'Medium':
                mu = 0.07
                sigma = 0.10
            else:  # Large companies
                mu = 0.05
                sigma = 0.08

        # Adjust for economic crises
        crisis_adjustments = {
            (2008, 2009): {'mu_adjustment': -0.15, 'sigma_adjustment': 0.10},
            (2020, 2021): {'mu_adjustment': -0.20, 'sigma_adjustment': 0.15}
        }

        for years, adjustments in crisis_adjustments.items():
            if current_year in range(years[0], years[1]+1):
                mu += adjustments['mu_adjustment']
                sigma += adjustments['sigma_adjustment']
                break  # Apply only one crisis adjustment per year

        # Simulate revenue using GBM
        dt = 1  # Time step of one year
        growth_factor = np.exp((mu - 0.5 * sigma**2) * dt + sigma * np.random.normal() * np.sqrt(dt))
        current_revenue *= growth_factor
        revenues.append(current_revenue)

        # Simulate gross margins using Beta distribution
        if distressed or current_year in [2008, 2009, 2020, 2021]:
            # Lower gross margins during crises or for distressed companies
            a = industry_gross_margin_params[industry]['a'] * 0.75
            b = industry_gross_margin_params[industry]['b'] * 1.25
        else:
            a = industry_gross_margin_params[industry]['a']
            b = industry_gross_margin_params[industry]['b']
        gross_margin = beta.rvs(a, b)
        gross_profit = current_revenue * gross_margin
        gross_profits.append(gross_profit)
        cogs = current_revenue - gross_profit
        cogs_list.append(cogs)

        # Simulate operating expenses as a percentage of revenue
        if distressed or current_year in [2008, 2009, 2020, 2021]:
            # Increase opex percentage during crises or for distressed companies
            a = industry_opex_params[industry]['a'] * 1.25
            b = industry_opex_params[industry]['b'] * 0.75
        else:
            a = industry_opex_params[industry]['a']
            b = industry_opex_params[industry]['b']
        opex_pct = beta.rvs(a, b)
        operating_expense = current_revenue * opex_pct
        total_operating_expenses.append(operating_expense)

        # Depreciation and amortization
        depreciation_and_amortization = current_revenue * np.random.uniform(0.02, 0.05)
        depreciation_and_amortization_list.append(depreciation_and_amortization)

        # EBIT and EBITDA
        ebit = gross_profit - operating_expense - depreciation_and_amortization
        ebitda = ebit + depreciation_and_amortization
        ebit_list.append(ebit)
        ebitda_list.append(ebitda)

        # Interest expense
        if distressed:
            interest_expense = current_revenue * np.random.uniform(0.05, 0.1)
        else:
            interest_expense = current_revenue * np.random.uniform(0.01, 0.03)

        # Taxes
        pre_tax_income = ebit - interest_expense
        tax_rate = 0.21 if pre_tax_income > 0 else 0  # No taxes if no profit
        taxes = pre_tax_income * tax_rate

        # Net Income
        net_income = pre_tax_income - taxes
        net_income_list.append(net_income)

        # Retained Earnings
        retention_ratio = np.random.uniform(0.5, 0.8) if not distressed else np.random.normal(0.0, 0.3)
        retained_earnings += net_income * retention_ratio
        retained_earnings_list.append(retained_earnings)

        # Total Assets and Liabilities
        if i == 0:
            total_assets = current_revenue * np.random.uniform(0.8, 1.2)
        else:
            assets_growth_rate = 0.05 if not distressed else -0.05
            total_assets *= (1 + assets_growth_rate)
        total_assets_list.append(total_assets)

        if distressed:
            liabilities_pct = np.random.uniform(0.6, 0.8)
        else:
            liabilities_pct = np.random.uniform(0.3, 0.6)
        total_liabilities = total_assets * liabilities_pct
        total_liabilities_list.append(total_liabilities)

        # Equity
        equity = total_assets - total_liabilities
        equity_list.append(equity)

        # Current Assets and Liabilities
        current_assets = total_assets * np.random.uniform(0.4, 0.8)
        current_assets_list.append(current_assets)
        total_current_liabilities = total_liabilities * np.random.uniform(0.3, 0.6)
        total_current_liabilities_list.append(total_current_liabilities)

        # Inventory and Receivables
        inventory = current_revenue * np.random.uniform(0.1, 0.2)
        inventory_list.append(inventory)
        total_receivables = current_revenue * np.random.uniform(0.1, 0.2)
        total_receivables_list.append(total_receivables)

        # Market Value
        market_value = equity * np.random.uniform(1.0, 1.5)
        market_value_list.append(market_value)

        # Long-term Debt
        total_long_term_debt = total_liabilities - total_current_liabilities
        total_long_term_debt_list.append(total_long_term_debt)

    # Assign generated financials to the company dataframe
    company['total_revenue'] = revenues
    company['net_sales'] = revenues
    company['gross_profit'] = gross_profits
    company['cost_of_goods_sold'] = cogs_list
    company['total_operating_expenses'] = total_operating_expenses
    company['depreciation_and_amortization'] = depreciation_and_amortization_list
    company['ebit'] = ebit_list
    company['ebitda'] = ebitda_list
    company['net_income'] = net_income_list
    company['total_assets'] = total_assets_list
    company['total_liabilities'] = total_liabilities_list
    company['equity'] = equity_list
    company['current_assets'] = current_assets_list
    company['total_current_liabilities'] = total_current_liabilities_list
    company['inventory'] = inventory_list
    company['total_receivables'] = total_receivables_list
    company['retained_earnings'] = retained_earnings_list
    company['market_value'] = market_value_list
    company['total_long_term_debt'] = total_long_term_debt_list

    # Clip non-negative values where appropriate
    non_negative_cols = [
        'total_revenue', 'net_sales', 'gross_profit', 'cost_of_goods_sold',
        'total_operating_expenses', 'depreciation_and_amortization',
        'ebitda', 'total_assets', 'total_liabilities', 'equity',
        'current_assets', 'total_current_liabilities', 'inventory',
        'total_receivables', 'market_value', 'total_long_term_debt'
    ]
    company[non_negative_cols] = company[non_negative_cols].clip(lower=0)

    # Net income, EBIT, and retained earnings can be negative

    return company

# Apply the function to each company with some companies becoming distressed over time
distressed_company_ids = np.random.choice(company_attributes['company_id'], size=int(0.15 * num_companies), replace=False)
df = df.groupby('company_id', group_keys=False).apply(lambda x: generate_financials(x, distressed=x['company_id'].iloc[0] in distressed_company_ids))

# Remove time component from 'date' column
df['date'] = pd.to_datetime(df['date']).dt.date

# Replace inf and -inf with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Handle missing values
df.fillna(0, inplace=True)

# Display the first few rows
df.head()


## Financial Metrics Calculation for Companies

This section describes the calculation of financial metrics for companies over time, including both healthy and distressed companies. Key metrics such as revenue, profit, and liabilities are computed based on company-specific factors like industry and size, while considering economic crises and distress situations.

### Key Steps:
- **Company Attributes**: Each company is assigned a founding year, along with industry, size category (`Small`, `Medium`, `Large`), and potential distress status.
- **Financial Metrics**: 
  - Metrics include revenue, gross profit, operating expenses, EBIT, EBITDA, net income, assets, liabilities, equity, and market value.
  - These metrics are adjusted annually, with factors such as size and industry influencing financial performance.
- **Crisis Adjustments**: Financial crises (2008-2009, 2020-2021) are accounted for, applying negative growth rates and increased volatility during those periods.
- **Distress Logic**: Companies can become distressed based on consecutive periods of losses or equity thresholds. Distressed companies exhibit higher volatility and lower growth rates, but they can also recover after multiple periods of positive income.
- **Final Dataset**: Financial metrics are assigned to each company for each period, with appropriate handling of non-negative values. The dataset is then grouped and updated with financial details for further analysis.
- **Clean-Up Operations**: The `date` column is converted to a consistent format, and any infinite values are handled by replacing them with `NaN` and subsequently filling missing values with zeros.

In [71]:
# Function to generate financial metrics for a company over time
def generate_financials(company, distressed=False):
    min_date = company['date'].min().year
    year_founded = np.random.randint(min_date - 30, min_date)  # Founded up to 30 years before the first record
    company['year_founded'] = year_founded

    industry = company['industry'].iloc[0]
    size_category = company['size_category'].iloc[0]

    # Set base revenue ranges based on company size
    if size_category == 'Small':
        revenue_lower, revenue_upper = 500000, 5000000
    elif size_category == 'Medium':
        revenue_lower, revenue_upper = 5000000, 50000000
    else:  # Large companies
        revenue_lower, revenue_upper = 50000000, 500000000

    initial_revenue = np.random.uniform(revenue_lower, revenue_upper)

    # Initialize arrays for financial metrics
    revenues = []
    gross_profits = []
    cogs_list = []
    total_operating_expenses = []
    ebit_list = []
    ebitda_list = []
    net_income_list = []
    total_assets_list = []
    total_liabilities_list = []
    equity_list = []
    current_assets_list = []
    total_current_liabilities_list = []
    inventory_list = []
    total_receivables_list = []
    retained_earnings_list = []
    market_value_list = []
    depreciation_and_amortization_list = []
    total_long_term_debt_list = []
    bankrupt_flags = []

    # Industry parameters for Beta distributions
    industry_gross_margin_params = {
        'Technology': {'a': 2, 'b': 1},
        'Healthcare': {'a': 2, 'b': 1.5},
        'Finance': {'a': 2, 'b': 1},
        'Retail': {'a': 2, 'b': 1},
        'Manufacturing': {'a': 2, 'b': 1.5}
    }

    industry_opex_params = {
        'Technology': {'a': 1, 'b': 3},
        'Healthcare': {'a': 1, 'b': 3},
        'Finance': {'a': 1, 'b': 2.5},
        'Retail': {'a': 1, 'b': 3},
        'Manufacturing': {'a': 1.5, 'b': 3}
    }

    # Crisis adjustment factors
    crisis_adjustments = {
        (2008, 2009): {'mu_adjustment': -0.15, 'sigma_adjustment': 0.10},
        (2020, 2021): {'mu_adjustment': -0.20, 'sigma_adjustment': 0.15}
    }

    # Initialize bankruptcy status
    bankrupt = False
    consecutive_loss_periods = 0
    retained_earnings = 0
    current_revenue = initial_revenue

    for i, date in enumerate(company['date']):
        current_year = date.year

        # Apply crisis adjustments
        mu_adjustment = 0
        sigma_adjustment = 0
        for years, adjustments in crisis_adjustments.items():
            if current_year in range(years[0], years[1] + 1):
                mu_adjustment = adjustments['mu_adjustment']
                sigma_adjustment = adjustments['sigma_adjustment']
                break

        # Default growth rates based on size
        if distressed:
            mu = -0.10
            sigma = 0.20
        else:
            if size_category == 'Small':
                mu = 0.10
                sigma = 0.15
            elif size_category == 'Medium':
                mu = 0.07
                sigma = 0.10
            else:
                mu = 0.05
                sigma = 0.08

        mu += mu_adjustment
        sigma += sigma_adjustment

        # Simulate revenue using geometric Brownian motion (GBM)
        growth_factor = np.exp((mu - 0.5 * sigma**2) + sigma * np.random.normal())
        current_revenue *= growth_factor
        revenues.append(current_revenue)

        # Gross profit and COGS
        if distressed or current_year in [2008, 2009, 2020, 2021]:
            a = industry_gross_margin_params[industry]['a'] * 0.75
            b = industry_gross_margin_params[industry]['b'] * 1.25
        else:
            a = industry_gross_margin_params[industry]['a']
            b = industry_gross_margin_params[industry]['b']
        gross_margin = beta.rvs(a, b)
        gross_profit = current_revenue * gross_margin
        gross_profits.append(gross_profit)
        cogs = current_revenue - gross_profit
        cogs_list.append(cogs)

        # Operating expenses
        if distressed or current_year in [2008, 2009, 2020, 2021]:
            a_opex = industry_opex_params[industry]['a'] * 1.25
            b_opex = industry_opex_params[industry]['b'] * 0.75
        else:
            a_opex = industry_opex_params[industry]['a']
            b_opex = industry_opex_params[industry]['b']
        opex_pct = beta.rvs(a_opex, b_opex)
        operating_expense = current_revenue * opex_pct
        total_operating_expenses.append(operating_expense)

        # Depreciation and amortization
        depreciation_and_amortization = current_revenue * np.random.uniform(0.02, 0.05)
        depreciation_and_amortization_list.append(depreciation_and_amortization)

        # EBIT and EBITDA
        ebit = gross_profit - operating_expense - depreciation_and_amortization
        ebit_list.append(ebit)
        ebitda = ebit + depreciation_and_amortization
        ebitda_list.append(ebitda)

        # Interest and taxes
        interest_expense = current_revenue * (0.05 if distressed else np.random.uniform(0.01, 0.03))
        pre_tax_income = ebit - interest_expense
        tax_rate = 0.21 if pre_tax_income > 0 else 0
        taxes = pre_tax_income * tax_rate

        # Net income
        net_income = pre_tax_income - taxes
        net_income_list.append(net_income)

        # Retained earnings
        retention_ratio = np.random.uniform(0.5, 0.8) if not distressed else max(np.random.normal(0.0, 0.3), 0)
        retained_earnings += net_income * retention_ratio
        retained_earnings_list.append(retained_earnings)

        # Assets and liabilities
        total_assets = current_revenue * np.random.uniform(0.8, 1.2) if i == 0 else total_assets * (1 + (0.05 if not distressed else -0.05))
        total_assets_list.append(total_assets)

        liabilities_pct = np.random.uniform(0.6, 0.8) if distressed else np.random.uniform(0.3, 0.6)
        total_liabilities = total_assets * liabilities_pct
        total_liabilities_list.append(total_liabilities)

        # Equity
        equity = total_assets - total_liabilities
        equity_list.append(equity)

        # Current assets and liabilities
        current_assets = total_assets * np.random.uniform(0.4, 0.8)
        current_assets_list.append(current_assets)
        total_current_liabilities = total_liabilities * np.random.uniform(0.3, 0.6)
        total_current_liabilities_list.append(total_current_liabilities)

        # Inventory and receivables
        inventory = current_revenue * np.random.uniform(0.1, 0.2)
        inventory_list.append(inventory)
        total_receivables = current_revenue * np.random.uniform(0.1, 0.2)
        total_receivables_list.append(total_receivables)

        # Market value and long-term debt
        market_value = equity * np.random.uniform(1.0, 1.5)
        market_value_list.append(market_value)
        total_long_term_debt = total_liabilities - total_current_liabilities
        total_long_term_debt_list.append(total_long_term_debt)

        # Bankruptcy logic
        if net_income < 0:
            consecutive_loss_periods += 1
        else:
            consecutive_loss_periods = 0

        # Determine thresholds based on crisis periods
        if current_year in [2008, 2009, 2020, 2021]:
            loss_threshold = 1
            equity_threshold = total_assets * 0.1
        else:
            loss_threshold = 2
            equity_threshold = total_assets * 0.2

        if not bankrupt and (consecutive_loss_periods >= loss_threshold or equity < equity_threshold):
            bankrupt = True
        elif bankrupt and net_income > 0:
            recovery_counter = 1 if 'recovery_counter' not in locals() else recovery_counter + 1
            if recovery_counter >= 3 and equity > equity_threshold:
                bankrupt = False
                consecutive_loss_periods = 0
                recovery_counter = 0

        bankrupt_flags.append(bankrupt)

    # Assign generated financials to the company dataframe
    company['total_revenue'] = revenues
    company['gross_profit'] = gross_profits
    company['cost_of_goods_sold'] = cogs_list
    company['total_operating_expenses'] = total_operating_expenses
    company['depreciation_and_amortization'] = depreciation_and_amortization_list
    company['ebit'] = ebit_list
    company['ebitda'] = ebitda_list
    company['net_income'] = net_income_list
    company['total_assets'] = total_assets_list
    company['total_liabilities'] = total_liabilities_list
    company['equity'] = equity_list
    company['current_assets'] = current_assets_list
    company['total_current_liabilities'] = total_current_liabilities_list
    company['inventory'] = inventory_list
    company['total_receivables'] = total_receivables_list
    company['retained_earnings'] = retained_earnings_list
    company['market_value'] = market_value_list
    company['total_long_term_debt'] = total_long_term_debt_list
    company['bankrupt'] = bankrupt_flags

    # Clip non-negative values where appropriate
    non_negative_cols = [
        'total_revenue', 'gross_profit', 'cost_of_goods_sold', 'total_operating_expenses',
        'depreciation_and_amortization', 'ebitda', 'total_assets', 'total_liabilities', 'equity',
        'current_assets', 'total_current_liabilities', 'inventory', 'total_receivables', 'market_value', 'total_long_term_debt'
    ]
    company[non_negative_cols] = company[non_negative_cols].clip(lower=0)

    return company

# Apply the function to each company
distressed_company_ids = np.random.choice(df['company_id'].unique(), size=int(0.15 * len(df['company_id'].unique())), replace=False)
df = df.groupby('company_id', group_keys=False).apply(lambda x: generate_financials(x, distressed=x['company_id'].iloc[0] in distressed_company_ids))

# Clean-up operations
df['date'] = pd.to_datetime(df['date']).dt.date
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace=True)

# Display the updated DataFrame
print(df.head())


In [70]:
# Apply the function to each company with some companies becoming distressed over time
# distressed_company_ids = np.random.choice(company_attributes['company_id'], size=int(0.15 * num_companies), replace=False)
df = df.groupby('company_id', group_keys=False).apply(generate_financials)

# Remove time component from 'date' column
# df['date'] = pd.to_datetime(df['date']).dt.date

# Replace inf and -inf with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Handle missing values
df.fillna(0, inplace=True)

# Display the first few rows
df.head()


In [68]:
# df['date'] = pd.to_datetime(df['date'])

# df['year'] = df['date'].dt.year
# # Calculate bankruptcy rate per year
# bankruptcy_per_year = df.groupby('year')['bankrupt'].mean() * 100  # Convert to percentage
# import matplotlib.pyplot as plt
# import seaborn as sns

# # Plot overall bankruptcy rate per year
# plt.figure(figsize=(12, 6))
# sns.lineplot(x=bankruptcy_per_year.index, y=bankruptcy_per_year.values, marker='o')
# plt.title('Annual Bankruptcy Rate')
# plt.xlabel('Year')
# plt.ylabel('Bankruptcy Rate (%)')
# plt.axvspan(2008, 2009, color='red', alpha=0.1, label='Financial Crisis')
# plt.axvspan(2020, 2021, color='red', alpha=0.1)
# plt.legend()
# plt.show()



In [69]:


# # Select multiple sample companies
# sample_company_ids = df['company_id'].unique()[:5]  # First 5 companies

# plt.figure(figsize=(15, 10))
# for company_id in sample_company_ids:
#     sample = df[df['company_id'] == company_id]
#     plt.plot(sample['date'], sample['bankrupt'], marker='o', label=f'Company {company_id}')

# plt.title('Bankruptcy Status Over Time for Sample Companies')
# plt.xlabel('Date')
# plt.ylabel('Bankrupt Status (1=Yes, 0=No)')
# plt.legend()
# plt.ylim(-0.1, 1.1)
# plt.show()


In [44]:
# Select a few companies
# sample_companies = df['company_id'].unique()

In [45]:
# import matplotlib.pyplot as plt



# for company_id in sample_companies[:10]:
#     company_data = df[df['company_id'] == company_id]
#     plt.figure(figsize=(12, 6))
#     plt.plot(company_data['date'], company_data['net_income'], label='Net Income')
#     plt.plot(company_data['date'], company_data['ebit'], label='EBIT')
#     plt.title(f'Financial Metrics Over Time for Company {company_id}')
#     plt.xlabel('Date')
#     plt.ylabel('Amount')
#     plt.legend()
#     plt.show()


In [46]:
# df.shape[0]


82410

In [47]:
# df[df['gross_profit'] <1000000].value_counts().sum()   


np.int64(18738)

In [48]:
# df[df['net_sales'] <100000].value_counts().sum()

np.int64(4166)

In [49]:
# df[(df['net_sales'] >=200000) & (df['net_sales'] <500000)].value_counts().sum()

np.int64(3314)

In [53]:
# import matplotlib.pyplot as plt
# import seaborn as sns

# state = df["state"]

# state_counts = np.unique(state, return_counts=True)

# # Create a bar plot for state distribution
# plt.figure(figsize=(10, 6))
# sns.barplot(x=state_counts[0], y=state_counts[1], palette="viridis")
# plt.title("Distribution of States in Generated Data")
# plt.xlabel("States")
# plt.ylabel("Count")
# plt.xticks(rotation=45)
# plt.show()


In [64]:
df['bankrupt'].value_counts()

bankrupt
False    54116
True     28294
Name: count, dtype: int64