In [1]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from openai import OpenAI
import json

In [2]:
# Load the dataset
file_path = 'german_data.csv'  # Replace with your file path
df = pd.read_csv(file_path)

# Splitting the data into train and test sets
X = df.drop('Target', axis=1)  # Assuming 'Target' is your dependent variable
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Adding a constant to the model (statsmodels doesn't add it by default)
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Create and fit the model
model = sm.Logit(y_train, X_train)
result = model.fit()

# Summary of the model
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.480428
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                 Target   No. Observations:                  800
Model:                          Logit   Df Residuals:                      779
Method:                           MLE   Df Model:                           20
Date:                Fri, 24 Nov 2023   Pseudo R-squ.:                  0.2021
Time:                        07:32:24   Log-Likelihood:                -384.34
converged:                       True   LL-Null:                       -481.72
Covariance Type:            nonrobust   LLR p-value:                 1.222e-30
                                        coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const                                -3.5050      1.150     -3

In [3]:
def table_to_markdown(table, header_row):
    # Create a table
    headers = table[0]
    rows = table[1:]
    
    # Convert to DataFrame
    df = pd.DataFrame(rows, columns=headers)

    return df.to_markdown(index=False)

In [4]:
# Create tables
general_info = table_to_markdown(result.summary().tables[0], 1)
variables_info = table_to_markdown(result.summary().tables[1], 1)

print(general_info)
print('\n---\n')
print(variables_info)

| Dep. Variable:   | Target           |   No. Observations:     |    800    |
|:-----------------|:-----------------|:------------------------|:----------|
| Model:           | Logit            | Df Residuals:           | 779       |
| Method:          | MLE              | Df Model:               | 20        |
| Date:            | Fri, 24 Nov 2023 | Pseudo R-squ.:          | 0.2021    |
| Time:            | 07:32:24         | Log-Likelihood:         | -384.34   |
| converged:       | True             | LL-Null:                | -481.72   |
| Covariance Type: | nonrobust        | LLR p-value:            | 1.222e-30 |

---

|                                   | coef       | std err   | z      | P>|z|   | [0.025   | 0.975]    |
|:----------------------------------|:-----------|:----------|:-------|:--------|:---------|:----------|
| const                             | -3.5050    | 1.150     | -3.048 | 0.002   | -5.759   | -1.251    |
| Account Balance                   | 0.5627     | 0.07

In [5]:
# Create example of how the analysis should look like
example = """
        The logistic model exhibits a moderate fit, indicated by a Pseudo R-squared of 0.2021,
        highlighting that while it captures some variability in the target variable, there's 
        scope for improvement. The model's overall statistical significance is confirmed by the
        significant LLR p-value.
        
        Key Variables Impacting the Target Variable:
        
        1. **Account Balance:** With a coefficient of 0.5627, this variable has a strong positive
           influence. As the account balance rises, so does the probability of the target outcome.
        
        2. **Payment Status of Previous Credit and Value Savings/Stocks:** These variables, like
           Account Balance, positively impact the target, indicating a higher likelihood of a 
           positive outcome with better credit history and more savings.
        
        3. **Length of Current Employment:** A coefficient of 0.1861 shows that longer employment
           tenure slightly increases the target's likelihood.
        
        4. **Instalment per cent:** This variable has a significant negative effect (-0.3518),
           suggesting higher installment percentages decrease the likelihood of the target being 1.
        
        Variables such as Purpose, Guarantors show no significant impact. Their high p-values 
        indicate a non-distinguishable effect from zero in this dataset. Borderline significant 
        variables are like Sex & Marital Status, Most Valuable Available Asset, and Foreign Worker 
        are on the borderline of significance. Their p-values, close to 0.05, suggest they might
        have a marginal effect. For example, Foreign Worker, with a coefficient of 1.2744, hints
        at a potentially significant influence, but the high p-value makes this inconclusive.
        
        In conclusion, the model's effectiveness is driven mainly by financial history and stability
        variables. In contrast, demographic and situational factors offer less clear contributions,
        pointing to potential areas for further data collection or research. The model appears to 
        have a reasonable fit and some strong significant predictors. 
        
        ### Recommendations:
        - Consider simplifying the model by removing non-significant variables. However, domain 
          knowledge and understanding of the context should guide whether to keep potentially 
          important predictors even if they are not statistically significant.
        - Evaluate multicollinearity, as correlated predictors can affect each other's significance
          levels.
        - Examine interactions between variables, as some variables might have a combined effect on
          the target.
        - Check the model against a validation set to assess its predictive performance and avoid 
          overfitting.
        - Assess the practical significance of coefficients in addition to statistical significance.
          Some variables with smaller coefficients might still have practical implications for the 
          outcome.
        
        """

In [6]:
# Create ana analysis
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-1106-preview",
  # response_format={ "type": "json_object" },
  messages=[
    # {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
    {"role": "user", "content": "You are given the output of the logistic regression, which you have to analyse."},
    {"role": "user", "content": f"Present your answer using the spirit and formatting of this example: {example}"},
    {"role": "user", "content": f"The answer should in markdown format with the row length below 100 characters"},
    {"role": "user", "content": f"General info about the model: {general_info}"},
    {"role": "user", "content": f"Specific information about individual risk drivers: {variables_info}"},
  ]
)

# Parse the JSON string
json_data = json.loads(response.json())

# Pretty print the JSON data
formatted_json = json.dumps(json_data, indent=4)
print(json_data['choices'][0]['message']['content'])

```markdown
#### Logistic Regression Analysis Summary:

The logistic regression model provides insights into factors influencing the target variable.
Although the model is statistically significant, it presents a moderate fit to the data:

- **Model Fit**: A Pseudo R-squared value of 0.2021 indicates the model explains some, 
  but not all, variability in the target variable.
- **Significance**: The overall model is statistically significant, as evidenced by a 
  low LLR p-value (1.222e-30).

#### Key Risk Drivers:

1. **Account Balance**: Highly significant (p < 0.001) with a positive coefficient of 
   0.5627, suggesting a strong relationship where higher account balances are associated 
   with an increased probability of the target event occurring.
2. **Duration of Credit (month)**: Negative coefficient (-0.0223) with significance 
   (p = 0.020), indicating longer credit durations may marginally decrease the likelihood
   of the target event.
3. **Payment Status of Previous Credit