Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logistic Regression #31

Closed
ijyliu opened this issue Feb 29, 2024 · 28 comments
Closed

Logistic Regression #31

ijyliu opened this issue Feb 29, 2024 · 28 comments
Assignees

Comments

@ijyliu
Copy link
Collaborator

ijyliu commented Feb 29, 2024

Use all_data_fixed_quarters.parquet.

Predict credit rating. You should be able to use the financial features (can do several or just variable Altman_Z), as well as the Sector variable, as these are on the data. We might also be able to a few NLP features if these are done in time.

Evaluate accuracy of prediction. Create confusion matrix if time allows.

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 2, 2024

@current12 you can start writing code for a multinomial logit for this. the variables will be the same as the existing file even if we swap out the earnings calls

@current12
Copy link
Owner

@current12 you can start writing code for a multinomial logit for this. the variables will be the same as the existing file even if we swap out the earnings calls

Sure!

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 4, 2024

Comments on https://github.com/current12/Stat-222-Project/blob/main/Code/simple_regression.ipynb

print out all variables in the dataset at the top of the code for reference

You can use Rating because the rating is the rating on the fixed quarter date and the earnings call and financial data is from before that. (you can do next rating also but that's more of an extra thing)

for the change prediction, i'd run with the upgrade v. downgrade v. constant variable rather than the number of notches of the change

for predictors, I'd run with just Altman_Z at first. Be careful with throwing too many variables in, a lot are collinear. if you do use a bunch of vars, also do a run setting penalty to 'l1' (LASSO penalty) and solver to 'liblinear'. print out the variables do you end up including.

for each prediction, show the share of the majority class as a baseline

on average, are our predictions too positive (predicted rating too high) or negative?

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 5, 2024

lr_bal=LogisticRegression(random_state=42,class_weight='balanced')

https://analyticsindiamag.com/handling-imbalanced-data-with-class-weights-in-logistic-regression/

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 5, 2024

you can add Sector to the regression too as a categorical

@ijyliu ijyliu changed the title Regression for Presentation 3/5 Baseline Logistic Regression Mar 6, 2024
@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 8, 2024

@current12 i'd suggest continuing to work on this as you have time. definitely add a one-hot encoding of Sector. also, it'd be nice if we had sector headings (created using ## in markdown) describing each model/section in the notebook so it's easy to scroll through and find stuff. finally, i'd keep adding combinations of the settings (with/without l1, weight balance, others) for all of the groupings of variables (difference X vars, different Y vars) as appropriate

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 12, 2024

consider writing a function to reduce code repetition and allow us to easily explore all relevant combinations of settings (l1 vs. l2, different X variable datasets, etc)

@current12
Copy link
Owner

done

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 15, 2024

function looks good

you should add more to it, including confusion matrices. then move it to right above "2. Model". after this function, there will be a minimal amount of code in the rest of the notebook, just headings like this

image

and then minimal code for function settings, printing the variable names if needed, setting arguments, then a function call

and then another section for the next model, etc.

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 22, 2024

use variable 'train_test_80_20' as train-test split

#54

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 25, 2024

I suggest adding a calculation of the share of cases where predicted rating is 1 or fewer ratings away from the actual one. And also, the share of cases that have a predicted rating in the same grade (A, B, C, D) as their actual one.

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 31, 2024

Reminder to update to using new dataset

# Limit to items in the finalized dataset
# list of files in '../../../Data/All_Data/All_Data_with_NLP_Features'
import os
file_list = [f for f in os.listdir(r'../../../Data/All_Data/All_Data_with_NLP_Features') if f.endswith('.parquet')]
# read in all parquet files
df = pd.concat([pd.read_parquet(r'../../../Data/All_Data/All_Data_with_NLP_Features/' + f) for f in file_list])

@ijyliu
Copy link
Collaborator Author

ijyliu commented Mar 31, 2024

I actually suggest using grid search for a variety of parameter settings instead of doing the functions.

Example code attached.
Logistic Regression Grid Search Example Code.zip

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 1, 2024

We also need insight into variable importance. So please add a permutation test (look at drop in accuracy when you randomly permute a feature), coefficient significance, or something else specific to logistic regression.

@current12
Copy link
Owner

I just uploaded the latest version.
For the grid searching part, as the current model is only a baseline and the performance is not bad, I think we can omit the grid search to find the best parameter.
I'll work on coefficient significance tomorrow.

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 1, 2024

fixed file paths and moved to https://github.com/current12/Stat-222-Project/tree/main/Code/Modelling/Logistic%20Regression

I do think grid search is important and I'm pretty sure they're expecting us to explain hyperparameter choices and bias-variance tradeoff (they mentioned this in lecture several times) and for l1/l2/elasticnet logistic regression that's setting C (inverse of lambda). you can do it on SCF if it gets too slow, you don't even have to explicitly parallelize anything other than setting n_jobs=-1 (see below) and you can request as many CPUs as you want (and you could also only do solver of 'saga'). below is basically the only code you need

hyperparameter_settings = [
    # Non-penalized
    {'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'], 
     'penalty': [None], 
     'C': [1],  # C is irrelevant here but required as a placeholder
     'class_weight': [None, 'balanced'], 
     'multi_class': ['ovr', 'multinomial']},
    # ElasticNet penalty
    {'solver': ['saga'], 
     'penalty': ['elasticnet'], 
     'C': [0.001, 0.01, 0.1, 1, 10, 100], 
     'l1_ratio': [0.0, 0.25, 0.5, 0.75, 1.0], 
     'class_weight': [None, 'balanced'], 
     'multi_class': ['ovr', 'multinomial']}
]

# Fit model
# Perform grid search with 5 fold cross validation
lr = LogisticRegression(max_iter=1000) # higher to encourage convergence
gs = GridSearchCV(lr, hyperparameter_settings, scoring='accuracy', cv=5, n_jobs=-1).fit(X, y)

print("tuned hyperparameters: ", gs.best_params_)
print("accuracy: ", gs.best_score_)
print("best model: ", gs.best_estimator_)

# Dump the best model to a file
joblib.dump(gs.best_estimator_, 'Best Logistic Regression Model.joblib')

let's not do any train-test splitting in this code and use the 'train_test_split_80_20' variable always. if we decide to fix the split to include every class (#54) we will do that upstream

for each model, I suggest saving an Excel file with accuracy, precision, F1, etc. and also the plot of the confusion matrix for us to use in the writeups. for the table you can just use the classification_report built-in to sklearn

from sklearn.metrics import classification_report

you can output a table of the coefficient significance to Excel also

@ijyliu ijyliu changed the title Baseline Logistic Regression Logistic Regression Apr 1, 2024
@current12
Copy link
Owner

I've uploaded the latest one with grid search and results in the notebook

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 1, 2024

it looks like your code doesn't have class D. did you git pull before running? be sure to do that so you have the latest version of the data.

it also looks like some of the model runs still use test_size. just delete the code that has the setting from the functions and fix any errors that arise

please do try to use the '../..' relative paths so that code is runnable on other people's machines (this might mean you have to run the notebook from the folder it is located in).

other than that i think we can go ahead and get setup to produce all the outputs: excel file of the classification report for each run, excel file of coefficient significance, png of confusion matrix. you can put these in Output/Modelling and create a Logistic Regression folder there.

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 2, 2024

i produced this, which should be helpful to you in several ways.

https://github.com/current12/Stat-222-Project/blob/main/Code/Data%20Loading%20and%20Cleaning/All%20Data/Variable%20Index.xlsx

first, it has a match from the raw variable names to a nicely formatted version suitable for tables and figures.

second, it has information on how we should use each variable. you can use this to refine your variable groupings and decide what to include/exclude. don't include things that are disallowed and don't include Predicted - Change when you are predicting ratings and vice versa.

you can load it in as dataframe and/or create a dictionary mapping or whatever to make it easy to rename variables and pick variables for models.

# Load '../../Data Loading and Cleaning/All Data/Variable Index.xlsx'
var_index = pd.read_excel('../../Data Loading and Cleaning/All Data/Variable Index.xlsx')
# Keep column_name and 'Clean Column Name'
var_index = var_index[['column_name', 'Clean Column Name']]
# Create dictionary mapping column_name to Clean Column Name
var_index_dict = dict(zip(var_index['column_name'], var_index['Clean Column Name']))
var_index_dict

@current12
Copy link
Owner

it looks like your code doesn't have class D. did you git pull before running? be sure to do that so you have the latest version of the data.

it also looks like some of the model runs still use test_size. just delete the code that has the setting from the functions and fix any errors that arise

please do try to use the '../..' relative paths so that code is runnable on other people's machines (this might mean you have to run the notebook from the folder it is located in).

other than that i think we can go ahead and get setup to produce all the outputs: excel file of the classification report for each run, excel file of coefficient significance, png of confusion matrix. you can put these in Output/Modelling and create a Logistic Regression folder there.

what is class D?

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 2, 2024

rating D I meant

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 2, 2024

@current12

  • create separate code file (.py or .ipynb to run in parallel on SCF). the code file should output everything possible and save to the folder - hyperparameters, classification report, all metrics, confusion matrices
  • add to this file something for feature importance on the most complicated model - coefficients and significance version, permutation and accuracy drop version
  • confusion matrix for the most complex model
  • changes model - confusion matrix,

table output for reports (Excel + LaTeX) constructed in separate individual files: @OwenLin2001 and @ijyliu can assist after they do readme/cleaning step stuff + share other helpful report things

  • table with row for each model (including the change in rating model) and columns for accuracy, everything else in the last row (weighted avg) of classification report, and share 1 rating or less away from actual. plus one row with majority baseline and the number filled in for the accuracy column. (Isaac to mockup)
  • for the most complex model, the entire classification report cleaned up and in Excel, + row for majority baseline + row for share 1 or fewer away from correct
  • table of hyperparameters selected for most complex model (with one row)
  • changes model - table with one row with accuracy, f1, majority class baseline. then in your reporting, mention we could change objective to f1 or force rebalancing to stop predictions from focusing on majority class

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 3, 2024

mockups attached and also in the folder

Table Mockups.xlsx

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 3, 2024

@current12 let us know once you have run all the models on the most recent data and saved everything and we can start working on output

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 3, 2024

see #31 (comment)

@current12
Copy link
Owner

@current12 let us know once you have run all the models on the most recent data and saved everything and we can start working on output

owen help me run the output, he will upload the result

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 3, 2024

just pushed a fix to the feature columns. (using rating_on_previous_fixed_quarter_date, not previous rating, which is the last rating in the raw credit data). this will probably improve accuracy.

please see the attached Variable Index.xlsx

we may also want to exclude the items that are Constructed for Tone because those are PCA'd into TONE1 through a linear combination (PN, SW, AP, OU are at least, the other ones enter into these as ratios so maybe there's value in having them)

the items marked as metadata are allowed but not super-well explored yet, probably don't want to add those right now.

@ijyliu
Copy link
Collaborator Author

ijyliu commented Apr 8, 2024

code completed, pending edits to underlying variables

@ijyliu ijyliu closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants