-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logistic Regression #31
Comments
@current12 you can start writing code for a multinomial logit for this. the variables will be the same as the existing file even if we swap out the earnings calls |
Sure! |
Comments on https://github.com/current12/Stat-222-Project/blob/main/Code/simple_regression.ipynb print out all variables in the dataset at the top of the code for reference You can use for the change prediction, i'd run with the upgrade v. downgrade v. constant variable rather than the number of notches of the change for predictors, I'd run with just for each prediction, show the share of the majority class as a baseline on average, are our predictions too positive (predicted rating too high) or negative? |
https://analyticsindiamag.com/handling-imbalanced-data-with-class-weights-in-logistic-regression/ |
you can add Sector to the regression too as a categorical |
@current12 i'd suggest continuing to work on this as you have time. definitely add a one-hot encoding of Sector. also, it'd be nice if we had sector headings (created using |
consider writing a function to reduce code repetition and allow us to easily explore all relevant combinations of settings (l1 vs. l2, different X variable datasets, etc) |
done |
function looks good you should add more to it, including confusion matrices. then move it to right above "2. Model". after this function, there will be a minimal amount of code in the rest of the notebook, just headings like this and then minimal code for function settings, printing the variable names if needed, setting arguments, then a function call and then another section for the next model, etc. |
use variable 'train_test_80_20' as train-test split |
I suggest adding a calculation of the share of cases where predicted rating is 1 or fewer ratings away from the actual one. And also, the share of cases that have a predicted rating in the same grade (A, B, C, D) as their actual one. |
Reminder to update to using new dataset
|
I actually suggest using grid search for a variety of parameter settings instead of doing the functions. Example code attached. |
We also need insight into variable importance. So please add a permutation test (look at drop in accuracy when you randomly permute a feature), coefficient significance, or something else specific to logistic regression. |
I just uploaded the latest version. |
fixed file paths and moved to https://github.com/current12/Stat-222-Project/tree/main/Code/Modelling/Logistic%20Regression I do think grid search is important and I'm pretty sure they're expecting us to explain hyperparameter choices and bias-variance tradeoff (they mentioned this in lecture several times) and for l1/l2/elasticnet logistic regression that's setting C (inverse of lambda). you can do it on SCF if it gets too slow, you don't even have to explicitly parallelize anything other than setting
let's not do any train-test splitting in this code and use the 'train_test_split_80_20' variable always. if we decide to fix the split to include every class (#54) we will do that upstream for each model, I suggest saving an Excel file with accuracy, precision, F1, etc. and also the plot of the confusion matrix for us to use in the writeups. for the table you can just use the classification_report built-in to sklearn
you can output a table of the coefficient significance to Excel also |
I've uploaded the latest one with grid search and results in the notebook |
it looks like your code doesn't have class D. did you git pull before running? be sure to do that so you have the latest version of the data. it also looks like some of the model runs still use please do try to use the '../..' relative paths so that code is runnable on other people's machines (this might mean you have to run the notebook from the folder it is located in). other than that i think we can go ahead and get setup to produce all the outputs: excel file of the classification report for each run, excel file of coefficient significance, png of confusion matrix. you can put these in |
i produced this, which should be helpful to you in several ways. first, it has a match from the raw variable names to a nicely formatted version suitable for tables and figures. second, it has information on how we should use each variable. you can use this to refine your variable groupings and decide what to include/exclude. don't include things that are disallowed and don't include you can load it in as dataframe and/or create a dictionary mapping or whatever to make it easy to rename variables and pick variables for models.
|
what is class D? |
rating D I meant |
table output for reports (Excel + LaTeX) constructed in separate individual files: @OwenLin2001 and @ijyliu can assist after they do readme/cleaning step stuff + share other helpful report things
|
mockups attached and also in the folder |
@current12 let us know once you have run all the models on the most recent data and saved everything and we can start working on output |
see #31 (comment) |
owen help me run the output, he will upload the result |
just pushed a fix to the feature columns. (using please see the attached Variable Index.xlsx we may also want to exclude the items that are the items marked as |
code completed, pending edits to underlying variables |
Use
all_data_fixed_quarters.parquet
.Predict credit rating. You should be able to use the financial features (can do several or just variable
Altman_Z
), as well as theSector
variable, as these are on the data. We might also be able to a few NLP features if these are done in time.Evaluate accuracy of prediction. Create confusion matrix if time allows.
The text was updated successfully, but these errors were encountered: