## Instructions {-}

1. Please answer the following questions as part of your project proposal.

2. Write your answers in the *Markdown* cells of the Jupyter notebook. You don't need to write any code, but if you want to, you may use the *Code* cells.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The project proposal is worth 8 points, and is due on **18th April 2023 at 11:59 pm**. 

5. You must make one submission as a group, and not individually.

6. Maintaining a GitHub repository is optional, though encouraged for the project.

7. Share the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0) (optional).

# 1) Team name

*(0 points)*

**Ashley's Phone is Not Lost**

# 2) Member names

*(0 points)*

Luca Moretti, Kaylee Mo, Nicket Mauskar, Ashley Witarsa

# 3) Link to the GitHub repository (optional)

*(0 points)*

N/A

# 4) Topic

*(0 points)*

**Predicting individual credit scores based on multiple credit-related predictors using multi-linear regression, decision trees, bagging, random Forests, gradient boosting, and other forms of modeling.**

# 5) Problem statement

*(4 points)*

In [41]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import random
from sklearn.metrics import mean_squared_error

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train['Credit_Rating'] = 0
train['Credit_Rating'] = np.where(train['Credit_Score'] == 'Good',
                          np.random.uniform(670, 730, size=len(train)),
                          train['Credit_Rating'])

train['Credit_Rating'] = np.where(train['Credit_Score'] == 'Standard',
                          np.random.uniform(580, 670, size=len(train)),
                          train['Credit_Rating'])

train['Credit_Rating'] = np.where(train['Credit_Score'] == 'Poor',
                          np.random.uniform(0, 580, size=len(train)),
                          train['Credit_Rating'])

y_true = train['Credit_Rating']

mean_value = np.mean(y_true) 
y_naive = [mean_value] * len(y_true)

naive_rmse = mean_squared_error(y_true, y_naive, squared=False)

The problem we aim to address in this research project is the accurate prediction/classification of credit scores based on customer credit information and history. Credit scores are a crucial factor in determining an individual's financial health, and lenders heavily rely on them to assess creditworthiness and risk. However, credit score calculations can be complex and multifaceted, taking into account a range of factors such as payment history, outstanding debts, credit utilization, and length of credit history. Our goal is to build a reliable model that can accurately classify credit scores based on customer information, providing lenders with a tool to make more informed decisions and ultimately helping individuals to secure better financial outcomes.

As of right now, we have only learned classification models with binary classification. The credit scores are classified into Poor, Standard, and Good on our dataset. We are prepared to assign random credit ratings to those with Poor credit scores (ratings from 300-629), those with Standard (ratings from 630-689), and Good (ratings from 690-719) in order to create regression models and pursue a continuous response; However, we are certain that the decision trees we are learning in class in addition to new material further down the line will allow for classification models with two or more classes. Thus, we expect to use classification rather than regression. 

If we are to predict using classification, the False Positive Rate (FPR) and True Positive Rate (TPR) are crucial metrics to evaluate the risk of lending money: FPR is a crucial metric for evaluating the risk of lending money to someone who may not be able to repay it, and TPR is an essential metric for evaluating the effectiveness of the credit scoring system in identifying reliable borrowers.  Thus, focusing on these metrics as well as creating an ROC-AUC curve would be effective.

In [42]:
print(f'RMSE of the naive model: {naive_rmse:.2f}')

RMSE of the naive model: 186.59


Because we need to learn more about decision trees and more complex modeling forms, we will use regression for now. We used [Investoperdia's FICO socring](https://www.investopedia.com/terms/f/ficoscore.asp) to determine ranges for the "Poor," "Standard" and "Good" credit score ratings.

RMSE will be used as the accuracy measurement for the naive model, which was calculated to be 186.59.  We aim to reduce this, or in the case that we chose to go for classification we will use accuracy score as our metric.

# 6) Data

*(1 point)*

Our data source is an open source found on Kaggle. The data provides 100,000 observations, 18 continuous predictors as well as four categorical predictors that will be useful in our prediction of individual credit scores. The response is the column `Credit_Score`. Some of the predictors include `Credit_History_Age`, `new credit`, `Credit_Mix`, `Credit_Utilization_Ratio`, and `Num_of_Delayed_Payment`, amongst many others that will need to be assessed. As it currently stands, we only plan on using one dataset, however this may be subject to change. Linked below is the Kaggle dataset:

https://www.kaggle.com/datasets/parisrohan/credit-score-classification

# 7) Exisiting solutions

*(1 point)*

There is one existing solution to our problem. The solution makes use of Logistic Regression, Random Forest, KNeighbors, and Decision Trees. The highest model accuracy is 77.9%, achieved through Random Forest. We plan on improving and tuning these models, and also using MARS, Bagging, Lasso/Ridge/Stepwise selection, and the other models that we will learn in Stat 303-3 to achieve a higher model accuracy than the current solution that is available. 

# 8) Stakeholders

*(1 point)*

**Lenders**: By having a reliable model to classify credit scores, lenders can more accurately assess creditworthiness, reduce the risk of default, and make more informed lending decisions.

**Borrowers**: With a better understanding of their creditworthiness, borrowers may be able to secure better loan terms, including lower interest rates, higher credit limits, and more favorable repayment terms.

**Credit bureaus**: Accurate credit score models can help credit bureaus improve their credit scoring algorithms, resulting in better credit scores for consumers.

**Regulators**: Regulators may benefit from a more accurate credit scoring model by promoting greater transparency and fairness in lending practices.

# 9) Work-split
*(This question is answered for you)*

How do you plan to split the project work amongst individual team members?

We will learn to develop and tune the following models in the STAT303 sequence:

1. MARS

2. Decision trees with cost-complexity pruning

3. Bagging (Bagging MARS / decision trees)

4. Random Forests

5. AdaBoost

6. Gradient boosting

7. XGBoost

8. Lasso / Ridge / Stepwise selection 

Each team member is required to develop and tune at least one of the above models. In the end, the team will combine all the developed models to create a model more accurate than each of the individual models.

*(0 points)*