# Example usage

To use `autopredictor` in a project:

In [1]:
import autopredictor

print(autopredictor.__version__)

0.1.0


In [2]:
from autopredictor.fit import fit
from autopredictor.show_all import show_all
from autopredictor.bestscore import display_best_score
from autopredictor.select_model import select_model
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

## Introduction

In real-life scenarios, like in healthcare, the process of selecting appropriate models, determining relevant metrics, scrutinizing specific models, and ultimately choosing the most suitable model is not only time-consuming but also critical in influencing decision-making and patient outcomes. Our package, autopredictor, is meticulously designed to streamline these aspects of machine learning for continuous data, thereby significantly accelerating the workflow. This acceleration is particularly beneficial for healthcare professionals, data scientists, or researchers, as it allows them to allocate more time towards insightful data interpretation and strategic decision-making. The diabetes dataset showcased in this example serves as a representative sample of real-world health data. The methodology employed in training, evaluating, and selecting models closely parallels the procedural work observed in actual research settings.

In [13]:
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## `fit`

In [15]:
model_scores = fit(X_train, X_test, y_train, y_test, return_train=True)

Linear Regression trained.
Linear Regression (L1) trained.
Linear Regression (L2) trained.
Linear Support Vector Machine trained.
Support Vector Machine trained.
Decision Tree trained.




Random Forest trained.
Gradient Boosting trained.
AdaBoost trained.


In [16]:
model_scores

({'Linear Regression': {'Mean Absolute Error': 42.79409467959994,
   'Mean Absolute Percentage Error': 0.3749982636756113,
   'R2 Score': 0.4526027629719195,
   'Mean Squared Error': 2900.1936284934814,
   'Root Mean Squared Error': 53.85344583676593},
  'Linear Regression (L1)': {'Mean Absolute Error': 49.73032753662261,
   'Mean Absolute Percentage Error': 0.47112563453406076,
   'R2 Score': 0.3575918767219115,
   'Mean Squared Error': 3403.5757216070733,
   'Root Mean Squared Error': 58.340172450954185},
  'Linear Regression (L2)': {'Mean Absolute Error': 46.13885766697452,
   'Mean Absolute Percentage Error': 0.42569291627271477,
   'R2 Score': 0.41915292635986556,
   'Mean Squared Error': 3077.4159388272296,
   'Root Mean Squared Error': 55.47446204180109},
  'Linear Support Vector Machine': {'Mean Absolute Error': 63.37274571076944,
   'Mean Absolute Percentage Error': 0.43105080659205713,
   'R2 Score': -0.27917999088950207,
   'Mean Squared Error': 6777.289705398665,
   'Root M

## `show_all`

### Visualizing model scores in a table format

After executing the `fit` function, both the training and testing scores are available in a dictionary format. The `show_all` function is a versatile tool for efficiently visualize the regression model scores. It transforms the raw model scores into a structured DataFrame and ensure compatibility with subsequent functions such as `display_best` and `select_model` functions.

By converting the dictionary into an organized format and sorting the results alphabetically by model name, `show_all` offers a quick and efficient mean of comprehending and comparing regression model performance. The tabular presentation enhances readability, simplifies the process of identifying specific model scores, and contributes to a streamlined model evaluation workflow.

While it's possible to achieve similar conversions using pandas manipulation, `show_all` is purpose-built for this package, ensuring the validity of scoring metrics in the dictionary.


In [17]:
scores_train = show_all(model_scores[1]) #results_train 

|                               |     MAE |     MAPE |        R2 |      MSE |    RMSE |
|-------------------------------|---------|----------|-----------|----------|---------|
| AdaBoost                      | 41.4908 | 0.387053 |  0.623656 | 2286.82  | 47.8207 |
| Decision Tree                 |  0      | 0        |  1        |    0     |  0      |
| Gradient Boosting             | 25.3517 | 0.227698 |  0.835903 |  997.121 | 31.5772 |
| Linear Regression             | 43.4835 | 0.389199 |  0.527919 | 2868.55  | 53.5588 |
| Linear Regression (L1)        | 52.9588 | 0.495438 |  0.364631 | 3860.75  | 62.135  |
| Linear Regression (L2)        | 48.8052 | 0.450646 |  0.442403 | 3388.18  | 58.2081 |
| Linear Support Vector Machine | 70.6389 | 0.466293 | -0.35379  | 8226.16  | 90.6982 |
| Random Forest                 | 17.675  | 0.155916 |  0.921401 |  477.601 | 21.8541 |
| Support Vector Machine        | 58.6858 | 0.494582 |  0.166804 | 5062.83  | 71.1536 |


In [18]:
scores_test = show_all(model_scores[0]) #results_test

|                               |     MAE |     MAPE |         R2 |     MSE |    RMSE |
|-------------------------------|---------|----------|------------|---------|---------|
| AdaBoost                      | 45.1543 | 0.433353 |  0.434052  | 2998.48 | 54.7584 |
| Decision Tree                 | 57.3034 | 0.473769 | -0.0347132 | 5482.07 | 74.041  |
| Gradient Boosting             | 44.6178 | 0.400537 |  0.450899  | 2909.22 | 53.9372 |
| Linear Regression             | 42.7941 | 0.374998 |  0.452603  | 2900.19 | 53.8534 |
| Linear Regression (L1)        | 49.7303 | 0.471126 |  0.357592  | 3403.58 | 58.3402 |
| Linear Regression (L2)        | 46.1389 | 0.425693 |  0.419153  | 3077.42 | 55.4745 |
| Linear Support Vector Machine | 63.3727 | 0.431051 | -0.27918   | 6777.29 | 82.3243 |
| Random Forest                 | 44.5472 | 0.396708 |  0.436355  | 2986.28 | 54.6468 |
| Support Vector Machine        | 56.0237 | 0.490284 |  0.182114  | 4333.29 | 65.8277 |


### Error prevention in `show_all`

To ensure a smooth and error-free experience, it is crucial to be mindful of certain considerations during its usage. Here are some troubleshooting tips:

1. __Type Check:__ The `show_all` function expects the input `result` to be a dictionary. Passing a non-dictionary type would result in a `TypeError`.
2. __Empty Dictionary:__ Ensure that the input argument contains scores for at least one model. Passing an empty dictionary would result in a `ValueError`.
3. __Valid Scoring Metrics:__ This function expects dictionary scores outputted from the `fit` function. It ensures that the scoring metrics are valid and complete. Passing an invalid scoring metrics as the dictionary's value will result in a `ValueError`.

By adhering to these guidelines, you can maximize the utility of the `show_all` function while preventing potential errors in its usage.

## `display_best_score`

Following the execution of the `show_all` function, a DataFrame is generated containing scoring metric results sorted alphabetically by model names. The `display_best_score` function within autopredictor plays a crucial role in this accelerated workflow. It simplifies the complex process of determining the optimal model by swiftly identifying the best-performing one based on a specified regression scoring metric.

In [25]:
display_best_score(scores_test,'MSE')

|                               |     MSE |
|-------------------------------|---------|
| Linear Support Vector Machine | 6777.29 |


Unnamed: 0,MSE
Linear Support Vector Machine,6777.289705


In [26]:
display_best_score(scores_train,'MSE')

|                               |     MSE |
|-------------------------------|---------|
| Linear Support Vector Machine | 8226.16 |


Unnamed: 0,MSE
Linear Support Vector Machine,8226.164511


## Error Prevention and troubleshooting in `display_best_score` Function

__Input Type Check:__ Prior to utilizing the `display_best_score` function, validate that the input result is a DataFrame. Passing any other data type will raise a `TypeError`.

__Empty DataFrame:__ Ensure that the DataFrame provided as an argument contains at least one model's scoring metrics. Attempting to use an empty DataFrame will result in a `TypeError`.

__Valid Scoring Metrics:__ Verify that the scoring metrics provided are both valid and comprehensive. If an invalid scoring metric is passed or if the DataFrame lacks essential metrics, a `ValueError` will be raised.

Users can improve their usage of the `display_best_score` function by adhering to these guidelines, which helps in minimizing the chance of encountering errors during its application. 

## `select_model`

After training the models and evaluating their performance, the select_model function allows the user to select a specific model and view its performance metrics. It ensures that the inputs are of the correct types and that the specified model is present in the results. If the model is found, it returns the performance metrics for that model; otherwise, it provides a list of available models. This function is particularly useful for zooming in on a specific model's performance or for retrieving the performance of the best model based on a particular metric.

In the context of a diabetes dataset, for instance, if an analyst suspects that a certain model, like a Random Forest Regressor, might be particularly well-suited to handling the complexities of diabetes data (due to its ability to model non-linear relationships and interactions between variables), the select_model function allows them to isolate and closely examine the performance of just this model. This is especially useful in situations where a multitude of models have been trained and evaluated, and there's a need to drill down into the specifics of one model without getting overwhelmed by the broader data.

Expanding beyond healthcare, this function has broad applicability in various fields. For example, in finance, an analyst might want to specifically evaluate the performance of a particular model in predicting stock prices or market trends. Similarly, in environmental science, a researcher could use this function to singularly assess a model's accuracy in forecasting climate patterns or pollution levels.

The ability to selectively examine a model is crucial when comparing models that might have different strengths and weaknesses depending on the context. This targeted approach enables a more thoughtful and focused analysis, allowing analysts to make more informed decisions about which model to deploy based on specific criteria relevant to their field or problem at hand. It's a tool that enhances precision in model selection.

In [27]:
select_model(scores_test, 'AdaBoost')

Unnamed: 0,MAE,MAPE,R2,MSE,RMSE
AdaBoost,45.154349,0.433353,0.434052,2998.478979,54.758369


In [28]:
select_model(scores_train, 'Linear Regression')

Unnamed: 0,MAE,MAPE,R2,MSE,RMSE
Linear Regression,43.483504,0.389199,0.527919,2868.549703,53.558843


In [29]:
select_model(scores_test, 'Random Forest')

Unnamed: 0,MAE,MAPE,R2,MSE,RMSE
Random Forest,44.547191,0.396708,0.436355,2986.278196,54.64685


### Error prevention in `select_model`

The select_model function is not only beneficial for focusing on a specific model's performance but also serves as a useful tool for verifying the presence of a model within the dataset. When an analyst specifies a model, the function checks if that model is included in the DataFrame's index. If the model is not found, the function doesn't just stop at returning an error message; it goes a step further by providing a list of the models that are included. This feature is particularly helpful in multiple ways:

- **Error Prevention and Troubleshooting:** It helps in preventing errors that might occur from typos or incorrect model names. By returning an error message and a list of included models, it guides the user towards the correct model names, making the troubleshooting process more intuitive and less time-consuming.

- **Model Inventory Check:** It essentially acts as a quick inventory check, allowing users to confirm which models have been trained and evaluated. This is especially useful in collaborative environments where multiple team members might be working on the same dataset but focusing on different models. It ensures that everyone is aware of the models that are already included in the analysis, helping to avoid redundant work.

- **Informed Decision Making:** By providing a list of available models, it aids in informed decision-making. Analysts can quickly scan through the available models and decide which ones to focus on based on their specific criteria or hypothesis, without having to look through the entire DataFrame.

For example, in a real-life scenario, an environmental scientist analyzing a dataset on air quality might be interested in examining a specific model's ability to predict pollution levels. If they aren't sure whether the model has been included in the analysis, they can use the select_model function. If the model is not found, the function's feedback not only informs them of this but also shows which models are available, allowing the scientist to make an informed decision on whether to proceed with an available trained model or to train and evaluate the model of interest.

In [31]:
# Model Does Not Exist
selected_model_name = 'Other Regressor'
select_model(scores_test, selected_model_name)

"Model 'Other Regressor' not found. Here is the list of the models available: AdaBoost, Decision Tree, Gradient Boosting, Linear Regression, Linear Regression (L1), Linear Regression (L2), Linear Support Vector Machine, Random Forest, Support Vector Machine."

In [32]:
# Wrong Spelling
selected_model_name = 'Rand Fore'
select_model(scores_test, selected_model_name)

"Model 'Rand Fore' not found. Here is the list of the models available: AdaBoost, Decision Tree, Gradient Boosting, Linear Regression, Linear Regression (L1), Linear Regression (L2), Linear Support Vector Machine, Random Forest, Support Vector Machine."

## Conclusion

Our script presents a comprehensive framework for model selection within the realm of machine learning tasks. It facilitates the training and evaluation of a diverse array of models, provides a detailed array of performance metrics for thorough assessment, and ensures a clear, user-friendly presentation of results, aiding in the informed selection of the most effective model. The application of this script to the diabetes dataset underlines its relevance and adaptability to real-world data, underscoring its potential to be a valuable asset in healthcare analytics and various other domains where modeling is a key.