# Autopredictor Tutorial

In critical sectors like healthcare, the choice of an accurate and efficient predictive model is crucial, impacting decision-making and outcomes significantly. The `autopredictor` package is designed to simplify and expedite the process of model selection and evaluation for continuous data scenarios. This package is especially valuable for healthcare professionals, data scientists, and researchers, offering them more time for insightful data interpretation and strategic decision-making. This tutorial will demonstrate the use of autopredictor with a diabetes dataset, reflecting real-world health data scenarios.

## Setting Up and Version Checking
To begin, install and import the autopredictor package and check its version to ensure compatibility with your dataset and analysis requirements.

In [1]:
import autopredictor

print(autopredictor.__version__)

0.1.0


## Importing Necessary Modules and Data
Import essential modules and the dataset. For this tutorial, the diabetes dataset from sklearn is used.

In [2]:
from autopredictor.fit import fit
from autopredictor.show_all import show_all
from autopredictor.bestscore import display_best_score
from autopredictor.select_model import select_model
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Fitting with `fit` 

With autopredictor, you can fit multiple models to your data effortlessly. The fit function trains the models and returns their performance scores.

In [15]:
model_scores = fit(X_train, X_test, y_train, y_test, return_train=True)

Linear Regression trained.
Linear Regression (L1) trained.
Linear Regression (L2) trained.
Linear Support Vector Machine trained.
Support Vector Machine trained.
Decision Tree trained.




Random Forest trained.
Gradient Boosting trained.
AdaBoost trained.


In [None]:
model_scores

## Evaluating all models with `show_all`

After executing the `fit` function, both the training and testing scores are available in a dictionary format. The `show_all` function is a versatile tool for efficiently visualize the regression model scores. It transforms the raw model scores into a structured DataFrame. By presenting data in a user-friendly format, `show_all` not only saves time but also ensures compatibility with other functions in the workflow, like `display_best_score` and `select_model`. It's efficient way to begin the model evaluation process, setting the stage for more detailed analysis.

By converting the dictionary into an organized format and sorting the results alphabetically by model name, `show_all` offers a quick and efficient mean of comprehending and comparing regression model performance. The tabular presentation enhances readability, simplifies the process of identifying specific model scores, and contributes to a streamlined model evaluation workflow. Consider a real-life application like using the load_diabetes dataset from sklearn. A researcher can train multiple models to predict diabetes progression and then use `show_all` to access a user friendly view of all the models and their respective metrics. 

While it's possible to achieve similar conversions using pandas manipulation, `show_all` is purpose-built for this package, ensuring the validity of scoring metrics in the dictionary.

**Visualizing training scores with `show_all`**

In [17]:
scores_train = show_all(model_scores[1]) #results_train 

|                               |     MAE |     MAPE |        R2 |      MSE |    RMSE |
|-------------------------------|---------|----------|-----------|----------|---------|
| AdaBoost                      | 41.4908 | 0.387053 |  0.623656 | 2286.82  | 47.8207 |
| Decision Tree                 |  0      | 0        |  1        |    0     |  0      |
| Gradient Boosting             | 25.3517 | 0.227698 |  0.835903 |  997.121 | 31.5772 |
| Linear Regression             | 43.4835 | 0.389199 |  0.527919 | 2868.55  | 53.5588 |
| Linear Regression (L1)        | 52.9588 | 0.495438 |  0.364631 | 3860.75  | 62.135  |
| Linear Regression (L2)        | 48.8052 | 0.450646 |  0.442403 | 3388.18  | 58.2081 |
| Linear Support Vector Machine | 70.6389 | 0.466293 | -0.35379  | 8226.16  | 90.6982 |
| Random Forest                 | 17.675  | 0.155916 |  0.921401 |  477.601 | 21.8541 |
| Support Vector Machine        | 58.6858 | 0.494582 |  0.166804 | 5062.83  | 71.1536 |


**Visualizing test scores with `show_all`**

In [18]:
scores_test = show_all(model_scores[0]) #results_test

|                               |     MAE |     MAPE |         R2 |     MSE |    RMSE |
|-------------------------------|---------|----------|------------|---------|---------|
| AdaBoost                      | 45.1543 | 0.433353 |  0.434052  | 2998.48 | 54.7584 |
| Decision Tree                 | 57.3034 | 0.473769 | -0.0347132 | 5482.07 | 74.041  |
| Gradient Boosting             | 44.6178 | 0.400537 |  0.450899  | 2909.22 | 53.9372 |
| Linear Regression             | 42.7941 | 0.374998 |  0.452603  | 2900.19 | 53.8534 |
| Linear Regression (L1)        | 49.7303 | 0.471126 |  0.357592  | 3403.58 | 58.3402 |
| Linear Regression (L2)        | 46.1389 | 0.425693 |  0.419153  | 3077.42 | 55.4745 |
| Linear Support Vector Machine | 63.3727 | 0.431051 | -0.27918   | 6777.29 | 82.3243 |
| Random Forest                 | 44.5472 | 0.396708 |  0.436355  | 2986.28 | 54.6468 |
| Support Vector Machine        | 56.0237 | 0.490284 |  0.182114  | 4333.29 | 65.8277 |


### Error prevention in `show_all`

To ensure a smooth and error-free experience, it is crucial to be mindful of certain considerations during its usage. Here are some tips:

__Type Check:__ The `show_all` function expects the input `result` to be a dictionary. Providing an input of a different type, such as a list or string, will trigger a `TypeError`. Always verify the input type to ensure compatibility.

__Empty Dictionary:__ Ensure that the input argument contains scores for at least one model. Passing an empty dictionary would result in a `ValueError`. Before invoking `show_all`, check that your dictionary is populated with relevant data.

__Valid Scoring Metrics:__ This function expects dictionary scores outputted from the `fit` function. It ensures that the scoring metrics are valid and complete. Passing an invalid scoring metrics as the dictionary's value will result in a `ValueError`.

By adhering to these guidelines, you can maximize the utility of the `show_all` function while preventing potential errors in its usage.

## Selecting the best model with `display_best_score`

Following the execution of the `show_all` function, a DataFrame is generated containing scoring metric results sorted alphabetically by model names. The `display_best_score` function within autopredictor plays a crucial role in this accelerated workflow. It simplifies the complex process of determining the optimal model by swiftly identifying the best-performing one based on a specified regression scoring metric.

**Selecting the best model based on the scoring metric MSE:**

In [40]:
display_best_score(scores_train,'MSE') # Based on Training Set

|                               |     MSE |
|-------------------------------|---------|
| Linear Support Vector Machine | 8226.16 |


Unnamed: 0,MSE
Linear Support Vector Machine,8226.164511


In [39]:
display_best_score(scores_test,'MSE') # Based on Test Set

|                               |     MSE |
|-------------------------------|---------|
| Linear Support Vector Machine | 6777.29 |


Unnamed: 0,MSE
Linear Support Vector Machine,6777.289705


In the context of the diabetes dataset, `display_best_score` swiftly identifies the most effective model, like a Random Forest or Support Vector Machine, for predicting diabetes progression, using a specified scoring metric. This feature is crucial for researchers and data scientists, allowing them to choose the most suitable model based on their specific needs. This flexibility is essential in healthcare and other fields where the choice of metric significantly impacts research outcomes and real-world applications.

**Selecting the best model based on the scoring metric RMSE:**

In [37]:
display_best_score(scores_train,'RMSE') # Based on Training Set

|                               |    RMSE |
|-------------------------------|---------|
| Linear Support Vector Machine | 90.6982 |


Unnamed: 0,RMSE
Linear Support Vector Machine,90.698206


In [41]:
display_best_score(scores_test,'RMSE') # Based on Test Set

|                               |    RMSE |
|-------------------------------|---------|
| Linear Support Vector Machine | 82.3243 |


Unnamed: 0,RMSE
Linear Support Vector Machine,82.324296


### Error Prevention and troubleshooting in `display_best_score` Function
Here are some refined strategies to boost error prevention and troubleshooting in the `display_best_score` function:

__Input Type Check:__ Prior to utilizing the `display_best_score` function, validate that the input result is a DataFrame. Passing any other data type will raise a `TypeError`.

__Empty DataFrame:__ Ensure that the DataFrame provided as an argument contains at least one model's scoring metrics. Attempting to use an empty DataFrame will result in a `TypeError`.

__Valid Scoring Metrics:__ Verify that the scoring metrics provided are both valid and comprehensive. If an invalid scoring metric is passed or if the DataFrame lacks essential metrics, a `ValueError` will be raised.

Users can improve their usage of the `display_best_score` function by adhering to these guidelines, which helps in minimizing the chance of encountering errors during its application. 

## Inspecting a specific model with `select_model`

Following the execution of the `show_all`, the `select_model` function allows the user to select a specific model from the DataFrame and view its performance metrics. If the model is found to be present in the DataFrame, it returns the performance metrics for that model; otherwise, it provides a list of available models. This function is particularly useful for zooming in on a specific model's performance or for retrieving the performance of the best model based on a particular metric.

In the context of a diabetes dataset, for instance, if an analyst suspects that a certain model, like a Random Forest Regressor, might be particularly well-suited to handling the complexities of diabetes data (due to its ability to model non-linear relationships and interactions between variables), the `select_model` function allows them to isolate and closely examine the performance of just this model. This is especially useful in situations where a multitude of models have been trained and evaluated, and there's a need to drill down into the specifics of one model without getting overwhelmed by the broader data.

**Viewing Random Forest model performance on dataset**

In [33]:
select_model(scores_test, 'Random Forest')

Unnamed: 0,MAE,MAPE,R2,MSE,RMSE
Random Forest,44.547191,0.396708,0.436355,2986.278196,54.64685


Expanding beyond healthcare, this function has broad applicability in various fields. For example, in finance, an analyst might want to specifically evaluate the performance of a particular model in predicting stock prices or market trends. Similarly, in environmental science, a researcher could use this function to singularly assess a model's accuracy in forecasting climate patterns or pollution levels.

The ability to selectively examine a model is crucial when comparing models that might have different strengths and weaknesses depending on the context. This targeted approach enables a more thoughtful and focused analysis, allowing analysts to make more informed decisions about which model to deploy based on specific criteria relevant to their field or problem at hand. It's a tool that enhances precision in model selection.

**Viewing Linear Regression model performance on dataset**

In [28]:
select_model(scores_train, 'Linear Regression')

Unnamed: 0,MAE,MAPE,R2,MSE,RMSE
Linear Regression,43.483504,0.389199,0.527919,2868.549703,53.558843


### Error prevention in `select_model`

For a seamless and trouble-free experience, be sure to adhere to these tips:

__Input Type Check:__ Similar to `display_best_score`, before employing the `select_model` function, ensure that the input result is a DataFrame. Passing any other data type will raise a `TypeError`.

__Empty DataFrame:__ Make sure that the DataFrame provided contains at least one model and its respective scoring metrics. Attempting to use an empty DataFrame will result in a `TypeError`.

The `select_model` function is not only beneficial for focusing on a specific model's performance but also serves as a useful tool for verifying the presence of a model within the dataset. When an analyst specifies a model, the function checks if that model is included in the DataFrame's index. If the model is not found, the function doesn't just stop at returning an error message; it goes a step further by providing a list of the models that are included. This feature is particularly helpful in multiple ways:

- __Model Inventory Check:__ It essentially acts as a quick inventory check, allowing users to confirm which models have been trained and evaluated. This is especially useful in collaborative environments where multiple team members might be working on the same dataset but focusing on different models. It ensures that everyone is aware of the models that are already included in the analysis, helping to avoid redundant work.

- __Informed Decision Making:__ By providing a list of available models, it aids in informed decision-making. Analysts can quickly scan through the available models and decide which ones to focus on based on their specific criteria or hypothesis, without having to look through the entire DataFrame.

In [42]:
selected_model_name = 'Other Regressor'
select_model(scores_test, selected_model_name)

"Model 'Other Regressor' not found. Here is the list of the models available: AdaBoost, Decision Tree, Gradient Boosting, Linear Regression, Linear Regression (L1), Linear Regression (L2), Linear Support Vector Machine, Random Forest, Support Vector Machine."

For example, in a real-life scenario, an environmental scientist analyzing a dataset on air quality might be interested in examining a specific model's ability to predict pollution levels. If they aren't sure whether the model has been included in the analysis, they can use the `select_model` function. If the model is not found, the function's feedback not only informs them of this but also shows which models are available, allowing the scientist to make an informed decision on whether to proceed with an available trained model or to train and evaluate the model of interest.

## Conclusion

The autopredictor package offers a streamlined approach to model training, evaluation, and selection, making it an indispensable tool in fields where accuracy is crucial. It facilitates the training and evaluation of a diverse array of models, provides a detailed array of performance metrics for thorough assessment, and ensures a clear, user-friendly presentation of results, aiding in the informed selection of the most effective model. Its robust functionality allow for an efficient workflow, catering to both all levels of data scientists and professionals in various domains. Whether it's healthcare, finance, or environmental science, autopredictor simplifies the complex task of model selection, empowering users to make informed, data-driven decisions.

## Best practices

- It is better to always split your dataset into training and testing sets to evaluate model performance.
- Consider the specific problem and dataset characteristics when choosing a regression model.
- Regularly check for updates to the autopredictor package for improved functionality and bug fixes.