### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [1]:
# Import the standard, publicly-available iris data set
import pandas as pd
from sklearn.datasets import load_iris

data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.Series(data['target'])

In [2]:
# Split data into appropriate training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df,
                                                        y,
                                                        train_size=0.7,
                                                        random_state=0)

((105, 4), (45, 4))

In [3]:
# Set and initialize the AutoMLx engine
import automlx
from automlx import init

init(engine='local')

In [4]:
# Train a model using AutoMLx
est = automlx.Pipeline(task='classification')
est.fit(X_train, y_train)

[2025-10-21 20:45:50,274] [automlx.interface] Dataset shape: (105,4)
[2025-10-21 20:45:50,363] [automlx.data_transform] Running preprocessing. Number of features: 5
[2025-10-21 20:45:50,581] [automlx.data_transform] Preprocessing completed. Took 0.219 secs
[2025-10-21 20:45:50,593] [automlx.process] Running Model Generation
[2025-10-21 20:45:50,641] [automlx.process] Model Generation completed.
[2025-10-21 20:45:50,706] [automlx.model_selection] Running Model Selection
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005252 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 77
[LightGBM] [Info] Number of data points in the train set: 84, number of used features: 4
[LightGBM] [Info] Start training from score -1.106324
[LightGBM] [Info] Start training from score -0.992996
[LightGBM] [Info] Start training from score -1.208107
[LightGBM] [Info]

<automlx._interface.classifier.AutoClassifier at 0x7f8a3334e410>

In [13]:
# Instantiate ads.model.framework.sklearn_model.SklearnModel using the sklearn LogisticRegression model
import tempfile

from ads.model.framework.sklearn_model import SklearnModel

sklearn_model = SklearnModel(
#    estimator=sklearn_estimator, artifact_dir=tempfile.mkdtemp()
    estimator=est, artifact_dir=tempfile.mkdtemp()
)

# Autogenerate score.py, serialized model, runtime.yaml, input_schema.json and output_schema.json
sklearn_model.prepare(
    inference_conda_env="dbexp_p38_cpu_v1",
    X_sample=X_train,
    y_sample=y_train,
)

# Verify generated artifacts
#sklearn_model.verify(X_test)

# Register scikit-learn model
#model_id = sklearn_model.save(display_name="Sklearn Model")

[2025-10-21 21:21:57,510] [ads.common] In the future model input will be serialized by `cloudpickle` by default. Currently, model input are serialized into a dictionary containing serialized input data and original data type information.Set `model_input_serializer="cloudpickle"` to use cloudpickle model input serializer.
[2025-10-21 21:21:57,511] [root] Cannot extract the hyperparameters from this model automatically.
[2025-10-21 21:22:12,199] [ADS] No commit found.                                                                                                                                                        ?, ?it/s]


algorithm: AutoClassifier
artifact_dir:
  /tmp/tmp02pu9bms:
  - - output_schema.json
    - score.py
    - runtime.yaml
    - model.joblib
    - .model-ignore
    - input_schema.json
framework: scikit-learn
model_deployment_id: null
model_id: null

In [12]:
import tempfile

import ads
from ads.model.framework.sklearn_model import SklearnModel
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


ads.set_auth(auth="resource_principal")

# Load dataset and Prepare train and test split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Train a LogisticRegression model
sklearn_estimator = LogisticRegression()
sklearn_estimator.fit(X_train, y_train)

# Instantiate ads.model.framework.sklearn_model.SklearnModel using the sklearn LogisticRegression model
sklearn_model = SklearnModel(
    estimator=sklearn_estimator, artifact_dir=tempfile.mkdtemp()
)

# Autogenerate score.py, serialized model, runtime.yaml, input_schema.json and output_schema.json
sklearn_model.prepare(
    inference_conda_env="dbexp_p38_cpu_v1",
    X_sample=X_train,
    y_sample=y_train,
)

# Verify generated artifacts
sklearn_model.verify(X_test)

# Register scikit-learn model
model_id = sklearn_model.save(display_name="Sklearn Model")

[2025-10-21 21:16:59,846] [ads.common] In the future model input will be serialized by `cloudpickle` by default. Currently, model input are serialized into a dictionary containing serialized input data and original data type information.Set `model_input_serializer="cloudpickle"` to use cloudpickle model input serializer.
                                                                                                                                                                                                        ?, ?it/s]

ERROR - Exception
Traceback (most recent call last):
  File "/home/datascience/conda/automlx251_p311_cpu_x86_64_v2/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_6214/1573694836.py", line 27, in <module>
    sklearn_model.prepare(
  File "/home/datascience/conda/automlx251_p311_cpu_x86_64_v2/lib/python3.11/site-packages/ads/model/generic_model.py", line 1092, in prepare
    self.populate_metadata(
  File "/home/datascience/conda/automlx251_p311_cpu_x86_64_v2/lib/python3.11/site-packages/ads/model/model_metadata_mixin.py", line 328, in populate_metadata
    self._populate_provenance_metadata(
  File "/home/datascience/conda/automlx251_p311_cpu_x86_64_v2/lib/python3.11/site-packages/ads/model/model_metadata_mixin.py", line 245, in _populate_provenance_metadata
    ModelProvenanceMetadata.fetch_training_code_details(
  File "/home/datascience/conda/automlx251_p311_cpu_x86_6