# Dimensionality Reduction in DataRobot Using t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful technique for dimensionality reduction that can effectively visualize high-dimensional data in a lower-dimensional space. Dimensionality reduction can improve machine learning results by reducing computational complexity of the algorithms, preventing overfitting, and focusing on the most relevant features in the dataset. Note that this technique should only be used when the number of features is low.

## Import libraries

In [None]:
import datarobot as dr
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE

## Connect to DataRobot
Instructions for obtaining your endpoint and token are located in the [DataRobot API documentation here](https://docs.datarobot.com/en/docs/api/api-quickstart/index.html#configure-api-authentication)

In [3]:
# either directly pass in your endpoint/token, use a config file, or connect using DataRobot notebooks
dr.Client()

<datarobot.rest.RESTClientObject at 0x7f5f10312280>

## Get dataset
This example uses data on the movement of a double pendulum which has already been loaded into DataRobot for this example, but can be [found here](https://github.com/datarobot-community/tutorials-for-data-scientists/blob/master/Regression/Python/double_pendulum_with_eureqa/src/Double%20Pendulum%20with%20Eureqa%20Models.ipynb)

In [40]:
# replace the dataset ID with your own data
ds_id = "62fbcdf583b30f0ef972dc31"

# get dataset from DataRobot
ds = dr.Dataset.get(ds_id)
df = ds.get_as_dataframe()
display(df)

Unnamed: 0,t,x1,x2,v1,v2,a1,a2
0,0.000000,2.36,3.14,-0.0100,-0.01000,-9.24,6.53
1,0.000862,2.36,3.14,-0.0180,-0.00437,-9.24,6.53
2,0.001720,2.36,3.14,-0.0259,0.00126,-9.24,6.53
3,0.002590,2.36,3.14,-0.0339,0.00689,-9.24,6.53
4,0.003450,2.36,3.14,-0.0418,0.01250,-9.24,6.53
...,...,...,...,...,...,...,...
2424,9.970000,-14.70,-22.40,1.1400,1.82000,6.94,-3.84
2425,9.980000,-14.70,-22.30,1.2000,1.79000,7.04,-3.64
2426,9.980000,-14.70,-22.30,1.2500,1.76000,7.12,-3.42
2427,9.990000,-14.70,-22.30,1.3100,1.73000,7.20,-3.19


## Reduce the number of features in the dataset

In [None]:
# features to exclude from reduction
# can be target columns or ID columns or other
exclude_cols = ["t", "a2"]

model = TSNE(learning_rate=100, random_state=42)
transformed = model.fit_transform(df.drop(exclude_cols, axis=1))

In [25]:
transformed

array([[  2.542573 , -80.301025 ],
       [  2.5057044, -80.29103  ],
       [  2.869162 , -80.113396 ],
       ...,
       [  9.5524645,  74.92201  ],
       [  9.630235 ,  74.90384  ],
       [  9.827253 ,  74.67084  ]], dtype=float32)

## Create new dataframe with reduced columns and previously excluded columns

In [39]:
# get the tsne dataset
reduced_df = pd.DataFrame(transformed, columns=["tsne_x", "tsne_y"])

# join in target and time columns from original dataset
reduced_df = pd.concat([reduced_df, df[exclude_cols]], axis=1)

display(reduced_df)

Unnamed: 0,tsne_x,tsne_y,t,a2
0,2.542573,-80.301025,0.000000,6.53
1,2.505704,-80.291031,0.000862,6.53
2,2.869162,-80.113396,0.001720,6.53
3,2.899721,-80.068108,0.002590,6.53
4,2.924986,-80.020332,0.003450,6.53
...,...,...,...,...
2424,9.658271,74.433037,9.970000,-3.84
2425,9.417135,74.999992,9.980000,-3.64
2426,9.552464,74.922012,9.980000,-3.42
2427,9.630235,74.903839,9.990000,-3.19


## Upload back to DataRobot

In [42]:
ds = dr.Dataset.create_from_in_memory_data(
    data_frame=reduced_df, fname=f"{ds.name}.csv"
)
ds.modify(name=f"{ds.name} t-SNE Reduced")
ds

Dataset(name='Double Pendulum.csv.csv t-SNE Reduced', id='65a970bc040d9a438cdfb9de')