## HTML Scatterplot with Lables

This document shows how to visualise the Iris data in 2D (no PCA here) using an interactive scatterplot in which labels can be attached to the data points.

This post (https://www.kaggle.com/mehmetkasap/plotly-scatter-bar-pie-chart-and-word-cloud) shows how to attach labels to data points on a `plotly` scatterplot.

`conda install -c conda-forge jupyterlab`

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import cufflinks as cf
import plotly.offline

# See the next comment too.
cf.go_offline()

# These two lines were required to make this example work outside jupyter notebooks.
#cf.set_config_file(theme='ggplot',sharing='public',offline=True)
#setattr(plotly.offline, "__PLOTLY_OFFLINE_INITIALIZED", True) # to fix one error in plotly


### Load and print iris data

In [2]:
data = load_iris()
# remove DESCR that makes the output messy
data.pop("DESCR")
data.pop("filename")
# dictionary-like object data returned by sklearn has attributes: data, target, target_names, feature_names
print(data)


{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

### Create a pandas data frame, df

We will use df instead of data after that.

In [3]:
# we create a dataframe with the data stored in the attribute "data" of the object data
df = pd.DataFrame(data['data'], columns=data['feature_names'])
# add the target column to the data frame (this is numerical)
df['target'] = data['target']
# add the species column to the data frame (this one will be categorical); note that this elegantly done by pandas
df['species'] = pd.Categorical.from_codes(data.target, data.target_names)

# print the head of the current frame
print(df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


We can now visualise the current data frame.

In [4]:
df.iplot(kind="scatter", categories="species", x="sepal length (cm)", y="sepal width (cm)")




The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead



Next, we will construct a scatterplot with custom labels.

In [5]:
# We will create a new column called textlables that will store our labels
# Method 1: axis=1 means apply function to each row; apply can be for every column or every row; this operation must be slow
# df['textlabels'] = df[df.columns].apply(lambda row: str("(") + '_'.join(row.values.astype(str)) + str(")"), axis=1)
# Method 2: a faster method and easier to understand for many
df['textlabels'] = "(" + df['sepal length (cm)'].astype(str) + "," \
                       + df['sepal width (cm)'].astype(str) + "," \
                       + df['petal length (cm)'].astype(str) + "," \
                       + df['petal width (cm)'].astype(str) + ")"

# print the head of the extended frame
print(df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target species         textlabels  
0       0  setosa  (5.1,3.5,1.4,0.2)  
1       0  setosa  (4.9,3.0,1.4,0.2)  
2       0  setosa  (4.7,3.2,1.3,0.2)  
3       0  setosa  (4.6,3.1,1.5,0.2)  
4       0  setosa  (5.0,3.6,1.4,0.2)  


Plot the final, interactive scatterplot.

In [6]:
df.iplot(kind="scatter", theme="white", x="sepal length (cm)", y="sepal width (cm)", categories="species", text="textlabels")



The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead



In [7]:
# print python version
import sys
print("Python version")
print (sys.version)
# print date and time
from datetime import datetime
now = datetime.now()
# dd/mm/YY H:M:S
print("Last compiled on: ", now.strftime("%d/%m/%Y %H:%M:%S"))


Python version
3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0]
Last compiled on:  16/02/2022 21:14:32
