# Data Exploration
- This notebook performs exploratory data analysis on the dataset.
- To expand on the analysis, attach this notebook to a cluster with runtime version **16.3.x-cpu-ml-scala2.12**,
edit [the options of pandas-profiling](https://pandas-profiling.ydata.ai/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/4297320214106172).

In [0]:
%pip install --no-deps ydata-profiling==4.8.3 pandas==2.2.3 visions==0.7.6 tzdata==2024.2

Collecting ydata-profiling==4.8.3
  Using cached ydata_profiling-4.8.3-py2.py3-none-any.whl.metadata (20 kB)


Collecting pandas==2.2.3
  Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting visions==0.7.6
  Using cached visions-0.7.6-py3-none-any.whl.metadata (11 kB)


Collecting tzdata==2024.2
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached ydata_profiling-4.8.3-py2.py3-none-any.whl (359 kB)


Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)


Using cached visions-0.7.6-py3-none-any.whl (104 kB)
Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)


Installing collected packages: ydata-profiling, visions, tzdata, pandas
  Attempting uninstall: ydata-profiling
    Found existing installation: ydata-profiling 4.9.0
    Not uninstalling ydata-profiling at /databricks/python3/lib/python3.12/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-685cd42f-327f-43bf-8b98-44b23c280f70
    Can't uninstall 'ydata-profiling'. No files were found to uninstall.


  Attempting uninstall: visions
    Found existing installation: visions 0.7.5
    Not uninstalling visions at /databricks/python3/lib/python3.12/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-685cd42f-327f-43bf-8b98-44b23c280f70
    Can't uninstall 'visions'. No files were found to uninstall.


  Attempting uninstall: pandas


    Found existing installation: pandas 1.5.3
    Not uninstalling pandas at /databricks/python3/lib/python3.12/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-685cd42f-327f-43bf-8b98-44b23c280f70
    Can't uninstall 'pandas'. No files were found to uninstall.


Successfully installed pandas-2.2.3 tzdata-2024.2 visions-0.7.6 ydata-profiling-4.8.3



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Note: you may need to restart the kernel to use updated packages.


In [0]:
import mlflow
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

# Download input data from mlflow into a pandas DataFrame
# Create temporary directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# Download the artifact and read it
training_data_path = mlflow.artifacts.download_artifacts(run_id="e1548ae7be39486e98423fb24429ad25", artifact_path="data", dst_path=temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# Delete the temporary data
shutil.rmtree(temp_dir)

target_col = "Churn"

# Drop columns created by AutoML and user-specified sample weight column (if applicable) before pandas-profiling
df = df.drop(['_automl_split_col_0000'], axis=1)

Thu Aug  7 04:36:48 2025 Connection to spark from PID  67889
Thu Aug  7 04:36:48 2025 Initialized gateway on port 39799


Thu Aug  7 04:36:48 2025 Connected to spark.


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

## Profiling Results

In [0]:
from ydata_profiling import ProfileReport
df_profile = ProfileReport(df,
                           correlations={
                               "auto": {"calculate": True},
                               "pearson": {"calculate": True},
                               "spearman": {"calculate": True},
                               "kendall": {"calculate": True},
                               "phi_k": {"calculate": True},
                               "cramers": {"calculate": True},
                           }, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)

  return df.corr(method="pearson")
  return df.corr(method="spearman")
  return df.corr(method="kendall")


0,1
Number of variables,8
Number of observations,7043
Missing cells,0
Missing cells (%),0.0%
Duplicate rows,548
Duplicate rows (%),7.8%
Total size in memory,440.3 KiB
Average record size in memory,64.0 B

0,1
Text,8

0,1
Dataset has 548 (7.8%) duplicate rows,Duplicates
InternetService is highly overall correlated with Contract,High correlation
Contract is highly overall correlated with InternetService,High correlation

0,1
Analysis started,2025-08-07 04:36:51.457374
Analysis finished,2025-08-07 04:36:54.371454
Duration,2.91 seconds
Software version,ydata-profiling v4.8.3
Download configuration,config.json

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,6.0
Median length,4.0
Mean length,4.990487
Min length,4.0

0,1
Total characters,35148
Distinct characters,6
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Female
2nd row,Female
3rd row,Male
4th row,Female
5th row,Male

Value,Count,Frequency (%)
male,3555,50.5%
female,3488,49.5%

Value,Count,Frequency (%)
e,10531,30.0%
a,7043,20.0%
l,7043,20.0%
M,3555,10.1%
F,3488,9.9%
m,3488,9.9%

Value,Count,Frequency (%)
Lowercase Letter,28105,80.0%
Uppercase Letter,7043,20.0%

Value,Count,Frequency (%)
e,10531,37.5%
a,7043,25.1%
l,7043,25.1%
m,3488,12.4%

Value,Count,Frequency (%)
M,3555,50.5%
F,3488,49.5%

Value,Count,Frequency (%)
Latin,35148,100.0%

Value,Count,Frequency (%)
e,10531,30.0%
a,7043,20.0%
l,7043,20.0%
M,3555,10.1%
F,3488,9.9%
m,3488,9.9%

Value,Count,Frequency (%)
ASCII,35148,100.0%

Value,Count,Frequency (%)
e,10531,30.0%
a,7043,20.0%
l,7043,20.0%
M,3555,10.1%
F,3488,9.9%
m,3488,9.9%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,1
Median length,1
Mean length,1
Min length,1

0,1
Total characters,7043
Distinct characters,2
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,0
2nd row,0
3rd row,0
4th row,0
5th row,0

Value,Count,Frequency (%)
0,5901,83.8%
1,1142,16.2%

Value,Count,Frequency (%)
0,5901,83.8%
1,1142,16.2%

Value,Count,Frequency (%)
Decimal Number,7043,100.0%

Value,Count,Frequency (%)
0,5901,83.8%
1,1142,16.2%

Value,Count,Frequency (%)
Common,7043,100.0%

Value,Count,Frequency (%)
0,5901,83.8%
1,1142,16.2%

Value,Count,Frequency (%)
ASCII,7043,100.0%

Value,Count,Frequency (%)
0,5901,83.8%
1,1142,16.2%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,3.0
Median length,2.0
Mean length,2.4830328
Min length,2.0

0,1
Total characters,17488
Distinct characters,5
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Yes
2nd row,No
3rd row,No
4th row,No
5th row,No

Value,Count,Frequency (%)
no,3641,51.7%
yes,3402,48.3%

Value,Count,Frequency (%)
N,3641,20.8%
o,3641,20.8%
Y,3402,19.5%
e,3402,19.5%
s,3402,19.5%

Value,Count,Frequency (%)
Lowercase Letter,10445,59.7%
Uppercase Letter,7043,40.3%

Value,Count,Frequency (%)
o,3641,34.9%
e,3402,32.6%
s,3402,32.6%

Value,Count,Frequency (%)
N,3641,51.7%
Y,3402,48.3%

Value,Count,Frequency (%)
Latin,17488,100.0%

Value,Count,Frequency (%)
N,3641,20.8%
o,3641,20.8%
Y,3402,19.5%
e,3402,19.5%
s,3402,19.5%

Value,Count,Frequency (%)
ASCII,17488,100.0%

Value,Count,Frequency (%)
N,3641,20.8%
o,3641,20.8%
Y,3402,19.5%
e,3402,19.5%
s,3402,19.5%

0,1
Distinct,3
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,11.0
Median length,3.0
Mean length,6.3000142
Min length,2.0

0,1
Total characters,44371
Distinct characters,14
Distinct categories,3 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,DSL
2nd row,Fiber optic
3rd row,Fiber optic
4th row,DSL
5th row,No

Value,Count,Frequency (%)
fiber,3096,30.5%
optic,3096,30.5%
dsl,2421,23.9%
no,1526,15.1%

Value,Count,Frequency (%)
i,6192,14.0%
o,4622,10.4%
F,3096,7.0%
b,3096,7.0%
e,3096,7.0%
r,3096,7.0%
,3096,7.0%
p,3096,7.0%
t,3096,7.0%
c,3096,7.0%

Value,Count,Frequency (%)
Lowercase Letter,29390,66.2%
Uppercase Letter,11885,26.8%
Space Separator,3096,7.0%

Value,Count,Frequency (%)
i,6192,21.1%
o,4622,15.7%
b,3096,10.5%
e,3096,10.5%
r,3096,10.5%
p,3096,10.5%
t,3096,10.5%
c,3096,10.5%

Value,Count,Frequency (%)
F,3096,26.0%
D,2421,20.4%
S,2421,20.4%
L,2421,20.4%
N,1526,12.8%

Value,Count,Frequency (%)
,3096,100.0%

Value,Count,Frequency (%)
Latin,41275,93.0%
Common,3096,7.0%

Value,Count,Frequency (%)
i,6192,15.0%
o,4622,11.2%
F,3096,7.5%
b,3096,7.5%
e,3096,7.5%
r,3096,7.5%
p,3096,7.5%
t,3096,7.5%
c,3096,7.5%
D,2421,5.9%

Value,Count,Frequency (%)
,3096,100.0%

Value,Count,Frequency (%)
ASCII,44371,100.0%

Value,Count,Frequency (%)
i,6192,14.0%
o,4622,10.4%
F,3096,7.0%
b,3096,7.0%
e,3096,7.0%
r,3096,7.0%
,3096,7.0%
p,3096,7.0%
t,3096,7.0%
c,3096,7.0%

0,1
Distinct,3
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,14.0
Median length,14.0
Mean length,11.30115
Min length,8.0

0,1
Total characters,79594
Distinct characters,15
Distinct categories,4 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Month-to-month
2nd row,Month-to-month
3rd row,Month-to-month
4th row,Month-to-month
5th row,Two year

Value,Count,Frequency (%)
month-to-month,3875,37.9%
year,3168,31.0%
two,1695,16.6%
one,1473,14.4%

Value,Count,Frequency (%)
o,13320,16.7%
t,11625,14.6%
n,9223,11.6%
h,7750,9.7%
-,7750,9.7%
e,4641,5.8%
M,3875,4.9%
m,3875,4.9%
,3168,4.0%
y,3168,4.0%

Value,Count,Frequency (%)
Lowercase Letter,61633,77.4%
Dash Punctuation,7750,9.7%
Uppercase Letter,7043,8.8%
Space Separator,3168,4.0%

Value,Count,Frequency (%)
o,13320,21.6%
t,11625,18.9%
n,9223,15.0%
h,7750,12.6%
e,4641,7.5%
m,3875,6.3%
y,3168,5.1%
a,3168,5.1%
r,3168,5.1%
w,1695,2.8%

Value,Count,Frequency (%)
M,3875,55.0%
T,1695,24.1%
O,1473,20.9%

Value,Count,Frequency (%)
-,7750,100.0%

Value,Count,Frequency (%)
,3168,100.0%

Value,Count,Frequency (%)
Latin,68676,86.3%
Common,10918,13.7%

Value,Count,Frequency (%)
o,13320,19.4%
t,11625,16.9%
n,9223,13.4%
h,7750,11.3%
e,4641,6.8%
M,3875,5.6%
m,3875,5.6%
y,3168,4.6%
a,3168,4.6%
r,3168,4.6%

Value,Count,Frequency (%)
-,7750,71.0%
,3168,29.0%

Value,Count,Frequency (%)
ASCII,79594,100.0%

Value,Count,Frequency (%)
o,13320,16.7%
t,11625,14.6%
n,9223,11.6%
h,7750,9.7%
-,7750,9.7%
e,4641,5.8%
M,3875,4.9%
m,3875,4.9%
,3168,4.0%
y,3168,4.0%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,3.0
Median length,3.0
Mean length,2.5922192
Min length,2.0

0,1
Total characters,18257
Distinct characters,5
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Yes
2nd row,Yes
3rd row,Yes
4th row,No
5th row,No

Value,Count,Frequency (%)
yes,4171,59.2%
no,2872,40.8%

Value,Count,Frequency (%)
Y,4171,22.8%
e,4171,22.8%
s,4171,22.8%
N,2872,15.7%
o,2872,15.7%

Value,Count,Frequency (%)
Lowercase Letter,11214,61.4%
Uppercase Letter,7043,38.6%

Value,Count,Frequency (%)
e,4171,37.2%
s,4171,37.2%
o,2872,25.6%

Value,Count,Frequency (%)
Y,4171,59.2%
N,2872,40.8%

Value,Count,Frequency (%)
Latin,18257,100.0%

Value,Count,Frequency (%)
Y,4171,22.8%
e,4171,22.8%
s,4171,22.8%
N,2872,15.7%
o,2872,15.7%

Value,Count,Frequency (%)
ASCII,18257,100.0%

Value,Count,Frequency (%)
Y,4171,22.8%
e,4171,22.8%
s,4171,22.8%
N,2872,15.7%
o,2872,15.7%

0,1
Distinct,4
Distinct (%),0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,25.0
Median length,23.0
Mean length,18.570212
Min length,12.0

0,1
Total characters,130790
Distinct characters,23
Distinct categories,5 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Electronic check
2nd row,Electronic check
3rd row,Credit card (automatic)
4th row,Mailed check
5th row,Credit card (automatic)

Value,Count,Frequency (%)
check,3977,23.2%
automatic,3066,17.9%
electronic,2365,13.8%
mailed,1612,9.4%
bank,1544,9.0%
transfer,1544,9.0%
credit,1522,8.9%
card,1522,8.9%

Value,Count,Frequency (%)
c,17272,13.2%
a,12354,9.4%
t,11563,8.8%
e,11020,8.4%
,10109,7.7%
i,8565,6.5%
r,8497,6.5%
k,5521,4.2%
n,5453,4.2%
o,5431,4.2%

Value,Count,Frequency (%)
Lowercase Letter,107506,82.2%
Space Separator,10109,7.7%
Uppercase Letter,7043,5.4%
Open Punctuation,3066,2.3%
Close Punctuation,3066,2.3%

Value,Count,Frequency (%)
c,17272,16.1%
a,12354,11.5%
t,11563,10.8%
e,11020,10.3%
i,8565,8.0%
r,8497,7.9%
k,5521,5.1%
n,5453,5.1%
o,5431,5.1%
d,4656,4.3%

Value,Count,Frequency (%)
E,2365,33.6%
M,1612,22.9%
B,1544,21.9%
C,1522,21.6%

Value,Count,Frequency (%)
,10109,100.0%

Value,Count,Frequency (%)
(,3066,100.0%

Value,Count,Frequency (%)
),3066,100.0%

Value,Count,Frequency (%)
Latin,114549,87.6%
Common,16241,12.4%

Value,Count,Frequency (%)
c,17272,15.1%
a,12354,10.8%
t,11563,10.1%
e,11020,9.6%
i,8565,7.5%
r,8497,7.4%
k,5521,4.8%
n,5453,4.8%
o,5431,4.7%
d,4656,4.1%

Value,Count,Frequency (%)
,10109,62.2%
(,3066,18.9%
),3066,18.9%

Value,Count,Frequency (%)
ASCII,130790,100.0%

Value,Count,Frequency (%)
c,17272,13.2%
a,12354,9.4%
t,11563,8.8%
e,11020,8.4%
,10109,7.7%
i,8565,6.5%
r,8497,6.5%
k,5521,4.2%
n,5453,4.2%
o,5431,4.2%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,55.2 KiB

0,1
Max length,3.0
Median length,2.0
Mean length,2.2653699
Min length,2.0

0,1
Total characters,15955
Distinct characters,5
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,No
2nd row,Yes
3rd row,No
4th row,No
5th row,No

Value,Count,Frequency (%)
no,5174,73.5%
yes,1869,26.5%

Value,Count,Frequency (%)
N,5174,32.4%
o,5174,32.4%
Y,1869,11.7%
e,1869,11.7%
s,1869,11.7%

Value,Count,Frequency (%)
Lowercase Letter,8912,55.9%
Uppercase Letter,7043,44.1%

Value,Count,Frequency (%)
o,5174,58.1%
e,1869,21.0%
s,1869,21.0%

Value,Count,Frequency (%)
N,5174,73.5%
Y,1869,26.5%

Value,Count,Frequency (%)
Latin,15955,100.0%

Value,Count,Frequency (%)
N,5174,32.4%
o,5174,32.4%
Y,1869,11.7%
e,1869,11.7%
s,1869,11.7%

Value,Count,Frequency (%)
ASCII,15955,100.0%

Value,Count,Frequency (%)
N,5174,32.4%
o,5174,32.4%
Y,1869,11.7%
e,1869,11.7%
s,1869,11.7%

Unnamed: 0,Gender,SeniorCitizen,Partner,InternetService,Contract,PaperlessBilling,PaymentMethod,Churn
Gender,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SeniorCitizen,0.0,1.0,0.017,0.161,0.086,0.242,0.293,0.233
Partner,0.0,0.017,1.0,0.0,0.18,0.013,0.243,0.233
InternetService,0.0,0.161,0.0,1.0,0.505,0.231,0.324,0.196
Contract,0.0,0.086,0.18,0.505,1.0,0.107,0.277,0.252
PaperlessBilling,0.0,0.242,0.013,0.231,0.107,1.0,0.37,0.296
PaymentMethod,0.0,0.293,0.243,0.324,0.277,0.37,1.0,0.449
Churn,0.0,0.233,0.233,0.196,0.252,0.296,0.449,1.0

Unnamed: 0,Gender,SeniorCitizen,Partner,InternetService,Contract,PaperlessBilling,PaymentMethod,Churn
0,Female,0,Yes,DSL,Month-to-month,Yes,Electronic check,No
1,Female,0,No,Fiber optic,Month-to-month,Yes,Electronic check,Yes
2,Male,0,No,Fiber optic,Month-to-month,Yes,Credit card (automatic),No
3,Female,0,No,DSL,Month-to-month,No,Mailed check,No
4,Male,0,No,No,Two year,No,Credit card (automatic),No
5,Male,0,Yes,Fiber optic,One year,No,Credit card (automatic),No
6,Male,0,No,Fiber optic,Month-to-month,Yes,Electronic check,No
7,Female,0,No,No,One year,No,Mailed check,No
8,Female,0,No,Fiber optic,Month-to-month,Yes,Electronic check,No
9,Male,1,No,DSL,Month-to-month,Yes,Electronic check,Yes

Unnamed: 0,Gender,SeniorCitizen,Partner,InternetService,Contract,PaperlessBilling,PaymentMethod,Churn
7033,Male,0,Yes,Fiber optic,Month-to-month,Yes,Mailed check,Yes
7034,Male,0,No,Fiber optic,One year,Yes,Electronic check,No
7035,Male,0,No,DSL,Month-to-month,No,Mailed check,No
7036,Female,0,Yes,DSL,Two year,No,Bank transfer (automatic),No
7037,Male,0,No,Fiber optic,Month-to-month,Yes,Credit card (automatic),No
7038,Female,0,No,Fiber optic,Month-to-month,Yes,Credit card (automatic),Yes
7039,Male,0,Yes,DSL,One year,Yes,Mailed check,No
7040,Female,0,Yes,Fiber optic,One year,Yes,Credit card (automatic),No
7041,Female,0,Yes,DSL,Month-to-month,Yes,Electronic check,No
7042,Male,1,Yes,Fiber optic,Month-to-month,Yes,Mailed check,Yes

Unnamed: 0,Gender,SeniorCitizen,Partner,InternetService,Contract,PaperlessBilling,PaymentMethod,Churn,# duplicates
322,Male,0,No,Fiber optic,Month-to-month,Yes,Electronic check,Yes,150
47,Female,0,No,Fiber optic,Month-to-month,Yes,Electronic check,Yes,136
225,Female,1,No,Fiber optic,Month-to-month,Yes,Electronic check,Yes,84
347,Male,0,No,No,Month-to-month,No,Mailed check,No,84
46,Female,0,No,Fiber optic,Month-to-month,Yes,Electronic check,No,79
417,Male,0,Yes,Fiber optic,Month-to-month,Yes,Electronic check,Yes,76
321,Male,0,No,Fiber optic,Month-to-month,Yes,Electronic check,No,72
143,Female,0,Yes,Fiber optic,Month-to-month,Yes,Electronic check,Yes,69
495,Male,1,No,Fiber optic,Month-to-month,Yes,Electronic check,Yes,65
193,Female,0,Yes,No,Two year,No,Mailed check,No,64
