# Unsupervised discretization
Dataset: clean_adult (L)

Updated at: 15 March 2023

By: Sam

### About Dataset
Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
- 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
- 45222 if instances with unknown values are removed (train=30162, test=15060)
- Duplicate or conflicting instances : 6
- Class probabilities for adult.all file
- Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
- Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)

ATTRIBUTE:
*Continuous attributes* : 6
- age: continuous.
- fnlwgt: continuous.
- education-num: continuous.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.

*Categorical attributes* : 8
- workclass
- education
- marital-status
- occupation
- relationship
- race
- sex
- native-country

In [1]:
# Load library
import pandas as pd
import numpy as np
import time
import timeit

In [2]:
from sklearn.preprocessing import KBinsDiscretizer as kbins # also use for unsupervised

In [3]:
from feature_engine.discretisation import EqualFrequencyDiscretiser as efd
from feature_engine.discretisation import EqualWidthDiscretiser as ewd

In [4]:
# Load dataset
data = pd.read_csv('clean_adult.csv')

In [5]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [7]:
# get list of numeric attributes to discretize
num_col = data.select_dtypes(include=np.number).columns
num_col = num_col.tolist()

In [8]:
num_col


['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [10]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
data['class']= label_encoder.fit_transform(data['class'])
  
data['class'].unique()

array([0, 1])

## Equal Width Discretization

In [11]:
# Define function: Inputs: dataset, number of parameters

def ewd_disc(data, k):
    ## set up the discretisation transformer
    ewd_disc = ewd(bins=k, variables=num_col, return_boundaries=False)
    '''
    Parameters
    ----------
    bins : int, default=10
        Desired number of equal width intervals / bins.

    variables : list
        The list of numerical variables to transform. If None, the
        discretiser will automatically select all numerical type variables.

    return_object : bool, default=False
        Whether the numbers in the discrete variable should be returned as
        numeric or as object. The decision should be made by the user based on
        whether they would like to proceed the engineering of the variable as
        if it was numerical or categorical.

    return_boundaries: bool, default=False
        whether the output should be the interval boundaries. If True, it returns
        the interval boundaries. If False, it returns integers.
    '''
    ## fit the transformer
    ewd_disc.fit(data)
    ## transform the data
    data_ewd = ewd_disc.transform(data)
    ## binner_dict contains the boundaries of the different bins: 
    # stores the interval limits identified for each variable
    ewd_disc.binner_dict_
    return data_ewd  # return dataset after discretization

### EWD - Scenario 1: k = 4

In [12]:
# Perform discretization
k = 4
start = time.time() # Starting  time
data_ewd1 = ewd_disc(data, k)
end = time.time()
ewd_t = end - start
print("Discretization time, EWD, k = ", k,":",ewd_t) # Total time execution

Discretization time, EWD, k =  4 : 0.048660993576049805


In [13]:
# OUTPUT:
data_ewd1.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,1,State-gov,0,Bachelors,3,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,1,United-States,0
1,1,Self-emp-not-inc,0,Bachelors,3,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States,0
2,1,Private,0,HS-grad,2,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,1,United-States,0
3,1,Private,0,11th,1,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,1,United-States,0
4,0,Private,0,Bachelors,3,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,1,Cuba,0


In [14]:
## OUTPUT: Check number of instance in each interval in the data_ewd
# With equal width discretisation, each bin does not necessarily 
# contain the same number of observations.
for col in num_col:
    print(col)
    print(data_ewd1.groupby(col)[col].count())

age
age
0    22346
1    19014
2     6732
3      750
Name: age, dtype: int64
fnlwgt
fnlwgt
0    46477
1     2301
2       54
3       10
Name: fnlwgt, dtype: int64
education-num
education-num
0     1794
1     4614
2    30324
3    12110
Name: education-num, dtype: int64
capital-gain
capital-gain
0    48511
1       87
3      244
Name: capital-gain, dtype: int64
capital-loss
capital-loss
0    46605
1     1896
2      330
3       11
Name: capital-loss, dtype: int64
hours-per-week
hours-per-week
0     5913
1    37494
2     4775
3      660
Name: hours-per-week, dtype: int64


### EWD - Scenario 2: k = 7

In [15]:
# Perform discretization
k = 7
start = time.time() # Starting  time
data_ewd2 = ewd_disc(data, k)
end = time.time()
ewd_t = end - start
print("Discretization time, EWD, k = ", k,":", ewd_t) # Total time execution

Discretization time, EWD, k =  7 : 0.04012799263000488


In [16]:
# OUTPUT:
data_ewd2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,2,State-gov,0,Bachelors,5,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,2,United-States,0
1,3,Self-emp-not-inc,0,Bachelors,5,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States,0
2,2,Private,0,HS-grad,3,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,2,United-States,0
3,3,Private,1,11th,2,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,2,United-States,0
4,1,Private,1,Bachelors,5,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,2,Cuba,0


In [17]:
## OUTPUT: Check number of instance in each interval in the data_ewd
# With equal width discretisation, each bin does not necessarily 
# contain the same number of observations.
for col in num_col:
    print(col)
    print(data_ewd2.groupby(col)[col].count())

age
age
0    12012
1    12962
2    12347
3     6943
4     3577
5      815
6      186
Name: age, dtype: int64
fnlwgt
fnlwgt
0    34670
1    13008
2     1010
3      116
4       25
5        9
6        4
Name: fnlwgt, dtype: int64
education-num
education-num
0      839
1     1711
2     3201
3    16441
4    12939
5     9626
6     4085
Name: education-num, dtype: int64
capital-gain
capital-gain
0    47894
1      695
2        9
6      244
Name: capital-gain, dtype: int64
capital-loss
capital-loss
0    46574
1       46
2      813
3     1346
4       50
5        4
6        9
Name: capital-loss, dtype: int64
hours-per-week
hours-per-week
0     2098
1     4053
2    28963
3     9830
4     3124
5      549
6      225
Name: hours-per-week, dtype: int64


### EWD - Scenario 3: k = 10

In [18]:
# Perform discretization
k = 10
start = time.time() # Starting time
data_ewd3 = ewd_disc(data, k)
end = time.time()
ewd_t = end - start
print("Discretization time, EWD, k = ", k,":", ewd_t) # Total time execution

Discretization time, EWD, k =  10 : 0.040617942810058594


In [19]:
# OUTPUT:
data_ewd3.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,3,State-gov,0,Bachelors,7,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,3,United-States,0
1,4,Self-emp-not-inc,0,Bachelors,7,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,1,United-States,0
2,2,Private,1,HS-grad,5,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,3,United-States,0
3,4,Private,1,11th,3,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,3,United-States,0
4,1,Private,2,Bachelors,7,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,3,Cuba,0


In [20]:
## OUTPUT: Check number of instance in each interval in the data_ewd
# With equal width discretisation, each bin does not necessarily 
# contain the same number of observations.
for col in num_col:
    print(col)
    print(data_ewd3.groupby(col)[col].count())

age
age
0    8432
1    8686
2    9120
3    9157
4    5965
5    3876
6    2456
7     777
8     277
9      96
Name: age, dtype: int64
fnlwgt
fnlwgt
0    19939
1    22790
2     5225
3      673
4      151
5       35
6       15
7        7
8        3
9        4
Name: fnlwgt, dtype: int64
education-num
education-num
0      330
1     1464
2      756
3     3201
4      657
5    26662
6     2061
7     9626
8     2657
9     1428
Name: education-num, dtype: int64
capital-gain
capital-gain
0    47708
1      753
2      128
3        6
4        3
9      244
Name: capital-gain, dtype: int64
capital-loss
capital-loss
0    46574
1       23
2       29
3      706
4     1169
5      307
6       21
7        2
8        8
9        3
Name: capital-loss, dtype: int64
hours-per-week
hours-per-week
0     1125
1     3328
2     3398
3    26639
4     8917
5     1582
6     2642
7      683
8      315
9      213
Name: hours-per-week, dtype: int64


## Equal Frequency Discretization - EFD
- Reference: https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/EqualFrequencyDiscretiser.ipynb
- Parameter:
- q : int, default=10
    Desired number of equal frequency intervals / bins. In other words the
    number of quantiles in which the variables should be divided.

- variables : list
    The list of numerical variables that will be discretised. If None, the
    EqualFrequencyDiscretiser() will select all numerical variables.

- return_object : bool, default=False
    Whether the numbers in the discrete variable should be returned as
    numeric or as object. The decision is made by the user based on
    whether they would like to proceed the engineering of the variable as
    if it was numerical or categorical.

- return_boundaries: bool, default=False
    whether the output should be the interval boundaries. If True, it returns
    the interval boundaries. If False, it returns integers.

In [21]:
def efd_disc(data, k):
    ## set up the discretisation transformer
    efd_disc = efd(q=k, variables=num_col)
    ## fit the transformer
    efd_disc.fit(data)
    ## transform the data
    data_efd = efd_disc.transform(data)
    ## binner_dict_ stores the interval limits identified for each variable.
    efd_disc.binner_dict_
    return data_efd

### Define function efd_disc, inputs include dataset, number of intervals (k)

### EFD - Scenario 1: k = 4

In [22]:
# Perform discretization
k = 4
start = time.time() # Starting time
data_efd1 = efd_disc(data, k)
end = time.time()
efd_t = end - start
print("Discretization time, EFD, k = ", k,":", efd_t) # Total time execution

Discretization time, EFD, k =  4 : 0.04237008094787598


In [23]:
## OUTPUT: Check number of instance in each interval 
for col in num_col:
    print(col)
    print(data_efd1.groupby(col)[col].count())

age
age
0    13292
1    11682
2    12347
3    11521
Name: age, dtype: int64
fnlwgt
fnlwgt
0    12211
1    12210
2    12210
3    12211
Name: fnlwgt, dtype: int64
education-num
education-num
0    22192
1    10878
2     3662
3    12110
Name: education-num, dtype: int64
capital-gain
capital-gain
0    48842
Name: capital-gain, dtype: int64
capital-loss
capital-loss
0    48842
Name: capital-loss, dtype: int64
hours-per-week
hours-per-week
0    34490
1     3651
2    10701
Name: hours-per-week, dtype: int64


### EFD - Scenario 2: k = 7

In [24]:
# Perform discretization
k = 7
start = time.time() # Starting time
data_efd2 = efd_disc(data, k)
end = time.time()
efd_t = end - start
print("Discretization time, EFD, k = ", k,":", efd_t) # Total time execution

Discretization time, EFD, k =  7 : 0.04386782646179199


In [25]:
## OUTPUT
data_efd2.info()
## OUTPUT: Check number of instance in each interval in the data_efd
for col in num_col:
    print(col)
    print(data_efd2.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.6+ MB
age
age
0    7226
1    7289
2    6494
3    7622
4    6764
5    6578
6    6869
Name: age, dtyp

### Scenario 3: k = 10

In [26]:
# Perform discretization
k = 10
start = time.time() # Starting time
data_efd3 = efd_disc(data, k)
end = time.time()
efd_t = end - start
print("Discretization time, EFD, k = ", k,":", efd_t) # Total time execution

Discretization time, EFD, k =  10 : 0.050167083740234375


In [27]:
## OUTPUT
data_efd3.info()
## OUTPUT: Check number of instance in each interval in the data_efd
for col in num_col:
    print(col)
    print(data_efd3.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.6+ MB
age
age
0    5897
1    4883
2    5013
3    3913
4    5268
5    4892
6    4432
7    5613
8    

## Fixed Frequency Discretization - FFD

### Define function ffd_disc: modify input of function efd
Input include dataset, interval frequency (m)

In [28]:
def ffd_disc(data, m): # 
    n = len(data)
    ## set up the discretisation transformer
    ffd_disc = efd(q=round(n/m), variables=num_col) # number of bins = n/m
    ## fit the transformer
    ffd_disc.fit(data)
    ## transform the data
    data_ffd = ffd_disc.transform(data)
    ## binner_dict_ stores the interval limits identified for each variable.
    ffd_disc.binner_dict_
    return data_ffd

### FFD - Scenario 1: m = 10

In [29]:
# Perform discretization
m = 10
start = time.time() # Starting time
data_ffd1 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, FFD,  m = ", m, ":", ffd_t) # Total time execution

Discretization time, FFD,  m =  10 : 0.19707417488098145


In [30]:
## OUTPUT
data_ffd1.info()
## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd1.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.6+ MB
age
age
0     1457
1     1053
2     1113
3     1096
4     1178
      ... 
72      11
73      

### FFD - Scenario 1: m = 30

In [31]:
# Perform discretization
m = 30
start = time.time() # Starting time
data_ffd2 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, EFD, m = ", m, ":", ffd_t) # Total time execution

Discretization time, EFD, m =  30 : 0.1091609001159668


In [32]:
## OUTPUT
data_ffd2.info()
## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd2.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.6+ MB
age
age
0     1457
1     1053
2     1113
3     1096
4     1178
      ... 
63      38
64      

### FFD - Scenario 3: m = 60

In [33]:
# Perform discretization
m = 60
start = time.time() # Starting time
data_ffd3 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, FFD, m = ", m, ":", ffd_t) # Total time execution

Discretization time, FFD, m =  60 : 0.08628606796264648


In [34]:
## OUTPUT
data_ffd3.info()
## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd3.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.6+ MB
age
age
0     1457
1     1053
2     1113
3     1096
4     1178
      ... 
61      34
62      

#### FFD, m = 100

In [35]:
# Perform discretization
m = 100
start = time.time() # Starting time
data_ffd4 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, FFD, m = ", m, ":", ffd_t) # Total time execution

Discretization time, FFD, m =  100 : 0.07551789283752441


In [36]:
## OUTPUT
data_ffd4.info()

## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd4.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 5.6+ MB
age
age
0     1457
1     1053
2     1113
3     1096
4     1178
      ... 
58      72
59     1

### Export discretized datasets

In [37]:
# EWD datasets:
data_ewd1.to_csv('adult_ewd1.csv', index=False) # k=4
data_ewd2.to_csv('adult_ewd2.csv', index=False) # k=7
data_ewd3.to_csv('adult_ewd3.csv', index=False) # k=10

In [38]:
# EFD datasets:
data_efd1.to_csv('adult_efd1.csv', index=False) # k=4
data_efd2.to_csv('adult_efd2.csv', index=False) # k=7
data_efd3.to_csv('adult_efd3.csv', index=False) # k=10


In [39]:
# FFD datasets:
data_ffd1.to_csv('adult_ffd1.csv', index=False) # m=10
data_ffd2.to_csv('adult_ffd2.csv', index=False) # m=30
data_ffd3.to_csv('adult_ffd3.csv', index=False) # m=60
data_ffd4.to_csv('adult_ffd4.csv', index=False) # m=100