# Data description

We took the data from https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified

Database contains records for 1885 respondents. For each respondent 12 attributes are known: Personality measurements which include NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and ImpSS (sensation seeking), level of education, age, gender, country of residence and ethnicity. All input attributes are originally categorical and are quantified. After quantification values of all input features can be considered as real-valued. In addition, participants were questioned concerning their use of 18 legal and illegal drugs (alcohol, amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile substance abuse and one fictitious drug (Semeron) which was introduced to identify over-claimers. For each drug they have to select one of the answers: never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day.
Database contains 18 classification problems. Each of independent label variables contains seven classes: "Never Used", "Used over a Decade Ago", "Used in Last Decade", "Used in Last Year", "Used in Last Month", "Used in Last Week", and "Used in Last Day".

# Load the data

In [10]:
# Load the dependencies
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt

In [2]:
# Let's import the dataset using UCL's API
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
drug_consumption_quantified = fetch_ucirepo(id=373) 
  
# data (as pandas dataframes) 
X = drug_consumption_quantified.data.features 
y = drug_consumption_quantified.data.targets 
  
# metadata 
print(drug_consumption_quantified.metadata) 

# variable information 
print(drug_consumption_quantified.variables) 


{'uci_id': 373, 'name': 'Drug Consumption (Quantified)', 'repository_url': 'https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified', 'data_url': 'https://archive.ics.uci.edu/static/public/373/data.csv', 'abstract': 'Classify type of drug consumer by personality data', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1885, 'num_features': 12, 'feature_types': ['Real'], 'demographics': ['Age', 'Gender', 'Education Level', 'Nationality', 'Ethnicity'], 'target_col': ['alcohol', 'amphet', 'amyl', 'benzos', 'caff', 'cannabis', 'choc', 'coke', 'crack', 'ecstasy', 'heroin', 'ketamine', 'legalh', 'lsd', 'meth', 'mushrooms', 'nicotine', 'semer', 'vsa'], 'index_col': ['id'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2015, 'last_updated': 'Fri Mar 08 2024', 'dataset_doi': '10.24432/C5TC7S', 'creators': ['Elaine Fehrman', 'Vincent Egan', 'Evgeny Mirkes'], 'intro_paper': {'ID': 413, 

Independent variables

In [4]:
X.head()

Unnamed: 0,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,cscore,impuslive,ss
0,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,-0.21712,-1.18084
1,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575
2,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,-1.37983,0.40148
3,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084
4,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,1.30612,-0.21712,-0.21575


In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        1885 non-null   float64
 1   gender     1885 non-null   float64
 2   education  1885 non-null   float64
 3   country    1885 non-null   float64
 4   ethnicity  1885 non-null   float64
 5   nscore     1885 non-null   float64
 6   escore     1885 non-null   float64
 7   oscore     1885 non-null   float64
 8   ascore     1885 non-null   float64
 9   cscore     1885 non-null   float64
 10  impuslive  1885 non-null   float64
 11  ss         1885 non-null   float64
dtypes: float64(12)
memory usage: 176.8 KB


## X decoding for EDA clearness

In [11]:
# The X features are already standarized and, in the case of categorical ones, quantified. However, let's decode them in order to get the EDA more comprehensive.

# 2. Create a copy to decode
X_decoded = X.copy()

# ---------------------------------------------------------
# Define Mappings (rounded to match dataset precision)
# ---------------------------------------------------------

age_map = {
    -0.95197: '18-24',
    -0.07854: '25-34',
    0.49788:  '35-44',
    1.09449:  '45-54',
    1.82213:  '55-64',
    2.59171:  '65+'
}

# Ordered from lowest to highest
age_order = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']

education_map = {
    -2.43591: 'Left school before 16 years',
    -1.73790: 'Left school at 16 years',
    -1.43719: 'Left school at 17 years',
    -1.22751: 'Left school at 18 years',
    -0.61113: 'Some college or university, no certificate or degree',
    -0.05921: 'Professional certificate/ diploma',
    0.45468:  'University degree',
    1.16365:  'Masters degree',
    1.98437:  'Doctorate degree'
}

# Ordered from lowest to highest
edu_order = [
    'Left school before 16 years',
    'Left school at 16 years',
    'Left school at 17 years',
    'Left school at 18 years',
    'Some college or university, no certificate or degree',
    'Professional certificate/ diploma',
    'University degree',
    'Masters degree',
    'Doctorate degree'
]

gender_map = {0.48246: 'Female', -0.48246: 'Male'}

country_map = {
    -0.09765: 'Australia', 0.24923: 'Canada', -0.46841: 'New Zealand',
    -0.28519: 'Other', 0.21128: 'Republic of Ireland', 0.96082: 'UK', -0.57009: 'USA'
}

ethnicity_map = {
    -0.50212: 'Asian', -1.10702: 'Black', 1.90725: 'Mixed-Black/Asian',
    0.12600: 'Mixed-White/Asian', -0.22166: 'Mixed-White/Black', 0.11440: 'Other', -0.31685: 'White'
}

# ---------------------------------------------------------
# Apply Transformations
# ---------------------------------------------------------

# Map the values (Rounding to 5 decimals is essential for matching)
X_decoded['age'] = X['age'].round(5).map(age_map)
X_decoded['education'] = X['education'].round(5).map(education_map)
X_decoded['gender'] = X['gender'].round(5).map(gender_map)
X_decoded['country'] = X['country'].round(5).map(country_map)
X_decoded['ethnicity'] = X['ethnicity'].round(5).map(ethnicity_map)

# Enforce Ordinal Types
X_decoded['age'] = pd.Categorical(X_decoded['age'], categories=age_order, ordered=True)
X_decoded['education'] = pd.Categorical(X_decoded['education'], categories=edu_order, ordered=True)

# Enforce Nominal Types (No order)
X_decoded['gender'] = X_decoded['gender'].astype('category')
X_decoded['country'] = X_decoded['country'].astype('category')
X_decoded['ethnicity'] = X_decoded['ethnicity'].astype('category')

# ---------------------------------------------------------
# Correct Verification of Ordinality
# ---------------------------------------------------------
# We verify by checking the integer codes underlying the categories
doc_code = X_decoded['education'].cat.categories.get_loc("Doctorate degree")
mast_code = X_decoded['education'].cat.categories.get_loc("Masters degree")

print("Encoded Dataframe Info:")
print(X_decoded.info())

print(f"\nCorrect Proof of Ordinality:")
print(f"Doctorate Rank Index: {doc_code}")
print(f"Masters Rank Index:   {mast_code}")
print(f"Is Doctorate > Masters? {doc_code > mast_code}")

Encoded Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   age        1885 non-null   category
 1   gender     1885 non-null   category
 2   education  1885 non-null   category
 3   country    1885 non-null   category
 4   ethnicity  1885 non-null   category
 5   nscore     1885 non-null   float64 
 6   escore     1885 non-null   float64 
 7   oscore     1885 non-null   float64 
 8   ascore     1885 non-null   float64 
 9   cscore     1885 non-null   float64 
 10  impuslive  1885 non-null   float64 
 11  ss         1885 non-null   float64 
dtypes: category(5), float64(7)
memory usage: 113.8 KB
None

Correct Proof of Ordinality:
Doctorate Rank Index: 8
Masters Rank Index:   7
Is Doctorate > Masters? True


In [12]:
X_decoded.head(5)

Unnamed: 0,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,cscore,impuslive,ss
0,35-44,Female,Professional certificate/ diploma,UK,Mixed-White/Asian,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,-0.21712,-1.18084
1,25-34,Male,Doctorate degree,UK,White,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575
2,35-44,Male,Professional certificate/ diploma,UK,White,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,-1.37983,0.40148
3,18-24,Female,Masters degree,UK,White,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084
4,35-44,Female,Doctorate degree,UK,White,0.73545,-1.6334,-0.45174,-0.30172,1.30612,-0.21712,-0.21575


Target variables

In [13]:
y.head()

Unnamed: 0,alcohol,amphet,amyl,benzos,caff,cannabis,choc,coke,crack,ecstasy,heroin,ketamine,legalh,lsd,meth,mushrooms,nicotine,semer,vsa
0,CL5,CL2,CL0,CL2,CL6,CL0,CL5,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,CL5,CL2,CL2,CL0,CL6,CL4,CL6,CL3,CL0,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,CL6,CL0,CL0,CL0,CL6,CL3,CL4,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,CL4,CL0,CL0,CL3,CL5,CL2,CL4,CL2,CL0,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,CL4,CL1,CL1,CL0,CL6,CL3,CL6,CL0,CL0,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0


In [14]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 19 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   alcohol    1885 non-null   object
 1   amphet     1885 non-null   object
 2   amyl       1885 non-null   object
 3   benzos     1885 non-null   object
 4   caff       1885 non-null   object
 5   cannabis   1885 non-null   object
 6   choc       1885 non-null   object
 7   coke       1885 non-null   object
 8   crack      1885 non-null   object
 9   ecstasy    1885 non-null   object
 10  heroin     1885 non-null   object
 11  ketamine   1885 non-null   object
 12  legalh     1885 non-null   object
 13  lsd        1885 non-null   object
 14  meth       1885 non-null   object
 15  mushrooms  1885 non-null   object
 16  nicotine   1885 non-null   object
 17  semer      1885 non-null   object
 18  vsa        1885 non-null   object
dtypes: object(19)
memory usage: 279.9+ KB


## Y decoding for EDA clearness

In [15]:
# Create a copy to keep the original y safe
y_decoded = y.copy()

# ---------------------------------------------------------
# Define the Ordinal Mapping
# ---------------------------------------------------------
# Based on the variable description provided earlier:
# CL0 = Never Used -> 0
# CL6 = Used in Last Day -> 6
drug_usage_map = {
    'CL0': 0,  # Never Used
    'CL1': 1,  # Used over a Decade Ago
    'CL2': 2,  # Used in Last Decade
    'CL3': 3,  # Used in Last Year
    'CL4': 4,  # Used in Last Month
    'CL5': 5,  # Used in Last Week
    'CL6': 6   # Used in Last Day
}

# ---------------------------------------------------------
# Apply Transformation to All Target Columns
# ---------------------------------------------------------
# We iterate through every column in y_encoded and map the values
for column in y_decoded.columns:
    y_decoded[column] = y_decoded[column].map(drug_usage_map)

    # Ensure they are stored as integers (not floats)
    y_decoded[column] = y_decoded[column].astype(int)

# ---------------------------------------------------------
# Verification
# ---------------------------------------------------------
print("Transformed Target Data (First 5 Rows):")
display(y_decoded.head())

print("\nData Types:")
print(y_decoded.dtypes)

# Check one column distribution to ensure mapping worked
print("\nExample Distribution for 'Cannabis':")
print(y_decoded['cannabis'].value_counts().sort_index())

Transformed Target Data (First 5 Rows):


Unnamed: 0,alcohol,amphet,amyl,benzos,caff,cannabis,choc,coke,crack,ecstasy,heroin,ketamine,legalh,lsd,meth,mushrooms,nicotine,semer,vsa
0,5,2,0,2,6,0,5,0,0,0,0,0,0,0,0,0,2,0,0
1,5,2,2,0,6,4,6,3,0,4,0,2,0,2,3,0,4,0,0
2,6,0,0,0,6,3,4,0,0,0,0,0,0,0,0,1,0,0,0
3,4,0,0,3,5,2,4,2,0,0,0,2,0,0,0,0,2,0,0
4,4,1,1,0,6,3,6,0,0,1,0,0,1,0,0,2,2,0,0



Data Types:
alcohol      int64
amphet       int64
amyl         int64
benzos       int64
caff         int64
cannabis     int64
choc         int64
coke         int64
crack        int64
ecstasy      int64
heroin       int64
ketamine     int64
legalh       int64
lsd          int64
meth         int64
mushrooms    int64
nicotine     int64
semer        int64
vsa          int64
dtype: object

Example Distribution for 'Cannabis':
cannabis
0    413
1    207
2    266
3    211
4    140
5    185
6    463
Name: count, dtype: int64


## Check for missing values in both datasets

In [16]:
print(X_decoded.isnull().any())
print(y_decoded.isnull().any())

age          False
gender       False
education    False
country      False
ethnicity    False
nscore       False
escore       False
oscore       False
ascore       False
cscore       False
impuslive    False
ss           False
dtype: bool
alcohol      False
amphet       False
amyl         False
benzos       False
caff         False
cannabis     False
choc         False
coke         False
crack        False
ecstasy      False
heroin       False
ketamine     False
legalh       False
lsd          False
meth         False
mushrooms    False
nicotine     False
semer        False
vsa          False
dtype: bool


## Decoded datasets concatenation

In [17]:
# Let's concatenate X and y for a while
df_decoded = pd.concat([X_decoded, y_decoded], axis=1)
df_decoded.head()

Unnamed: 0,age,gender,education,country,ethnicity,nscore,escore,oscore,ascore,cscore,...,ecstasy,heroin,ketamine,legalh,lsd,meth,mushrooms,nicotine,semer,vsa
0,35-44,Female,Professional certificate/ diploma,UK,Mixed-White/Asian,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,...,0,0,0,0,0,0,0,2,0,0
1,25-34,Male,Doctorate degree,UK,White,-0.67825,1.93886,1.43533,0.76096,-0.14277,...,4,0,2,0,2,3,0,4,0,0
2,35-44,Male,Professional certificate/ diploma,UK,White,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,...,0,0,0,0,0,0,1,0,0,0
3,18-24,Female,Masters degree,UK,White,-0.14882,-0.80615,-0.01928,0.59042,0.58489,...,0,0,2,0,0,0,0,2,0,0
4,35-44,Female,Doctorate degree,UK,White,0.73545,-1.6334,-0.45174,-0.30172,1.30612,...,1,0,0,1,0,0,2,2,0,0


# Save the data

In [18]:
X.to_csv('Data/X.csv', index=False)
y.to_csv('Data/y.csv', index=False)
X_decoded.to_csv('Data/X_decoded.csv', index=False)
y_decoded.to_csv('Data/y_decoded.csv', index=False)
df_decoded.to_csv('Data/df_decoded.csv', index=False)