Skip to content

Quantified drug consumption using personality tests and demographic data.

Notifications You must be signed in to change notification settings


Repository files navigation

Quantified drug consumption

⚠️ This Readme is a short summary of our project and does not reflect the whole process. To get more details about our approach, please refer to the following documents:

Table of Contents


The objective of this project is to carry out a Data Science project from an imposed dataset. We get the following database: Drug Consumption Quantified from UCI Machine Learning repository.

It is derived from an online survey conducted between 2011 and 2012 among 1885 respondents aged 18 years and older from English-speaking countries. It collects demographic informations, three personality tests:

  • NEO-FFI-R: The Big Five personality test measures the five personality factors that psychologists have determined are core to our personality makeup.
    • Nscore: Neuroticism - How sensitive a person is to stress and negative emotional triggers.
    • Escore: Extraversion - How much a person is energized by the outside world.
    • Oscore: Openness - How open a person is to new ideas and experiences.
    • Ascore: Agreeableness - How much a person puts others' interests and needs ahead of their own.
    • Cscore: Conscientiousness - How goal-directed, persistent, and organized a person is.
  • BIS11: assess the personality/behavioral construct of impulsiveness
  • ImpSS: assess various personality characteristics and behaviors related to impulsivity and sensation seeking

and 19 central nervous system psychoactive drugs with the following possibilities:

Never Used Used over a Decade Ago Used in Last Decade Used in Last Year Used in Last Month Used in Last Week Used in Last Day

The authors of the survey showed that there is a relationship between risk of addiction to drugs and personnality attributes.

From this dataset, we choose to address the following problematic:

How can we model the risk of addiction to a drug based on personality and demographic data?

We choose the following drugs with the following classes to answer the problematic:

  • For Alcohol, Cannabis, Nicotine, Amphet, Benzos, Coke, Ecstasy, Legalh, LSD, Mushrooms
Not addicted Addicted
  • For Amyl, Crack, Heroin, Ketamine, Meth, VSA
Never Used Used
  • For Caff
Not daily addicted Daily addicted

This repository contains:

  • Drug Consumption Quantified dataset (if the weblink stop working)
  • Requirements file
  • Python Notebook (ipynb and html)
  • Python Web application Django and API which predicts the addiction to the drugs.
  • An example of api used
  • PowerPoint Presentation of the project
  • Final trained models for each drugs

Getting Started


  1. Clone the repository
git clone
  1. Install python libraries
pip install -r requirements.txt


Simply launch the notebook on jupyter!

Website & API

  1. Get to the api directory
cd PATH_TO_API/api 
  1. Launch the server
python runserver


Simply browse on the website



Feature Format
age int >= 18
gender {"Man", "Woman"}
education {"Left school before 16 years", "Left school at 16 years", "Left school at 17 years", "Left school at 18 years", "Some college or university, no certificate or degree", "Professional certificate/ diploma", "University degree", "Master's degree","Doctoral degree"}
country {"Australia", Canada", "New Zealand", "Other", "Republic of Ireland", "UK", "USA"}
ethnicity {"Asian", Black", "Mixed-Black/Asian", "Mixed-White/Asian", "Mixed-White/Black", "Other", "White"}
nscore 12 <= int <= 60
escore 16 <= int <= 59
oscore 24 <= int <= 60
ascore 12 <= int <= 60
cscore 17 <= int <= 57
impulsivity 1 <= int <= 10
SS 1 <= int <= 11


Result Description
Addicted Used last month
Not addicted Never used in the last month
Used Used
Never used Never used
Daily addicted Used yesterday
Not daily addicted Not used yesterday

You can use this python file example (

import requests

age = 18
gender = "Man"
education = "Left school before 16 years"
country = "Australia"
ethnicity = "Asian"
nscore = 12
escore = 16
oscore = 24
ascore = 12
cscore = 17
impulsivity = 1
SS = 1

parameters = [age, gender, education, country, ethnicity, nscore, escore, oscore, ascore, cscore, impulsivity, SS]
parameter_names = ["age", "gender", "education", "country", "ethnicity", "nscore", "escore", "oscore", "ascore", "cscore", "impulsivity", "SS"]

url = ''

for n, p in zip(parameter_names, parameters):
    url += f"{n}={p}"
    if n != SS:
        url += "&"

r = requests.get(url)

if r.status_code == 200:
    for name, result in r.json().items():
        print(f"{name}: {result}")

Data Vizualisation


The following graph describes the repartition of the number of drugs used per responders. We can clearly identify two groups of people:

  • people who have tried less than 6 drugs (47.53%)
  • people who have tried more than 7 drugs (52.47%)


The following graph describes the repartition for each drug of each classes. As expected, we have an imbalanced repartition of data for each class given that for most drugs, majority of people didn't consume drugs.



  1. Feature selection
    Drop feature that will not be used for modelling: ID

  2. Feature encoding
    Encoding have already been performed on the original dataset, so we just changed the encoding for country and ethnicity (one hot) because there are not ordinal variables.


  1. Target selection
    Drop drug that will not be used for modelling: Choc (performances too low due to the popularity of this drug) and Semer(fictif)

  2. Target encoding

  • First we keep the 7 classes
  • Then we group the classes into binary classes



  • Accuracy
  • Balanced accuracy
  • Confusion matrix

Technics used (imbalanced data)

  • Weighting
  • Sampling: SMOTE


  • Initial classes
    We keep the original classes and see what we get.

    • Original models and data
    • Weighting features
    • Sampling features
  • New classes
    We create our own classes to obtain better results.

    • Original models and data
    • Weighting features
    • Sampling features
  • Tuning hyperparameters

Finally we obtained the following evaluation balanced accuracy scores:

Selected models

Drug Model
Alcohol Logistic Regression
Amphet Logistic Regression
Amyl Logistic Regression
Benzos Logistic Regression
Caff Logistic Regression
Cannabis Logistic Regression
Coke SVC
Crack Logistic Regression
Esctasy SVC
Heroin SVC
Ketamine SVC
Legalh SVC
Meth Logistic Regression
Mushrooms BernoulliNB
Nicotine BernoulliNB
VSA Logistic Regression


Finally the best method to overcome the imbalanced classes in targets was weighting. Thus, we have achieved a mean of 71.24% of balanced accuracy.
To conclude majors difficulties we encounter were:

  • Lack of data
  • Imbalanced classes for the target
  • Preprocessed features
  • Biased dataset:
    • young population
    • most people comes from UK



Quantified drug consumption using personality tests and demographic data.







No releases published


No packages published