Practice data scientist technical assessments — fully offline.
You can either use UV or pip to set up the project for both the CLI and WebApp.
For UV use the following:
# 1. Sync the project using UV. This also creates a virtual environment.
uv sync
# 2. Run the CLI or WebApp
ds-trainer train
# OR
uv run python web/main.py# 1. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2. Install (editable, so you can add your own questions later)
pip install -e .
# 3. Run a default 5-question session across all domains
ds-trainer train
# 4. Focus on what you need
ds-trainer train --domain sql --difficulty hard --count 3
ds-trainer train --type fill_in_code --domain pythonNo install? If you skip
pip install -e ., you can still run the tool from inside thelogicdirectory using:python -m ds_trainer trainNote: the dependencies (
rich,pandas,numpy) must still be installed for this to work.
You can also practice and manage questions using the interactive visual web application.
# Start the FastAPI web server
uv run python web/main.pyThen navigate to http://127.0.0.1:8000 in your browser.
From the web UI, you can:
- Configure Training Sessions: Select your domain, difficulty, and question types.
- Interactive Practice: Write code and SQL with syntax highlighting, and instantly grade your solutions against the test cases.
- Database Management: From the bottom of the Setup View, you can access tools to manage the database:
- Add Question: Opens a form to securely evaluate and add new custom questions directly to the SQLite database (
ds_trainer/questions.db). IDs are generated automatically! - Reset Database: Wipes all custom questions and safely restores the database back to its original core question set.
- Add Question: Opens a form to securely evaluate and add new custom questions directly to the SQLite database (
Companies that hire data scientists typically test these areas:
| Domain | Topics covered |
|---|---|
| SQL | JOINs, aggregations, HAVING, window functions (RANK, ROW_NUMBER, moving average), CTEs, correlated subqueries |
| Python / Pandas | Boolean indexing, missing data imputation, groupby/transform, rolling windows, pivot_table, MultiIndex |
| Statistics | p-values, CLT, A/B test design, two-sample t-test, multiple comparisons, bootstrap CI, bias-variance tradeoff |
| ML | Overfitting, class imbalance, cross-validation, feature importance, gradient boosting, Ridge + GridSearchCV, data leakage |
| Algorithms | Two-sum, sliding window, binary search, group anagrams, max profit (single pass) |
| Case Studies | North Star metric, DAU drops, funnel analysis, churn prediction lifecycle, RFM segmentation, cold-start |
| Probability | Bayes' theorem, base-rate fallacy, expected value, independence, Monty Hall, birthday problem, Markov chains, law of total expectation |
Difficulty levels: easy / medium / hard (filter with --difficulty)
| Type | How to answer |
|---|---|
multiple_choice |
Type a letter: A, B, C, or D |
fill_in_code |
Paste a complete function body; submit with a blank line |
sql_challenge |
Write a SQL query; submit with a blank line |
explain_concept |
Write your answer in plain text; submit with a blank line |
take_home |
A dataset CSV is generated locally; work in your editor, press Enter when done |
Python code and SQL queries are evaluated automatically. Concept questions and take-home projects are self-graded — compare your answer to the model solution shown after you submit.
During any session:
| Key | Action |
|---|---|
h |
Show the next available hint |
s |
Skip this question (counts as wrong) |
q |
Quit the session early (summary is still shown) |
ds-trainer train [--domain DOMAIN] [--difficulty LEVEL] [--type TYPE]
[--count N] [--no-shuffle]
ds-trainer list [--domain DOMAIN] [--difficulty LEVEL] [--type TYPE]
ds-trainer stats
ds-trainer --version
Options:
| Flag | Values | Default |
|---|---|---|
--domain, -d |
sql python statistics ml algorithms case_studies probability all |
all |
--difficulty, -l |
easy medium hard all |
all |
--type, -t |
multiple_choice fill_in_code explain_concept sql_challenge take_home all |
all |
--count, -n |
integer | 5 |
--no-shuffle |
flag | (shuffle by default) |
Examples:
# See what questions are available before committing
ds-trainer list --domain ml --difficulty medium
# Check how many questions exist per domain × difficulty
ds-trainer stats
# Pure coding interview prep
ds-trainer train --type fill_in_code --count 10
# SQL-only hard grind
ds-trainer train --domain sql --difficulty hard --no-shuffleds_trainer/
├── cli.py # argparse entry point; dispatches train / list / stats
├── models.py # Question, Session, SessionResult dataclasses + enums
├── registry.py # load_all(), filter_questions(), sample_questions()
├── runner.py # interactive loop, render_question(), evaluate(), show_summary()
├── evaluators.py # eval_code() and eval_sql() — sandboxed execution engines
├── domains/
│ ├── sql.py # 12 questions
│ ├── python_pandas.py # 10 questions
│ ├── statistics.py # 10 questions
│ ├── ml.py # 10 questions
│ ├── algorithms.py # 8 questions
│ └── case_studies.py # 9 questions
└── data/
└── generators.py # CSV dataset generators for take-home exercises
Key design decisions:
- Single flat
Questiondataclass — no inheritance hierarchy; theExerciseTypeenum carries the type tag and__post_init__validates required fields per type viamatch/case. - Sandboxed code execution —
eval_codeusesexec()with a whitelist of ~30 safe builtins (noopen, no__import__). Athreading.Threadwith a 10-second timeout prevents infinite loops. - In-memory SQLite —
eval_sqlruns both the user query and the model answer against a freshsqlite3.connect(":memory:")and compares result sets asfrozenset[tuple](order-independent). - Marker dicts — pandas DataFrames and sklearn arrays in test cases are stored as serialisable dicts (
{"__pandas_df__": True, "data": {...}}) and resolved to real objects just before evaluation. - SQLite Database — questions are stored in
ds_trainer/questions.dbmaking it fully portable and easy to query or extend.
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=ds_trainer --cov-report=term-missingPython 3.10+ required (uses match/case throughout).
There are two ways to add custom questions to your test prep environment.
The easiest way to add new questions dynamically is through the visual web application, which persists data directly to ds_trainer/questions.db.
- Run
uv run python web/main.pyand openhttp://127.0.0.1:8000. - In the Database Management section at the bottom of the setup screen, click Add Question.
- Fill out the form fields. The system adapts based on the exercise type and automatically generates your Question ID!
- Click Test Question. This securely evaluates your model answer against your test cases (or SQL schema) to guarantee the question works.
- Once the test passes, click Add to Database.
If you prefer managing questions as code, you can define them in the core python files. Note: Questions added here can be populated to the web app's database later by using the "Reset Database" utility.
- Open the relevant domain file under
ds_trainer/domains/(e.g.,sql.py,ml.py). - Add a
Question(...)literal to theQUESTIONSlist. - Follow the ID convention:
{domain_abbrev}_{difficulty_abbrev}_{serial}— e.g.,sql_m_007. - Run
ds-trainer statsto confirm the new question appears. - Run
pytest tests/— the domain smoke tests will verify that your model answer passes successfully.
For FILL_IN_CODE questions, always include test_cases so the answer can be auto-evaluated. The model_answer field is shown to the user after they submit and is also used by the test suite to verify correctness.
For TAKE_HOME questions, add a generator to ds_trainer/data/generators.py decorated with @register("your_key") and reference it via dataset_generator="your_key".
MIT