# Data Science Practical Test (Python + Theory)

**Structure & Weighting**

- **Section A – Data Science Theory (20%)**
- **Section B – Hands-on Python (70%)**
- **Section C – Statistics Theory (10%)**

**Instructions**

- Use this notebook as your answer sheet.
- For **code questions**, write your code in the provided cells (you may add new cells if needed).
- For **theory questions**, answer in **markdown cells** directly under each question.
- You may use external documentation (e.g. pandas / scikit-learn docs) but **do not copy-paste full solutions**.
- Make sure all cells run **top-to-bottom without errors** before submitting.

## Setup & Dataset

Run the cell below to import libraries and create a synthetic dataset that you will use for most questions.

The dataset simulates a **customer churn** problem with the following columns:

- `customer_id` – Unique identifier
- `age` – Age of customer (years)
- `monthly_charges` – Current monthly charges (USD)
- `tenure_months` – Number of months the customer has been with the company
- `contract_type` – Categorical: `"month-to-month"`, `"one-year"`, `"two-year"`
- `support_tickets_last_6m` – Number of support tickets opened in the last 6 months
- `avg_call_minutes` – Average call minutes per month
- `is_active_app_user` – 1 if the customer uses the mobile app, 0 otherwise
- `region` – Categorical: `"North"`, `"South"`, `"East"`, `"West"`
- `churned` – **Target variable**: 1 if customer churned, 0 otherwise

In [4]:
import numpy as np
import pandas as pd

np.random.seed(42)

n = 1000

customer_id = np.arange(1, n + 1)
age = np.random.randint(18, 80, size=n)
tenure_months = np.random.randint(1, 72, size=n).astype(float)

contract_type = np.random.choice(["month-to-month", "one-year", "two-year"], size=n, p=[0.6, 0.25, 0.15])
region = np.random.choice(["North", "South", "East", "West"], size=n, p=[0.25, 0.25, 0.25, 0.25])

monthly_charges = (
    20
    + (age - 18) * 0.3
    + np.where(contract_type == "month-to-month", 10, 0)
    + np.where(contract_type == "two-year", -5, 0)
    + np.random.normal(0, 10, size=n)
)
monthly_charges = np.clip(monthly_charges, 10, None)

support_tickets_last_6m = np.random.poisson(lam=2, size=n)
avg_call_minutes = np.random.normal(loc=300, scale=80, size=n)
avg_call_minutes = np.clip(avg_call_minutes, 30, None)

is_active_app_user = np.random.binomial(1, p=0.7, size=n)

# Latent churn probability (logistic-style)
logit = (
    -2.0
    + 0.03 * (age - 40)
    + 0.05 * (support_tickets_last_6m)
    - 0.02 * (tenure_months)
    + 0.0005 * (monthly_charges - 50) ** 2 / 10
    - 0.5 * is_active_app_user
    + np.where(contract_type == "month-to-month", 0.6, 0)
    + np.where(contract_type == "two-year", -0.4, 0)
)

prob_churn = 1 / (1 + np.exp(-logit))
churned = np.random.binomial(1, prob_churn)

df = pd.DataFrame({
    "customer_id": customer_id,
    "age": age,
    "monthly_charges": monthly_charges.round(2),
    "tenure_months": tenure_months,
    "contract_type": contract_type,
    "support_tickets_last_6m": support_tickets_last_6m,
    "avg_call_minutes": avg_call_minutes.round(1),
    "is_active_app_user": is_active_app_user,
    "region": region,
    "churned": churned
})

df.head()

Unnamed: 0,customer_id,age,monthly_charges,tenure_months,contract_type,support_tickets_last_6m,avg_call_minutes,is_active_app_user,region,churned
0,1,56,65.89,11.0,month-to-month,2,313.5,1,East,0
1,2,69,18.54,26.0,two-year,0,375.8,1,East,0
2,3,46,27.64,63.0,month-to-month,2,340.8,1,North,0
3,4,32,22.6,59.0,month-to-month,3,341.3,1,North,0
4,5,60,39.64,27.0,month-to-month,4,292.4,0,West,1


---

## Section A – Data Science Theory (20%)

Answer the following questions in **markdown** cells. Be concise but clear (3–6 sentences per question is usually sufficient).

### Q1. Supervised vs. Unsupervised Learning

1. Define **supervised learning** and **unsupervised learning**.
2. For each paradigm, give **two example tasks** (e.g. classification, clustering) and **one real-world example**.
3. Identify whether the churn problem in this notebook is supervised or unsupervised, and explain why.

**Your answer for Q1 (edit this cell or add a new markdown cell):**

### Q2. Model Evaluation & Bias–Variance Tradeoff

1. Explain the **bias–variance tradeoff** in your own words.
2. Describe how **underfitting** and **overfitting** relate to bias and variance.
3. In the context of this churn dataset, describe **one concrete action** you could take during modeling to reduce overfitting.

**Your answer for Q2 (edit this cell or add a new markdown cell):**

---

## Section B – Hands-on Python (70%)

Use `df` (the dataset created above) to answer the following questions.  
Where code is requested, write your solution in the provided code cell (you may add more cells if needed).

### Q3. Exploratory Data Analysis (EDA)

a. Show the shape of the dataset and basic information about each column (types, non-null counts).  
b. Compute basic descriptive statistics for the numeric variables.  
c. Check for missing values in the dataset.  
d. Compute and display the **churn rate** (percentage of rows where `churned == 1`).

Write code to perform all the above tasks.

In [None]:
# Q3 – YOUR CODE HERE

# a) Shape & basic info



# b) Descriptive statistics for numeric variables



# c) Missing values check



# d) Churn rate (percentage)

### Q4. Univariate & Bivariate Analysis

a. Produce a **histogram or density plot** for `tenure_months`. Briefly interpret the distribution in a markdown comment (e.g. are most customers recent or long-term?).  
b. Produce a **bar plot** showing churn rate by `contract_type`.  
c. Based on the plot, which contract type appears to have the **highest churn rate**?

> You may use `matplotlib` or `seaborn` (if installed).

In [None]:
# Q4 – YOUR CODE HERE

import matplotlib.pyplot as plt

# a) Distribution of tenure_months




# b) Churn rate by contract_type

**Brief interpretation for Q4 (c):**  
_Write 2–4 sentences describing what you observe from the churn rate by contract type._

### Q5. Data Preparation & Train/Test Split

We will build models to predict `churned`.

a. Create a new DataFrame `features` that includes all predictor variables **except** `customer_id` and `churned`.  
b. Encode categorical variables appropriately (you may use `pd.get_dummies` or `sklearn` encoders).  
c. Create `X` (features) and `y` (target).  
d. Split the data into **train** and **test** sets (e.g. 80% train, 20% test) using `train_test_split` from `sklearn.model_selection`. Use `random_state=42`.

Show your code and the shapes of the resulting matrices.

In [None]:
# Q5 – YOUR CODE HERE

from sklearn.model_selection import train_test_split

# a) & b) Feature dataframe and encoding



# c) X (features) and y (target)



# d) Train-test split

### Q6. Baseline Model – Logistic Regression

Using the train/test split from Q5:

a. Fit a **Logistic Regression** model to predict `churned`.  
b. Evaluate the model on the **test set** using at least the following metrics:
   - Accuracy  
   - Precision  
   - Recall  
   - F1-score  
c. Display the **confusion matrix**.  
d. Briefly interpret the results: is the model performing reasonably well? Any tradeoffs you notice between precision and recall?

> Hint: Use `sklearn.linear_model.LogisticRegression` and metrics from `sklearn.metrics`.

In [None]:
# Q6 – YOUR CODE HERE

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

# a) Fit Logistic Regression



# b) & c) Evaluate and show confusion matrix

**Interpretation for Q6 (d):**  
_Write 3–5 sentences commenting on the model performance (e.g. accuracy level, precision vs recall, what that means for a churn prediction problem)._

### Q7. Tree-Based Model & Comparison

a. Train a **Random Forest** classifier on the same training data.  
b. Evaluate it on the test set using the **same metrics** as in Q6.  
c. Compare the performance of the Random Forest to the Logistic Regression. Which performs better, and by how much (for at least two metrics)?  
d. Print out the **top 5 most important features** according to the Random Forest model.

> Hint: Use `sklearn.ensemble.RandomForestClassifier`.

In [None]:
# Q7 – YOUR CODE HERE

from sklearn.ensemble import RandomForestClassifier

# a) Train Random Forest



# b) Evaluate on test set



# d) Feature importance

**Brief comparison for Q7 (c):**  
_Write 3–5 sentences comparing the two models and discussing which you would choose for deployment and why._

### Q8. Cross-Validation

a. Use **cross-validation** (e.g. 5-fold) to estimate the performance of your Logistic Regression model on the whole dataset.  
b. Report the **mean** and **standard deviation** of the cross-validated accuracy scores.  
c. Briefly explain why we use cross-validation instead of a single train/test split.

> Hint: Use `sklearn.model_selection.cross_val_score`.

In [None]:
# Q8 – YOUR CODE HERE

from sklearn.model_selection import cross_val_score

# a) Cross-validation for Logistic Regression



# b) Report mean and std

**Short explanation for Q8 (c):**  
_Write 2–3 sentences on the purpose and benefit of cross-validation._

### Q9. Modeling Pipeline (Optional but Strongly Recommended)

Create a **scikit-learn Pipeline** that:

1. Handles categorical encoding and any scaling you consider appropriate.
2. Trains either a Logistic Regression or Random Forest model.
3. Evaluates the model using cross-validation.

Show your pipeline definition and print the cross-validated accuracy.

> Hint: Use `sklearn.pipeline.Pipeline` and `sklearn.compose.ColumnTransformer`.

In [None]:
# Q9 – YOUR CODE HERE (optional but recommended)

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Define preprocessing & pipeline here

---

## Section C – Statistics Theory (10%)

Answer the following questions in **markdown**.

### Q10. Correlation vs Causation & Confounding

1. Explain the difference between **correlation** and **causation**.  
2. Give an example (not necessarily from this dataset) where two variables are correlated but not causally related.  
3. In the churn dataset, describe one possible **confounder** that might affect the relationship between `age` and `churned`.

**Your answer for Q10 (edit this cell or add a new markdown cell):**

### Q11. Confidence Intervals & Proportions

1. Suppose the sample churn rate in this dataset is 0.25 (25%).  
2. Explain in words what a **95% confidence interval** for the true churn rate means.  
3. Briefly outline (no formulas needed) how you would construct such a confidence interval for the churn rate.

You do **not** need to compute the confidence interval numerically—just explain the concepts.

**Your answer for Q11 (edit this cell or add a new markdown cell):**