Creating a Python test to assess data analysis and modeling skills for a Staff Data Scientist position involves designing questions that evaluate a candidate's ability to work with data, perform analysis, and build predictive models. Here’s a sample test consisting of various tasks:

### Python Test for Data Analysis and Modeling Skills

#### Instructions
- You have 90 minutes to complete the test.
- You may use any libraries or tools you are comfortable with.
- Submit your code and a brief report summarizing your findings and conclusions.

#### Dataset
You will work with a fictional retail dataset named `retail_data.csv`. It contains the following columns:
- `order_id`: Unique identifier for each order
- `customer_id`: Unique identifier for each customer
- `order_date`: Date when the order was placed
- `product_id`: Unique identifier for each product
- `quantity`: Number of items purchased
- `price`: Price of each item
- `category`: Category of the product (e.g., Electronics, Clothing, Groceries)

#### Tasks

1. **Data Cleaning (20 minutes)**
   - Load the dataset and check for missing values. Handle any missing data appropriately.
   - Convert the `order_date` column to a datetime object and extract the year and month into new columns.

2. **Exploratory Data Analysis (EDA) (30 minutes)**
   - Generate summary statistics for numerical columns.
   - Create visualizations to answer the following questions:
     - What is the distribution of sales by product category?
     - How do sales vary over time (monthly sales trend)?
     - Identify the top 5 products by total sales.

3. **Feature Engineering (15 minutes)**
   - Create a new feature called `total_sales` which is calculated as `quantity * price`.
   - Create any additional features that you think might be relevant for predicting future sales.

4. **Modeling (25 minutes)**
   - Split the data into training and testing sets (80% training, 20% testing).
   - Build a regression model to predict `total_sales` based on the features you engineered.
   - Evaluate your model using appropriate metrics (e.g., RMSE, R²).

5. **Reporting (Optional)**
   - Provide a brief report (1-2 pages) summarizing your analysis, modeling approach, results, and any recommendations based on your findings.

#### Submission
- Submit your Python code as a Jupyter Notebook or Python script.
- Include visualizations and a summary report in your submission.

### Evaluation Criteria
- **Data Cleaning**: Appropriateness of handling missing values and data types.
- **EDA**: Depth of analysis and quality of visualizations.
- **Feature Engineering**: Creativity and relevance of features created.
- **Modeling**: Correctness of the model, choice of algorithms, and evaluation metrics.
- **Clarity of Reporting**: Ability to communicate findings clearly and effectively.

### Notes
- You may use libraries such as Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
- Make sure your code is well-documented with comments explaining your thought process.

This test provides a comprehensive evaluation of a candidate’s data analysis and modeling skills, relevant for a Staff Data Scientist role. Let me know if you need any modifications or additional details!

In [1]:
### Create code 

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Parameters for generating the dataset
num_records = 1000
categories = ['Electronics', 'Clothing', 'Groceries', 'Home & Garden', 'Health & Beauty']
start_date = datetime(2022, 1, 1)
end_date = datetime(2024, 10, 1)

# Generate random data
data = {
    'order_id': [f'ORD{str(i).zfill(6)}' for i in range(1, num_records + 1)],
    'customer_id': [f'CUST{np.random.randint(1, 201)}' for _ in range(num_records)],
    'order_date': [start_date + timedelta(days=np.random.randint(0, (end_date - start_date).days)) for _ in range(num_records)],
    'product_id': [f'PROD{np.random.randint(1, 101)}' for _ in range(num_records)],
    'quantity': np.random.randint(1, 10, size=num_records),
    'price': np.round(np.random.uniform(5.0, 500.0, size=num_records), 2),
    'category': [np.random.choice(categories) for _ in range(num_records)]
}

# Create DataFrame
retail_data = pd.DataFrame(data)

# Save to CSV
csv_file_path = './data/retail_data.csv'
retail_data.to_csv(csv_file_path, index=False)

csv_file_path


'./data/retail_data.csv'

In [None]:
## Data Cleaning (20 minutes)

## Load the dataset and check for missing values. Handle any missing data appropriately.
df = pd.read_csv("./data/retail_data.csv")




In [None]:
#Convert the order_date column to a datetime object and extract the year and month into new column


In [None]:
#Exploratory Data Analysis (EDA) (30 minutes)

## Generate summary statistics for numerical columns.


## Create visualizations to answer the following questions:
import matplotlib.pyplot as plt 
import seaborn as sns 

###What is the distribution of sales by product category?



In [None]:
###How do sales vary over time (monthly sales trend)?

##Identify the top 5 products by total sales.


In [None]:
# Modeling (25 minutes)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler(), OneHotEncoding


## Split the data into training and testing sets (80% training, 20% testing).


## Build a regression model to predict total_sales based on the features you engineered.


## Evaluate your model using appropriate metrics (e.g., RMSE, R²).