**Python Test for Staff Data Scientist Position **

This test assesses your skills in data analysis, data science, and machine learning at an intermediate/advanced level using Python. Please provide clear and concise code, along with explanations and interpretations where necessary.

**Dataset:**

For this test, you'll be working with a synthetic dataset simulating sales data for a retailer. You can generate this dataset yourself using the following code:

In [1]:
import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_rows = 10000
df = pd.DataFrame({
    'Store_ID': np.random.randint(1, 100, n_rows),
    'Date': pd.to_datetime(np.random.choice(pd.date_range('2023-01-01', '2024-03-31'), n_rows)),
    'Department': np.random.choice(['Electronics', 'Clothing', 'Grocery', 'Home Goods'], n_rows),
    'Product_Category': np.random.choice(['TVs', 'Shirts', 'Produce', 'Furniture', 'Laptops', 'Pants', 'Dairy', 'Kitchenware'], n_rows),
    'Units_Sold': np.random.randint(1, 50, n_rows),
    'Price_per_Unit': np.random.uniform(5, 200, n_rows),
    'Promotion': np.random.choice([True, False], n_rows),
    'Holiday': np.random.choice([True, False], n_rows, p=[0.1, 0.9]),
    'Temperature': np.random.uniform(20, 90, n_rows),
    'Fuel_Price': np.random.uniform(2.5, 5, n_rows)
})

# Calculate total sales
df['Total_Sales'] = df['Units_Sold'] * df['Price_per_Unit']

# Save the dataset
df.to_csv('sales_data.csv', index=False)

**Tasks:**

1. **Data Loading and Exploration (10 points)**
    * Load the dataset into a pandas DataFrame.
    * Display the first 5 rows and all columns by `df.head()`.
    * Show columns and their types by `df.info()`.
    * Provide descriptive statistics of the numerical features.


2. **Data Cleaning and Preprocessing (15 points)**
    * Check for missing values in each column. If any are present, handle them with an appropriate imputation technique.
    * Convert categorical variables ('Department', 'Product_Category', 'Promotion', 'Holiday') into numerical representations using one-hot encoding.
    * Split the data into features (X) and target variable (y), where 'Total_Sales' is the target.


3. **Exploratory Data Analysis (25 points)**
    * Analyze the distribution of 'Total_Sales'.
    * Explore the relationship between 'Total_Sales' and categorical features ('Department', 'Product_Category', 'Promotion', 'Holiday') using visualizations.
    * Investigate the correlation between 'Total_Sales' and numerical features ('Units_Sold', 'Price_per_Unit', 'Temperature', 'Fuel_Price').
    * Identify any trends or patterns in sales over time ('Date').


4. **Feature Engineering (15 points)**
    * Create new features that you believe could improve model performance. Justify your choices.
    * Examples:
        * `Weekend`: Whether the date is on a weekend.
        * `Month`: Month of the year from the date.
        * `Lagged_Sales`: Sales from the previous week or month.
        * `Price_per_Unit_Discount`: Calculate discount percentage if `Promotion` is True.


5. **Model Building and Evaluation (25 points)**
    * Choose an appropriate machine learning model for predicting 'Total_Sales'. Consider regression models like Linear Regression, Decision Tree, Random Forest, or Gradient Boosting.
    * Split the data into training and testing sets.
    * Train the chosen model and tune hyperparameters using cross-validation or a similar technique.
    * Evaluate the model's performance on the test set using relevant metrics (e.g., R-squared, Mean Squared Error, Root Mean Squared Error).


6. **Model Interpretability and Insights (10 points)**
    * If applicable, interpret the model's results. Which features are most important for predicting 'Total_Sales'?
    * Provide actionable insights based on your analysis and model predictions that could be useful.


**Bonus (10 points)**

* Deploy your trained model as a simple API endpoint using a framework like Flask or FastAPI. This allows for model predictions on new data.

**Submission:**

Please submit your completed test as a Jupyter Notebook or a Python script. Include your code, explanations, visualizations, and interpretations.

This test is designed to be challenging but also to showcase your abilities. Good luck!

In [None]:
## Load the dataset into a pandas DataFrame.

## Display the first 5 rows and all columns by df.head().

## Show columns and their types by df.info().

## Provide descriptive statistics of the numerical features.


In [None]:
## Check for missing values in each column. If any are present, handle them with an appropriate imputation technique.

## Convert categorical variables ('Department', 'Product_Category', 'Promotion', 'Holiday') into numerical representations using one-hot encoding.

## Split the data into features (X) and target variable (y), where 'Total_Sales' is the target.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Analyze the distribution of 'Total_Sales'.

## Explore the relationship between 'Total_Sales' and categorical features ('Department', 'Product_Category', 'Promotion', 'Holiday') using visualizations.

## Investigate the correlation between 'Total_Sales' and numerical features ('Units_Sold', 'Price_per_Unit', 'Temperature', 'Fuel_Price').


## Identify any trends or patterns in sales over time ('Date')


## Create new features that you believe could improve model performance. Justify your choices.




In [None]:
# Choose an appropriate machine learning model for predicting 'Total_Sales'. Consider regression models like Linear Regression, Decision Tree, Random Forest, or Gradient Boosting.
# Split the data into training and testing sets.
# Train the chosen model and tune hyperparameters using cross-validation or a similar technique.
# Evaluate the model's performance on the test set using relevant metrics (e.g., R-squared, Mean Squared Error, Root Mean Squared Error)