# Project 1 - Building & Evaluating ML Algorithms

In this project, you will work with a supermarket sales dataset. You will implement both regression and classification tasks to report on a set of questions.

The goal of this assignment include:

1. Carry exploratory data analysis to gather knowledge from data
2. Apply data visualization techniques
3. Build transformation pipelines for data preprocessing and data cleaning
4. Select machine learning algorithms for regression and classification tasks
5. Design pipelines for hyperparameter tuning and model selection
6. Implement performance evaluation metrics and evaluate results
7. Report observations, propose business-centric solutions and propose mitigating strategies

## Deliverables

As part of this project, you should deliver the following materials:

1. [**4-page IEEE-format pape**](https://www.ieee.org/conferences/publishing/templates.html). Write a paper with no more than 4 pages addressing the questions posed below. When writing this report, consider a business-oriented person as your reader (e.g. your PhD advisor, your internship manager, etc.). Tell *the story* for each datasets' goal and propose solutions by addressing (at least) the questions posed below.

2. **Python Code**. Create two separate Notebooks: (1) "training.ipynb" used for training and hyperparameter tuning, (2) "test.ipynb" for evaluating the final trained model in the test set. The "test.ipynb" should load all trained objects and simply evaluate the performance. So don't forget to **push the trained models** to your repository to allow us to run it.

All of your code should run without any errors and be well-documented. 

3. **README.md file**. Edit the readme.md file in your repository and how to use your code. If there are user-defined parameters, your readme.md file must clearly indicate so and demonstrate how to use your code.

This is an **individual assignment**. 

These deliverables are **due Tuesday, October 11 @ 11:59pm**. Late submissions will not be accepted, so please plan accordingly.

---

# About the Dataset

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. The supermarket sales dataset is available in the ```supermarket_sales.csv``` file.

### Attribute Description

1. **Invoice id**: Computer generated sales slip invoice identification number.

2. **Branch**: Branch of supercenter (3 branches are available identified by A, B and C).

3. **City**: Location of supercenters.

4. **Customer type**: Type of customers, recorded by ```Member``` for customers using member card and ```Normal``` for without member card.

5. **Gender**: Gender type of customer.

6. **Product line**: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel.

7. **Unit price**: Price of each product in US dollars.

8. **Quantity**: Number of products purchased by customer.

9. **Total**: Total price including tax.

10. **Date**: Date of purchase (record available from January 2019 to March 2019).

11. **Time**: Purchase time (10am to 9pm).

12. **Payment**: Payment used by customer for purchase (3 methods are available - ```Cash```, ```Credit``` card and ```Ewallet```).

13. **COGS**: Cost of goods sold.

14. **Gross margin percentage**: Gross margin percentage.

15. **Gross income**: supercenter gross income in US dollars.

16. **Rating**: Customer stratification rating on their overall shopping experience (on a scale of 1 to 10).

# Assignment

1. Apply the necessary data preprocessing using ```scikit-learn``` pipelines. Justify all choices. The only requirements regarding attribute encoding are:

    1. Encode the attribute ```Date``` with the respective day of the week (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday).
    2. Encode the attribute ```Time``` into 4 categories: Morning (10:00 - 11:59), Afternoon (12:00 - 17:00), Evening (17:01 - 19:00) and Night (19:01 - 21:00).

For what follows, use the coefficient of determination, $r^2$, as one of your metrics of success and report its 95% confidence interval. Carry any necessary hyperparameter tuning with pipelines. Choose the best CV strategy and report on the best hyperparameter settings.

2. Train a multiple linear regression with and without Lasso regularization to **predict ```gross income```**.

    1. How is the gross income affected by unit price, quantity, and other variables like day, time slot, and product line in general?
    
    4. When using Lasso regularizer, which value for the hyperparameter $\lambda$ best works for this dataset? Which features were excluded in this model, if any?

3. Train a multiple linear regression with and without Lasso regularization to **predict ```Unit price```**.

    1. How is the unit price affected by gross income, quantity, and other variables like day, timeslot, and product line in general?
    
    4. When using Lasso regularizer, which value for the hyperparameter $\lambda$ best works for this dataset? Which features were excluded in this model, if any?

---

#### Question 4 and 5 are required for completion for the EEL 5934 section only. Individuals in EEL 4930 are welcome to solve these tasks but no extra credit will be credited.

4. Train a logistic regression to **classify gender** and study the relationship between attributes. Namely, explain the relationship between gender, product line, payment and gross income for branch C. To study this relationship, consider all the interaction attribution of degree 2. (See ```interaction_only``` variable in [```PolynomialFeatures```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)).

    1. For ```Gender=male``` customers, plot the parameters values for all attributes (and its 2nd-order interactions).
    2. Which attributes are the most informative?

5. Train a logistic regression to **classify customer type** and study the relationship between attributes. Namely, explain the relationship between customer type, gender, day and timeslot for branch C. To study this relationship, consider all the interaction attribution of degree 2. (See ```interaction_only``` variable in [```PolynomialFeatures```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)).

    1. For ```Customer type = Normal``` customers, plot the parameters values for all attributes (and its 2nd-order interactions).
    2. Which attributes are the most informative?

---

For what follows, use accuracy as one of your metrics of success and report its 95% confidence interval. Carry any necessary hyperparameter tuning with pipelines. Choose the best CV strategy and report on the best hyperparameter settings.

6. Train a classifier to **predict the day of purchase** (Monday, Tuesday, etc.).

    1. Select at least 2 classifiers.

---

# Submit Your Solution

Confirm that you've successfully completed the assignment.

Along with the Notebook, include a PDF of the notebook with your solutions.

```add``` and ```commit``` the final version of your work, and ```push``` your code to your GitHub repository.

Submit the URL of your GitHub Repository as your assignment submission on Canvas.

---