# Final project


You are a team of data scientists and ML engineers working for a recipe website.

Every day your website receives recipe submissions from users. The recipes are analyzed by a lab to ascertain their nutritional content. Then the recipe is posted on your website along with its nutrition information.  Your company makes money by running ads along with the recipes.

Recently, ad space on high-protein receipes has been in high demand, so the company is interested in maximizing the use of the high-protein recipes that users submit. The problem is that the lab generally can't analyze recipes as quickly as they are submitted, so their queue is always full.

The good news is that the lab will let you push recipes to the front of the queue so that they will get processed sooner.  If you suspect a recipe will, after analysis, be labeled "high protein", then it would be a good idea to push it to the front of the queue.

Your team's job is to build a model that predicts whether a recipe is high protein based on features of the recipe (ex., the ingredients). Each new recipe will be presented to your model. If you model predicts that it will be high protein, then that receipe will be pushed to the front of the lab's queue. Otherwise it will be pushed to the back ofthe lab's queue.

# Data

The data set is available under Module 8 as `final_data_set.csv`. The data are from recipes that have already been analyzed by the lab. The columns labeled `calories` and `protein` are the lab's output. We'll say that a recipe is "high protein" if `protein / calories > .10`.  All of the other columns are features of the recipe.

The ordering of the rows is immaterial.

# Model

Your model should predict whether a recipe is high protein based on the recipe's features.


As you develop your model, be sure to justify each decision you make.

# Dollars

Predictions are great -- as a starting point. But no business ever got to IPO by saying, "We have the best mean-squared-error in our industry!" Instead, businesses survive and thrive by earning money.

Since your team is part of a business, you'll need to understand the dollar value of your model: How much more money could your company earn if they deployed your model?

Fortunately, you have a simulator that can turn model predictions into ad revenue estimates. The simulator simulates
- the users' recipe submissions
- pushing recipes onto the front or back of the lab's queue based on your model's predictions
- the lab's daily recipe analysis
- daily ad revenue from high- and low- protein ads

The simulator's not perfect, but it can used to evaluate your model is dollar terms. Doing this can help compare the value of your model to the value of other teams' efforts to improve the business -- whether they are building models, writing code, selling ads, designing web pages, or doing whatever else makes your business work.

The input to the simulator is `hp_predictions_and_actuals`, a list of tuples `(prediction, actual)`. `actual` should be 1 if the recipe is high protein and 0 if the recipe is not high protein. `prediction` should be your model's prediction of `actual`, based on a recipe's features. (N.B.: `simulate()` doesn't use the recipe's features.)

`hp_predictions_and_actuals` might look like:
```
hp_predictions_and_actuals = [
    (1, 1),  # predict high-protein, actually high-protein
    (0, 1),  # predict low-protein, actually high-protein
    (0, 0),  # predict low-protein, actually low-protein
    (1, 1),  # predict high-protein, actually high-protein
    .
    .
    .
]
```

You could construct `hp_predictions_and_actuals` with `hp_predictions_and_actuals = list(zip(predictions, actuals))` if you had a list (or an ndarray or a Pandas Series) of predictions and another of actuals.

To use `simulate()`, you'll probably create one entry in `hp_predictions_and_actuals` for each row in the data frame then call `simulate(hp_predictions_and_actuals)`. This call will return an estimate of one year's revenue.  Please see the comments in `simulate()` for more information.

Evaluate your model in dollar terms using `simulate()`. `simulate()`'s output is a random variable, so please report the standard error of your model's dollar value, too.
- How much better is your model than random guessing?
- How much worse is your model than the best possible model? (Hint: What would be the predictions of the best possible model?)

In [None]:
def simulate(hp_predictions_and_actuals):
    high_protein_ad_revenue = 1
    low_protein_ad_revenue = .25
    
    ad_revenue = 0
    analysis_queue = []
    active_recipes = []
    i_data = 0
    for _ in range(365):  # for one year
        # 50-100 recipe submissions/day
        num_submissions = np.random.randint(50, 100)
        for __ in range(num_submissions): # 
            if i_data < len(hp_predictions_and_actuals):
                p, a = hp_predictions_and_actuals[i_data]
                if p == 1:  # go to the front of the queue
                    analysis_queue.insert(0, a)
                else: # go to the back of the queue
                    analysis_queue.append(a)
                i_data += 1
            
        # can analyze only 25 recipes/day   
        for __ in range(25):
            if len(analysis_queue)==0:
                break
            acutal = analysis_queue.pop(0)
            active_recipes.append(acutal)
                          
        # run 500-1000 ads/day
        num_ads_today = np.random.randint(500, 1000)
        for a in random.choices(active_recipes, k=num_ads_today):
            # a==1 if recipe is high protein
            if a==1:
                ad_revenue += high_protein_ad_revenue
            else:
                ad_revenue += low_protein_ad_revenue
    
    return ad_revenue

# Presentation

Your team will give a 20 minute presentation in class. (Note: The syllabus said 30 minutes. Make it 20.) 

Your presentation should explain your process of building and evaluating your model, inform us of the expected dollar impact on the business, and provide justification for your claims.

# Model evaluation

Your model will be evaluated on a held-out (and hidden) set of data using `simulate()`.

For this purpose, please supply a Python function that implements your model with this signature:
```
def predict(df):
```
`predict()` should return a list (or ndarray or Pandas Series) of predictions, one for each row of the input dataframe, `df`.  `df` will look just like the dataframe in `final_data_set.csv`, except the columns `calories` and `protein` will be absent.

There are two parts to the evaluation:

1. Does your model do better than random guessing?
2. Does your model do better than the other team's model?

For part (2): I will run an A/B test to compare the two teams' models on `simulate()`'s output on the held-out data set. The team whose model performs better wins. Each member of the winning team gets to drop their lowest homework grade.

# Tips

Some techniques that might be helpful:

- linear regression
- logistic / GLM regression
- hypothesis testing
- regularization, including hyperparameter tuning
- bootstrap sampling
- data transformations / feature engineering
- ask your instructor for clarification when you think you need it

You might not need all of them. Also, you might find techniques to be helpful that are not listed.



# Notes

- Each team should submit a single Jupyter notebook.
- The notebook must contain the `predict()` function.
- There will be two teams, one will 3 students and one with 4 students.