# AiCore Regression Project: Project Hawthorne

## Introduction
This project is a continuation of Project Hawthorne, for which I wrote a Python web-scraper to collect drink recipes from user submitted recipe sites. The ultimate aim of Project Hawthorne is to create a recipe-generator bot which will invent, and then publish cocktail recipes to user-submitted websites.

For this regression project I applied supervised learning to solve a regression problem based on my own data. Several regression models were evaluated. The full evaluation process, including regularisation and hyperparameter tuning, is documented in the Jupyter notebooks in this repo. Results are displayed below.

## Project Brief & Deliverables

The project brief was to:
* Identify an industry relevant prediction problem.
* Develop a solution to this problem.
* Present the results.

Deliverables were:
* A GitHub repo containing all code.
* A presentation in two parts:
    1. **Non-technical presentation** highlighting the problem and the solution at a high level. This part explained the results that were attained and how they will drive business value.
    2. **Technical presentation** giving details of the techniques applied during data processing and modelling.

## Planning
### Identifying the Problem

The key criteria for excellence in cocktail making have been widely discussed by professional bartenders and mixologists. Among these criteria are:
- An eye catching or descriptive name.
- Quality ingredients in a unique formulation.
- A memorable story or interesting inspiration.
- The look.
- The purpose.

However, in the world of user-submitted recipe sites the criteria may be quite different. I hypothesize that:
- The simplest recipes are the most popular.
- Drinks containing familiar ingredients are more popular than those with unusual ingredients.
- Alcoholic drinks are more popular than non-alcoholic drinks.

### Stakeholders

Stakeholders for this project include:
- **Myself.** I plan to create a recipe-generator bot which will invent and publish cocktail recipes. The insights generated in this regression analysis will be vital in creating popular recipes.
- **Drinks manufacturers.** Many drinks manufacturers provide recipes to inspire and encourage the customer to buy their products. Understanding the driving forces behind recipe popularity will aid their recipe writing and may boost sales.

### Defining Success

This project investigates the hypotheses:
- The simplest recipes are the most popular.
- Drinks containing familiar ingredients are more popular than those with unusual ingredients.
- Alcoholic drinks are more popular than non-alcoholic drinks.

Using regression analysis, I will investigate recipe complexity, degree of familiarity, and alcohol content, and the impact these criteria have on the recipe popularity as shown in star rating. 

Success will be measured by :
- Identifying the most powerful of these influences on star rating.
- Selecting a model which can generalise well to unseen data.
- Being able to predict star rating of unseen data using recipe complexity, degree of familiarity, and alcohol content.


In [32]:
import data_cleaning
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

random.seed(2021)


# Importing the data
Recipe data is stored in a Postgres database on AWS RDS. We can get the data as a pandas DataFrame:

In [33]:
df = data_cleaning.get_data()
df.head()

Unnamed: 0_level_0,name,url,description,star_rating,n_ratings,prep_time,ingredient_0,ingredient_1,ingredient_2,ingredient_3,...,step_1,step_2,step_3,step_4,step_5,step_6,step_7,step_8,step_9,step_10
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Citrus and Mint Punch,http://allrecipes.co.uk/recipe/5711/citrus-and...,"On a hot summer day, nothing hits the spot qui...",5.0,5.0,20 min,600ml (1 pint) boiling water,12 sprigs fresh mint,4 ordinary tea bags,"200g (7 oz) caster sugar, or to taste",...,Add the sugar and stir until it has dissolved....,"Serve over ice cubes, garnished with mint leav...",,,,,,,,
2,"Coconut, mango and pineapple smoothie",http://allrecipes.co.uk/recipe/25444/coconut--...,This creamy fruit smoothie transports you to t...,4.5,10.0,5 min,"1 ripe mango, cubed",1 small banana,150ml coconut milk,100g pineapple pieces,...,,,,,,,,,,
3,Groovy green smoothie,http://allrecipes.co.uk/recipe/5115/groovy-gre...,A great way to get your little ones to eat spi...,4.5,409.0,10 min,"1 banana, sliced",150g (5 oz) green grapes,1 (200g) tub vanilla yoghurt,"1/2 apple, cored and chopped",...,,,,,,,,,,
4,Banana blast,http://allrecipes.co.uk/recipe/878/banana-blas...,This banana smoothie is a lovely drink on a ni...,4.5,214.0,5 min,2 bananas,225ml (8 fl oz) semi-skimmed milk,4 tablespoons water,2 tablespoons brown sugar,...,,,,,,,,,,
5,B and L's Strawberry Smoothie,http://allrecipes.co.uk/recipe/819/b-and-l-s-s...,"This icy cold strawberry smoothie is healthy, ...",4.5,1059.0,5 min,"8 strawberries, hulled",110ml (4 fl oz) skimmed milk,120g (4 oz) low-fat plain yoghurt,3 tablespoons demerara sugar,...,,,,,,,,,,


# Hypothesis 1: The simplest recipes are the most popular.



To answer this hypothesis, we compute the number of ingredients and number of steps in each recipe. These will be used as features in regression analysis.

## Data cleaning
The first step is to compute the number of ingredients and steps in each 

In [34]:
ingredient_df = df.filter(like="ingredient", axis=1)
df['n_ingredients'] = 15 - ingredient_df.isnull().sum(axis=1)

method_df = df.filter(like="step", axis=1)
df['n_steps'] = 11 - method_df.isnull().sum(axis=1)

simple_df = df[['name', 'star_rating', 'n_ingredients', 'n_steps']]
simple_df.head()

Unnamed: 0_level_0,name,star_rating,n_ingredients,n_steps
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Citrus and Mint Punch,5.0,10,3
2,"Coconut, mango and pineapple smoothie",4.5,5,1
3,Groovy green smoothie,4.5,5,1
4,Banana blast,4.5,5,1
5,B and L's Strawberry Smoothie,4.5,6,1


In [35]:
y = simple_df[['star_rating']]
X = simple_df.drop(['star_rating', 'name'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

## Setting the Baseline

We are about to compare multiple regression models on their ability to fit and model our training and validation datasets. In order to compare these models, we will first fit a simple model to use as our performance baseline.

### Linear Regression


In [36]:
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train, y_train)

print(f'Score on the training set is: {linear_regression_model.score(X_train, y_train)}')
print(f'Score on the validation set is: {linear_regression_model.score(X_val, y_val)}')
print(f'Linear regression coefficients are: {linear_regression_model.coef_}')

Score on the training set is: 0.04843485033676265
Score on the validation set is: 0.0006717248343860449
Linear regression coefficients are: [[-0.00159365 -0.35852548]]


### What does this tell us?

On the training set, our model is able to explain only 4.8% of the variation in star rating, and less than 1% of the variation on the validation set. 

The coefficients are both negative, which tells us that there is a negative correlation between number of ingredients and number of steps with star rating. In other words, more complex recipes seem slightly less popular. Number of steps carries more weight than number of ingredients, however it is worth noting that both coefficients are small in magnitude and therefore indicate a weak correlation between complexity and popularity. 