# Technical Notebook

___

### Goal: Manage ad placement on social recipe app Spoonacular
We aim to more effectively target ads to Spoonacular app users. To do so, we predict which recipes will be popular - or whether or not to post an ad on the recipe's webpage. 

### Data: [Spoonacular API](https://spoonacular.com/food-api/docs)
We were able to call 1000 *unique* recipes using the Spoonacular API. Each recipe contains extensive information from nutritional content to dietary classifications. A number of "Likes" is also provided for each recipe. In lieu of webpage traffic, we use Likes as our dependent variable. From recipe information, we developed and/or utilized a total of 35 numerical predictors (indepedent variables) of Likes. The numerical data is standardized below.

As a note for future work, natural language processing may be useful to cluster text such as ingredient types - such as to predict the most visited recipe webpages with respect to the use of "avacado".

### Model: Binary Logistic Regression
#### Categorization of "Likes"
We utilize median Likes as a cutoff between well-liked recipes and those receiving low Likes.  This helps ensure that our model trains on proportional amounts of well-liked and lesser-liked recipes. 
#### Assumptions
- Dependent variable (Likes) is binary.
- Large sample size. 
 - Our data fulfills the general guideline of 10 cases with the least frequent outcome for each independent variable. In this case, p(outcome) = 0.5, and for a total of 1000 cases, we *could in some cases* utilize a maximum of 50 indepedent variables in the model.
- Observations (recipes) are independent of each other. 
 - While recipes often reference one another within another recipe (i.e. frosting for a cake), we do not osberve this on Spoonacular.
- Little or no multicollinearity among the independent variables. 
 - We understand that macro and micronutrients are likely to have high multicollinearity and test for this.
- Linearity of independent variables and log odds. 
 - In our [exploratory data analysis](https://github.com/alexwcheng/recipe-strategy/blob/master/Logistic_Regression_Final/Exploratory%20Data%20Analysis.ipynb), we see that the dependent variable, Likes, has an exponential distribution. While, there exists a linear relationship between log odds and some predictors, this is not a majority case.

___

## Import Packages, Data, and Functions

In [1]:
import sys
sys.path.append("..")

In [2]:
# Import necessary python packages and functions
from Python_Files.imports import *
%matplotlib inline

In [3]:
%run ../Python_Files/max_range.py

In [4]:
# Read in data
df = pd.read_csv('../Data/Recipes_raw.csv', index_col=0)

In [5]:
# Ignore unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

___

## Prepare Data

Scale numerical predictors using StandardScaler and copy non-numerical predictors from original dataframe.

In [6]:
from Python_Files.data_cleaning import *
#scale_num_vars, categorize_likes, produce_roc_curve

In [7]:
# Create new dataframe containing both scaled and unscaled predictors
df_ss = scale_num_vars(df)

Categorize Likes as binary outcome using median value to equalize training.

In [8]:
# Create new column for outcome
df_ss['high_likes'] = categorize_likes(df_ss, 'aggregateLikes')
# Ensure proportional binary outcome
df_ss['high_likes'].value_counts()

1    500
0    500
Name: high_likes, dtype: int64

In [9]:
y = df_ss.high_likes
X = df_ss
X.drop(['high_likes', 'num_steps_instructions', 'cookingMinutes', 'preparationMinutes', 'ingredients_list', 'ingredient_types', 'title', 'spoonacularSourceUrl'], axis=1, inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 35 columns):
num_ingredients             1000 non-null float64
pricePerServing             1000 non-null float64
readyInMinutes              1000 non-null float64
servings                    1000 non-null float64
weightWatcherSmartPoints    1000 non-null float64
Calories                    1000 non-null float64
Fat                         1000 non-null float64
Saturated_Fat               1000 non-null float64
Carbohydrates               1000 non-null float64
Sugar                       1000 non-null float64
Cholesterol                 1000 non-null float64
Sodium                      1000 non-null float64
Protein                     1000 non-null float64
Vitamin_K                   1000 non-null float64
Vitamin_A                   1000 non-null float64
Vitamin_C                   1000 non-null float64
Manganese                   1000 non-null float64
Folate                      1000 non-null fl

Note: Due to an error raised from the Statsmodels library: "PerfectSeparationError: Perfect separation detected, results not available." We have removed a predictor that was essentially "unsolvable" for its coefficients. The predictor "num_steps_instructions" is being removed for this reason.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2020, stratify = y)

In [11]:
X_train.head()

Unnamed: 0,num_ingredients,pricePerServing,readyInMinutes,servings,weightWatcherSmartPoints,Calories,Fat,Saturated_Fat,Carbohydrates,Sugar,Cholesterol,Sodium,Protein,Vitamin_K,Vitamin_A,Vitamin_C,Manganese,Folate,Fiber,Copper,Magnesium,Phosphorus,Vitamin_B6,Potassium,Vitamin_B1,Iron,Vitamin_B2,Vitamin_E,Zinc,Vitamin_B5,Vitamin_B3,Calcium,Selenium,num_words_instructions,aggregateLikes
558,-0.756978,0.988085,-0.290526,0.024401,-0.666891,-0.496993,-0.267642,-0.088512,-1.522822,-0.774579,1.497412,-0.277766,1.034904,-0.801643,-0.520135,0.099482,-1.404039,-0.698207,-1.171685,-0.493402,-0.856788,0.218438,1.292988,-0.276896,0.092803,-0.420974,0.671428,-0.623324,-0.610651,0.653512,0.978889,-0.936896,1.645923,-1.309141,2941
391,0.644314,-0.166652,0.17518,0.824448,0.238227,-0.181521,-0.149239,-0.287632,0.224522,0.190769,-0.406466,-0.055959,-0.75837,0.364664,-0.192068,0.8606,-0.706732,-0.366194,-0.3084,-0.170389,-0.520102,-0.571703,0.180163,0.489292,-0.407484,-0.135234,-0.636832,-0.032592,-0.589552,-0.369726,-0.326463,-0.557071,-0.652058,0.403409,4105
520,-0.756978,1.411525,0.17518,0.024401,0.057203,0.653958,0.345171,0.56174,-1.582657,-0.958958,3.559123,-0.041057,3.668923,-0.624947,-0.206755,-0.406721,-1.453847,-1.136354,-1.174894,-0.72945,-0.50828,2.031707,2.674427,0.129457,3.094526,-0.248339,0.438849,-0.537996,2.089974,0.322645,1.564098,-0.132102,2.593715,-0.373919,3017
574,-0.476719,-0.928668,-0.290526,0.024401,-0.485867,-0.750721,-0.890055,-0.692095,-0.079599,-0.90611,-0.406466,-0.479082,-0.613523,0.54559,-0.163326,-0.227141,-0.358079,-0.787858,-0.983944,0.376248,-0.120419,-0.277441,-0.66405,-1.053334,-0.836301,-0.398651,-0.578687,-0.810389,0.486478,-0.627067,-0.282462,-0.533363,-0.648571,-0.14315,533
102,-0.196461,-0.429804,-0.383668,-0.375622,-0.304844,-0.613557,-0.214041,-0.315633,-0.546577,-0.893192,-0.406466,-0.479318,-0.803241,-0.321339,-0.601168,-0.714322,-0.544858,-0.148403,-0.289144,-0.505826,-0.26034,-0.560658,-0.625677,-0.668636,-0.657627,-0.269175,-0.49147,-0.176993,-0.406697,-0.179784,-0.474598,-0.929585,-0.606282,-0.689708,607


In [12]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

___

## SVM Classifier

In [13]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')

In [14]:
from sklearn.model_selection import cross_validate
cv = cross_validate(svclassifier,
                    X_train,
                    y_train,
                    cv = 5,
                    scoring= 'roc_auc',
                    return_estimator= True,
                    return_train_score= True,
                    n_jobs= -1)

In [15]:
print(cv['train_score'])
print(cv['test_score'])

[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]


In [16]:
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score

estimator = cv['estimator'][1]
y_pred = estimator.predict(X_test)

print(classification_report(y_test, y_pred)) 
print(f"The accuracy score is {accuracy_score(y_test, y_pred)}")

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       125
           1       1.00      1.00      1.00       125

    accuracy                           1.00       250
   macro avg       1.00      1.00      1.00       250
weighted avg       1.00      1.00      1.00       250

The accuracy score is 1.0


In [19]:
y_pred = y_pred.reshape(-1,1)
y_test = np.array(y_test).reshape(-1,1)

In [20]:
print(y_pred.shape)
print(y_test.shape)

(250, 1)
(250, 1)


In [21]:
np.unique(y_test, return_counts=True)

(array([0, 1]), array([125, 125]))

In [22]:
np.unique(y_pred, return_counts=True)

(array([0, 1]), array([125, 125]))

In [23]:
#define the confusion matrix
conf_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred),
                           index = ['actual 0', 'actual 1'], 
                           columns = ['predicted 0', 'predicted 1'])

#return the confusion matrix
conf_matrix

Unnamed: 0,predicted 0,predicted 1
actual 0,125,0
actual 1,0,125


___

## K-Nearest Neighbors Classifier

___

## AdaBoost Classifier