# Statistical Data Analysis

## Loading Modules

In [1]:
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from scipy.stats import norm
from scipy.stats import t
from scipy.stats import ttest_ind_from_stats

## Loading Data

In [2]:
recipes_df=pd.read_csv('recipes_df.csv')
recipes_df.set_index('id', inplace=True)
reviews_df=pd.read_csv('reviews_df.csv')
reviews_df.set_index('recipe_id', inplace=True)
tags_matrix=pd.read_csv('tags_matrix.csv')
tags_matrix.set_index('id', inplace=True)
ingredients_matrix=pd.read_csv('ingredients_matrix.csv')
ingredients_matrix.set_index('id', inplace=True)

## T-tests of Recipe Attributes

After performing the EDA on the data we have the following findings: <br>
 1. The best ingredient is garlic. It has the average recipe ranking of 4.453
 2. The worst ingredient is baking powder. It has the average recipe ranking of 4.388
 3. The best tag is 'beijing'. It has the average recipe ranking of 5.000
 4. The worst tag is 'pressure-canning'. It has the average recipe ranking of 2.981
 5. Successful recipes on average take longer to make and include 10 or more steps
 6. In terms of nutritional value, successful recipes have more total fats, sugars and carbohydrates, but less sodium and saturated fats, and slightly less protein. 

Below we will perform t-tests to confirm which of our EDA findings are statistically significant. More specifically: <br><br>
1) We will perform a t-test on two independent samples: the recipes with a specific attribute and the recipes without it <br><br>

2) We will test the null hypothesis $H_0$ that the mean recipe rating for the two samples is identical.<br><br>
2) The alternative hypothesis $H_a$ would be that the means are different (e.g. the mean recipe rating is indeed affected by the specific recipe attribute). <br><br>
3) If the t-test results are statistically significant (e.g. p-value > $a, a$ = 0.05), then we will reject the $H_0$ and accept the $H_a$.

### Garlic

1. Define the function for a t-test and apply it

In [3]:
def myttest(x,y):
    mean_x = np.mean(x)
    std_x = np.std(x)
    n_x = len(x)
    mean_y = np.mean(y)
    std_y = np.std(y)
    n_y = len(y)
    return ttest_ind_from_stats(mean_x, std_x, n_x, mean_y, std_y, n_y)

2. Define the two samples for recipes with and without garlic

In [4]:
a = reviews_df.rating[reviews_df.index.isin(ingredients_matrix.index[ingredients_matrix.garlic==1])]
b = reviews_df.rating[reviews_df.index.isin(ingredients_matrix.index[ingredients_matrix.garlic==0])]

3. Calculate t-test statistic and p-value

In [5]:
myttest(a,b)

Ttest_indResult(statistic=8.580416896899402, pvalue=9.475783077096418e-18)

The p-value of 9.475783077096418e-18 is very small, so we can reject the $H_0$ and confirm that garlic in recipes contribute to a higher recipe rating

### Baking Powder

In [6]:
c = reviews_df.rating[reviews_df.index.isin(ingredients_matrix.index[ingredients_matrix['baking powder']==1])]
d = reviews_df.rating[reviews_df.index.isin(ingredients_matrix.index[ingredients_matrix['baking powder']==0])]

In [7]:
myttest(c,d)

Ttest_indResult(statistic=-13.271590757508033, pvalue=3.429445436924159e-40)

The p-value of 9.475783077096418e-18 is very small, so we can reject the $H_0$ and confirm that baking powder in recipes contribute to a lower recipe rating

### 'beijing' Recipe Tag

In [8]:
e = reviews_df.rating[reviews_df.index.isin(tags_matrix.index[tags_matrix[" 'beijing'"]==1])]
f = reviews_df.rating[reviews_df.index.isin(tags_matrix.index[tags_matrix[" 'beijing'"]==0])]
myttest(e,f)

Ttest_indResult(statistic=1.5223040043497773, pvalue=0.12793347240670874)

The p-value of 0.13 > $a$ = 0.05, so we can reject the $H_a$. This means that 'beijing' tag positive effect on the recipe rating is statistically not significant.

### 'pressure-canning' Recipe Tag

In [9]:
g = reviews_df.rating[reviews_df.index.isin(tags_matrix.index[tags_matrix[" 'pressure-canning'"]==1])]
h = reviews_df.rating[reviews_df.index.isin(tags_matrix.index[tags_matrix[" 'pressure-canning'"]==0])]
myttest(g,h)

Ttest_indResult(statistic=-6.309979344310214, pvalue=2.792736995754129e-10)

The p-value is very small, so we can reject the $H_o$. This means that 'pressure-canning' tag negative effect on the recipe rating is statistically significant.

### Time to make a recipe

Time to make the recipe is represented by the 'minutes' attribute of the recipes dataframe. We know from EDA that the higher rated recipes on average require more time to make than the lower rated recipes. For the t-test we will take a sample of recipes that take less than an hour to make and a sample of recipes that take more than an hour to make.

In [10]:
i = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df.minutes <= 60])]
j = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df.minutes > 60])]
myttest(i,j)

Ttest_indResult(statistic=1.2950403454784183, pvalue=0.195307054244819)

The p-value of 0.20 > $a$ = 0.05, so we can reject the $H_a$. This means that time to make a recipe does not impact the recipe rating significantly.

### Number of steps to make a recipe

The number of steps to make the recipe (or complexity of the recipe) is represented by the 'n_steps' attribute of the recipes dataframe. We know from EDA that the higher rated recipes have 10 or more steps. For the t-test we will take a sample of recipes with less than 10 steps and a sample of recipes with more than 10 steps.

In [11]:
k = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df.n_steps < 10])]
l = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df.n_steps >= 10])]
myttest(k,l)

Ttest_indResult(statistic=1.6296614415681554, pvalue=0.10317400160552949)

The p-value of 0.10 > $a$ = 0.05, so we can reject the $H_a$. This means that the number of steps over 10 does not impact the recipe rating significantly.

### Total Fats Content

We know from EDA that the 'Total Fat' attribute of the recipes dataframe has a posiive effect on the recipe rating.  For the t-test we will take a sample of recipes with total fat less than 25 steps and a sample of recipes with total fat over 25.

In [12]:
m = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df['Total Fat'] < 25])]
n = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df['Total Fat'] >= 25])]
myttest(m,n)

Ttest_indResult(statistic=1.6504603313321327, pvalue=0.09884978527865239)

The p-value of 0.10 > $a$ = 0.05, so we can reject the $H_a$. This means that the total fat content does not impact the recipe rating significantly.

### Sugar Content

We know from EDA that the sugar content has a posiive effect on the recipe rating.  For the t-test we will take a sample of recipes with sugar less than 30 and a sample of recipes with sugar over 30.

In [13]:
o = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df['Sugars'] < 20])]
p = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df['Sugars'] >= 20])]
myttest(o,p)

Ttest_indResult(statistic=0.14791774616466175, pvalue=0.8824077946155509)

The p-value of 0.88 > $a$ = 0.05, so we can reject the $H_a$. This means that the sugar content does not impact the recipe rating significantly.

### Sodium Content

We know from EDA that the sodium content has a negative effect on the recipe rating.  For the t-test we will take a sample of recipes with sugar less than 30 and a sample of recipes with sugar over 30.

In [14]:
q = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df['Sodium'] < 10])]
r = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df['Sodium'] >= 10])]
myttest(q,r)

Ttest_indResult(statistic=1.8366578448832098, pvalue=0.06626201500112164)

The p-value of 0.07 > $a$ = 0.05, so we can reject the $H_a$. This means that the sodium content does not impact the recipe rating significantly.

### Carbohydrates Content

We know from EDA that the carbs content has a negative effect on the recipe rating.  For the t-test we will take the carbs content threshold of 5 for the samples split.

In [15]:
q = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df['Total Carbohydrate'] < 5])]
r = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df['Total Carbohydrate'] >= 5])]
myttest(q,r)

Ttest_indResult(statistic=1.3984372828546618, pvalue=0.16198366772171705)

The p-value of 0.07 > $a$ = 0.05, so we can reject the $H_a$. This means that the sodium content does not impact the recipe rating significantly.

### Protein Content

We know from EDA that the protein content has a slightly negative effect on the recipe rating.  For the t-test we will take the carbs content threshold of 15 for the samples split.

In [16]:
s = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df['Protein'] < 10])]
t = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df['Protein'] >= 10])]
myttest(s,t)

Ttest_indResult(statistic=1.7088986054895212, pvalue=0.08747134962599898)

The p-value of 0.09 > $a$ = 0.05, so we can reject the $H_a$. This means that the sodium content does not impact the recipe rating significantly.

### Saturated Fats Content

We know from EDA that the saturated fats content has a negative effect on the recipe rating.  For the t-test we will take the saturated fats content threshold of 25 for the samples split.

In [17]:
z = reviews_df.rating[reviews_df.index.isin(recipes_df.index[recipes_df['Saturated Fat'] < 25])]
w = reviews_df.rating[reviews_df.index.isin(recipes_df.minutes[recipes_df['Saturated Fat'] >= 25])]
myttest(z,w)

Ttest_indResult(statistic=2.249021930838759, pvalue=0.02451185160533414)

The p-value of 0.02 < $a$ = 0.05, so we can reject the $H_0$. This means that the saturated fats content has a statistically significant negative effect on the recipe rating.

# Summary

To summarize our findings: <br>
 -  Garlic is the best ingredient
 -  Baking powder is the worst ingredient
 - 'pressure-canning' is indeed the worst recipe tag
 - Time and the number of steps to make the recipe do not significantly change the recipe rating
 - Saturated Fats content negatively affects the recipe rating