# Fast-Food Restaurants and Nutrition

As we have studied if the nutritional properties of the average product per area were correlated to the well-being features, we now want to study the correlation between the nutrients composition and the number of fast-food restaurants per area.

We were able to find the database from the [London Datastore](https://data.london.gov.uk/), the same one as for the well-being features. 

In [None]:
# Imports

import statsmodels.api as sm
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline
sns.set_theme()

## I) Data Preparation 

### A) Data imports

In [None]:
# Data Imports

fast_food = pd.read_excel(
    "data/fast_food_ward.xlsx", sheet_name="Ward Data", header=[3], usecols="E,G")
display(fast_food.head())
print(fast_food.shape)

wellbeing_grocery = pd.read_pickle("data/wellbeing_grocery.pkl")
display(wellbeing_grocery.head())
print(wellbeing_grocery.shape)

grocery_analysis = pd.read_pickle("data/grocery_nutripoints.pkl")
display(grocery_analysis.head())
print(grocery_analysis.shape)

### B) Merging

We compare the number of lines of grocery dataset and fast-food one and try to see how many they have in common. We then merge the two datasets. We also check that there is no null or NA values.

In [None]:
# check if area id is a unique id
is_area_unique = grocery_analysis["area_id"].is_unique
print(is_area_unique)

In [None]:
nr_wards_grocery = len(set(grocery_analysis["area_id"].values))
print(nr_wards_grocery)

In [None]:
grocery_analysis.isna().any()

In [None]:
nr_wards_fast_food = len(set(fast_food["2015 Ward code"].values))
print(nr_wards_fast_food)

In [None]:
fast_food.isna().any()

In [None]:
fast_food.isnull().any()

In [None]:
# calculate number of rows both datasets have in common

nr_wards = len(set(fast_food["2015 Ward code"].values)
    & set(grocery_analysis["area_id"].values))

print(nr_wards)

We loose 70 values by merging.  

In [None]:
fastfood_grocery = pd.merge(
    left=grocery_analysis, right=fast_food, left_on='area_id', right_on="2015 Ward code")
fastfood_grocery = fastfood_grocery.drop("2015 Ward code", axis=1)
display(fastfood_grocery.head())
print(fastfood_grocery.shape)

We also merge well-being and fast-food in order to check the correlation between the well-being features and the number of fast-food, as we can make some assumptions about how they are linked and maybe prove the reliability of our data. 

In [None]:
fastfood_wellbeing = pd.merge(
    left=wellbeing_grocery, right=fast_food, left_on='area_id', right_on="2015 Ward code")
fastfood_wellbeing = fastfood_wellbeing.drop("2015 Ward code", axis=1)
display(fastfood_wellbeing.head())
print(fastfood_wellbeing.shape)

We lose 33 values between well-being and grocery because wellbeing_grocery had already been merged before with year_grocery, leaving aside the area for which we did not have the well-being values.

## II) Comprehension of the data

### A) Distribution of the values: describe, boxplot, distplot

In [None]:
# Understanding better how the values are distributed
fastfood_grocery.describe()

In [None]:
fig = plt.figure(figsize=(4, 5))

sns.boxplot(y=fastfood_grocery["Count of outlets"])

They are many outliers for the number of outlets going up to 147 whereas the median is at 11. Due to this phenomenom, the mean is quite high whereas 75% of the values are between 1 and 18. 

In [None]:
# List of columns of interest in the nutritional dataset
COLUMNS_GROCERY = [
    'energy_fat',
    'energy_saturate',
    'energy_sugar',
    'energy_protein',
    'energy_carb',
    'energy_fibre',
    'energy_alcohol',
    'energy_tot',
    'h_nutrients_calories',
    'nutripoints',
    'Count of outlets'
]


# Selection of the numerical columns of interest in the fastfood_grocery dataset
fastfood_grocery_analysis = fastfood_grocery[COLUMNS_GROCERY].copy()

In [None]:
fig, ax = plt.subplots(4, 3, figsize=(16, 8), sharey=False)

for i in range(len(COLUMNS_GROCERY)):
    sbplt = ax[int(i/3), i % 3]

    sns.histplot(data=fastfood_grocery_analysis.iloc[:, i], ax=sbplt)
    sbplt.set_xlabel('')
    sbplt.set_ylabel('')
    sbplt.set_title(fastfood_grocery_analysis.columns[i], wrap=True)

fig.tight_layout()
fig.subplots_adjust(top=0.9)

fig.suptitle('Histplot for each column', fontsize=18)

Most of the nutritional variables seem to be normally distributed. On the other side, the number of outlets is more logarithmic. Most of the areas have between 5 and 10 fast foods but it goes up to 140! 

### B) Correlation between the number of fast foods and the different nutritional variables

In [None]:
# Heatmap to visualize the correlation between the variables
fig = plt.figure(figsize=(10, 6))
sns.heatmap(fastfood_grocery_analysis.corr())

The correlation with the number of oulets (the last column or row) seems really low as the colours are mainly red, corresponding to values around 0. We will display the correlation table for the count of outlets to better understand the importance of the correlation with the nutritional variables. 

In [None]:
correlation = fastfood_grocery_analysis.corr(method="spearman")
display(correlation[["Count of outlets"]])

In [None]:
plt.figure(figsize=(14, 3))
correlation["Count of outlets"].plot.bar(
    x=None, y=None, width=0.8, legend=None)
plt.ylabel("Spearman R")
plt.title("Correlation fast food outlets")
plt.show()

**The number of fast food is not correlated at all with the Nutripoints.**   
Furthermore, there is almost no correlation with other variables. Therefore, it is hardly justifiable to predict the nutritional informations of the average product of an area from the number of fast-food.

We will however try to further investigate this correlation. 

### C) Correlation between the number of fast food and the well-being measures

In [None]:
# List of columns of interest in the wellbeing dataset
COLUMNS_SCORES = [
    'Life Expectancy',
    'Childhood Obesity',
    'Incapacity Benefit rate',
    'Unemployment rate',
    'Crime rate - Index',
    'Deliberate Fires',
    'Average Capped GCSE and Equivalent Point Score Per Pupil',
    'Unauthorised Absence in All Schools (%)',
    'Dependent children in out-of-work families',
    'Public Transport Accessibility',
    'Homes with access to open space & nature, and % greenspace',
    'Subjective well-being average score',
    'Index Score 2013',
    'nutripoints',
    'Count of outlets'
]

# Selection of the numerical columns of interest in the wellbeing_grocery dataset
fastfood_wellbeing_analysis = fastfood_wellbeing[COLUMNS_SCORES].copy()

In [None]:
correlation = fastfood_wellbeing_analysis.corr(method="spearman")
display(correlation[["Count of outlets"]])

For the plot, we only keep the variables that have a spearman score higher than 0.2.

In [None]:
fastfood_wellbeing_analysis.drop(columns=['Incapacity Benefit rate',
                                          'Deliberate Fires',
                                          'Average Capped GCSE and Equivalent Point Score Per Pupil',
                                          'Unauthorised Absence in All Schools (%)', 'Dependent children in out-of-work families',
                                          'Homes with access to open space & nature, and % greenspace',
                                          'Subjective well-being average score'], inplace=True)
fastfood_wellbeing_analysis.to_pickle("plot_data/fastfood_wellbeing.pkl")

In [None]:
# Heatmap to visualize the correlation between the variables
fig = plt.figure(figsize=(10, 6))
sns.heatmap(fastfood_wellbeing_analysis.corr())

In [None]:
correlation_shorten = fastfood_wellbeing_analysis.corr(method="spearman")

In [None]:
plt.figure(figsize=(14, 3))
correlation_shorten.iloc[0:7, 7].plot.bar(
    x=None, y=None, width=0.8, legend=None)
plt.ylabel("Spearman R")
plt.title("Correlation fast food outlets-wellbeing")
plt.show()

When talking about fast-food and its relation with well-being variables, it is true that we already have some assumptions in our head. Of course, it is well correlated with Public Transport as to facilitate the access to the public and increase customer rate. On the other hand, Crime rate should be low (as we can see here it is negatively correlated to the number of fast-food) as the clients should feel safe when coming to the restaurants. However, we would have expected that it would be positively correlated with Childhood Obesity and Unemployment rate. We think that as fast-food are cheap, the chains won't target the fancy social class but mostly the ones that need comfort foods. Furthermore, fast-food don't serve healthy food but mostly fat and sugar riched one, causing obesity, so we thought that it was positively correlated with this variable. 

Therefore, some results seem predictable but others aren't. Thus, it is difficult to make a decisive conclusion about the link between fast-food and the Nutripoints of the average product of an area. We then do a regression analysis to see if it justifiable to try to predict Nutripoints from the number of fast-food in an area.

## Regression Analysis

### A) Linear Regression of Nutripoints from the number of fastfood

In [None]:
# Understanding the linear regression between the two variables
Y = fastfood_grocery[["nutripoints"]]
X1 = fastfood_grocery[["Count of outlets"]]
X = sm.add_constant(X1)  # adding a constant

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)

The R-squared is very small: 0.005, meaning we can only predict 0.5% of our Nutripoints from the number of fast food. Furthermore, the coefficient of this latter is neglectable as it is -0.0137, the intercept is the one having the more weight in this regression. Therefore, we can conclude that the Linear Regression is not the adapted regression for prediction of Nutripoints of each area from the number of fast-food. It actually makes sense as there was no correlation between these two variables, it is actually difficult to predict one from the other. 

### B) Gradient boosting Regressor

We know that there is no correlation and that the Linear Regression is not representative at all. However, we still wanted to explore if another type of regression: the Gradient Boosting method, would be more concluent. 

In [None]:
# train a gradient boosting regressor
gradboost = GradientBoostingRegressor()

In [None]:
y = fastfood_grocery["nutripoints"]
predicted_y = cross_val_predict(gradboost, X, y, cv=5)

In [None]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(Y, predicted_y, edgecolors=(0, 0, 0))
ax.set_xlabel('Original')
ax.set_ylabel('Predicted')
plt.show()

If the prediction was representative of the nutripoints, we should observe a diagonal line, it is not what we have at all. This means that for a certain Nutripoints value, the predicted one from the Gradient Boosting Regression is higher or smaller than the original value. 

To confirm this assumption, we compute the mean squared error and the R-squared value:

In [None]:
r2 = r2_score(Y, predicted_y)
mse = mean_squared_error(Y, predicted_y)
print(r2, mse)

The R-squared is negative. A horizontal line would actually be more representative of our regression. Furthermore, the mean squared error is very high. The regression is therefore not representative neither of our Nutripoints variable. 

## Conclusion 

As the correlation predicted it, it is not possible to find a good regression in order to predict the Nutripoints from the number of fast food of an area. Indeed, there are almost not correlated, making it difficult to have high coefficient for our regressions. 

Having the average products per area, actually, makes it difficult to have a representation of how healthy the inhabitants of an area consume. A high Nutripoint doesn't necessary mean that all the consumers of this area consume more fatty or sugary products. Furthermore, an unhealthy average product is not linked to a high number of fast-foods. However, it could be interesting to see if the richest areas have less fast-foods making the assumption that the inhabitants would prefer to go in a real restaurant.  