# Fast-food Analysis

As we have studied if the nutritional informations of the average product per area was correlated to the wellbeing features, we now want to study the correlation between the nutrients composition and the number of fast foods per area.

We were able to find the database from the Londonian data website, the same one as for the wellbeing features. 

In [1]:
# Imports
from utils import calculate_nutripoints
from sklearn.cluster import KMeans, DBSCAN
from statsmodels.stats import diagnostic
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy import stats
from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict, train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, auc, roc_curve, r2_score
from sklearn.feature_selection import RFE
import math
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline
sns.set_theme()

## I) Data Preparation 

### A) Data imports

In [2]:
# Data Imports

#year_grocery = pd.read_csv("data/year_osward_grocery.csv")
#display(year_grocery.head())

fast_food = pd.read_excel(
    "data/fast_food_ward.xlsx", sheet_name="Ward Data", header=[3], usecols="E,G")
display(fast_food.head())
print(fast_food.shape)

grocery_analysis = pd.read_pickle("data/grocery_nutripoints.pkl")
display(grocery_analysis.head())
print(grocery_analysis.shape)

FileNotFoundError: [Errno 2] No such file or directory: 'data/Fast_food.xlsx'

### B) Merging

We compare the number of lines of grocery and fast food and try to see how many they have in common. We then merge the two datasets.

In [None]:
boolean = not grocery_analysis["area_id"].is_unique      # True (credit to @Carsten)
print(boolean)

In [None]:
len(set(grocery_analysis["area_id"].values))

In [None]:
len(set(fast_food["2015 Ward code"].values))

In [None]:
len(set(fast_food["2015 Ward code"].values)
    & set(grocery_analysis["area_id"].values))

REVOIR: As fast food dataset lists all the fast-foods of England and that the final merging dataset has only 4 missing rows compared to grocery_analysis, we can say that the merging does not loose information.  

In [None]:
fastfood_grocery = pd.merge(
    left=grocery_analysis, right=fast_food, left_on='area_id', right_on="2015 Ward code")
fastfood_grocery = fastfood_grocery.drop("2015 Ward code", axis=1)
display(fastfood_grocery.head())
print(fastfood_grocery.shape)

## II) Comprehension of the data

### A) Distribution of the values: describe, boxplot, distplot

In [None]:
#checking that they are no null value
fastfood_grocery.isnull().any()

In [None]:
#Understanding better how the values are distributed
fastfood_grocery.describe()

In [None]:
columns_grocery = [
    'energy_fat',
    'energy_saturate',
    'energy_sugar',
    'energy_protein',
    'energy_carb',
    'energy_fibre',
    'energy_alcohol',
    'energy_tot',
    'h_nutrients_calories',
    'nutripoints',
    'Count of outlets'
]

column_boxplot = columns_grocery

fastfood_grocery_analysis = fastfood_grocery[column_boxplot].copy()
fig, ax = plt.subplots(4, 3, figsize=(16, 8), sharey=False)

for i in range(len(column_boxplot)):
    sbplt = ax[int(i/3), i % 3]

    sns.boxplot(data=fastfood_grocery_analysis.iloc[:, i], ax=sbplt)
    sbplt.set_xlabel('')
    sbplt.set_ylabel('')
    sbplt.set_title(
        fastfood_grocery_analysis.columns[i], loc='center', wrap=True)

fig.tight_layout()
fig.subplots_adjust(top=0.9)

fig.suptitle('boxplot for each column', fontsize=18)

We observe that they are some outliers principally for the fast food. We will try to visualize it better later

In [None]:
fig, ax = plt.subplots(4, 3, figsize=(16, 8), sharey=False)

for i in range(len(column_boxplot)):
    sbplt = ax[int(i/3), i % 3]

    sns.histplot(data=fastfood_grocery_analysis.iloc[:, i], ax=sbplt)
    sbplt.set_xlabel('')
    sbplt.set_ylabel('')
    sbplt.set_title(fastfood_grocery_analysis.columns[i], wrap=True)

fig.tight_layout()
fig.subplots_adjust(top=0.9)

fig.suptitle('histplot for each column', fontsize=18)

Most of the nutritional variables seem to be normally distributed. On the other side, the number of outlets is more logarithmic. Most of the areas have between 5 and 10 fast foods but it go to 140! 

### B) Correlation between the different variables

In [None]:
# Heatmap to visualize the correlation between the variables
fig = plt.figure(figsize=(10, 6))
sns.heatmap(fastfood_grocery_analysis.corr())

The correlation with the number of oulets (the last column or row) seems really lo as the colours are mainly red, corresponding to values around O. We will display the correlation table to better understand the importance of the correlation between the different variables. 

In [None]:
correlation = fastfood_grocery_analysis.corr(method="spearman")
display(correlation)

In [None]:
plt.figure(figsize=(14, 3))
display(correlation["Count of outlets"])
correlation["Count of outlets"].plot.bar(
    x=None, y=None, width=0.8, legend=None)
plt.ylabel("Spearman R")
plt.title("Correlation fast food outlets")
plt.show()

In [None]:
Y = fastfood_grocery[["nutripoints"]]
X = fastfood_grocery[["Count of outlets"]]
X = sm.add_constant(X)  # adding a constant

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)

In [None]:
## linear regression ##
lin_reg = LinearRegression()  # create the model
lin_reg.fit(X, Y)  # train it

In [None]:
print("{0} * {1} + ".format(lin_reg.coef_[0], "Count of outlets"))
print(lin_reg.intercept_)

In [None]:
# train a gradient boosting regressor
gradboost = GradientBoostingRegressor()

# compute r^2 for this new model
#r2_random_gradboost = gradboost_random.score(train_X, train_y)

#print(f"R² for the Gradient Boost Regression: {r2_random_gradboost}")

In [None]:
predicted_y = cross_val_predict(gradboost, X, Y, cv=5)

In [None]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(Y, predicted_y, edgecolors=(0, 0, 0))
ax.set_xlabel('Original')
ax.set_ylabel('Predicted')
plt.show()

In [None]:
r2 = r2_score(Y, predicted_y)
mse = mean_squared_error(Y, predicted_y)
print(r2, mse)