## PCA on Pizza Dataset

The following data is the pizza-dataset from <a href="https://data.world/sdhilip/pizza-datasets">data world </a>. It consists of the following attributes:

- brand -- Pizza brand (class label)
- id -- Sample analysed
- mois -- Amount of water per 100 grams in the sample
- prot -- Amount of protein per 100 grams in the sample
- fat -- Amount of fat per 100 grams in the sample
- ash -- Amount of ash per 100 grams in the sample
- sodium -- Amount of sodium per 100 grams in the sample
- carb -- Amount of carbohydrates per 100 grams in the sample
- cal -- Amount of calories per 100 grams in the sample

Here is a preview of the data:

In [1]:
import pandas as pd

In [3]:
pizza = pd.read_csv('Pizza.csv')
pizza.head()

Unnamed: 0,brand,id,mois,prot,fat,ash,sodium,carb,cal
0,A,14069,27.82,21.43,44.87,5.11,1.77,0.77,4.93
1,A,14053,28.49,21.26,43.89,5.34,1.79,1.02,4.84
2,A,14025,28.35,19.99,45.78,5.08,1.63,0.8,4.95
3,A,14016,30.55,20.15,43.13,4.79,1.61,1.38,4.74
4,A,14005,30.49,21.28,41.65,4.82,1.64,1.76,4.67


## Data Exploration

Average attribute across different brands to see if there is a different trend. We see that brand B and I uses more water in their dough compared to the other brands, while brand G has the highest sodium levels per 100g.

In [5]:
pizza.groupby(['brand']).mean()

Unnamed: 0_level_0,id,mois,prot,fat,ash,sodium,carb,cal
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,20632.0,29.966207,20.107241,43.446897,5.014483,1.656207,1.486897,4.773793
B,19881.16129,51.307742,13.63871,27.620323,3.463548,0.984839,3.969677,3.190968
C,20001.703704,49.477407,26.025556,19.171111,3.283333,0.464815,2.046296,2.848889
D,19696.53125,47.67125,22.23125,21.645312,4.315938,0.715,4.13625,3.003438
E,20158.321429,36.083214,7.732857,15.115714,1.476071,0.449286,39.592143,3.253929
F,21085.533333,29.404333,7.898,16.424667,1.473667,0.462,44.787333,3.596
G,19934.551724,28.241034,8.236552,15.643793,1.446897,0.443793,46.431724,3.595172
H,20131.575758,35.825152,7.894545,14.291515,1.406061,0.416061,40.583939,3.224545
I,22001.689655,54.592759,10.383103,13.06069,2.098276,0.487241,19.865517,2.384138
J,24682.53125,46.035,10.56625,16.324062,2.364688,0.614375,24.735938,2.878437


## Dataset Description

The data consists of 300 different pizzas across 10 different brands with various different ingredient levels in their pizzas.

In [7]:
pizza.describe()

Unnamed: 0,id,mois,prot,fat,ash,sodium,carb,cal
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,20841.04,40.903067,13.373567,20.229533,2.633233,0.6694,22.864767,3.271
std,6962.917222,9.552987,6.434392,8.975658,1.269724,0.370358,18.029722,0.620034
min,14003.0,25.0,6.98,4.38,1.17,0.25,0.51,2.18
25%,14093.75,30.9025,8.06,14.77,1.45,0.45,3.4675,2.91
50%,24020.5,43.3,10.44,17.135,2.225,0.49,23.245,3.215
75%,24110.25,49.115,20.0225,21.435,3.5925,0.7025,41.3375,3.52
max,34045.0,57.22,28.48,47.2,5.43,1.79,48.64,5.08


## Pizza Count by Brand

The pizzas in the data seems pretty evenly distributed across the different brands

In [4]:
pizza.brand.value_counts()

H    33
J    32
D    32
B    31
F    30
G    29
I    29
A    29
E    28
C    27
Name: brand, dtype: int64

## Standardizing the Data for PCA

In PCA, it is important to standardize the attributes. Recall that PCA seeks to components according to maximal variance. If we did not standardize across the columns, an attribute with a high variance will get more weight compared to the rest, rendering the analysis meaningless. By normalizing the data such that all columns have equal variance, all attributes will start with the same weight and PCA will produce relevant axis.

In [41]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_norm = scaler.fit_transform(pizza[['mois', 'prot', 'fat', 'ash', 'sodium','carb', 'cal']])
X_norm = pd.DataFrame(X_norm)
X_norm.columns = ['mois', 'prot', 'fat', 'ash', 'sodium','carb', 'cal']

After normalization, we have mean 0 and variance 1 for all columns.

In [44]:
X_norm.mean()

mois     -2.849572e-17
prot      3.482400e-16
fat      -4.296563e-16
ash      -9.505359e-16
sodium   -2.634929e-16
carb     -2.701543e-17
cal       2.505403e-16
dtype: float64

In [43]:
X_norm.var()

mois      1.003344
prot      1.003344
fat       1.003344
ash       1.003344
sodium    1.003344
carb      1.003344
cal       1.003344
dtype: float64

## PCA