# Cereal ratings
Data Analysis Python Pandas Model Linear Regression Data Manipulation Data Visualization External Dataset

Suppose you have the following [dataset](https://docs.google.com/spreadsheets/d/1OyFwsZ77RnGiwSSkTPx6j9V9qyOqitysQij4WW4nhMU/edit?usp=sharing), 
which is a list of 80 cereals, containing the following fields:

```
    mfr: Manufacturer of cereal
        A = American Home Food Products
        G = General Mills
        K = Kelloggs
        N = Nabisco
        P = Post
        Q = Quaker Oats
        R = Ralston Purina
    type:
        cold
        hot
    calories: calories per serving
    protein: grams of protein per serving
    fat: grams of fat per serving
    sodium: milligrams of sodium
    fiber: grams of dietary fiber
    carbs: grams of complex carbohydrates
    sugars: grams of sugars
    potass: milligrams of potassium
    vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
    shelf: display shelf (1, 2, or 3, counting from the floor)
    weight: weight in ounces of one serving
    cups: number of cups in one serving
    rating: a rating of the cereals (Possibly from Consumer Reports?)
```

Given the above, can you build a model using Python to predict cereal rating? We'll be creating a multivariate linear regression as a solution for premium users.

[Dataset source](https://www.kaggle.com/crawford/80-cereals)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn import preprocessing
%matplotlib inline

filename = 'q164_data_cereals.csv'
df = pd.read_csv(filename)
print('shape', df.shape)
print(df.columns)
df.head()


shape (77, 16)
Index(['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber',
       'carbs', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups',
       'rating'],
      dtype='object')


Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbs,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [2]:
def pretty_print_ols_summary(fit):
    """ concise alternative to sm.OLS().fit().summary() """
    # use dir(fit) to list properties
    formula = fit.model.endog_names+ ' ~ '+ ' + '.join(fit.model.exog_names)
    print(formula)
    print(f'--- adj r2 = {fit.rsquared_adj:.2f}')
    for param, beta in fit.params.items():
        print(f'{param}: beta = {beta:.2f}, p = {fit.pvalues.get(param):.2f}')


def ols_fit(df, outcome_colname, predictor_colnames, add_intercept=True):
    """ ordinary least square fit on dataframe """
    data = df[predictor_colnames].copy() # avoid warning of setting on copy
    data['intercept'] = 1
    linreg = sm.OLS(df[outcome_colname], data)
    return linreg.fit()


def normalize_columns(df, cols):
    """ normalize the provided columns: mean 0, sd 1 """
    ddf = df.copy()
    for col in cols:
        ddf[col] = (ddf[col] - ddf[col].mean()) / ddf[col].std()
    return ddf


df['is_cold'] = (df['type'] == 'C').astype(int)
num_features = [
    'calories','protein','fat','sodium',
    'fiber','carbs','sugars','potass','vitamins',
    'weight','cups'
]
categ_features = ['is_cold']
label = 'rating'
ddf = df[num_features + categ_features + [label]].dropna()
ddf = normalize_columns(ddf, num_features)
fit = ols_fit(ddf, label, num_features + categ_features)
pretty_print_ols_summary(fit)

rating ~ calories + protein + fat + sodium + fiber + carbs + sugars + potass + vitamins + weight + cups + is_cold + intercept
--- adj r2 = 1.00
calories: beta = -4.34, p = 0.00
protein: beta = 3.58, p = 0.00
fat: beta = -1.70, p = 0.00
sodium: beta = -4.57, p = 0.00
fiber: beta = 8.21, p = 0.00
carbs: beta = 4.67, p = 0.00
sugars: beta = -3.22, p = 0.00
potass: beta = -2.42, p = 0.00
vitamins: beta = -1.14, p = 0.00
weight: beta = -0.00, p = 0.51
cups: beta = 0.00, p = 0.34
is_cold: beta = 0.00, p = 0.86
intercept: beta = 42.67, p = 0.00
