<a href="https://colab.research.google.com/github/bellatchen/Assignments/blob/main/Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bella Chen

How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?  

You will need to:

Define features (X) and target (y).

Train test split data to prepare for machine learning.

Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook).

Use pipelines and column transformers to complete the following tasks:

- Impute any missing values. 
  - Use ‘mean’ strategy for numeric columns and ‘most_frequent’ strategy for categorical columns
- One-hot encode the nominal features.
- Scale the numeric columns.

All preprocessing steps should be contained within a single preprocessing object.

Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. 

Show the resulting Numpy array.


In [63]:
# Imports
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn import set_config
set_config(display='diagram')

In [64]:
# mounting Data
cereal = "/content/drive/MyDrive/Coding Dojo/Data Sets/Cereal with missing values.xlsx"
cereal_df_original = pd.read_excel(cereal)
cereal_df_original
cereal_df_1 = cereal_df_original.copy(deep= True)
print(cereal_df_1)

                       name                 Manufacturer  type  \
0   Apple Cinnamon Cheerios                General Mills  Cold   
1                   Basic 4                General Mills  Cold   
2                  Cheerios                General Mills  Cold   
3     Cinnamon Toast Crunch                General Mills  Cold   
4                  Clusters                General Mills  Cold   
..                      ...                          ...   ...   
72                Rice Chex               Ralston Purina  Cold   
73               Wheat Chex               Ralston Purina  Cold   
74                    Maypo  American Home Food Products   Hot   
75   Cream of Wheat (Quick)                      Nabisco   Hot   
76           Quaker Oatmeal                  Quaker Oats   Hot   

    calories per serving  grams of protein  grams of fat  \
0                  110.0               2.0           2.0   
1                  130.0               3.0           2.0   
2                    NaN   

In [65]:
#Checking for Duplicates
cereal_df_1.duplicated().sum()

0

In [66]:
#check for missing data
cereal_df_1.isna().sum()

name                                               0
Manufacturer                                       0
type                                               9
calories per serving                               7
grams of protein                                   0
grams of fat                                       8
milligrams of sodium                               1
grams of dietary fiber                             0
grams of complex carbohydrates                     0
grams of sugars                                    9
milligrams of potassium                            0
vitamins and minerals (% of FDA recommendation)    1
Display shelf                                      0
Weight in ounces per one serving                   0
Number of cups in one serving                      0
Rating of cereal                                   0
dtype: int64

In [67]:
cereal_df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     float64
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

In [68]:
# Separate data into features matrix (X) and target vector (y)
#defining X and y; predicting charges
X = cereal_df_1.drop(columns=["calories per serving"])
y = cereal_df_1["calories per serving"]
# Train/test split the data. Using the random number 42 for consistency.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [69]:
 #verifying split
 len(X_train)

57

Categories are: 

Nominal: 
- name
- Manufacturer
- type

Numeric:
- calories per serving
- grams of protein
- grams of fat
- milligrams of sodium
- grams of dietary fiber
- grams of complex carbohydrates
- grams of sugars
- milligrams of potassium 
- vitamins and minerals (% of FDA recommendation) 
- Display shelf
- Weight in ounces per one serving 
- Number of cups in one serving

Ordinal:
- rating of cereal

In [87]:
#pipelines and column transformers to impute missing data - "mean" for numerical and "most frequent" for categorical
#impute
mean_imputer = SimpleImputer(strategy = "mean")
most_frequent_imputer = SimpleImputer(strategy = "most_frequent")

 #create selectors (categorical and numerical)
cat_selector = make_column_selector(dtype_include="object")
num_selector = make_column_selector(dtype_include="number")
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
scaler = StandardScaler()

#numeric pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)

#categorical pipeline
cat_pipe = make_pipeline(most_frequent_imputer, ohe)

# create tuples of (imputer, selector) for each datatype
num_tuple = (numeric_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

#column transformers
column_transformer = make_column_transformer(num_tuple, cat_tuple)
column_transformer

#fitting column transformer
column_transformer.fit(X_train)

#transforming training and testing data (results in NumPy Array)
X_train_imputed =  column_transformer.transform(X_train)
X_test_imputed = column_transformer.transform(X_test)

#checking for missing values and scaling and ohe
print(np.isnan(X_train_imputed).sum().sum(), "missing values in training data")
print(np.isnan(X_test_imputed).sum().sum(),"missing values in testing data")
print("\n")
print("All data in X_train_imputed are", X_train_imputed.dtype)
print("All data in X_test_imputed are", X_test_imputed.dtype)
print("\n")
print("shape of data is", X_train_imputed.shape)
print("\n")
X_train_imputed

0 missing values in training data
0 missing values in testing data


All data in X_train_imputed are float64
All data in X_test_imputed are float64


shape of data is (57, 77)




array([[-1.30301442, -0.97467943,  0.56162348, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.        ,  0.68120871, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378, -0.97467943,  1.99664622, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 1.25808288,  1.94935887, -0.03630266, ...,  1.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.97467943, -0.15588789, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.        ,  0.08328257, ...,  0.        ,
         1.        ,  0.        ]])