<a href="https://colab.research.google.com/github/dvisionst/Pipelines_Exercise/blob/main/Pipelines_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pipelines Activity 

- Jose Flores
- 22 July 2022

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn import set_config
set_config(display='diagram')




For this task, you will use the cereals dataset. This dataset shows popular cereals by brand and manufacturer along with nutrition facts.  The machine learning question is: 

*How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?*  

At this point, you are just completing the pre-processing steps for this assignment.

You will need to:

* Define features (X) and target (y).
X should only include the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces columns.



In [2]:
data = '/content/Cereal with missing values.xlsx - Sheet 1 - cereal.csv'
df = pd.read_csv(data)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     int64  
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

In [3]:
from pandas.core.dtypes.missing import isna
# Splitting up the features and target column. For this exercise the target 
# will be calories per serving. The features matrix will only include 
# Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces

y = df['calories per serving']
X = df.drop(columns=['name', 
             'grams of protein', 
             'calories per serving', 
             'milligrams of sodium',
             'grams of dietary fiber', 
             'grams of complex carbohydrates',
             'milligrams of potassium',
             'vitamins and minerals (% of FDA recommendation)', 
             'Display shelf',
             'Number of cups in one serving',
             'Rating of cereal'])

X.head()


Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving
0,General Mills,Cold,2.0,10.0,1.0
1,General Mills,Cold,2.0,,1.33
2,General Mills,Cold,2.0,1.0,1.0
3,General Mills,Cold,3.0,9.0,1.0
4,General Mills,Cold,2.0,7.0,1.0


In [4]:
# checking for duplicate values 

df.duplicated().sum()

0

In [5]:
# checking for  missing values
df.isna().sum()

name                                               0
Manufacturer                                       0
type                                               9
calories per serving                               7
grams of protein                                   0
grams of fat                                       8
milligrams of sodium                               1
grams of dietary fiber                             0
grams of complex carbohydrates                     0
grams of sugars                                    9
milligrams of potassium                            0
vitamins and minerals (% of FDA recommendation)    1
Display shelf                                      0
Weight in ounces per one serving                   0
Number of cups in one serving                      0
Rating of cereal                                   0
dtype: int64

## Train test split the data to prepare for machine learning.

In [None]:
# doing the train and test validation split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook).

 *The Manufacturer and type columns are categorical.
The remaining three columns are all numerical (grams of fat, grams of sugars & 
weight in ounces).
There are no ordinal features in the Matrix.*

## Use pipelines and column transformers to complete the following tasks:
* Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.
* One-hot encode the nominal features.
* Scale the numeric columns.


In [None]:
# Instantiating column selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')
# Imputers for categorical and numerical data
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')
# scaler for numerical
scaler = StandardScaler()

## All preprocessing steps should be contained within a single preprocessing object.


In [None]:
# Instantiating numeric pipeline

numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [None]:
# Instantiating categorical pipeline
catgorical_pipe = make_pipeline(freq_imputer)
catgorical_pipe

In [None]:
# Instantiating column transfers

# Tuples for the column transfer
num_tuple = (numeric_pipe, num_selector)
cat_tuple = (catgorical_pipe, cat_selector)

# column transformer

preprocessor = make_column_transformer(num_tuple, cat_tuple)
preprocessor

## Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. Show the resulting Numpy array.

In [None]:
# transforming the data starting with fitting training set

preprocessor.fit(X_train)

#transforming the training and testing sets

X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [None]:
# Data inspection

X_train_processed



array([[-0.9746794344808966, 0.9944816473415153, -0.13276489651437107,
        'Kelloggs', 'Cold'],
       [0.0, 1.2219191541326242, 2.038807019516324, 'Kelloggs', 'Cold'],
       [-0.9746794344808966, -0.8250184069873556, -0.13276489651437107,
        'Kelloggs', 'Cold'],
       [0.0, 1.676794167714842, 3.1574955823200144, 'General Mills',
        'Cold'],
       [0.0, -0.142705886614029, -0.13276489651437107, 'Quaker Oats',
        'Cold'],
       [0.0, -0.37014339340513785, -0.13276489651437107, 'Post', 'Cold'],
       [-0.9746794344808966, 0.08473162017707987, -0.13276489651437107,
        'Kelloggs', 'Cold'],
       [0.0, 0.7670441405504065, -0.13276489651437107, 'General Mills',
        'Cold'],
       [0.0, 2.0200508536226356e-16, -0.13276489651437107, 'Quaker Oats',
        'Cold'],
       [-0.9746794344808966, -0.8250184069873556, -0.13276489651437107,
        'Kelloggs', 'Cold'],
       [0.0, -0.8250184069873556, -0.13276489651437107, 'Ralston Purina',
        'Cold'],
      