# Preliminary Data Analytic Pipeline 
Provide a coded solution for each area below.  Where appropriate show output and explanations/insights.  Make sure it runs properly.
You will need to install if required just

!pip install < lib >
* [pandas](https://pandas.pydata.org/)
* [numpy](https://numpy.org/) 
* [sklearn](https://scikit-learn.org/stable) 
* [ydata_profiling](https://ydata-profiling.ydata.ai/docs/master/pages/getting_started/overview.html)

In [1]:
import pandas as pd
import numpy as np
import os
import sklearn
from sklearn.preprocessing import MinMaxScaler
# from ydata_profiling import ProfileReport
# from ydata_profiling.utils.cache import cache_file

## Data Integration
Select a data store for **categorization** noting multiple input (X) features and one or more label(s) (y). For example if it is a fraud data set having multiple input features such as age, salary, transaction cost, etc and a label indicating fraudulent or not.  If you are ahead of the curve, you make work on the data set for your final project but it is not necessary.  One option is select the data from [kaggle.com](https://www.kaggle.com/datasets). Bring the data into a pandas dataframe. Do not select unstructured, free-form text or graphic data for this assignment. 
* Note why you selected this data set.

In [2]:
# data integration
#This dataset contains nutritional data on 80 cereal products, including calorie, protein, carbohydrates, and fat contents.
#Ratings are based on consumer preference.
#This code categorizes cereal brands by calories,protein, sugars, and fat content into a scale of customer ratings
#Dataset source: https://www.kaggle.com/datasets/crawford/80-cereals

df=pd.read_csv('cereal.csv')
print(df)

                         name mfr type  calories  protein  fat  sodium  fiber  \
0                   100% Bran   N    C        70        4    1     130   10.0   
1           100% Natural Bran   Q    C       120        3    5      15    2.0   
2                    All-Bran   K    C        70        4    1     260    9.0   
3   All-Bran with Extra Fiber   K    C        50        4    0     140   14.0   
4              Almond Delight   R    C       110        2    2     200    1.0   
..                        ...  ..  ...       ...      ...  ...     ...    ...   
72                    Triples   G    C       110        2    1     250    0.0   
73                       Trix   G    C       110        1    1     140    0.0   
74                 Wheat Chex   R    C       100        3    1     230    3.0   
75                   Wheaties   G    C       100        3    1     200    3.0   
76        Wheaties Honey Gold   G    C       110        2    1     200    1.0   

    carbo  sugars  potass  

## Format and Type
Determine the format of the file and the types of each feature.

In [3]:
## FORMAT AND TYPE

#Determine the file extension using the os library
split_tuple = os.path.splitext('cereal.csv')
print(split_tuple)
  
file_name = split_tuple[0]
file_format = split_tuple[1]
  
print("File Name: ", file_name)
print("File Extension: ", file_format)

#Determine types for each of the 16 features in cereal.csv
df.dtypes

('cereal', '.csv')
File Name:  cereal
File Extension:  .csv


name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

## Analysis
Determine the dynamics of each feature (int/float - math stats, text - categorical or not)

In [4]:
# ANALYSIS
# We convert objects to strings. Although, objects are "by default" strings in python, we convert to the pandas "string" data type as recommended
# The reason is objects have a broader scope than pandas strings and we want to ensure cleanliness
df['name'] = df['name'].astype("string")
df['mfr'] = df['mfr'].astype("string")
df['type'] = df['type'].astype("string")


#Nutritional features will be converted to uint8 for efficiency
df['calories'] = df['calories'].astype("uint8")
df['protein'] = df['protein'].astype("uint8")
df['fat'] = df['fat'].astype("uint8")
df['sodium'] = df['sodium'].astype("uint8")
df['sugars'] = df['sugars'].astype("uint8")
df['potass'] = df['potass'].astype("uint8")
df['vitamins'] = df['vitamins'].astype("uint8")
df['shelf'] = df['shelf'].astype("uint8")


#float64 data types can be converted to uint8 (by converting first to int64 to avoid errors), except fiber, weight and cups
#'rating' will also be converted to uint8 because for our categorization, we do not need the decimal precision when we normalize the data
df['carbo'] = df['carbo'].round().astype("int64")
df['carbo'] = df['carbo'].astype("uint8")

df['rating'] = df['rating'].round().astype("int64")
df['rating'] = df['rating'].astype("uint8")


print(df.dtypes)
print(df)

name         string
mfr          string
type         string
calories      uint8
protein       uint8
fat           uint8
sodium        uint8
fiber       float64
carbo         uint8
sugars        uint8
potass        uint8
vitamins      uint8
shelf         uint8
weight      float64
cups        float64
rating        uint8
dtype: object
                         name mfr type  calories  protein  fat  sodium  fiber  \
0                   100% Bran   N    C        70        4    1     130   10.0   
1           100% Natural Bran   Q    C       120        3    5      15    2.0   
2                    All-Bran   K    C        70        4    1       4    9.0   
3   All-Bran with Extra Fiber   K    C        50        4    0     140   14.0   
4              Almond Delight   R    C       110        2    2     200    1.0   
..                        ...  ..  ...       ...      ...  ...     ...    ...   
72                    Triples   G    C       110        2    1     250    0.0   
73                

## Clean up
* Find and List number of blank entries and outliers/errors
* Take corrective actions and provide justification
* Remove unnecessary features
* If a categorical approach breakout the input features (X) from the output features (y)

In [5]:
# CLEAN UP

#Count and print blank data entries, if any
blank_entries_count = df.isna().sum()
print("Count of blank entries per column\n",blank_entries_count)

#Replace any blank data entries with zeros
df.fillna(0,inplace=True)

#Breakout inputs(X) from outputs(y) and remove unnecessary data
df_inputs = df[['name','calories','protein','fat','sugars']].copy()
print(df_inputs)

df_outputs = df[['rating']].copy()
print(df_outputs)


Count of blank entries per column
 name        0
mfr         0
type        0
calories    0
protein     0
fat         0
sodium      0
fiber       0
carbo       0
sugars      0
potass      0
vitamins    0
shelf       0
weight      0
cups        0
rating      0
dtype: int64
                         name  calories  protein  fat  sugars
0                   100% Bran        70        4    1       6
1           100% Natural Bran       120        3    5       8
2                    All-Bran        70        4    1       5
3   All-Bran with Extra Fiber        50        4    0       0
4              Almond Delight       110        2    2       8
..                        ...       ...      ...  ...     ...
72                    Triples       110        2    1       3
73                       Trix       110        1    1      12
74                 Wheat Chex       100        3    1       3
75                   Wheaties       100        3    1       3
76        Wheaties Honey Gold       110       

## Normalize
Dont worry about text features but you must normalize the numeric features. 
* Provide rationale as to why the particular normalization feature was selected.

In [6]:
#NORMALIZE THE NUMERIC FEATURES
#Normalization done using the min/max method to scale all numeric values within 0 and 1, and no negative values.

df_inputs['calories'] = (df_inputs['calories'] - df_inputs['calories'].min()) / (df_inputs['calories'].max() - df_inputs['calories'].min())   
df_inputs['protein'] = (df_inputs['protein'] - df_inputs['protein'].min()) / (df_inputs['protein'].max() - df_inputs['protein'].min())   
df_inputs['fat'] = (df_inputs['fat'] - df_inputs['fat'].min()) / (df_inputs['fat'].max() - df_inputs['fat'].min())   
df_inputs['sugars'] = (df_inputs['sugars'] - df_inputs['sugars'].min()) / (df_inputs['sugars'].max() - df_inputs['sugars'].min())   
print(df_inputs)

df_outputs['rating'] = (df_outputs['rating'] - df_outputs['rating'].min()) / (df_outputs['rating'].max() - df_outputs['rating'].min())
print(df_outputs)

                         name  calories  protein  fat    sugars
0                   100% Bran  0.181818      0.6  0.2  0.023529
1           100% Natural Bran  0.636364      0.4  1.0  0.031373
2                    All-Bran  0.181818      0.6  0.2  0.019608
3   All-Bran with Extra Fiber  0.000000      0.6  0.0  0.000000
4              Almond Delight  0.545455      0.2  0.4  0.031373
..                        ...       ...      ...  ...       ...
72                    Triples  0.545455      0.2  0.2  0.011765
73                       Trix  0.545455      0.0  0.2  0.047059
74                 Wheat Chex  0.454545      0.4  0.2  0.011765
75                   Wheaties  0.454545      0.4  0.2  0.011765
76        Wheaties Honey Gold  0.545455      0.2  0.2  0.031373

[77 rows x 5 columns]
      rating
0   0.657895
1   0.210526
2   0.539474
3   1.000000
4   0.210526
..       ...
72  0.276316
73  0.131579
74  0.421053
75  0.447368
76  0.236842

[77 rows x 1 columns]


## Feature and Label Selection
Down select from your data, the input features and label(s)

In [7]:
# Set up the input features (X) and the assocated label(s) (y)
#Inputs
x0=df_inputs['name']
x1=df_inputs['calories']
x2=df_inputs['protein']
x3=df_inputs['fat']
x4=df_inputs['sugars']

#Labels
y1=df_outputs['rating']

## Split into 3 data sets for training, validation, and test (Explain your % for each)

In [8]:
# SPLIT
# 70% Training / 15% Validation / 15% Test
# Following standard best practices, 70% of the data set will be assigned for training the model. It is important to train the model with a diverse set of inputs and scenarios,
# therefore, the majority of the data is designated for training.
# The remaining data will be split evenly among validation and testing.

training_length=54
validation_length=12
test_lenght=len(df.index)-training_length-validation_length

#Training
df_inputs_training = df_inputs.loc[0:training_length-1]
df_outputs_training = df_outputs.loc[0:training_length-1]

#Validation
df_inputs_validation = df_inputs.loc[training_length:training_length+validation_length-1]
df_outputs_validation = df_outputs.loc[training_length:training_length+validation_length-1]

#Test
df_inputs_test = df_inputs.loc[training_length+validation_length:len(df.index)-1]
df_outputs_test = df_outputs.loc[training_length+validation_length:len(df.index)-1]

print(df_inputs_training)
print(df_outputs_training)

print(df_inputs_validation)
print(df_outputs_validation)

print(df_inputs_test)
print(df_outputs_test)


                                      name  calories  protein  fat    sugars
0                                100% Bran  0.181818      0.6  0.2  0.023529
1                        100% Natural Bran  0.636364      0.4  1.0  0.031373
2                                 All-Bran  0.181818      0.6  0.2  0.019608
3                All-Bran with Extra Fiber  0.000000      0.6  0.0  0.000000
4                           Almond Delight  0.545455      0.2  0.4  0.031373
5                  Apple Cinnamon Cheerios  0.545455      0.2  0.4  0.039216
6                              Apple Jacks  0.545455      0.2  0.0  0.054902
7                                  Basic 4  0.727273      0.4  0.4  0.031373
8                                Bran Chex  0.363636      0.2  0.2  0.023529
9                              Bran Flakes  0.363636      0.4  0.0  0.019608
10                            Cap'n'Crunch  0.636364      0.0  0.4  0.047059
11                                Cheerios  0.545455      1.0  0.4  0.003922

## Summary
# Provide your thoughts on the quality, amount, trustworthiness, diffencencies, timeliness, and available documentation on the data you selected.  This can be written and/or code to demonstrate your conclusions.
* Determine if the data selected is suitable for a machine learning ingest.
* Note there are other prepossessing steps depending on the data such as graphics, free form text, and graphs and/or the type of model such as a time series model.  These topics are covered in the upcoming modules.

In [9]:
# SUMMARY AND QUALITY CHECK
# The features obtained in this data set are of high quality. No information is missing and the fields were properly formatted.
# The quantity was relatively deficient, since 80 rows of information make it difficult to have large enough data sets after splitting between training and validation.
# The data is not spare. All the features fall within the overarching "cereals" category and the numeric features are within expected range.
# Althought the data set was created 6 years ago, it is unlikely that much has changed in terms of cereal nutrional content. However, a more recent data set can provide
# information about brands newly introduced to the market.
# For additional documentation on the data set selected for this exercise, please refer to:
#Dataset source: https://www.kaggle.com/datasets/crawford/80-cereals

# Quality Check
After your analysis provide details on the following qualities of your selected data.
* Overall Quality of the data
* Sufficient amount of the data
* Spareness of any data categories (eg. no young adults)
* Trustworthiness of the data (Is it true?)
* Timeliness of the data (is it recent?)  What might be the problem if it is not?
* Note difficenties
* Available document on the data types, how the data was collected, how it was verified?

Provide your answers here for the quality check...

# Extra Credit
**Describe and demonstrate**  an interesting, useful, and unusual feature of one of the data listed libraries worthy of sharing with your class. **Or** do the same with a useful feature from a data library not used here.

In [10]:
# Extra Credit