<a href="https://www.kaggle.com/code/dmquindoza/cereals-exploratory-data-analysis?scriptVersionId=143307776" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Cereals! 🥣 Exploratory Data Analysis

**This notebook utilizes the six principles of EDA from the Google Analytics Course. This practice comes in no particular order as sometimes we need to repeat certain steps after validating our data.**

![PACE.png](https://i.ibb.co/BTT0JSZ/EDA.png)


In [None]:
#Import Libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
filepath = '/kaggle/input/80-cereals/cereal.csv'

cereals = pd.read_csv(filepath)

# **Discovering**
**In this phase we familiarize ourselves with our data**

In [None]:
#View first rows too see what we are working on 
cereals.head()

In [None]:
#View dataset summary about columns and rows
print("Dataset Columns and rows:", cereals.shape)
print("Dataset size:", cereals.size)

In [None]:
cereals.info()

In [None]:
#Some Summary statistics
cereals.describe()

In [None]:
#Lets see the overall distribution of our data

plt.figure(figsize=(8,6))
plt.title("Overall Cereals Data Distribution")
sns.histplot(data = cereals)

**Our data appears to be right-skewed, indicating a positive skew in the data distribution. In the following sections of our notebook, we will create histograms for each variable to gain a deeper understanding of our data.**

# Structuring/Joining

**Our dataset appears to be well-structured, as evident from its Kaggle datacard page. It is easily organized, searchable, and suitable for analysis. In this notebook, we will further examine it for any missing data, duplicates, or outliers. So far, there are no issues with typos, column names, or data types.**

# Cleaning

**In this phase, we will now look for missing data, duplicates, or outliers.**

In [None]:
#Check for missing data
cereals.isnull().any()

**There are no, missing values using the isnull() function across all our columns. Now let's check for duplicates.**

In [None]:
cereals.duplicated().any()

**No duplicates were found in the dataset. Next, we will generate histograms to identify any unusual outliers within our variables.**

In [None]:
# Let's create a function so that we don't need to retype everything each plot
def plot_histogram(column_data, column_name):
    plt.figure(figsize=(5, 3))
    plt.title(f"Distribution of {column_name}")
    sns.histplot(column_data, kde=True) 
    plt.show()


In [None]:
plot_histogram(cereals['mfr'], 'Manufacturer')

**The histogram reveals that Kellogs and General Mills are the primary cereal manufacturers in our dataset. This bimodal distribution indicates two prominent peaks, highlighting the dominance of these two manufacturers in our product range.**

In [None]:
plot_histogram(cereals['type'], 'Cold or Hot Types of Cereal')

**Our analysis suggests that the majority of our cereals are designed to be served cold, with fewer options suitable for hot consumption.**

In [None]:
plot_histogram(cereals['calories'], 'Calories')

**The distribution of calories in our dataset exhibits a normal distribution, characterized by a bell-shaped curve. Most cereals in our dataset provide around 100-120 calories per serving.**

In [None]:
plot_histogram(cereals['protein'], 'Protein')

**The distribution of protein content among our products predominantly falls within the range of 2-3 grams, as indicated by our right-skewed histogram.**

In [None]:
plot_histogram(cereals['fat'], 'Fat')

**The distribution of fat content per gram also exhibits a right-skewed pattern, with the majority of servings containing 0-1 gram. Some cereals have up to 5 grams of fat per serving, which, although slightly higher, does not qualify as an extreme outlier and does not significantly impact our analysis.**

In [None]:
plot_histogram(cereals['sodium'], 'Sodium')

**For the contents of sodium, we can observe a normal distribution of our data, showing that 150-250 milligrams of servings for our cereals.**

In [None]:
plot_histogram(cereals['fiber'], 'Fiber')

**The distribution of dietary fiber content in our cereals exhibits a positive skew in the histogram. Most cereals contain 0 to 6 grams of dietary fiber per serving, with some outliers providing as much as 14 grams of fiber.**

In [None]:
plot_histogram(cereals['carbo'], 'Carbohydrates')

**Regarding the distribution of carbohydrates in the dataset, it exhibits a left-skewed distribution or negative skewness. The majority of cereals offer approximately 10-20 grams of complex carbohydrates.**  

In [None]:
plot_histogram(cereals['sugars'], 'Sugar')

**In contrast, the distribution of sugar content in grams shows a uniform distribution, indicating an even distribution of sugar content across our cereals.**

In [None]:
plot_histogram(cereals['potass'], 'Potassium')

**The distribution of potassium content in milligrams reveals a right-skewed distribution, with the majority of cereals containing 0-150 milligrams of potassium. However, there are some outliers with values as high as 300 milligrams.**

In [None]:
plot_histogram(cereals['vitamins'], 'Vitamins')

**In terms of vitamins, the vast majority of cereals contain either 0, 25, or 100 units. These values align with the typical recommended percentages outlined by the FDA on the data card.**

In [None]:
plot_histogram(cereals['cups'], 'Cups')

**When we examine the distribution of the number of cups per serving, our data reveals a bimodal distribution. This implies that the majority of cereals typically offer either 0.8 or 1 cup per serving.**

In [None]:
plot_histogram(cereals['rating'], 'Consumer Rating')

**The dataset reveals that the majority of cereals have ratings falling within the 30 to 40 point range. However, there are noteworthy outliers with ratings as high as 90 points, resulting in a right-skewed distribution.**

#### Outliers

**So far, our histograms have revealed the presence of outliers. However, it's important to note that these outliers do not represent extreme values and can be considered valid observations within our dataset. As a result, there is currently no immediate need to address them.** 

# Validation 

**Given the absence of missing values, duplicates, or extreme values in our dataset, which is already structured, there is no need for further checks in this phase of validation.** 

# Presenting and Showing Visualizations

**For this cereal dataset, let's create some questions that we might want to answer using visualizations to make it easier for us to convey our findings to the stakeholders.** 

- Which brand has the most highest ratings?
- Are there any relationships between sugar content and rating?
- Is there a relationship between fat content and calories?
- What are the most influential factors or features that contribute to the ratings of cereals in the Cereals80 dataset?

## Which brand has the most highest ratings?

In [None]:
# Sort the DataFrame by rating in descending order
cereals_sorted = cereals.sort_values(by='rating', ascending=False)

plt.figure(figsize=(14, 7))
plt.title("Brands and Their Ratings")
plt.xticks(rotation=90)
sns.barplot(data=cereals_sorted, x=cereals_sorted['name'], y=cereals_sorted['rating'])

**Based on our barplot, it seems that the top five brands with the highest ratings are:**

1. All-Bran with Extra Fiber
2. Shredded Wheat 'n' Bran
3. Shredded Wheat Spoon Size
4. 100% Bran
5. Shredded Wheat


## Are there any relationships between sugar content and rating?



In [None]:
plt.figure(figsize=(10, 6))
plt.title('Relationship between Sugar Content and Rating')
plt.xticks(rotation=90)
sns.regplot(data=cereals, x=cereals['sugars'], y=cereals['rating'])

**According to our regression plot analysis, there appears to be a positive correlation between lower sugar content and higher cereal ratings.**

## Is there a relationship between fat content and calories?

In [None]:
plt.figure(figsize=(10, 6))
plt.title('Relationship between Sugar Content and Rating')
plt.xticks(rotation=90)
sns.regplot(data=cereals, x=cereals['fat'], y=cereals['rating'])

**Based on our regression plot analysis, it suggests a positive correlation between lower fat content and higher cereal ratings. This observation implies that consumers may prioritize cereals with lower fat content when making purchasing decisions.**

## What are the most influential factors or features that contribute to the ratings of cereals in the Cereals80 dataset?

### Evaluating Linear Relationships between Features and Ratings using the Pearson Correlation Method 

In [None]:
# Exclude non-numeric columns ('name') from the dataframe
numeric_columns = cereals.drop(columns=['name', 'type', 'mfr','shelf', 'cups', 'weight'])

# Create a correlation matrix for numeric columns
correlation_matrix = numeric_columns.corr(method = 'pearson')

# Create a heatmap for correlation with 'rating'
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix[['rating']].sort_values(by='rating', ascending=False), annot=True)
plt.title("Correlation Heatmap with Ratings")
plt.show()


**Our heatmap reveals significant insights into the factors influencing consumer ratings. Fiber, protein, potassium, and carbohydrates exhibit a positive linear relationship with ratings. Conversely, vitamins, sodium, fat, calories, and sugars display a negative correlation, suggesting that consumers tend to favor cereals with lower calorie, fat, and sugar content, while showing a preference for those rich in fiber, protein, potassium, and carbohydrates.**

**Among these factors, sugars emerge as the most influential for consumers, bearing the lowest negative correlation. In contrast, fiber stands out with the highest positive linear score, indicating a strong inclination toward cereals with higher fiber content.**

**In our next analysis, we will explore which feature plays the most critical role in predicting cereal ratings using a Random Forest model.**

### Using the Random Forests to check for Feature Importance 

In [None]:
#Let's drop non numerical features for our X variable 
X = cereals.drop(columns=['name', 'type', 'mfr', 'rating','shelf', 'cups', 'weight'])
y = cereals['rating']

#Fit our model
model = RandomForestRegressor()
model.fit(X, y)

#Assign our important features for visualization
feature_importances = model.feature_importances_

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=X.columns)
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importances")
plt.show()


**Through the utilization of a Random Forest Regressor, we gain valuable insights into which features wield the greatest influence on predicting consumer ratings. The bar chart showcases sugars and calories as two pivotal factors contributing to our ability to forecast ratings. Therefore,if ever we develop a machine learning model, sugars and calories emerge as the best candidates for predictor variables. With this, we can ready our data for the first step of Feature Engineering, which is Feature Selection.**

---
*Thank you for taking the time to explore this notebook! I hope you found the insights and analysis useful. If you have any questions or feedback, please feel free to reach out. :)*
---
