# Data Analysis

## <a id="index">Table of Contents:</a>
* [Intro](#intro)
* [Data Load](#load-data)
* [Data Validation](#validation)
* [Exploratory Data Analysis](#eda)
    * [Category Analysis](#category)
    * [Servings Analysis](#servings)
    * [High-Traffic Column Analysis](#target-analysis)
    * [Handling Missing Values](#missing-values)
    * [Other Numeric Features Analysis](#numeric-features)

## <a id="intro">Intro</a> <font size='2'>[Table of contents🔝](#index)<font size>


The primary goal of this notebook is to prepare our dataset for subsequent analysis and modeling. In doing so, we will conduct data preprocessing tasks to enhance data quality, remove duplicates, handle missing values, and address outliers. Additionally, we will perform exploratory analysis to gain insights into the dataset's characteristics.

As part of our data preprocessing journey, we will introduce the concept of a data preprocessing pipeline. This pipeline will help streamline and organize the various data preparation tasks. 

Let's get started!

## <a id="load-data">Load Data</a> <font size='2'>[Table of contents🔝](#index)<font size>

Imports and function definitions

In [None]:
import matplotlib.pyplot as plt 
import numpy as np

import os
import pandas as pd
import plotly.express as px
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# import warnings
# # Ignore FutureWarnings
# warnings.simplefilter(action='ignore', category=[FutureWarning])

In [None]:
def get_dataTypes_and_missingValues(df):
    info = pd.DataFrame()
    info['data_types'] =  df.dtypes
    info['unique_values'] = df.nunique()
    info['missing_values'] = df.isna().sum()
    return info

In [None]:
data_dir = os.path.join("..","data")
print(f"The files contained in the data directory are: {', '.join(os.listdir(data_dir))}")

In [None]:
data_path = os.path.join(data_dir, "recipe_site_traffic.csv")
raw_df = pd.read_csv(data_path)

## <a id="validation">Data Validation</a> <font size='2'>[Table of contents🔝](#index)<font size>

In [None]:
print(f"There are {raw_df.shape[0]} rows and {raw_df.shape[1]} columns")

In [None]:
raw_df.head()

There are some missing values and the recipie column is a unique identifier of each recipe/row.

The recipe column can be dropped and the other columns explored further.

In [None]:
raw_df.drop("recipe", axis=1, inplace=True)
raw_df.head()

The high_traffic column seems to represent the column for the target values. The missing values for the target need to be handled before further analysis is performed.

In [None]:
print(f"The unique values in the target values are {raw_df['high_traffic'].unique()}")

Since the target contains 'High' for popular recipes the and there is only 1 unique value for this feature the missing values represent recipes that were not popular and will be set to 'Low'.

In [None]:
raw_df['high_traffic'] = raw_df['high_traffic'].fillna('Low')

We visualize the distribution of traffic to get more insight

In [None]:
raw_df

In [None]:
explode = (0.1, 0)
colors = ["#17becf", "#e41a1c"]
raw_df['high_traffic'].value_counts().plot.pie(
    autopct='%.2f%%', startangle=135, explode=explode, shadow=True, colors=colors
)
plt.title('Distribution of Recipe Traffic')
plt.axis('equal')
plt.show()

Before preprocessing the data we will check for duplicates then split the data to avoid **data leakage**.

In [None]:
print(f"There are {raw_df.duplicated().sum()} duplicates in the dataframe")

Removing the duplicates from the dataframe below

In [None]:
raw_df.drop_duplicates(inplace=True)
print(f"After dropping the duplicates there are {len(raw_df)} observations")

Since the duplicates have been removed the data can be split.

In [None]:
X = raw_df.drop('high_traffic', axis=1)
y = raw_df['high_traffic']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=42, shuffle=True)

In [None]:
print("After the split the observations in the data are")
print(f"Train: {len(X_train)}, validation: {len(X_val)}, test: {len(X_test)}")

## <a id="eda">Exploratory Analysis</a> <font size='2'>[Table of contents🔝](#index)<font size>

Let's get a better overview of the data using visualizations and complete the data validation from the previous step preparing the data to be used in training the model.

In [None]:
train_df = pd.concat([X_train, y_train], axis=1).reset_index(drop=True)
train_df.head()

### <a id="category">Category Analysis</a> <font size='2'>[Table of contents🔝](#index)<font size>

Taking a closer look at the category column

In [None]:
train_df['high_traffic'].unique()

In [None]:
sns.histplot(train_df, x='category', hue='high_traffic', multiple='stack')
plt.title('Recipe Food Categories')
plt.xticks(rotation=90)
plt.show()

It seems like vegetable and potato featured foods generate the highest traffic relative to the low traffic, however beverages and breakfast items generate the least traffic relative to the high traffic.

In [None]:
categories = train_df['category'].unique()
print(f"The {len(categories)} categories in the train features are:\n {', '.join(categories)}")

The category column needs to match the required categories in one of the 10 possible categories. 

Lunch/Snacks', 'Beverages', 'Potato','Vegetable', 'Meat', 'Chicken, 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal'

However, there are 11 categories, an extra category 'Chicken Breast' was added. This category needs to be converted to 'Chicken'

In [None]:
train_df['category'] = train_df['category'].replace('Chicken Breast', 'Chicken')

In [None]:
print("After replacing the occurences of Chicken Breast with Chicken")
categories = train_df['category'].unique()
print(f"The {len(categories)} categories in the train features are:\n {', '.join(categories)}")

Visualizing the Food Categories

In [None]:
colors = sns.color_palette('pastel')[0:5]

train_df['category'].value_counts().plot.pie(colors=colors,
                                            autopct='%.2f%%', 
                                            shadow=True, startangle=140)
plt.title("Distribution of the category of food items")
plt.axis('equal')
plt.show()

The chicken category is the largest category in the dataframe.

In [None]:
sns.histplot(train_df, x='category', hue='high_traffic', multiple='stack')
plt.title('Recipe Food Categories')
plt.xticks(rotation=90)
plt.show()

After merging the 2 chicken categories chicken generates the most traffic(high and low) but it seems that chicken based dishes generate nearly as much high traffic as they do low traffic.

**Encoding Categorical Values**

Converting the categories to numeric representations. We will opt for creating a unique column for each category (one-hot encoding) as there is no inherent order in the categories.

In [None]:
cat_features = ['category']
enc = OneHotEncoder(sparse_output=False)
enc.fit(train_df[cat_features])
converted_categories = enc.get_feature_names_out().tolist()
train_df[converted_categories] = enc.transform(train_df[cat_features])
train_df.drop('category', axis=1, inplace=True)

In [None]:
train_df.head()

Now there is only 1 other non-numeric column the target column `high_traffic`

In [None]:
non_numeric_cols = train_df.select_dtypes(exclude=np.number).columns.values
print(f"The non numeric columns are {', '.join(non_numeric_cols)}")

### <a id="servings">Servings Analysis</a> <font size='2'>[Table of contents🔝](#index)<font size>

In [None]:
train_df.info()

In [None]:
train_df['servings'].sort_values()

The feature servings seems to contain only numeric values. However, the data type for that column is type object so there should either be some incorrect values or the column was simply stored in a wrong data type.

Below we will try converting the column to int data type.

In [None]:
try:
    train_df['servings'].astype(int)
except Exception:
    print("The column contains non-numeric characters")

Let us try to figure out the non integer data types contained in the servings column

In [None]:
mask = train_df['servings'].astype(str).str.contains(r'\D', regex=True)
non_numeric_values = train_df[mask]

print("Non-numeric values in servings")
non_numeric_values

There is one occurence where the serving does not contain only numeric values however this instance can be converted to a numeric representation of 4. The category is snack and 4 serving were taken as a snack. This is probably an inputation error.

In [None]:
train_df.loc[mask, 'servings'] = 4
train_df['servings'] = train_df['servings'].astype(int)

non_numeric_cols = train_df.select_dtypes(exclude=np.number).columns.values
print(f"The non numeric columns are {''.join(non_numeric_cols)}")

### <a id="target-analysis">High-Traffic Column Analysis</a> <font size='2'>[Table of contents🔝](#index)<font size>

Visualizing the target column

In [None]:
sns.countplot(train_df, y='servings', hue='high_traffic')
plt.title('Histogram of Serving Size')
plt.show()

When the serving size was 4 the most traffic is generated however, there is no trend to indicate that higher servings translate to higher traffic. Serving might not be a good indicator on its own of if there is high traffic or not.

In [None]:
sns.violinplot(train_df, x='servings', y='high_traffic', scale='count')
plt.title('Serving Sizes vs Traffic Volume')
plt.show()

From the violin plot the high traffic recipies have a wider distance indicating that there are more high traffic recipies than low traffic recipies if we distribute them based on the servings.

Some of the serving sizes are missing(3 and 5) this might be as a result of the split. **Could this pose a challenge?** 

### <a id="missing-values">Handling Missing Values</a> <font size='2'>[Table of contents🔝](#index)<font size>

In [None]:
get_dataTypes_and_missingValues(train_df)

There are still quite a few missing values. Let us explore them further.

In [None]:
missing_indices = train_df[train_df['calories'].isnull()].index
train_df.iloc[missing_indices]

There are 4 columns with 14 missing values each, however all the 14 columns coincide and are spread accross different categories and servings. The missing data would be difficult to recreate/predict and do not make up a significant amount of the data for this analysis so they would be omitted.

In [None]:
train_df.drop(missing_indices, inplace=True)
print(f"After cleaning up the dataframe there are {train_df.isna().values.sum()} missing values")

### <a id="numeric-features">Other Numeric Features Analysis</a> <font size='2'>[Table of contents🔝](#index)<font size>

So far we have taken a look at 2 of the 6 columns the original features contained. Further analysis will focus on the other 4 columns and explore their relationship to the target.

In [None]:
numeric_cols = ['calories', 'carbohydrate', 'sugar', 'protein', 'servings']
train_df.loc[:,numeric_cols].describe()

The code above gives the summary statistics of the unexplored columns plus the servings column. 

The ranges are below
* Calories 0.14 - 2906.0
* Carbohydrate 0.05 - 530.4 grams
* Sugar 0.01 - 131.39 grams
* Protein 0.00 - 239.57 grams
* Servings 1 - 6

A graphical visualization might provide insights that are easier to digest.

**Calories**

In [None]:
fig = px.histogram(train_df,
                   x='calories',
                   marginal='box',
                   color='high_traffic',
                   color_discrete_sequence=['green','grey'],
                   title='Distribution of Calories by Traffic')
fig.update_layout(bargap=0.1)
fig.update_layout(width=700, height=500)
fig.show()

The data is right skewed with most of the calories residing in the left part of the graph and a couple of outliers towards the right part of the graph i.e towards where there are more calories.

However higher traffic seems to be the norm regardless of the calories.

**Carbohydrates**

In [None]:
fig = px.histogram(train_df,
                   x='carbohydrate',
                   marginal='box',
                   color='high_traffic',
                   color_discrete_sequence=['green','grey'],
                   title='Distribution of Carbohydrates by Traffic')
fig.update_layout(bargap=0.1)
fig.update_layout(width=700, height=500)
fig.show()

There is a similar trend as the calories: 
* Right skewed data and 
* Higher traffic regardless of the grams of carbohydrates

However there seems to be a higher skew to the data

**Sugar**

In [None]:
fig = px.histogram(train_df,
                   x='sugar',
                   marginal='box',
                   color='high_traffic',
                   color_discrete_sequence=['green','grey'],
                   title='Distribution of Sugar by Traffic')
fig.update_layout(bargap=0.1)
fig.update_layout(width=700, height=500)
fig.show()

**Protein**

In [None]:
fig = px.histogram(train_df,
                   x='protein',
                   marginal='box',
                   color='high_traffic',
                   color_discrete_sequence=['green','grey'],
                   title='Distribution of Protein by Traffic')
fig.update_layout(bargap=0.1)
fig.update_layout(width=700, height=500)
fig.show()

The sugar and protein columns have a similar trend as the calories: 
* Right skewed data and 
* Higher traffic regardless of the grams

#### Visualization of all the numeric columns

In [None]:
pair_plot = sns.pairplot(train_df[numeric_cols + ['high_traffic']], hue='high_traffic', diag_kind='kde')

pair_plot.fig.suptitle('Pairplot of Numeric Columns', y=1.02)
pair_plot.fig.tight_layout()

plt.show()

There is no noticable trend in each pair of numerical columns (calories, carbohydrate, protein, sugar, serving) and the traffic generated.

In [None]:
y_train = train_df['high_traffic']
X_train =  train_df.drop('high_traffic', axis=1)

At this step we will go ahead and train the model. Further feature engineering could be included in future steps.

But, before the model is trained we will input all the pre-processing steps into a pipeline to ensure reproducability, maintainability and standardization of steps.