# Importing the necessary libraries

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt 
sb.set()

# Cleaning the dataset

## Importing the dataset

In [None]:
chocoRaw = pd.read_csv('chocolate.csv')
chocoTaste = pd.read_csv('chocolate_taste_dataset.csv')

## Examining basic infos 

In [None]:
chocoRaw.info()

In [None]:
chocoTaste

## Dropping redundant columns

Some columns such as "ref", "company", "review_date" or "company_location" are variables that can not be controlled when creating a chocolate bar, which should be removed. "specific_bean_origin_or_bar_name" is also removed because it is too specific, and there are too many minority variables.

In [None]:
choc = pd.DataFrame(chocoRaw.drop(columns=['Unnamed: 0', 'ref', 'company', 'company_location', 'review_date', 'beans', 'specific_bean_origin_or_bar_name']))

## Changing the values of ingredients to integer

We replaced the values in the ingredients columns to cater to future machine learning model usage.
- "have" changed into 1
- "not_have" changed into 0

In [None]:
choc['cocoa_butter'] = choc['cocoa_butter'].replace('have_cocoa_butter' ,'1')
choc['cocoa_butter'] = choc['cocoa_butter'].replace('have_not_cocoa_butter' ,'0')
choc['cocoa_butter'] = choc['cocoa_butter'].astype('category')
choc['vanilla'] = choc['vanilla'].replace('have_vanila' ,'1')
choc['vanilla'] = choc['vanilla'].replace('have_not_vanila' ,'0')
choc['vanilla'] = choc['vanilla'].astype('category')
choc['lecithin'] = choc['lecithin'].replace('have_lecithin' ,'1')
choc['lecithin'] = choc['lecithin'].replace('have_not_lecithin' ,'0')
choc['lecithin'] = choc['lecithin'].astype('category')
choc['salt'] = choc['salt'].replace('have_salt' ,'1')
choc['salt'] = choc['salt'].replace('have_not_salt' ,'0')
choc['salt'] = choc['salt'].astype('category')
choc['sugar'] = choc['sugar'].replace('have_sugar' ,'1')
choc['sugar'] = choc['sugar'].replace('have_not_sugar' ,'0')
choc['sugar'] = choc['sugar'].astype('category')
choc['sweetener_without_sugar'] = choc['sweetener_without_sugar'].replace('have_sweetener_without_sugar' ,'1')
choc['sweetener_without_sugar'] = choc['sweetener_without_sugar'].replace('have_not_sweetener_without_sugar' ,'0')
choc['sweetener_without_sugar'] = choc['sweetener_without_sugar'].astype('category')

Seeing the proportion of each ingredients

In [None]:
#jus to see proportion of yes and no of each ingredients
print("cocoa butter:")
print(choc["cocoa_butter"].value_counts())
print()
print("vanilla:")
print(choc["vanilla"].value_counts())
print()
print("lecithin:")
print(choc["lecithin"].value_counts())
print()
print("salt:")
print(choc["salt"].value_counts())
print()
print("sugar:")
print(choc["sugar"].value_counts())
print()
print("sweetener w/o sugar:")
print(choc["sweetener_without_sugar"].value_counts())

## Reclassifying rating 

In [None]:
rating = pd.DataFrame(choc['rating'])
sb.catplot(y = "rating", data = rating, kind = "count", height = 8)

As we can see, the rating data are too imbalanced, with a heavy skew towards the higher rating. Therefore, we reclassify them into 3 categories, which helps with reducing class imbalance, without too much oversampling or downscaling:
- rating<=2.75: "low"
- 2.75 < rating <= 3.25: "mid"
- 3.25 < rating <= 4: "high
Once again, we assign numerical values to the new ratings:
- "high" = 2
- "mid" = 1
- "low" = 0

In [None]:
conditions = [(choc['rating'] <= 2.75),
              (choc['rating'] > 2.75) & (choc['rating'] <= 3.25),
              (choc['rating'] > 3.25) & (choc['rating'] <= 4)]

values = ['0', '1', '2']

choc['rating_category'] = np.select(conditions, values)
swap = choc.pop('rating_category')
choc.insert(5,'rating_category', swap)
choc['rating_category'] = choc['rating_category'].astype('category')

sb.catplot(y = 'rating_category', data = choc, kind = 'count', height = 8)

The different ratings are now much more balanced.

## Adding "number_of_taste column"

There are data on number of ingredients, so we believe there should be data on number of taste as well.

In [None]:
choc['number_of_taste'] = choc.apply(lambda row: 4 - sum(row[0:2223]=='no taste'), axis = 1).astype('category')
choc

In [None]:
choc.info()