# Chocolate Bar Ratings

Imagine you are a data scientist working on a project to analyze chocolate bar ratings. You've sourced a comprehensive dataset from Kaggle that includes expert ratings for over 1,700 different chocolate bars. This dataset provides a wealth of information, including the regional origin of the chocolate, the percentage of cocoa content, the variety of chocolate bean used, and the geographical location where the beans were grown.

The dataset is available on Kaggle at the following link: [Chocolate Bar Ratings Dataset](https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings). For more detailed information about the data, you can visit the "Overview" page on Kaggle or explore the [Flavors of Cacao](http://flavorsofcacao.com/index.html) website. The Flavors of Cacao website provides insights into the rating system used in this dataset, an overview of the factors that contribute to the flavor of chocolate, and other interesting information about chocolate.

The dataset includes the following attributes:

| ***Attribute*** | ***Description*** |
| ---: | :--- |
| **Company (Maker-if known)** | Name of the company manufacturing the bar |
| **Specific Bean Origin or Bar Name** | The specific geo-region of origin for the bar |
| **REF** | Reference number, a value linked to when the review was entered in the database. Higher = more recent. |
| **Review Date** | Date of publication of the review |
| **Cocoa Percent** | Cocoa percentage (darkness) of the chocolate bar being reviewed |
| **Company Location** | Manufacturer base country |
| **Rating** | Expert rating for the bar |
| **Bean Type** | The variety of bean used, if provided |
| **Broad Bean Origin** | The broad geo-region of origin for the bean |
|||

Aim: To uncover patterns and trends in chocolate ratings, identify the key factors that contribute to higher ratings, and perhaps even discover new insights into what makes certain chocolates more appealing to experts.

Steps followed:
    
    1. Loaded data into dataframe
    2. Data Inspecting and Cleaning
    3. Feature Engineering 
    4. EDA

In [1]:
import numpy as np
import pandas as pd
choco_data = pd.read_csv("flavors_of_cacao.csv")
choco_data

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin
0,A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70%,France,3.00,,Togo
3,A. Morin,Akata,1680,2015,70%,France,3.50,,Togo
4,A. Morin,Quilla,1704,2015,70%,France,3.50,,Peru
...,...,...,...,...,...,...,...,...,...
1790,Zotter,Peru,647,2011,70%,Austria,3.75,,Peru
1791,Zotter,Congo,749,2011,65%,Austria,3.00,Forastero,Congo
1792,Zotter,Kerala State,749,2011,65%,Austria,3.50,Forastero,India
1793,Zotter,Kerala State,781,2011,62%,Austria,3.25,,India


There are 1795 records with 9 columns

In [2]:
choco_data['Rating'].dtype # float

dtype('float64')

In [3]:
choco_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 9 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Company 
(Maker-if known)         1795 non-null   object 
 1   Specific Bean Origin
or Bar Name  1795 non-null   object 
 2   REF                               1795 non-null   int64  
 3   Review
Date                       1795 non-null   int64  
 4   Cocoa
Percent                     1795 non-null   object 
 5   Company
Location                  1795 non-null   object 
 6   Rating                            1795 non-null   float64
 7   Bean
Type                         1794 non-null   object 
 8   Broad Bean
Origin                 1794 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 126.3+ KB


In [4]:
choco_data.describe()

Unnamed: 0,REF,Review\nDate,Rating
count,1795.0,1795.0,1795.0
mean,1035.904735,2012.325348,3.185933
std,552.886365,2.92721,0.478062
min,5.0,2006.0,1.0
25%,576.0,2010.0,2.875
50%,1069.0,2013.0,3.25
75%,1502.0,2015.0,3.5
max,1952.0,2017.0,5.0


In [5]:
choco_data.shape

(1795, 9)

In [6]:
# look at the header row of the DataFrame with column names. We also want to investigate the "Bean Type" column.
# Print out the column names. What can you tell about the format of the column names?
choco_data.columns


Index(['Company \n(Maker-if known)', 'Specific Bean Origin\nor Bar Name',
       'REF', 'Review\nDate', 'Cocoa\nPercent', 'Company\nLocation', 'Rating',
       'Bean\nType', 'Broad Bean\nOrigin'],
      dtype='object')

In [7]:
# Update column names, assign names that are easy to work with. 
# You can update all column names or only some of them, and you can assign new names.
choco_data.columns = ['company', 'bar_name',
       'REF', 'review_date', 'cocoa_per', 'location', 'Rating',
       'bean_type', 'bean_origin']

In [8]:
choco_data

Unnamed: 0,company,bar_name,REF,review_date,cocoa_per,location,Rating,bean_type,bean_origin
0,A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70%,France,3.00,,Togo
3,A. Morin,Akata,1680,2015,70%,France,3.50,,Togo
4,A. Morin,Quilla,1704,2015,70%,France,3.50,,Peru
...,...,...,...,...,...,...,...,...,...
1790,Zotter,Peru,647,2011,70%,Austria,3.75,,Peru
1791,Zotter,Congo,749,2011,65%,Austria,3.00,Forastero,Congo
1792,Zotter,Kerala State,749,2011,65%,Austria,3.50,Forastero,India
1793,Zotter,Kerala State,781,2011,62%,Austria,3.25,,India


In [9]:
#Explore the "Bean Type" column. How many entries are there in this column? Are there any empty values?

# we can see bean_type does have empty but they are not coming in describe

# lets check what feels null to our eye what values it contain by finding unique values in bean_type
choco_data['bean_type'].unique()

array(['\xa0', 'Criollo', 'Trinitario', 'Forastero (Arriba)', 'Forastero',
       'Forastero (Nacional)', 'Criollo, Trinitario',
       'Criollo (Porcelana)', 'Blend', 'Trinitario (85% Criollo)',
       'Forastero (Catongo)', 'Forastero (Parazinho)',
       'Trinitario, Criollo', 'CCN51', 'Criollo (Ocumare)', 'Nacional',
       'Criollo (Ocumare 61)', 'Criollo (Ocumare 77)',
       'Criollo (Ocumare 67)', 'Criollo (Wild)', 'Beniano', 'Amazon mix',
       'Trinitario, Forastero', 'Forastero (Arriba) ASS', 'Criollo, +',
       'Amazon', 'Amazon, ICS', 'EET', 'Blend-Forastero,Criollo',
       'Trinitario (Scavina)', 'Criollo, Forastero', 'Matina',
       'Forastero(Arriba, CCN)', 'Nacional (Arriba)',
       'Forastero (Arriba) ASSS', 'Forastero, Trinitario',
       'Forastero (Amelonado)', nan, 'Trinitario, Nacional',
       'Trinitario (Amelonado)', 'Trinitario, TCGA', 'Criollo (Amarru)'],
      dtype=object)

In [10]:
choco_data['bean_origin'].unique()

array(['Sao Tome', 'Togo', 'Peru', 'Venezuela', 'Cuba', 'Panama',
       'Madagascar', 'Brazil', 'Ecuador', 'Colombia', 'Burma',
       'Papua New Guinea', 'Bolivia', 'Fiji', 'Mexico', 'Indonesia',
       'Trinidad', 'Vietnam', 'Nicaragua', 'Tanzania',
       'Dominican Republic', 'Ghana', 'Belize', '\xa0', 'Jamaica',
       'Grenada', 'Guatemala', 'Honduras', 'Costa Rica',
       'Domincan Republic', 'Haiti', 'Congo', 'Philippines', 'Malaysia',
       'Dominican Rep., Bali', 'Venez,Africa,Brasil,Peru,Mex', 'Gabon',
       'Ivory Coast', 'Carribean', 'Sri Lanka', 'Puerto Rico', 'Uganda',
       'Martinique', 'Sao Tome & Principe', 'Vanuatu', 'Australia',
       'Liberia', 'Ecuador, Costa Rica', 'West Africa', 'Hawaii',
       'St. Lucia', 'Cost Rica, Ven', 'Peru, Madagascar',
       'Venezuela, Trinidad', 'Trinidad, Tobago',
       'Ven, Trinidad, Ecuador', 'South America, Africa', 'India',
       'Africa, Carribean, C. Am.', 'Tobago', 'Ven., Indonesia, Ecuad.',
       'Trinidad-Tobago

In [11]:
# \xa0 is something appearing where we see nothing
# We can replace it with NaN values

q=choco_data['bean_type'].replace('\xa0',np.nan)
q
choco_data['bean_type'][2]

'\xa0'

In [12]:
choco_data # nothing changed here, as change was done in q

Unnamed: 0,company,bar_name,REF,review_date,cocoa_per,location,Rating,bean_type,bean_origin
0,A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70%,France,3.00,,Togo
3,A. Morin,Akata,1680,2015,70%,France,3.50,,Togo
4,A. Morin,Quilla,1704,2015,70%,France,3.50,,Peru
...,...,...,...,...,...,...,...,...,...
1790,Zotter,Peru,647,2011,70%,Austria,3.75,,Peru
1791,Zotter,Congo,749,2011,65%,Austria,3.00,Forastero,Congo
1792,Zotter,Kerala State,749,2011,65%,Austria,3.50,Forastero,India
1793,Zotter,Kerala State,781,2011,62%,Austria,3.25,,India


In [13]:
choco_data['bean_type'].replace(u'\xa0',np.nan, regex=True, inplace=True)
choco_data['bean_origin'].replace(u'\xa0',np.nan, regex=True, inplace=True)
choco_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   company      1795 non-null   object 
 1   bar_name     1795 non-null   object 
 2   REF          1795 non-null   int64  
 3   review_date  1795 non-null   int64  
 4   cocoa_per    1795 non-null   object 
 5   location     1795 non-null   object 
 6   Rating       1795 non-null   float64
 7   bean_type    907 non-null    object 
 8   bean_origin  1721 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 126.3+ KB


In [14]:
# Now \xa0 got replaced with Nan values thats why bean's null values increased

### Find the record with the highest chocolate rating. What company produces the chocolate bar and what country do the beans originate from?

In [15]:
highest_rating = choco_data['Rating'].max()

In [16]:
highest_rating_record = choco_data [choco_data['Rating']==highest_rating]
highest_rating_record

Unnamed: 0,company,bar_name,REF,review_date,cocoa_per,location,Rating,bean_type,bean_origin
78,Amedei,Chuao,111,2007,70%,Italy,5.0,Trinitario,Venezuela
86,Amedei,Toscano Black,40,2006,70%,Italy,5.0,Blend,


In [17]:
choco_data.loc[choco_data.Rating.idxmax()]

company            Amedei
bar_name            Chuao
REF                   111
review_date          2007
cocoa_per             70%
location            Italy
Rating                5.0
bean_type      Trinitario
bean_origin     Venezuela
Name: 78, dtype: object

### Amedei company orgin from Venezeuala has highest rating
As you can see, the highest rating of 5 was given to the chocolate bar Chuao of the company Amedei with 70% cacao made from a bean from Venezuela. 
However, we can see that there are actually two chocolate bars, both from the company Amedei, that have the highest rating of 5

In [18]:
#By adding `index` method, we get only the index for both records:
choco_data [choco_data['Rating']==highest_rating].index

Int64Index([78, 86], dtype='int64')

### Beans from what country are the most frequently used in the chocolate bars? We are looking for a top 10 countries of origin.

In [19]:
# Top 10 countries of origin which have the most rows of data in the dataset
choco_data['bean_origin'].value_counts()

Venezuela                214
Ecuador                  193
Peru                     165
Madagascar               145
Dominican Republic       141
                        ... 
Peru, Belize               1
Peru, Mad., Dom. Rep.      1
PNG, Vanuatu, Mad          1
Trinidad, Ecuador          1
Venezuela, Carribean       1
Name: bean_origin, Length: 99, dtype: int64

In [20]:
choco_data['bean_origin'].value_counts().head(10)

Venezuela             214
Ecuador               193
Peru                  165
Madagascar            145
Dominican Republic    141
Nicaragua              60
Brazil                 58
Bolivia                57
Belize                 49
Papua New Guinea       42
Name: bean_origin, dtype: int64

### Find the region of origin (column bean_origin) for the chocolate bars with the highest rating. Display the top 10 results, sorted by the rating and cocoa percent (column cocoa_per).

In [21]:
# if `bean_origin` has any null values:
choco_data['bean_origin'].isnull().sum()
#Yes, there are 74 missing values

74

In [22]:
# Fixing cocoa percent (column cocoa_per)
choco_data[['cocoa_per']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   cocoa_per  1795 non-null   object
dtypes: object(1)
memory usage: 14.1+ KB


In [23]:
# The data in this column is formatted as string. but it needs to be %age (numeric ) in order to do mean and sort
choco_data['cocoa_per'] = choco_data['cocoa_per'].apply(lambda x: x.split('%')[0]).astype(float)
choco_data.head()

Unnamed: 0,company,bar_name,REF,review_date,cocoa_per,location,Rating,bean_type,bean_origin
0,A. Morin,Agua Grande,1876,2016,63.0,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70.0,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70.0,France,3.0,,Togo
3,A. Morin,Akata,1680,2015,70.0,France,3.5,,Togo
4,A. Morin,Quilla,1704,2015,70.0,France,3.5,,Peru


In [24]:
# group by bean_origin, sort by mean of rating and cocoa_per , display top 10
choco_data.groupby('bean_origin')\
[['Rating','cocoa_per']].aggregate('mean').sort_values(by = ['Rating','cocoa_per'],ascending=False).head(10)

Unnamed: 0_level_0,Rating,cocoa_per
bean_origin,Unnamed: 1_level_1,Unnamed: 2_level_1
"Guat., D.R., Peru, Mad., PNG",4.0,88.0
"Dom. Rep., Madagascar",4.0,70.0
"Gre., PNG, Haw., Haiti, Mad",4.0,70.0
"Ven, Bolivia, D.R.",4.0,70.0
"Venezuela, Java",4.0,70.0
"Peru, Dom. Rep",4.0,67.0
"Peru, Belize",3.75,75.0
"Ven.,Ecu.,Peru,Nic.",3.75,75.0
"DR, Ecuador, Peru",3.75,70.0
"Dominican Rep., Bali",3.75,70.0


#### The results reveal that the highest-rated chocolate bars are crafted from a blend of beans sourced from various countries. Additionally, the results indicate that these top-rated bars tend to have a relatively high cocoa percentage.

### What countries are the top 10 chocolate producers, based on the variety of chocolate bars produced?
### And what countries produce the highest rated chocolate bars (top 10)?

In [25]:
# Aggregation of the number of rows in the dataset, based on the company_location column:
choco_data.location.value_counts().head(10)

U.S.A.         764
France         156
Canada         125
U.K.            96
Italy           63
Ecuador         54
Australia       49
Belgium         40
Switzerland     38
Germany         35
Name: location, dtype: int64

Largest producer of chocolate bars is USA, followed by France, Canda, UK and Italy

In [26]:
choco_data.groupby('location')['Rating'].max().sort_values(ascending =False).head(10)

location
Italy          5.0
U.S.A.         4.0
Spain          4.0
Germany        4.0
Scotland       4.0
Sao Tome       4.0
Switzerland    4.0
U.K.           4.0
Ecuador        4.0
France         4.0
Name: Rating, dtype: float64

#### Highest rated chocolate bars are produced in Italy, Brazil and Guatemala, to name the top 3 locations