## Data Scraping -- Chocolate Database 

    Data is scraped from https://flavorsofcacao.com/chocolate_database.html originally but due to the website using Spry Framework and requesting the required table through another HTML page; a browser tool like Brave developer tool to spot XHR requests during a reload of the page led to this page http://flavorsofcacao.com/database_w_REF.html and it will be used to scrape and parse data from.

In [1]:
#Import libraries needed for scraping
from bs4 import BeautifulSoup
import requests
import csv

#Import libraries needed for cleaning
import pandas as pd

In [2]:
#Requesting authontication from HTML page 
#Using BeautifulSoup to show what is inside the tags

requesting_page = requests.get("http://flavorsofcacao.com/database_w_REF.html")
soup = BeautifulSoup(requesting_page.content, "html.parser")

In [3]:
#Writing scraped chocolate table into csv file using csv.writer and "th" from table's headers tag

with open('chocolate_database.csv', 'w', newline='', encoding='utf-8') as chocolate_data:
    csv_chocolate = csv.writer(chocolate_data)
    csv_chocolate.writerow([th.get_text(strip=True) for th in soup.table.tr.find_all('th')])

    for tr in soup.table.find_all("tr")[1:]:
        csv_chocolate.writerow([td.get_text(strip=True) for td in tr.find_all('td')])

## Data Cleaning -- Chocolate Database 

    Chocolate dataset has 2530 rows and 10 columns and 87 rows that contain NaN values in Ingredients column. There are two splutions for this, either to simply delete rows that contain missing values using "df.dropna()", which leads to loss of information. Second solution would be to replace missing values with empty space to avoid the deletions of the whole row. 

In [4]:
#Reading the dataset

chocolate_df = pd.read_csv(r"C:\Users\Toshiba\Desktop\projects\chocolate_bar\chocolate_database.csv")
chocolate_df.head(5)

Unnamed: 0,REF,Company (Manufacturer),Company Location,Review Date,Country of Bean Origin,Specific Bean Origin or Bar Name,Cocoa Percent,Ingredients,Most Memorable Characteristics,Rating
0,2454,5150,U.S.A.,2019,Tanzania,"Kokoa Kamili, batch 1",76%,"3- B,S,C","rich cocoa, fatty, bready",3.25
1,2458,5150,U.S.A.,2019,Dominican Republic,"Zorzal, batch 1",76%,"3- B,S,C","cocoa, vegetal, savory",3.5
2,2454,5150,U.S.A.,2019,Madagascar,"Bejofo Estate, batch 1",76%,"3- B,S,C","cocoa, blackberry, full body",3.75
3,2542,5150,U.S.A.,2021,Fiji,"Matasawalevu, batch 1",68%,"3- B,S,C","chewy, off, rubbery",3.0
4,2546,5150,U.S.A.,2021,Venezuela,"Sur del Lago, batch 1",72%,"3- B,S,C","fatty, earthy, moss, nutty,chalky",3.0


In [5]:
#Counting number of rows and columns
n_rows = chocolate_df.shape[0]
n_columns = chocolate_df.shape[1]
print(f'Rows: {n_rows}')
print(f'Columns: {n_columns}')
print()

#Identifying which column has NaN and coming up with their total
n_null = chocolate_df.isnull().sum()
print(n_null)

Rows: 2530
Columns: 10

REF                                  0
Company (Manufacturer)               0
Company Location                     0
Review Date                          0
Country of Bean Origin               0
Specific Bean Origin or Bar Name     0
Cocoa Percent                        0
Ingredients                         87
Most Memorable Characteristics       0
Rating                               0
dtype: int64


In [6]:
#Replacing older DataFrame with newer one through replacing NaN with empty space 
chocolate_df_filled = chocolate_df.fillna(" ")

#Checking new DataFrame's NaN total
print(chocolate_df_filled['Ingredients'].isnull().sum())

#Creating a new csv replacing the first one
chocolate_df_filled.to_csv('chocolate_dataset_cleaned.csv', index=False)

0
