# Establishing the Data

**Name**: Angel Lai

**Computing ID**: Bpy2nc


## Coffee Quality Dataset

**Source**: https://www.kaggle.com/datasets/fatihb/coffee-quality-data-cqi

This dataset is a big spreadsheet of coffee samples (mostly Arabica) where each row is one coffee and the columns include taste scores like Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Sweetness, Overall/Total Cup Points, plus some background info like country of origin and other production/processing details. 

The scores were originally produced by the Coffee Quality Institute (CQI): trained graders taste (cup) the coffee using CQI’s system and enter the scores into CQI’s database. Then the Kaggle uploader (fatihb / Fatih Boyar) made this Kaggle dataset by scraping/downloading those CQI database pages and saving them as CSV files (the Kaggle page calls it a May-2023 snapshot)

In [3]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("df_arabica_clean.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Country of Origin,Farm Name,Lot Number,Mill,ICO Number,Company,Altitude,Region,...,Total Cup Points,Moisture Percentage,Category One Defects,Quakers,Color,Category Two Defects,Expiration,Certification Body,Certification Address,Certification Contact
0,0,0,Colombia,Finca El Paraiso,CQU2022015,Finca El Paraiso,,Coffee Quality Union,1700-1930,"Piendamo,Cauca",...,89.33,11.8,0,0,green,3,"September 21st, 2023",Japan Coffee Exchange,"〒413-0002 静岡県熱海市伊豆山１１７３−５８ 1173-58 Izusan, Ata...",松澤　宏樹　Koju Matsuzawa - +81(0)9085642901
1,1,1,Taiwan,Royal Bean Geisha Estate,"The 2022 Pacific Rim Coffee Summit,T037",Royal Bean Geisha Estate,,Taiwan Coffee Laboratory,1200,Chiayi,...,87.58,10.5,0,0,blue-green,0,"November 15th, 2023",Taiwan Coffee Laboratory 台灣咖啡研究室,"QAHWAH CO., LTD 4F, No. 225, Sec. 3, Beixin Rd...","Lin, Jen-An Neil 林仁安 - 886-289116612"
2,2,2,Laos,OKLAO coffee farms,"The 2022 Pacific Rim Coffee Summit,LA01",oklao coffee processing plant,,Taiwan Coffee Laboratory,1300,Laos Borofen Plateau,...,87.42,10.4,0,0,yellowish,2,"November 15th, 2023",Taiwan Coffee Laboratory 台灣咖啡研究室,"QAHWAH CO., LTD 4F, No. 225, Sec. 3, Beixin Rd...","Lin, Jen-An Neil 林仁安 - 886-289116612"
3,3,3,Costa Rica,La Cumbre,CQU2022017,La Montana Tarrazu MIll,,Coffee Quality Union,1900,"Los Santos,Tarrazu",...,87.17,11.8,0,0,green,0,"September 21st, 2023",Japan Coffee Exchange,"〒413-0002 静岡県熱海市伊豆山１１７３−５８ 1173-58 Izusan, Ata...",松澤　宏樹　Koju Matsuzawa - +81(0)9085642901
4,4,4,Colombia,Finca Santuario,CQU2023002,Finca Santuario,,Coffee Quality Union,1850-2100,"Popayan,Cauca",...,87.08,11.6,0,2,yellow-green,2,"March 5th, 2024",Japan Coffee Exchange,"〒413-0002 静岡県熱海市伊豆山１１７３−５８ 1173-58 Izusan, Ata...",松澤　宏樹　Koju Matsuzawa - +81(0)9085642901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,202,202,Brazil,Fazenda Conquista,019/22,Dry Mill,,Ipanema Coffees,950,Sul de Minas,...,80.08,11.4,0,0,green,4,"February 2nd, 2024",Brazil Specialty Coffee Association,"Rua Gaspar Batista Paiva, 416 – Santa Luiza Va...",Chris Allen - 55 35 3212-4705
203,203,203,Nicaragua,Finca San Felipe,017-053-0155,Beneficio Atlantic Sébaco,017-053-0155,Exportadora Atlantic S.A,1200,Matagalpa,...,80.00,10.4,0,2,green,12,"March 2nd, 2024",Asociación de Cafés Especiales de Nicaragua,"Del Hotel Seminole 2 C al lago, 1 C arriba.",Maria Ines Benavidez Toval - 011-(505)-8396 4717
204,204,204,Laos,-,105/3/VL7285-005,DRY MILL,105/3/VL7285-005,Marubeni Corporation,1300,Bolaven Plateau,...,79.67,11.6,0,9,green,11,"November 11th, 2023",Japan Coffee Exchange,"〒413-0002 静岡県熱海市伊豆山１１７３−５８ 1173-58 Izusan, Ata...",松澤　宏樹　Koju Matsuzawa - +81(0)9085642901
205,205,205,El Salvador,"Rosario de Maria II, Area de La Pila",0423A01,"Optimum Coffee, San Salvador, El Salvador",,Aprentium Enterprises LLC,1200,"Volcan de San Vicente, La Paz, El Salvador",...,78.08,11.0,0,12,bluish-green,13,"March 7th, 2024",Salvadoran Coffee Council,"Final 1a. Av. Norte y 13 Calle Pte., dentro de...",Tomas Bonilla - (503) 2505-6600


In [5]:
cols = pd.DataFrame({
    "Column Name": df.columns,
    "Type": df.dtypes.astype(str)
})
cols = pd.DataFrame({
    "Column Name": df.columns,
    "Type": df.dtypes.astype(str),
    "Description": ""  
})

cols.loc[cols["Column Name"] == "ID", "Description"] = "Unique identifier for each coffee record in the dataset"
cols.loc[cols["Column Name"] == "Country of Origin", "Description"] = "Country where the coffee was grown"
cols.loc[cols["Column Name"] == "Farm Name", "Description"] = "Name of the farm where the coffee was grown"
cols.loc[cols["Column Name"] == "Lot Number", "Description"] = "Producer or exporter batch identifier for this coffee lot"
cols.loc[cols["Column Name"] == "Mill", "Description"] = "Wet or dry mill where the coffee was processed"
cols.loc[cols["Column Name"] == "ICO Number", "Description"] = "International Coffee Organization identifier associated with this batch"
cols.loc[cols["Column Name"] == "Company", "Description"] = "Company or organization that submitted or owns the coffee batch"
cols.loc[cols["Column Name"] == "Altitude", "Description"] = "Growing altitude of the coffee"
cols.loc[cols["Column Name"] == "Region", "Description"] = "Region where the coffee was produced"
cols.loc[cols["Column Name"] == "Producer", "Description"] = "Producer, cooperative, or farmer responsible for the coffee"
cols.loc[cols["Column Name"] == "Number of Bags", "Description"] = "Number of bags represented by this row of data"
cols.loc[cols["Column Name"] == "Bag Weight", "Description"] = "Weight of each bag"
cols.loc[cols["Column Name"] == "In-Country Partner", "Description"] = "CQI-affiliated partner organization that coordinated evaluation in the producing country"
cols.loc[cols["Column Name"] == "Harvest Year", "Description"] = "Year or harvest season when the coffee was harvested"
cols.loc[cols["Column Name"] == "Grading Date", "Description"] = "Date on which this coffee batch was graded"
cols.loc[cols["Column Name"] == "Owner", "Description"] = "Owner of the coffee batch"
cols.loc[cols["Column Name"] == "Variety", "Description"] = "Botanical coffee variety"
cols.loc[cols["Column Name"] == "Status", "Description"] = "Status of the sample or certificate (all completed)"
cols.loc[cols["Column Name"] == "Processing Method", "Description"] = "Post-harvest processing method used (e.g., washed, natural, honey)"
cols.loc[cols["Column Name"] == "Aroma", "Description"] = "Cupping score for aroma of the coffee"
cols.loc[cols["Column Name"] == "Flavor", "Description"] = "Cupping score for flavor of the coffee"
cols.loc[cols["Column Name"] == "Aftertaste", "Description"] = "Cupping score for aftertaste of the coffee"
cols.loc[cols["Column Name"] == "Acidity", "Description"] = "Cupping score for acidity of the coffee"
cols.loc[cols["Column Name"] == "Body", "Description"] = "Cupping score for body or mouthfeel"
cols.loc[cols["Column Name"] == "Balance", "Description"] = "Cupping score for how well the different attributes harmonize"
cols.loc[cols["Column Name"] == "Uniformity", "Description"] = "Score representing consistency across multiple cups of the same sample"
cols.loc[cols["Column Name"] == "Clean Cup", "Description"] = "Score for absence of off-flavors and presence of a clean cup"
cols.loc[cols["Column Name"] == "Sweetness", "Description"] = "Score for sweetness in the cup"
cols.loc[cols["Column Name"] == "Overall", "Description"] = "Overall impression score given by the cupper"
cols.loc[cols["Column Name"] == "Defects", "Description"] = "Total defect-related deductions or notes associated with the lot"
cols.loc[cols["Column Name"] == "Total Cup Points", "Description"] = "Total cupping score (sum of sensory attributes minus defect deductions)"
cols.loc[cols["Column Name"] == "Moisture Percentage", "Description"] = "Measured moisture content of the green coffee beans, in percent"
cols.loc[cols["Column Name"] == "Category One Defects", "Description"] = "Count of primary (Category 1) green coffee defects found in the sample"
cols.loc[cols["Column Name"] == "Quakers", "Description"] = "Number of quakers (underdeveloped beans) detected in the roast sample"
cols.loc[cols["Column Name"] == "Color", "Description"] = "Color or appearance description of the green coffee beans"
cols.loc[cols["Column Name"] == "Category Two Defects", "Description"] = "Count of secondary (Category 2) green coffee defects found in the sample"
cols.loc[cols["Column Name"] == "Expiration", "Description"] = "Expiration date of the evaluation or Q certificate validity"
cols.loc[cols["Column Name"] == "Certification Body", "Description"] = "Name of the organization or laboratory that certified or evaluated the coffee"
cols.loc[cols["Column Name"] == "Certification Address", "Description"] = "Address of the certifying organization or laboratory"
cols.loc[cols["Column Name"] == "Certification Contact", "Description"] = "Contact person or contact details for the certifying body"

In [6]:
cols

Unnamed: 0,Column Name,Type,Description
Unnamed: 0,Unnamed: 0,int64,
ID,ID,int64,Unique identifier for each coffee record in th...
Country of Origin,Country of Origin,object,Country where the coffee was grown
Farm Name,Farm Name,object,Name of the farm where the coffee was grown
Lot Number,Lot Number,object,Producer or exporter batch identifier for this...
Mill,Mill,object,Wet or dry mill where the coffee was processed
ICO Number,ICO Number,object,International Coffee Organization identifier a...
Company,Company,object,Company or organization that submitted or owns...
Altitude,Altitude,object,Growing altitude of the coffee
Region,Region,object,Region where the coffee was produced
