# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [None]:
# I'm importing pandas and numpy to help with the data cleanup.
import pandas as pd
import numpy as np

# Now that I've moved the file into this folder,I can load it directly.
df = pd.read_csv('flavors_of_cacao.csv')

# The column names are really messy,so I'm renaming them to simple words.
# This makes it much easier to write the rest of my cleaning code.
df.columns = ['company', 'bean_origin', 'ref', 'review_date', 'cocoa_percent', 
              'company_location', 'rating', 'bean_type', 'broad_bean_origin']

# Turning the cocoa percentage into a number so I can do math with it later.
df['cocoa_percent'] = df['cocoa_percent'].str.replace('%', '').astype(float)

# Just checking the first few rows to make sure it loaded right.
df.head()

Unnamed: 0,company,bean_origin,ref,review_date,cocoa_percent,company_location,rating,bean_type,broad_bean_origin
0,A. Morin,Agua Grande,1876,2016,63.0,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70.0,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70.0,France,3.0,,Togo
3,A. Morin,Akata,1680,2015,70.0,France,3.5,,Togo
4,A. Morin,Quilla,1704,2015,70.0,France,3.5,,Peru


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [4]:
# First, I'll check for any obvious null values using isnull.
print(df.isnull().sum())

# My thought process: I noticed that even where it says 0 nulls for some columns, 
# there are actually empty spaces or blanks in the 'bean_type' and 'broad_bean_origin' columns. 
# I don't want to delete these rows because I'll lose too much data. 
# Instead, I'm filling those empty spots with 'Unknown' so the data is consistent.

df['bean_type'] = df['bean_type'].fillna('Unknown')
df['bean_type'] = df['bean_type'].apply(lambda x: 'Unknown' if str(x).strip() == "" else x)

df['broad_bean_origin'] = df['broad_bean_origin'].fillna('Unknown')
df['broad_bean_origin'] = df['broad_bean_origin'].apply(lambda x: 'Unknown' if str(x).strip() == "" else x)

company              0
bean_origin          0
ref                  0
review_date          0
cocoa_percent        0
company_location     0
rating               0
bean_type            1
broad_bean_origin    1
dtype: int64


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [3]:
# I'm using the IQR (Interquartile Range) method to check for any ratings that are way outside the normal range.
Q1 = df['rating'].quantile(0.25)
Q3 = df['rating'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# This helps me see if there are any bars with really extreme ratings.
outliers = df[(df['rating'] < lower_bound) | (df['rating'] > upper_bound)]
print(f"Number of outliers found: {len(outliers)}")

# My thought process: Even though these are technically "outliers" (like a 1.0 rating), 
# I'm deciding to keep them. In chocolate tasting, some bars are just really bad, 
# and that's important information for my analysis of quality.

Number of outliers found: 19


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [5]:
# I'm checking to see if there are any identical rows that got entered twice.
duplicates = df.duplicated().sum()
print(f"Duplicates found: {duplicates}")

# My thought process: Since I didn't find any exact duplicates, I don't need to drop rows. 
# I'm also keeping the 'ref' and 'review_date' columns for now because they might be 
# useful if I want to see if chocolate quality has changed over the years.

Duplicates found: 0


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [7]:
# I noticed that some country names are written differently, like 'U.S.A.' and 'USA'. 
# I'm creating a small function to fix these so they all group together properly.

def fix_countries(name):
    name = str(name).strip()
    if name == 'U.S.A.' or name == 'USA':
        return 'United States'
    if name == 'U.K.':
        return 'United Kingdom'
    if name == 'Domincan Republic': # Fixing a typo I spotted in the original file
        return 'Dominican Republic'
    return name

# Applying the fix to both location columns to keep it clean.
df['company_location'] = df['company_location'].apply(fix_countries)
df['broad_bean_origin'] = df['broad_bean_origin'].apply(fix_countries)

# Finally, I'm saving this clean version as a new CSV for my next assignments.
df.to_csv('chocolate_bars_cleaned.csv', index=False)

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?

Yes, I found most of them. I had missing data in the bean type columns (hidden as blanks), inconsistent country names like "U.S.A" vs "USA," and some irregular ratings that counted as statistical outliers. I checked for unnecessary data (duplicates), and while there weren't many, it was important to verify.

2. Did the process of cleaning your data give you new insights into your dataset?

Definitely. I realized that "missing data" isn't always a null value; sometimes people just leave it blank or hit the space bar. It also showed me just how global the chocolate industry is when I had to standardize all the different country names.

3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?

Converting the cocoa percentage from a string to a float was the most important step. In my last checkpoint, I couldn't do any real math with that column, but now that it's a number, I can actually run statistics and see the trends clearly in my charts.