In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('cafe_sales.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


Looking at the above info(), it can be seen that the Dtype for the columns are incorrect, they should int and location needs to be a string 

In [4]:
df.dtypes

Transaction ID      object
Item                object
Quantity            object
Price Per Unit      object
Total Spent         object
Payment Method      object
Location            object
Transaction Date    object
dtype: object

## Why We Replaced Rows with "ERROR", "UNKNOWN", with Blank Values
As part of the Data Cleaning & Processing phase, we made the decision to treat "ERROR", "UNKNOWN", and blank (NaN) values as invalid entries across several key columns, including Location, Payment Method, and Item.

### These values were dropped or flagged for the following reasons:
- They do not represent meaningful or analyzable categories ‚Äî including them in analysis (like counts, trends, or visualizations) would distort results.
- They introduce ambiguity ‚Äî it‚Äôs unclear whether "UNKNOWN" means the customer didn‚Äôt provide info, the data was lost, or it was never collected at all.
- They prevent accurate grouping or aggregation ‚Äî especially in columns where we're measuring trends by category (e.g. sales by location or payment method).
- They signal data quality issues ‚Äî and retaining them without clarity could lead to misleading insights.
- They cannot be imputed reliably ‚Äî for categorical fields like Location or Payment Method, we lack sufficient context to fill in reasonable values without introducing bias.

I chose to flag these rows before replacing, to maintain transparency and allow for optional analysis of how many records were impacted by invalid or missing data.
- In total, this resulted in a notable reduction in the dataset size ‚Äî but improved overall data quality and trustworthiness of the insights drawn from it.



In [5]:
df['Quantity'] = df['Quantity'].replace('UNKNOWN', np.nan)
df['Quantity'] = df['Quantity'].replace('ERROR', np.nan)
df['Quantity'] = pd.to_numeric(df['Quantity'])


In [6]:
df['Price Per Unit'] = df['Price Per Unit'].replace('UNKNOWN', np.nan)
df['Price Per Unit'] = df['Price Per Unit'].replace('ERROR', np.nan)
df['Price Per Unit'] = pd.to_numeric(df['Price Per Unit'])

In [7]:
df['Total Spent'] = df['Total Spent'].replace('UNKNOWN', np.nan)
df['Total Spent'] = df['Total Spent'].replace('ERROR', np.nan)
df['Total Spent'] = pd.to_numeric(df['Total Spent'])

Above are ints, below are str/objects


In [8]:
# df['col1'] = df['col1'].astype(str)
df['Transaction ID'] = df['Transaction ID'].replace('UNKNOWN', np.nan)
df['Transaction ID'] = df['Transaction ID'].replace('ERROR', np.nan)
df['Transaction ID'] = df['Transaction ID'].astype(str)

In [9]:
df['Item'] = df['Item'].replace('UNKNOWN', np.nan)
df['Item'] = df['Item'].replace('ERROR', np.nan)
df['Item'] = df['Item'].astype(str)

In [10]:
df['Payment Method'] = df['Payment Method'].replace('UNKNOWN', np.nan)
df['Payment Method'] = df['Payment Method'].replace('ERROR', np.nan)
df['Payment Method'] = df['Payment Method'].astype(str)

### üö© What Does ‚ÄúFlagging‚Äù Mean?
Flagging means creating a new column that marks or identifies certain rows in your dataset based on a condition ‚Äî like rows where the value is "UNKNOWN" or "ERROR".
- You‚Äôre not deleting or changing anything ‚Äî you‚Äôre just labeling them.

### üìå Why Flag?
* Count how many errors you had
* See patterns (e.g. do most "UNKNOWN" locations happen with cash payments?)
* Filter or exclude those rows later, without losing track of them



In [11]:
# Step 1: Flag the rows
df['Location_was_unavailable'] = df['Location'].isin(['UNKNOWN', 'ERROR']) | df['Location'].isna()

# Step 2: Replace them with missing values
df['Location'] = df['Location'].replace(['UNKNOWN', 'ERROR'], np.nan)

df['Location'] = df['Location'].astype(str)  # Optional: keeps it string

In [12]:
print(df['Location_was_unavailable'].value_counts())


Location_was_unavailable
False    6039
True     3961
Name: count, dtype: int64


In [13]:
df.dtypes

Transaction ID               object
Item                         object
Quantity                    float64
Price Per Unit              float64
Total Spent                 float64
Payment Method               object
Location                     object
Transaction Date             object
Location_was_unavailable       bool
dtype: object

In [14]:
print("Location column missing values:")
print(df['Location'].value_counts(dropna=False))  # shows NaNs too


Location column missing values:
Location
nan         3961
Takeaway    3022
In-store    3017
Name: count, dtype: int64


In [15]:
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], errors='coerce')

# Add new time columns
# df['Day of Week'] = df['Transaction Date'].dt.day_name()
# df['Month'] = df['Transaction Date'].dt.month_name()


In [16]:
df.to_csv("cleaned_cafe_data.csv", index=False)
