# Data Cleaning Exercises

These exercises will help you practice cleaning and analyzing data using pandas.

## Exercise 1: Checking for Missing Data
Check if there are any missing values in the dataset. Write code to check how many missing values there are in each column.

In [12]:
import pandas as pd
df = pd.read_csv("ecommerce_data.csv")  
missing_values = df.isnull().sum()
print(missing_values)

OrderID            0
CustomerID         0
ProductCategory    0
Quantity           3
Price              1
ShippingCost       4
OrderDate          0
DeliveredOnTime    0
dtype: int64


## Exercise 2: Dropping Missing Values
Drop any rows with missing values from the dataset.

In [4]:
# Use .dropna() to drop rows with missing values
print(df.shape)
df_cleaned = df.dropna()
print(df_cleaned.shape)

(100, 8)
(92, 8)


## Exercise 3: Replacing Missing Values
Replace any missing values in the "Quantity" column with the mean of that column.

In [7]:
# Use .fillna() to replace missing values with the mean of "Quantity"
mean_quantity = df['Quantity'].mean()
#df['Quantity'].fillna(mean_quantity, inplace=True)
df.fillna(mean_quantity, inplace=True)
# verify that we have no missing values
print(df.isnull().sum())

OrderID            0
CustomerID         0
ProductCategory    0
Quantity           0
Price              0
ShippingCost       0
OrderDate          0
DeliveredOnTime    0
dtype: int64


## Exercise 4: Removing Duplicates
Remove any duplicate rows from the dataset.

In [15]:
# Use .drop_duplicates() to remove duplicate rows
# check the number of duplicates
print(df.duplicated().sum())
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates.shape)

1
(100, 8)


## Exercise 5: Renaming Columns
Rename the "ProductCategory" column to "Category".

In [17]:
# Use .rename() to rename the "ProductCategory" column
df_renamed = df.rename(columns={'ProductCategory': 'Category'})
print(df_renamed.head())

   OrderID  CustomerID  Category  Quantity   Price  ShippingCost   OrderDate  \
0        1        1102      Home       1.0  128.51         31.59  2023-01-01   
1        2        1435  Clothing       4.0   55.62          6.37  2023-01-02   
2        3        1860  Clothing       4.0  449.64          6.68  2023-01-03   
3        4        1270  Clothing       5.0  451.20         42.02  2023-01-04   
4        5        1106      Home       7.0  320.22         21.21  2023-01-05   

   DeliveredOnTime  
0            False  
1             True  
2             True  
3             True  
4            False  


## Exercise 6: Summary Statistics
Calculate the mean, median, and mode for the "Price" column.

In [18]:
# Use .mean(), .median(), and .mode() to calculate summary statistics for "Price"
mean_price = df['Price'].mean()
median_price = df['Price'].median()
mode_price = df['Price'].mode()
print(f"Mean: {mean_price}, Median: {median_price}, Mode: {mode_price}")

Mean: 274.3462, Median: 291.005, Mode: 0    150.41
Name: Price, dtype: float64


## Exercise 7: Creating New Columns
Create a new column called "TotalCost" which is the sum of "Price" and "ShippingCost".

In [20]:
# Use simple arithmetic to create a new column "TotalCost"
df['TotalCost'] = df['Price'] + df['ShippingCost']
print(df.head())

   OrderID  CustomerID ProductCategory  Quantity   Price  ShippingCost  \
0        1        1102            Home       1.0  128.51         31.59   
1        2        1435        Clothing       4.0   55.62          6.37   
2        3        1860        Clothing       4.0  449.64          6.68   
3        4        1270        Clothing       5.0  451.20         42.02   
4        5        1106            Home       7.0  320.22         21.21   

    OrderDate  DeliveredOnTime  TotalCost  
0  2023-01-01            False     160.10  
1  2023-01-02             True      61.99  
2  2023-01-03             True     456.32  
3  2023-01-04             True     493.22  
4  2023-01-05            False     341.43  


## Exercise 8: Counting Categories
Count the number of orders for each "Category".

In [22]:
# Use .value_counts() to count the occurrences of each category in "ProductCategory"
category_counts = df['ProductCategory'].value_counts()
print(category_counts)

ProductCategory
Books          32
Electronics    26
Home           25
Clothing       18
Name: count, dtype: int64


## Exercise 9: Aggregating Data
Group the data by "DeliveredOnTime" and calculate the average "Price" for each group.

In [23]:
# Use .groupby() to group by "DeliveredOnTime" and calculate the mean of "Price"
average_price_by_delivery = df.groupby('DeliveredOnTime')['Price'].mean()
print(average_price_by_delivery)

DeliveredOnTime
False    256.152955
True     288.640893
Name: Price, dtype: float64
