# Example usage

To use `wrangle_in_py` in a project:

# Introduction

In this tutorial you'll learn how to use some functions that help with some common data wrangling and tidying tasks along with the pandas python package. This will prepare the dataframe for any future analysis you might be interested in including exploratory data analysis, formal progress reports, machine learning, etc.
For this tutorial we will create our own messy dataframe to work with as an exmaple. In order to keep it simple it has just 18 rows of data.

To walk through the process of using our wrangle_in_py package we'll introduce Susie, who has opened her first ice cream shop and is eager to get going and eventually expand into more locations. However, before she gets too ahead of herself she's trying to collect some simple data about her employees from her shop that she's hoping to use later on. But for now it looks pretty messy so let's walk through cleaning up the dataframe with her!

In [3]:
import pandas as pd
import wrangle_in_py

print(wrangle_in_py.__version__)

0.1.0


To begin we'll load Susie's dataframe in. We can see she's already got quite a few employees, their role in the shop and their shift start and end times.  

In [4]:
data_dict = {
    "Shift Time": [
        "2025-01-01 09:00:00", "2025-01-01 17:00:00", "2025-01-04 01:00:00", 
        "2025-01-02 09:00:00", "2025-01-02 17:00:00", "2025-01-03 01:00:00", 
        "2025-01-05 09:00:00", "2025-01-03 17:00:00", "2025-01-04 01:00:00", 
        "2025-01-04 09:00:00", "2025-01-04 17:00:00", "2025-01-05 01:00:00", 
        "2025-01-05 09:00:00", "2025-01-05 17:00:00", "2025-01-06 01:00:00",
        "2025-01-06 01:00:00", "2025-01-06 01:00:00", "2025-01-06 01:00:00"
    ],
    "Employee ID": [101, 102, 103, 104, 105, 106, 107, 102, 103, 108, 109, 110, 101, 111, 112, 112, 112, 112],
    "Shop ID": [1] * 18,
    "Task Assigned": [
        "Inventory", "Cashier", None, "Cleaning", "Cashier", "Stocking", 
        "Inventory", None, "Cleaning", "Cashier", "Stocking", "Inventory", 
        "Cashier", None, "Cleaning", "Cleaning", "Cleaning", "Cleaning"
    ]
}

df = pd.DataFrame(data_dict)

# rename columns

explain..

set-up

In [None]:
#code

In [None]:
# more code

In [None]:
# more code

conclusion

# function 2 - Remove duplicates

introduction...

more info

In [None]:
# code

In [None]:
# more code

conclusion..

# function 3 - remove column based on threshold, / low coefficient of variance

Next Susie is wanting to simplify things for herself later on and wants to remove any columns that aren't useful for her. She's decided that any column missing more than 10% of data should be removed since it won't be useful for her later when she's deciding how many employees are needed in a new shop. She always wants to remove any columns that have redundant info, aka the same values for all the cells, since it's not adding anything to her dataset currently and she finds it distracting to have columns included if they don't have a purpose. 

Thus Susie wants to remove the columns "task_assigned" (for missing data), and "shop_id" (for redundant info). She could do this by examining her dataframe, noting down the columns to remove, then deleting the columns with the pandas .drop() function. But there's a better way! The function column_drop_threshold in this package will calculate the proportion of missing data in all the columns and remove the columns that have too much missing data, based on a threshold specified by Susie (or you, the user)! The function also has an optional argument which when included will delete any columns that have a lower coefficient of variance than specified. In this way if there's a column with the same value stored for every example we can remove it without having to examine the dataframe to identify columns that fit this description.

As such Susie has decided to remove any columns with more than 10% missing data, and any columns with a coefficient of variance lower than 0.001.

In [None]:
# Note that the default for variance is None, so we don't have to include it if we don't want columns removed on that basis

column_drop_threshold(df, threshold=0.1, variance=0.001)

In [None]:
# more code

Wonderful! Now we can see the columns we wanted removed are gone, and rather than going through the entire dataframe manually calculating missing amounts of data and coefficients of variance, the function did it for us and saved us some time! Now Susie can spend her free time finding great locations for a new shop and training more employees!

# function 4 - date time

introduction..

more info..

In [None]:
# code

In [None]:
# more code

function 4 conclusion

general summary

overall conclusion of full example