# Data Binning and Case Statements

## Data Binning
Data binning is a data preprocessing technique used to group continuous numeric values into intervals or "bins." The goals of data binning are:
* Reduce noise or small variations in the data,
* Simplify models or analyses,
* Facilitate visual and statistical interpretation.

Example: Product prices between 0 and 1 million rupiah can be divided into 3 bins:
* Low (0 – 300,000)
* Medium (300,000–700,000)
* High (700,000–1 million rupiah)
## Case Statements
Case statements are conditional logic used to group or classify data based on certain conditions. This concept is similar to the CASE WHEN statement in SQL and is useful for:
* Creating new categories from numeric or categorical data,
* Combining if-else logic into column-based operations,
* Increasing the flexibility of data grouping.

# Impor packages

In [40]:
import pandas as pd
import numpy as np

In [41]:
import os
os.getcwd()

'C:\\Users\\LENOVO\\Python\\Intermediate'

# Import data from CSV to DataFrame

In [42]:
data = pd.read_csv('C:/Users/LENOVO/Python/Online Retail Data.csv', header=0)
data = data[data['price'].between(0, 100, inclusive='right')].reset_index(drop=True)
data

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0
3,493413,21724,PANDA AND BUNNIES STICKER SHEET,1,2010-01-04 09:54:00,0.85,
4,493413,84578,ELEPHANT TOY WITH BLUE T-SHIRT,1,2010-01-04 09:54:00,3.75,
...,...,...,...,...,...,...,...
457466,539991,21618,4 WILDFLOWER BOTANICAL CANDLES,1,2010-12-23 16:49:00,1.25,
457467,539991,72741,GRAND CHOCOLATECANDLE,4,2010-12-23 16:49:00,1.45,
457468,539992,21470,FLOWER VINE RAFFIA FOOD COVER,1,2010-12-23 17:41:00,3.75,
457469,539992,22258,FELT FARM ANIMAL RABBIT,1,2010-12-23 17:41:00,1.25,


# Cleaning dataframe

In [43]:
data.isna().sum()

order_id            0
product_code        0
product_name        0
quantity            0
order_date          0
price               0
customer_id     96858
dtype: int64

In [44]:
df = data.dropna()

In [45]:
df.isna().sum()

order_id        0
product_code    0
product_name    0
quantity        0
order_date      0
price           0
customer_id     0
dtype: int64

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 360613 entries, 0 to 457442
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      360613 non-null  object 
 1   product_code  360613 non-null  object 
 2   product_name  360613 non-null  object 
 3   quantity      360613 non-null  int64  
 4   order_date    360613 non-null  object 
 5   price         360613 non-null  float64
 6   customer_id   360613 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 22.0+ MB


# Creating binned data in DataFrame

## Binning with a certain number of bins and the length of each bin is the same

At this stage, a data binning process is performed to group price values into three categories: 'Cheap', 'Medium', and 'Expensive'. The pd.cut() technique used, without explicitly defining bin boundaries, directly divides the price range into three equal-width bins. Category labels are then assigned for easier interpretation. The result is a new column, bin1, that indicates the price category of each product.

In [47]:
df1 = df.copy()
category = ['Cheap','Medium','Expensive']
df1['bin1'] = pd.cut(df1['price'], 3, labels=category, include_lowest=True)
df1

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,bin1
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0,Cheap
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0,Cheap
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0,Cheap
6,493414,21844,RETRO SPOT MUG,36,2010-01-04 10:28:00,2.55,14590.0,Cheap
7,493414,21533,RETRO SPOT LARGE MILK JUG,12,2010-01-04 10:28:00,4.25,14590.0,Cheap
...,...,...,...,...,...,...,...,...
457438,539988,84380,SET OF 3 BUTTERFLY COOKIE CUTTERS,1,2010-12-23 16:06:00,1.25,18116.0,Cheap
457439,539988,84849D,HOT BATHS SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap
457440,539988,84849B,FAIRY SOAP SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap
457441,539988,22854,CREAM SWEETHEART EGG HOLDER,2,2010-12-23 16:06:00,4.95,18116.0,Cheap


Next, we aggregate the bin1 column to calculate summary statistics for each category. Some of the metrics calculated are the number of rows (row_cnt), the minimum price (min_price), and the maximum price (max_price) in each bin. The difference between the maximum and minimum prices (bin_interval) is also calculated to assess price distribution within each category. These results help us understand the distribution and range of prices within each price group.

In [48]:
bin1_summary = df1.groupby('bin1', observed=True).agg(
    row_cnt=('product_code','count'),
    min_price=('price','min'),
    max_price=('price','max')
)
bin1_summary['bin_interval'] = bin1_summary['max_price'] - bin1_summary['min_price']
bin1_summary

Unnamed: 0_level_0,row_cnt,min_price,max_price,bin_interval
bin1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cheap,360163,0.001,32.69,32.689
Medium,360,34.95,65.0,30.05
Expensive,90,66.98,100.0,33.02


In bin1, the data is automatically divided into three equal price intervals, each labeled 'Cheap', 'Medium', and 'Expensive'. Although the interval lengths for each bin are roughly equal (approximately 30–33 price units), the distribution of products is highly skewed. There are 360,163 products in the 'Cheap' category, compared to only 360 for 'Medium' and 90 for 'Expensive'. This indicates that most products are in the low-price range, so dividing by fixed intervals does not guarantee a balanced number of products in each category.

The next approach uses a similar binning technique, but with greater control over the boundaries between bins. Using np.linspace(), the price range is explicitly divided into three equal parts, resulting in four boundary points. The labels 'Cheap', 'Medium', and 'Expensive' are again used for each bin, and the results are stored in the bin2 column. This provides more flexibility in adjusting the price range to suit the needs of the analysis.

In [49]:
df2 = df1.copy()
bins = np.linspace(min(df['price']), max(df['price']), 4)
category = ['Cheap','Medium','Expensive']
df2['bin2'] = pd.cut(df2['price'], bins, labels=category, include_lowest=True)
df2

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,bin1,bin2
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0,Cheap,Cheap
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0,Cheap,Cheap
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0,Cheap,Cheap
6,493414,21844,RETRO SPOT MUG,36,2010-01-04 10:28:00,2.55,14590.0,Cheap,Cheap
7,493414,21533,RETRO SPOT LARGE MILK JUG,12,2010-01-04 10:28:00,4.25,14590.0,Cheap,Cheap
...,...,...,...,...,...,...,...,...,...
457438,539988,84380,SET OF 3 BUTTERFLY COOKIE CUTTERS,1,2010-12-23 16:06:00,1.25,18116.0,Cheap,Cheap
457439,539988,84849D,HOT BATHS SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap
457440,539988,84849B,FAIRY SOAP SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap
457441,539988,22854,CREAM SWEETHEART EGG HOLDER,2,2010-12-23 16:06:00,4.95,18116.0,Cheap,Cheap


Finally, the second binning result (bin2) is aggregated in the same manner as before. The calculated summary statistics include the number of data points, the minimum price, the maximum price, and the price intervals within each category. By comparing this summary to the previous result (bin1), we can evaluate whether the binning method affects the distribution and interpretation of the data.

In [51]:
bin2_summary = df2.groupby('bin2', observed=True).agg(
    row_cnt=('product_code','count'), 
    min_price=('price','min'), 
    max_price=('price','max'))
bin2_summary['bin_interval'] = bin2_summary['max_price'] - bin2_summary['min_price']
bin2_summary

Unnamed: 0_level_0,row_cnt,min_price,max_price,bin_interval
bin2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cheap,360163,0.001,32.69,32.689
Medium,360,34.95,65.0,30.05
Expensive,90,66.98,100.0,33.02


The bin2 method also produces identical results to bin1, because the bin boundaries used (np.linspace) produce three equidistant intervals from the minimum to the maximum price. The category labels used are also identical. Consequently, the distribution of product counts and other summary statistics (min, max, bin interval) for each category are also identical to bin1.

Although their approaches are slightly different, both bin1 and bin2 divide the price data into three groups with equal price interval lengths. However, both result in a highly skewed distribution of the number of products, as the original price data is not evenly distributed. If the goal of the analysis is to divide the data into evenly spaced product groups, a quantile-based binning approach (pd.qcut) would be more appropriate.

## Binning with a certain length for each interval

In this section, product price data is divided into fixed-length intervals, each with 10 price units. To create the bin boundaries, pd.interval_range is used, which automatically creates an interval range from the minimum to the maximum price value with a specified frequency (freq=10). Then, pd.cut is used to group each product into its appropriate price bin. This allows us to see a more detailed and uniform product distribution based on price ranges.

In [52]:
df3 = df2.copy()
interval_range = pd.interval_range(start=min(df['price']), freq=10, end=max(df['price']))
df3['bin3'] = pd.cut(df['price'], bins=interval_range, include_lowest=True)
df3

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,bin1,bin2,bin3
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0,Cheap,Cheap,"(0.001, 10.001]"
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0,Cheap,Cheap,"(0.001, 10.001]"
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0,Cheap,Cheap,"(0.001, 10.001]"
6,493414,21844,RETRO SPOT MUG,36,2010-01-04 10:28:00,2.55,14590.0,Cheap,Cheap,"(0.001, 10.001]"
7,493414,21533,RETRO SPOT LARGE MILK JUG,12,2010-01-04 10:28:00,4.25,14590.0,Cheap,Cheap,"(0.001, 10.001]"
...,...,...,...,...,...,...,...,...,...,...
457438,539988,84380,SET OF 3 BUTTERFLY COOKIE CUTTERS,1,2010-12-23 16:06:00,1.25,18116.0,Cheap,Cheap,"(0.001, 10.001]"
457439,539988,84849D,HOT BATHS SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap,"(0.001, 10.001]"
457440,539988,84849B,FAIRY SOAP SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap,"(0.001, 10.001]"
457441,539988,22854,CREAM SWEETHEART EGG HOLDER,2,2010-12-23 16:06:00,4.95,18116.0,Cheap,Cheap,"(0.001, 10.001]"


In [53]:
interval_range

IntervalIndex([  (0.001, 10.001],  (10.001, 20.001],  (20.001, 30.001],
                (30.001, 40.001],  (40.001, 50.001],  (50.001, 60.001],
                (60.001, 70.001],  (70.001, 80.001],  (80.001, 90.001],
               (90.001, 100.001]],
              dtype='interval[float64, right]')

In [54]:
bin3_summary = df3.groupby('bin3', observed=True).agg(
    row_cnt=('product_code','count'), 
    min_price=('price','min'), 
    max_price=('price','max'))
bin3_summary['bin_interval'] = bin3_summary['max_price'] - bin3_summary['min_price']
bin3_summary

Unnamed: 0_level_0,row_cnt,min_price,max_price,bin_interval
bin3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(0.001, 10.001]",350070,0.01,10.0,9.99
"(10.001, 20.001]",9822,10.08,20.0,9.92
"(20.001, 30.001]",252,20.3,30.0,9.7
"(30.001, 40.001]",242,31.78,39.95,8.17
"(40.001, 50.001]",102,41.12,50.0,8.88
"(50.001, 60.001]",17,50.05,60.0,9.95
"(60.001, 70.001]",15,62.0,70.0,8.0
"(70.001, 80.001]",55,71.61,79.95,8.34
"(80.001, 90.001]",5,80.7,88.5,7.8
"(90.001, 100.001]",20,92.03,100.0,7.97


The product price distribution in the dataset is heavily left-skewed, with over 95% of products falling in the price range below 20,000. As prices increase, the number of products decreases dramatically. This indicates that the analyzed market or product catalog is primarily focused on low-cost products, with only a few items considered expensive. This fixed-interval binning approach is very helpful in uncovering uneven and highly skewed price distribution patterns.

## Binning with the length of each bin determined independently according to needs.

In this process, binning is performed based on manually defined price intervals, rather than the fixed interval lengths as before. Dataframe df4 is created as a copy of df3 to preserve the previous data. The bins_edges variable contains three price boundary points: 0, 30, 80, and 100, which form the three categories Cheap, Medium, and Expensive.

In [55]:
df4 = df3.copy()
bins_edges = [0, 30, 80, 100]
category = ['Cheap','Medium','Expensive']
df4['bin4'] = pd.cut(df['price'], bins=bins_edges, labels=category, include_lowest=True)
df4

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,bin1,bin2,bin3,bin4
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
6,493414,21844,RETRO SPOT MUG,36,2010-01-04 10:28:00,2.55,14590.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
7,493414,21533,RETRO SPOT LARGE MILK JUG,12,2010-01-04 10:28:00,4.25,14590.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
...,...,...,...,...,...,...,...,...,...,...,...
457438,539988,84380,SET OF 3 BUTTERFLY COOKIE CUTTERS,1,2010-12-23 16:06:00,1.25,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
457439,539988,84849D,HOT BATHS SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
457440,539988,84849B,FAIRY SOAP SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap
457441,539988,22854,CREAM SWEETHEART EGG HOLDER,2,2010-12-23 16:06:00,4.95,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap


In [56]:
bin4_summary = df4.groupby('bin4', observed=True).agg(
    row_cnt=('product_code','count'), 
    min_price=('price','min'), 
    max_price=('price','max'))
bin4_summary['bin_interval'] = bin4_summary['max_price'] - bin4_summary['min_price']
bin4_summary

Unnamed: 0_level_0,row_cnt,min_price,max_price,bin_interval
bin4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cheap,360157,0.001,30.0,29.999
Medium,431,31.78,79.95,48.17
Expensive,25,80.7,100.0,19.3


Price distribution is highly uneven, with the majority of products in the low-price range (Cheap), while only a few products are in the Medium and Expensive segments. This suggests that the pricing or inventory strategy may be focused more on low-cost products.

## Binning with a certain number of bins and the number of members in each bin is the same

At this stage, a binning process is performed using the quantile method (qcut) to divide the price data into three groups (bins): Cheap, Medium, and Expensive, each with a relatively equal number of members (products). Unlike cut, which divides based on a fixed value range, qcut automatically determines interval boundaries so that each bin covers the same proportion of data. Category labels are used to intuitively describe each bin.

In [57]:
df5 = df4.copy()
category = ['Cheap','Medium','Expensive']
df5['bin5'] = pd.qcut(df['price'], q=3, labels=category)
df5

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,bin1,bin2,bin3,bin4,bin5
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Expensive
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Expensive
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Expensive
6,493414,21844,RETRO SPOT MUG,36,2010-01-04 10:28:00,2.55,14590.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Medium
7,493414,21533,RETRO SPOT LARGE MILK JUG,12,2010-01-04 10:28:00,4.25,14590.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Expensive
...,...,...,...,...,...,...,...,...,...,...,...,...
457438,539988,84380,SET OF 3 BUTTERFLY COOKIE CUTTERS,1,2010-12-23 16:06:00,1.25,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Cheap
457439,539988,84849D,HOT BATHS SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Medium
457440,539988,84849B,FAIRY SOAP SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Medium
457441,539988,22854,CREAM SWEETHEART EGG HOLDER,2,2010-12-23 16:06:00,4.95,18116.0,Cheap,Cheap,"(0.001, 10.001]",Cheap,Expensive


In [58]:
bin5_summary = df5.groupby('bin5', observed=True).agg(
    row_cnt=('product_code','count'), 
    min_price=('price','min'), 
    max_price=('price','max'))
bin5_summary['bin_interval'] = bin5_summary['max_price'] - bin5_summary['min_price']
bin5_summary

Unnamed: 0_level_0,row_cnt,min_price,max_price,bin_interval
bin5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cheap,122747,0.001,1.25,1.249
Medium,136197,1.27,2.95,1.68
Expensive,101669,2.98,100.0,97.02


qcut successfully divided the product range into three groups, each balanced in quantity but uneven in price. The Expensive group had a much greater price variation than the other two groups.

# Grouping data with case statements
The main purpose of this code is to group products by type and price range to facilitate analysis such as product segmentation, price comparisons between categories, or advanced visualizations.

In [59]:
df6 = df.copy()
df6['price_class'] = np.select(
    [(df6['product_name'].str.lower().str.contains('tea')) & (df6['price']<=2),
     (df6['product_name'].str.lower().str.contains('tea')) & (df6['price'].between(2, 4, inclusive='right')),
     (df6['product_name'].str.lower().str.contains('tea')) & (df6['price']>4),
     (df6['product_name'].str.lower().str.contains('coffee')) & (df6['price']<=3),
     (df6['product_name'].str.lower().str.contains('coffee')) & (df6['price'].between(3, 5, inclusive='right')),
     (df6['product_name'].str.lower().str.contains('coffee')) & (df6['price']>5)],
    ['Cheap Tea', 'Medium Tea', 'Expensive Tea', 'Cheap Coffee', 'Medium Coffee', 'Expensive Coffee'],
    default='Undetermined'
)
df6

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,price_class
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0,Undetermined
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0,Undetermined
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0,Undetermined
6,493414,21844,RETRO SPOT MUG,36,2010-01-04 10:28:00,2.55,14590.0,Undetermined
7,493414,21533,RETRO SPOT LARGE MILK JUG,12,2010-01-04 10:28:00,4.25,14590.0,Undetermined
...,...,...,...,...,...,...,...,...
457438,539988,84380,SET OF 3 BUTTERFLY COOKIE CUTTERS,1,2010-12-23 16:06:00,1.25,18116.0,Undetermined
457439,539988,84849D,HOT BATHS SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Undetermined
457440,539988,84849B,FAIRY SOAP SOAP HOLDER,1,2010-12-23 16:06:00,1.69,18116.0,Undetermined
457441,539988,22854,CREAM SWEETHEART EGG HOLDER,2,2010-12-23 16:06:00,4.95,18116.0,Undetermined


Knowing the statistical summary

In [60]:
price_class_summary = df6.groupby('price_class').agg(
    row_cnt=('product_code','count'), 
    min_price=('price','min'), 
    max_price=('price','max'))
price_class_summary['price_interval'] = price_class_summary['max_price'] - price_class_summary['min_price']
price_class_summary

Unnamed: 0_level_0,row_cnt,min_price,max_price,price_interval
price_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cheap Coffee,1531,0.85,2.95,2.1
Cheap Tea,5500,0.08,2.0,1.92
Expensive Coffee,127,5.45,60.0,54.55
Expensive Tea,3737,4.14,24.95,20.81
Medium Coffee,2,3.95,3.95,0.0
Medium Tea,4318,2.1,4.0,1.9
Undetermined,345398,0.001,100.0,99.999


Some results notes:
* The majority of data (345,398 rows) fall into the Undetermined category, meaning the products do not contain the words "tea" or "coffee" in the product_name, and therefore cannot be grouped using the created rules.
* The most tea products are in the Cheap Tea category (5,500 products), followed by Medium Tea (4,318) and Expensive Tea (3,737).
* The most coffee products are in the Cheap Coffee category (1,531 products), followed by Expensive Coffee (127), and only two Medium Coffee products, indicating that this category is very rare.
* The price intervals for Expensive Coffee and Expensive Tea are very wide, indicating high price variation within the category.
* Conversely, products in the Medium Coffee category only have two products with the same price (interval 0), indicating that this category is less representative and may need to be redefined.

Knowing data with certain criteria

In [62]:
df6[(df6['price_class']=='Undetermined') & (df6['price']==-53594.36)]

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,price_class


In [63]:
df6[df6['price_class']=='Cheap Tea']

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,price_class
22,493427,79000,MOROCCAN TEA GLASS,12,2010-01-04 10:43:00,0.85,13287.0,Cheap Tea
112,493433,84991,60 TEATIME FAIRY CAKE CASES,24,2010-01-04 12:51:00,0.55,14709.0,Cheap Tea
455,493467,84991,60 TEATIME FAIRY CAKE CASES,24,2010-01-04 14:29:00,0.55,13004.0,Cheap Tea
698,493573,21067,VINTAGE RED TEATIME MUG,36,2010-01-05 10:18:00,0.38,18145.0,Cheap Tea
831,493667,21067,VINTAGE RED TEATIME MUG,72,2010-01-05 12:13:00,0.38,13253.0,Cheap Tea
...,...,...,...,...,...,...,...,...
456558,539836,21067,VINTAGE RED TEATIME MUG,3,2010-12-22 13:46:00,1.25,17894.0,Cheap Tea
456623,539920,21069,VINTAGE BILLBOARD TEA MUG,6,2010-12-23 11:06:00,1.25,14085.0,Cheap Tea
456627,539920,84946,ANTIQUE SILVER TEA GLASS ETCHED,12,2010-12-23 11:06:00,1.25,14085.0,Cheap Tea
456678,539933,84991,60 TEATIME FAIRY CAKE CASES,24,2010-12-23 11:24:00,0.55,15235.0,Cheap Tea
