### Lesson 4: Manipulation and grouping of data

### Part 2.4.1  : What are the unique products?
- In this lecture, we will focus on answering some business questions like:
    - how to replace a value by something else in the column?
    - what are the unique product sectors are available in the data?
    - what is the proportion of each product sector ?
    - whether a product of particular brand is listed for sales or not?

In [1]:
import pandas as pd

In [2]:
# Loading the data set which does not contain any missing values
pos_data = pd.read_csv('POS_CleanData.csv')
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


In [3]:
# quickly check its shape
pos_data.shape

(31057, 10)

#### How to replace a value in a column by some other value?
- Let us try to replace 'Toothpaste' in the Category column by 'Tooth Paste'

In [4]:
# let us make the replacement in the 'Category' column
pos_data['Category'] = pos_data['Category'].replace('Toothpaste', 'Tooth Paste')
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Tooth Paste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


#### What are the different sectors present in the sales data?


In [5]:
pos_data['Sector'].unique()

array(['Oral Care', 'Fabric Care', 'Beauty and Personal Care'],
      dtype=object)

In [6]:
# use unique() method and display the result as a list
print("Different sectors of products available are:\n", list(pos_data['Sector'].unique()))

Different sectors of products available are:
 ['Oral Care', 'Fabric Care', 'Beauty and Personal Care']


#### How many products belong to each of these sectors?

In [7]:
pos_data['Sector'].value_counts()

Fabric Care                 15456
Beauty and Personal Care     9250
Oral Care                    6351
Name: Sector, dtype: int64

In [8]:
# use value_counts() method
print('Following is the list of sectors along with their count:\n', pos_data['Sector'].value_counts())

Following is the list of sectors along with their count:
 Fabric Care                 15456
Beauty and Personal Care     9250
Oral Care                    6351
Name: Sector, dtype: int64


#### What is the proportion of each sector in the sales data ?

In [9]:
pos_data['Sector'].value_counts(normalize=True)

Fabric Care                 0.497666
Beauty and Personal Care    0.297839
Oral Care                   0.204495
Name: Sector, dtype: float64

In [10]:
# use value_counts() method
print('Following is the list of sectors along with their proportion:')
round(pos_data['Sector'].value_counts(normalize=True) * 100, 2)

Following is the list of sectors along with their proportion:


Fabric Care                 49.77
Beauty and Personal Care    29.78
Oral Care                   20.45
Name: Sector, dtype: float64

**Explanation:**
- The `value_counts()` method with the argument `noramalize = True` will display the count in fractions. 
- When we multiply these fractions by 100, we will get the percentage.
- The `round()` function is used to display the percentage with the desired number of decimal places.

#### How to check whether a particular categorical value exists in a column?
- Let us check whether the 'Lakme' brand exist in the sales data.

In [11]:
# List out all the unique brands in the data
brand_list = pos_data['Brand'].unique().tolist()
print("Various brands available are:\n",brand_list)

Various brands available are:
 ['Close-up', "Tom's of Maine", 'Himalaya Herbals', 'Colgate', 'Sensodyne', 'Philips', 'Oral-B', 'Listerine', 'Scope', 'Crest', 'Downy', 'Gain', 'Bounce', 'comfort', 'Ariel', 'Tide', 'Olay', 'Clinique', 'Aveeno', 'Cetaphil', 'Neutrogena', 'Dove', 'Pantene']


In [12]:
if 'Lakme' in brand_list:
    print('Lakme is one of the brands in POS data')
else:
    print('Lakme is not in POS data')

Lakme is not in POS data


In [13]:
if 'Tide' in brand_list:
    print('Tide is present')
else:
    print('Tide is absent')

Tide is present


In [14]:
# alternatively, we can use the following code to do similar task

df=pos_data[pos_data['Brand'].isin(['Colgate'])]

if df.shape[0] != 0:
    print('Colgate is one of the brands in POS data')
else:
    print('Colgate is not in POS data')

Colgate is one of the brands in POS data


### Part 2.4.2  : How much did we sell per category?
- To answer such questions, we use the `groupby()` method.
- We will be able answer several business questions like finding total revenue per sector, finding the average revenue per category, finding the average number of visitors to a particular brand's page on the retailer's website etc.

#### What is the total revenue per sector?
- Use `groupby()` method with `sum()` to get a dataframe containing sum of revenues for all sectors.
- Then display only the revenue column of this dataframe

In [15]:
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Tooth Paste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


In [16]:
#groupby() is used with sum() to get total 
df=pos_data.groupby('Sector').sum(numeric_only=True)     #numeric_only makes sure to sum up only the numeric columns in dataframe
df

Unnamed: 0_level_0,Revenue($),Units_sold,Page_traffic
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Beauty and Personal Care,133734179,6517589,18931575
Fabric Care,220755928,10802487,31590093
Oral Care,92021092,4465306,13206428


In [17]:
#display only the total revenue per sector in the descending order
df['Revenue($)'].sort_values(ascending = False)

Sector
Fabric Care                 220755928
Beauty and Personal Care    133734179
Oral Care                    92021092
Name: Revenue($), dtype: int64

#### What are the top 3 selling brands?

In [18]:
# first group the data by brand, and then sort the results in descending order
df=pos_data.groupby('Brand').sum(numeric_only=True)     
df['Revenue($)'].sort_values(ascending = False)[0:3]   #showing only top 3 using slicing

Brand
Gain     79016458
Ariel    44421501
Downy    30870829
Name: Revenue($), dtype: int64

#### What is the average  sales of each category?

In [19]:
# group by category, compute mean and round off to 2 decimal points

df=pos_data.groupby('Category').mean(numeric_only=True).round(2)   

In [20]:
df

Unnamed: 0_level_0,Revenue($),Units_sold,Page_traffic
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fabric Softeners,14494.72,696.5,2050.56
Haircare,14173.31,692.27,2014.07
Laundry Detergents,14172.14,700.19,2040.38
Mouthwash,14607.62,702.25,2084.02
Skincare,14639.89,712.51,2067.53
Tooth Paste,14666.36,710.38,2099.78
Toothbrushes,14336.55,699.07,2065.45


In [21]:
df['Revenue($)']

Category
Fabric Softeners      14494.72
Haircare              14173.31
Laundry Detergents    14172.14
Mouthwash             14607.62
Skincare              14639.89
Tooth Paste           14666.36
Toothbrushes          14336.55
Name: Revenue($), dtype: float64

In [22]:
# sort them to see the categories having top revenues
df['Revenue($)'].sort_values(ascending = False)

Category
Tooth Paste           14666.36
Skincare              14639.89
Mouthwash             14607.62
Fabric Softeners      14494.72
Toothbrushes          14336.55
Haircare              14173.31
Laundry Detergents    14172.14
Name: Revenue($), dtype: float64

***Conclusion:***
- There is not much difference in the average sales across multiple categories.


#### What is average revenue for each brand? How many distinct products are there in each brand?
- To answer this, we need two diffferent aggregation functions - mean of revenue and count of SKU ID.
- Hence, we will apply `agg()` method on top of `groupby()`

In [23]:
df=pos_data.groupby('Brand').agg({'Revenue($)': 'mean', 'SKU ID':'count'}).round(2)

In [24]:
df

Unnamed: 0_level_0,Revenue($),SKU ID
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1
Ariel,14399.19,3085
Aveeno,14126.16,859
Bounce,14731.49,551
Cetaphil,15111.62,1095
Clinique,14854.7,1717
Close-up,14519.82,342
Colgate,14631.63,2075
Crest,14728.45,432
Dove,14374.06,1796
Downy,14700.39,2100


In [25]:
df=pos_data.groupby('Brand').agg({'Revenue($)': 'mean', 'SKU ID':'count'}).round(2).reset_index()

In [26]:
df

Unnamed: 0,Brand,Revenue($),SKU ID
0,Ariel,14399.19,3085
1,Aveeno,14126.16,859
2,Bounce,14731.49,551
3,Cetaphil,15111.62,1095
4,Clinique,14854.7,1717
5,Close-up,14519.82,342
6,Colgate,14631.63,2075
7,Crest,14728.45,432
8,Dove,14374.06,1796
9,Downy,14700.39,2100


In [27]:
df = df.rename(columns={'Revenue($)':'Avg Revenue($)', 'SKU ID':'No. of SKU'})
df

Unnamed: 0,Brand,Avg Revenue($),No. of SKU
0,Ariel,14399.19,3085
1,Aveeno,14126.16,859
2,Bounce,14731.49,551
3,Cetaphil,15111.62,1095
4,Clinique,14854.7,1717
5,Close-up,14519.82,342
6,Colgate,14631.63,2075
7,Crest,14728.45,432
8,Dove,14374.06,1796
9,Downy,14700.39,2100


#### What is the best selling segment?

In [28]:
df=pos_data.groupby('Segment').sum(numeric_only=True)     

In [29]:
df

Unnamed: 0_level_0,Revenue($),Units_sold,Page_traffic
Segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acne,8030417,399199,1125542
Alcohol-Free Mouthwash,6455108,309350,887083
Anti-aging,50130741,2460511,7138576
Breath-Freshening Mouthwash,5615592,274774,825741
Conditioners,14522060,715094,2075982
Dryer Sheets,15705592,760220,2248320
Electric Toothbrushes,15786214,764807,2266883
Fluoride Mouthwash,6378728,302813,919294
Fluoride-Free Toothpaste,9316420,459912,1343900
Kids Toothbrushes,14841313,723733,2102428


In [30]:
df['Revenue($)'].sort_values(ascending = False)

Segment
Liquid                         103536572
Powder                          87893932
Anti-aging                      50130741
Shampoo                         36657767
Suncreens                       24393194
Electric Toothbrushes           15786214
Dryer Sheets                    15705592
Manual Toothbrushes             15048718
Kids Toothbrushes               14841313
Conditioners                    14522060
Pods                            13619832
Whitening Toothpaste             9408829
Fluoride-Free Toothpaste         9316420
Sensitivity Toothpaste           9170170
Acne                             8030417
Alcohol-Free Mouthwash           6455108
Fluoride Mouthwash               6378728
Breath-Freshening Mouthwash      5615592
Name: Revenue($), dtype: int64

***Explanation:***
- We can see that *Liquid* is the best selling segment.
- However, it makes sense to know under which category and which sector, this Liquid segment falls into. 
- To know that we need to group the data based on more than one attribute. We will see such tasks in the next lecture.

### Part 2.4.3  : How much did we sell per sector per category
- In the previous lecture, we have seen how to group the data based on certain attribute and find the aggregation of values.
- We can actually group the data based on more than one attribute, to answer the questions like "how much is the revenue per category within each sector" and so on.

In [31]:
# view the dataframe
pos_data.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Tooth Paste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Tooth Paste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0


#### What is the best selling segment?
- Earlier, we found that 'Liquid' is the best selling segment.
- However, we do not know which category and sector the segment 'Liquid' belongs to.
- To understand this, we need to group the data based on more than one attribute.
- We must first group the data based on sectors and then on categories within each sector, and then the segment within the category.
- We can give the sequence of attributes by which we need to group the data as a list to `groupby()` function

In [32]:
# note the sequence of attributes within groupby
df = pos_data.groupby(['Sector','Category', 'Segment']).sum(numeric_only=True).round(2)
df            #the result is not yet sorted to show the best selling segment. 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Revenue($),Units_sold,Page_traffic
Sector,Category,Segment,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Beauty and Personal Care,Haircare,Conditioners,14522060,715094,2075982
Beauty and Personal Care,Haircare,Shampoo,36657767,1784679,5196813
Beauty and Personal Care,Skincare,Acne,8030417,399199,1125542
Beauty and Personal Care,Skincare,Anti-aging,50130741,2460511,7138576
Beauty and Personal Care,Skincare,Suncreens,24393194,1158106,3394662
Fabric Care,Fabric Softeners,Dryer Sheets,15705592,760220,2248320
Fabric Care,Fabric Softeners,Liquid,61188916,2934686,8629916
Fabric Care,Laundry Detergents,Liquid,42347656,2074022,6116001
Fabric Care,Laundry Detergents,Pods,13619832,684884,1982919
Fabric Care,Laundry Detergents,Powder,87893932,4348675,12612937


In [33]:
# drop the unwanted attributes
df=df.drop(['Units_sold', 'Page_traffic'], axis=1)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Revenue($)
Sector,Category,Segment,Unnamed: 3_level_1
Beauty and Personal Care,Haircare,Conditioners,14522060
Beauty and Personal Care,Haircare,Shampoo,36657767
Beauty and Personal Care,Skincare,Acne,8030417
Beauty and Personal Care,Skincare,Anti-aging,50130741
Beauty and Personal Care,Skincare,Suncreens,24393194
Fabric Care,Fabric Softeners,Dryer Sheets,15705592
Fabric Care,Fabric Softeners,Liquid,61188916
Fabric Care,Laundry Detergents,Liquid,42347656
Fabric Care,Laundry Detergents,Pods,13619832
Fabric Care,Laundry Detergents,Powder,87893932


In [34]:
# sort the data based on revenue
df=df.sort_values(by=['Revenue($)'], ascending=False)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Revenue($)
Sector,Category,Segment,Unnamed: 3_level_1
Fabric Care,Laundry Detergents,Powder,87893932
Fabric Care,Fabric Softeners,Liquid,61188916
Beauty and Personal Care,Skincare,Anti-aging,50130741
Fabric Care,Laundry Detergents,Liquid,42347656
Beauty and Personal Care,Haircare,Shampoo,36657767
Beauty and Personal Care,Skincare,Suncreens,24393194
Oral Care,Toothbrushes,Electric Toothbrushes,15786214
Fabric Care,Fabric Softeners,Dryer Sheets,15705592
Oral Care,Toothbrushes,Manual Toothbrushes,15048718
Oral Care,Toothbrushes,Kids Toothbrushes,14841313


***Explanation:***
- We can now see that powder, and not liquid, is the best selling segment
- This is because, *Liquid* is a segment listed under two different categories viz. *Fabric Softeners* and *Laundry Detergents*
- This result shows that as a data analyst, we should be cognizant to see the big picture by thinking beyond the boundaries.
- Whenever there are hierarchical relationship between the attributes (like in this dataset, the hierarchy is Sector -> Category -> Segment -> Brand), we must use appropriate domain knowledge to infer the outcomes

#### How many units were sold per category within each sector?
- We must first group the data based on sectors and then on categories within each sector.
- We can give the sequence of attributes by which we need to group the data as a list to *groupby()* function

In [35]:
# note that the order of sequence in the list is important

Units_perCat_perSect =pos_data.groupby(['Sector','Category']).sum(numeric_only=True).round(2)
Units_perCat_perSect

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue($),Units_sold,Page_traffic
Sector,Category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Beauty and Personal Care,Haircare,51179827,2499773,7272795
Beauty and Personal Care,Skincare,82554352,4017816,11658780
Fabric Care,Fabric Softeners,76894508,3694906,10878236
Fabric Care,Laundry Detergents,143861420,7107581,20711857
Oral Care,Mouthwash,18449428,886937,2632118
Oral Care,Tooth Paste,27895419,1351135,3993773
Oral Care,Toothbrushes,45676245,2227234,6580537


In [36]:
# As we are interested only in number of units sold, let us drop the other two columns
Units_perCat_perSect=Units_perCat_perSect.drop(['Revenue($)', 'Page_traffic'], axis=1)
Units_perCat_perSect

Unnamed: 0_level_0,Unnamed: 1_level_0,Units_sold
Sector,Category,Unnamed: 2_level_1
Beauty and Personal Care,Haircare,2499773
Beauty and Personal Care,Skincare,4017816
Fabric Care,Fabric Softeners,3694906
Fabric Care,Laundry Detergents,7107581
Oral Care,Mouthwash,886937
Oral Care,Tooth Paste,1351135
Oral Care,Toothbrushes,2227234


#### What percentage of units are sold per category within each sector?
- To answer this, let us use the result of the previous question, i.e. the dataframe *Units_perCat_perSect*


In [37]:
# as the code is long, a line-breaker \ is used
# Python will understand that the statments in the next line after \ is a continuation of the previous code line

Units_perCat_perSect['% of Units'] = (100 * Units_perCat_perSect['Units_sold'] / \
                                      Units_perCat_perSect.groupby('Sector')['Units_sold'].transform('sum')).round(2)
Units_perCat_perSect

Unnamed: 0_level_0,Unnamed: 1_level_0,Units_sold,% of Units
Sector,Category,Unnamed: 2_level_1,Unnamed: 3_level_1
Beauty and Personal Care,Haircare,2499773,38.35
Beauty and Personal Care,Skincare,4017816,61.65
Fabric Care,Fabric Softeners,3694906,34.2
Fabric Care,Laundry Detergents,7107581,65.8
Oral Care,Mouthwash,886937,19.86
Oral Care,Tooth Paste,1351135,30.26
Oral Care,Toothbrushes,2227234,49.88


**Sometimes, we may need to present the data in a different shape. That is the column headers should be the values of some attribute. In such cases, we can use the `pivot_table()` method.**

In [38]:
import numpy as np
df =pd.pivot_table(pos_data,columns='Sector',values=['Revenue($)','Units_sold', 'Page_traffic'],\
                   aggfunc=np.mean).round(2)
df

Sector,Beauty and Personal Care,Fabric Care,Oral Care
Page_traffic,2046.66,2043.87,2079.42
Revenue($),14457.75,14282.86,14489.23
Units_sold,704.6,698.92,703.09


***Explanation:***
- In the above code, we have imported a package called *numpy* indicating Numerical Python.
- Numpy is a package in Python, which provides various functionalities related to linear alegebra, vectors etc.
- Some of the pandas functions use numpy methods as an argument value, as the one used above.