___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Session - 05</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Groupby & Useful Operations - Part 1</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#0)
* [BASIC AGGREGATION METHODS](#1)
* [GROUPBY & AGGREGATION](#2)
    * [DataFrame.groupby()](#2.1)
* [DATAFRAME/SERIES OPERATIONS](#3)
    * [.aggregate()/agg()](#3.1)
        * [DataFrame.agg()](#3.1.1)
        * [DataFrame.groupby().agg()](#3.1.2)
    * [.filter()](#3.2)
        * [DataFrame.groupby().filter()](#3.2.1)
    * [.transform()](#3.3)
        * [DataFrame.groupby().transform()](#3.3.1)
* [THE END OF THE SESSION - 05 & 06](#4)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy you can import it as a library:

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Basic Aggregation Methods</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

One of the most basic analysis functions is grouping and aggregating data. In some cases, this level of analysis may be sufficient to answer real-world/business questions. In other instances, this activity might be the first step in a more complex data science analysis. In pandas, the groupby function can be combined with one or more aggregation functions to quickly and easily summarize data. This concept is deceptively simple and most new pandas users will understand this concept. However, they might be surprised at how useful complex aggregation functions can be for supporting sophisticated analysis [Source](https://pbpython.com/groupby-agg.html).

An essential piece of analysis of large data is efficient summarization: computing aggregations, such as ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset. The aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis [Source](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html).

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question. Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast built-in aggregation functions for working on arrays which are also used for Pandas Dataframes [Pandas Official Documentation](https://pandas.pydata.org/docs/reference/frame.html); we'll discuss and demonstrate some of them here:

* ``count()`` ==> Counts non-NA cells for each column or row.
* ``mean()`` ==> Returns the mean of the values over the requested axis.
* ``median()`` ==> Returns the median of the values over the requested axis.
* ``min()`` ==> Returns the minimum of the values over the requested axis.
* ``max()`` ==> Returns the maximum of the values over the requested axis.
* ``std()`` ==> Returns sample standard deviation over requested axis.
* ``var()`` ==> Returns unbiased variance over requested axis.
* ``sum()`` ==> Returns the sum of the values over the requested axis.
* ``idxmin()`` ==> Returns index of first occurrence of minimum over requested axis.
* ``idxmax()`` ==> Returns index of first occurrence of maximum over requested axis.
* ``corr()`` ==> Computes pairwise correlation of columns, excluding NA/null values.

To sum up, in this session, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby.

In [2]:
df = pd.DataFrame(np.random.randint(0, 100, size=(7, 5)), 
                  columns=["x1", "x2", "x3", "x4", "x5"])
df

Unnamed: 0,x1,x2,x3,x4,x5
0,46,83,27,93,31
1,86,84,83,87,62
2,48,24,35,45,77
3,23,51,59,79,28
4,68,3,74,89,7
5,6,52,68,52,29
6,75,65,31,24,6


In [3]:
df.sum()

x1    352
x2    362
x3    377
x4    469
x5    240
dtype: int64

In [4]:
df.count(axis=1)

0    5
1    5
2    5
3    5
4    5
5    5
6    5
dtype: int64

In [5]:
df["x1"].count()

7

In [6]:
df.mean()

x1    50.285714
x2    51.714286
x3    53.857143
x4    67.000000
x5    34.285714
dtype: float64

In [7]:
df.x2.mean()

51.714285714285715

In [8]:
df.min()

x1     6
x2     3
x3    27
x4    24
x5     6
dtype: int32

In [9]:
df.x4.min()

24

Pandas dataframe.idxmin() function returns index of first occurrence of minimum over requested axis. While finding the index of the minimum value across any index, all NA/null values are excluded.

In [10]:
df.idxmin()

x1    5
x2    4
x3    0
x4    6
x5    6
dtype: int64

In [11]:
df.x5.idxmin()

6

In [12]:
df.x5.argmin()

6

In [13]:
df.loc[df.idxmin(), "x2" ]

5    52
4     3
0    83
6    65
6    65
Name: x2, dtype: int32

In [14]:
df.std(axis=0)

x1    28.663067
x2    29.831272
x3    22.660013
x4    26.652079
x5    26.506064
dtype: float64

In [15]:
df[["x1", "x2"]].std()

x1    28.663067
x2    29.831272
dtype: float64

In [16]:
df[["x1", "x2"]]

Unnamed: 0,x1,x2
0,46,83
1,86,84
2,48,24
3,23,51
4,68,3
5,6,52
6,75,65


In [17]:
df.var()

x1    821.571429
x2    889.904762
x3    513.476190
x4    710.333333
x5    702.571429
dtype: float64

In [18]:
df[["x1", "x2"]].var()

x1    821.571429
x2    889.904762
dtype: float64

In [19]:
df.sum(axis=1)

0    280
1    402
2    229
3    240
4    241
5    207
6    201
dtype: int64

In [20]:
df.x1.sum()

352

In [21]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x1,7.0,50.285714,28.663067,6.0,34.5,48.0,71.5,86.0
x2,7.0,51.714286,29.831272,3.0,37.5,52.0,74.0,84.0
x3,7.0,53.857143,22.660013,27.0,33.0,59.0,71.0,83.0
x4,7.0,67.0,26.652079,24.0,48.5,79.0,88.0,93.0
x5,7.0,34.285714,26.506064,6.0,17.5,29.0,46.5,77.0


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Groupby & Aggregation</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In Exploratory Data Analysis (EDA), we often would like to analyze data by some categories. In SQL, the GROUP BY statement groups row that has the same category values into summary rows. In Pandas, SQL's GROUP BY operation is performed using the similarly named **``groupby()``** method. Pandas' groupby() allows us to split data into separate groups to perform computations for better analysis [Source](https://towardsdatascience.com/all-pandas-groupby-you-should-know-for-grouping-data-and-performing-operations-2a8ec1327b5).

In this part of the session, you'll learn the "group by" process (split-apply-combine) and how to use Pandas's groupby() function to group data and perform operations.

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.groupby()</p>

<a id="2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``groupby()``** method groups DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups [Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html).

In other words, the **``groupby()``** method allows you to group rows of data together and call aggregate functions.

**``DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)``**

In [22]:
data = {'Company':['GOOG', 'GOOG', 'MSFT', 'MSFT', 'GOOG', 'MSFT', 'GOOG', 'MSFT'],
        'Department':['HR', 'IT', 'IT', 'HR', 'HR', 'IT', 'IT', 'HR'],
        'Person':['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah', 'Tom', 'Terry'],
        'Age':[30, 28, 35, 40, 42, 25, 32, 48],
        'Sales':[200, 120, 340, 124, 243, 350, 180, 220]}

In [23]:
df1 = pd.DataFrame(data)
df1

Unnamed: 0,Company,Department,Person,Age,Sales
0,GOOG,HR,Sam,30,200
1,GOOG,IT,Charlie,28,120
2,MSFT,IT,Amy,35,340
3,MSFT,HR,Vanessa,40,124
4,GOOG,HR,Carl,42,243
5,MSFT,IT,Sarah,25,350
6,GOOG,IT,Tom,32,180
7,MSFT,HR,Terry,48,220


In [24]:
df1.groupby("Company")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FC04E299F0>

In [25]:
df1.groupby("Company").mean(numeric_only = True)

Unnamed: 0_level_0,Age,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
GOOG,33.0,185.75
MSFT,37.0,258.5


In [26]:
df1.groupby("Company")[["Sales"]].mean(numeric_only = True)

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
GOOG,185.75
MSFT,258.5


In [27]:
df1.groupby("Company")["Company"].count()

Company
GOOG    4
MSFT    4
Name: Company, dtype: int64

In [28]:
df1["Company"].value_counts()

GOOG    4
MSFT    4
Name: Company, dtype: int64

In [29]:
df1.groupby(["Company", "Department"]).mean(numeric_only = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Sales
Company,Department,Unnamed: 2_level_1,Unnamed: 3_level_1
GOOG,HR,36.0,221.5
GOOG,IT,30.0,150.0
MSFT,HR,44.0,172.0
MSFT,IT,30.0,345.0


In [30]:
df1.groupby(["Company", "Department"])["Sales"].mean(numeric_only = True)

Company  Department
GOOG     HR            221.5
         IT            150.0
MSFT     HR            172.0
         IT            345.0
Name: Sales, dtype: float64

In [31]:
df1.groupby(["Company", "Department"])["Sales"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
Company,Department,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
GOOG,HR,2.0,221.5,30.405592,200.0,210.75,221.5,232.25,243.0
GOOG,IT,2.0,150.0,42.426407,120.0,135.0,150.0,165.0,180.0
MSFT,HR,2.0,172.0,67.882251,124.0,148.0,172.0,196.0,220.0
MSFT,IT,2.0,345.0,7.071068,340.0,342.5,345.0,347.5,350.0


**Now you can use the ``.groupby()`` method to group rows together based on a column name. For instance, let's group based on Company. This will create a DataFrameGroupBy object:**

In [32]:
df1.groupby("Company")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FC04E2A290>

In [33]:
by_comp = df1.groupby("Company")

In [34]:
by_comp.mean(numeric_only=True)

Unnamed: 0_level_0,Age,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
GOOG,33.0,185.75
MSFT,37.0,258.5


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Useful Operations</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- ### `.aggregate()`
- ### `.filter()`
- ### `.transform()`
- ### `.apply()`
- ### `.applymap()`
- ### `.map()`
- ### `.pivot() & .pivot_table()`
- ### `.stack() & .unstack()`

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">.aggregate() / agg()</p>

<a id="3.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

#### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.agg()</p>

<a id="3.1.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``DataFrame.aggregate(func=None, axis=0, *args, **kwargs)``**

Returns: scalar, Series or DataFrame

The return can be:
- scalar : when Series.agg is called with single function
- Series : when DataFrame.agg is called with a single function
- DataFrame : when DataFrame.agg is called with several functions (Returns scalar, Series or DataFrame).

The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d, axis=0).

**agg()** is an **alias for aggregate()**. Use the alias [Official Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html).

In [35]:
df2 = pd.DataFrame({'groups': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
                   'var1': [10, 23, 33, 22, 11, 99, 76, 84, 45],
                   'var2': [100, 253, 333, 262, 111, 969, 405, 578, 760]})
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [36]:
df2.agg([sum, min])

Unnamed: 0,groups,var1,var2
sum,ABCABCABC,403,3771
min,A,10,100


In [37]:
df2[["var1", "var2"]].agg([sum, min])

Unnamed: 0,var1,var2
sum,403,3771
min,10,100


In [38]:
df2.agg({"var1":[sum], "var2": [min]})

Unnamed: 0,var1,var2
sum,403.0,
min,,100.0


In [39]:
df2.agg({"var1":[sum, np.mean], "var2":["min", max]})

Unnamed: 0,var1,var2
sum,403.0,
mean,44.777778,
min,,100.0
max,,969.0


#### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.groupby().agg()</p>

<a id="3.1.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``DataFrameGroupBy.agg(arg, *args, **kwargs)``**

Aggregates using one or more operations over the specified axis [Pandas Official Documentation](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.core.groupby.DataFrameGroupBy.agg.html).

[SOUREC01](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/) & [SOURCE02](https://www.analyticsvidhya.com/blog/2020/03/groupby-pandas-aggregating-data-python/)

In [40]:
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [41]:
df2.groupby('groups').agg([min, "median", max])

Unnamed: 0_level_0,var1,var1,var1,var2,var2,var2
Unnamed: 0_level_1,min,median,max,min,median,max
groups,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,10,22.0,76,100,262.0,405
B,11,23.0,84,111,253.0,578
C,33,45.0,99,333,760.0,969


In [42]:
df2.groupby('groups').agg({'var1':(min, 'max'), 'var2':'median'})

Unnamed: 0_level_0,var1,var1,var2
Unnamed: 0_level_1,min,max,median
groups,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,10,76,262.0
B,11,84,253.0
C,33,99,760.0


In [43]:
df2.groupby('groups')[["var1"]].agg([min, 'max'])

Unnamed: 0_level_0,var1,var1
Unnamed: 0_level_1,min,max
groups,Unnamed: 1_level_2,Unnamed: 2_level_2
A,10,76
B,11,84
C,33,99


In [44]:
df2.groupby('groups')["var1"].agg([min, 'max'])

Unnamed: 0_level_0,min,max
groups,Unnamed: 1_level_1,Unnamed: 2_level_1
A,10,76
B,11,84
C,33,99


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">.filter()</p>

<a id="3.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

#### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.filter()</p>

<a id="3.2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Pandas **``Dataframe.filter()``** is an inbuilt function that is used to subset columns or rows of DataFrame according to labels in the particular index. The DataFrame **``filter()``** returns subset the DataFrame rows or columns according to the detailed index labels. **One thing to note that** this routine does **NOT** filter a DataFrame on its contents. The filter() function is applied to the labels of the index[Pandas Official Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html)

[SOUREC01](https://www.sharpsightlabs.com/blog/pandas-filter/) & [SOURCE02](https://appdividend.com/2020/03/19/pandas-filter-pandas-dataframe-filter-in-python-example/)

In [45]:
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [46]:
df2.filter(["groups", "var1"])

Unnamed: 0,groups,var1
0,A,10
1,B,23
2,C,33
3,A,22
4,B,11
5,C,99
6,A,76
7,B,84
8,C,45


In [47]:
df2[["groups", "var1"]]

Unnamed: 0,groups,var1
0,A,10
1,B,23
2,C,33
3,A,22
4,B,11
5,C,99
6,A,76
7,B,84
8,C,45


In [48]:
df2.filter(regex="^var", axis=1)

Unnamed: 0,var1,var2
0,10,100
1,23,253
2,33,333
3,22,262
4,11,111
5,99,969
6,76,405
7,84,578
8,45,760


In [49]:
df2.filter(like="var", axis=1)

Unnamed: 0,var1,var2
0,10,100
1,23,253
2,33,333
3,22,262
4,11,111
5,99,969
6,76,405
7,84,578
8,45,760


In [50]:
df2.filter(like="1", axis=0)

Unnamed: 0,groups,var1,var2
1,B,23,253


#### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.groupby().filter()</p>

<a id="3.2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs)``**
- Returns a copy of a DataFrame excluding filtered elements.
- Elements from groups are filtered if they do not satisfy the boolean criterion specified by func [Official Pandas Document](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html).

In [51]:
df2.groups.unique()

array(['A', 'B', 'C'], dtype=object)

In [52]:
df2.groupby("groups").mean(numeric_only=True)

Unnamed: 0_level_0,var1,var2
groups,Unnamed: 1_level_1,Unnamed: 2_level_1
A,36.0,255.666667
B,39.333333,314.0
C,59.0,687.333333


In [53]:
def filter_func(x) :
    return x["var1"].mean() > 39

In [54]:
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [55]:
df2.groupby("groups").filter(filter_func)

Unnamed: 0,groups,var1,var2
1,B,23,253
2,C,33,333
4,B,11,111
5,C,99,969
7,B,84,578
8,C,45,760


In [56]:
df2.groupby("groups").sum()

Unnamed: 0_level_0,var1,var2
groups,Unnamed: 1_level_1,Unnamed: 2_level_1
A,108,767
B,118,942
C,177,2062


In [57]:
df2.groupby("groups")[["var2"]].sum()

Unnamed: 0_level_0,var2
groups,Unnamed: 1_level_1
A,767
B,942
C,2062


In [58]:
lambda x : x["var2"].sum() < 800

<function __main__.<lambda>(x)>

In [59]:
df2.groupby("groups")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FC04EB9780>

In [60]:
print(*df2.groupby("groups"), sep="\n")

('A',   groups  var1  var2
0      A    10   100
3      A    22   262
6      A    76   405)
('B',   groups  var1  var2
1      B    23   253
4      B    11   111
7      B    84   578)
('C',   groups  var1  var2
2      C    33   333
5      C    99   969
8      C    45   760)


In [61]:
(lambda x : x["var2"].sum() < 800)(df2.groupby("groups"))

groups
A     True
B    False
C    False
Name: var2, dtype: bool

In [62]:
df2.groupby("groups")["var2"].sum() < 800

groups
A     True
B    False
C    False
Name: var2, dtype: bool

In [63]:
df2.groupby("groups").filter(lambda x : x["var2"].sum() < 800)

Unnamed: 0,groups,var1,var2
0,A,10,100
3,A,22,262
6,A,76,405


### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">.transform()</p>

<a id="3.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

#### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.transform()</p>

<a id="3.3.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``DataFrame.transform(func, axis=0, *args, **kwargs)``**

- Returns DataFrame
- Call func on self producing a DataFrame that must have the same length as self.

Python’s Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter. This dataframe has the same length as the passed dataframe. **``transform()``** is an operation mostly used in conjunction with groupby (which is one of the most useful operations in pandas). It is a powerful function that you can lean on for **feature engineering** in Python. As the name (feature engineering) suggests, it enables us to extract new features from existing ones. Let’s understand the importance of the transform function with the help of an example. 

[SOUREC01](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html) & [SOURCE02](https://www.analyticsvidhya.com/blog/2020/03/understanding-transform-function-python/)

In [64]:
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [65]:
df_num = df2.iloc[:, 1:3]
df_num

Unnamed: 0,var1,var2
0,10,100
1,23,253
2,33,333
3,22,262
4,11,111
5,99,969
6,76,405
7,84,578
8,45,760


In [66]:
df_num.transform(lambda x : x+ 10)

Unnamed: 0,var1,var2
0,20,110
1,33,263
2,43,343
3,32,272
4,21,121
5,109,979
6,86,415
7,94,588
8,55,770


In [67]:
df_num + 10

Unnamed: 0,var1,var2
0,20,110
1,33,263
2,43,343
3,32,272
4,21,121
5,109,979
6,86,415
7,94,588
8,55,770


In [68]:
df_num.var1.transform(np.sqrt)

0    3.162278
1    4.795832
2    5.744563
3    4.690416
4    3.316625
5    9.949874
6    8.717798
7    9.165151
8    6.708204
Name: var1, dtype: float64

In [69]:
np.sqrt(df_num.var1)

0    3.162278
1    4.795832
2    5.744563
3    4.690416
4    3.316625
5    9.949874
6    8.717798
7    9.165151
8    6.708204
Name: var1, dtype: float64

In [70]:
df_num.var1.agg(np.sqrt)

0    3.162278
1    4.795832
2    5.744563
3    4.690416
4    3.316625
5    9.949874
6    8.717798
7    9.165151
8    6.708204
Name: var1, dtype: float64

In [71]:
df_num.var1.transform([np.sqrt, np.exp])

Unnamed: 0,sqrt,exp
0,3.162278,22026.47
1,4.795832,9744803000.0
2,5.744563,214643600000000.0
3,4.690416,3584913000.0
4,3.316625,59874.14
5,9.949874,9.889030000000001e+42
6,8.717798,1.0148e+33
7,9.165151,3.025077e+36
8,6.708204,3.493427e+19


In [72]:
df_num.var1.agg([np.sqrt, np.exp])

Unnamed: 0,sqrt,exp
0,3.162278,22026.47
1,4.795832,9744803000.0
2,5.744563,214643600000000.0
3,4.690416,3584913000.0
4,3.316625,59874.14
5,9.949874,9.889030000000001e+42
6,8.717798,1.0148e+33
7,9.165151,3.025077e+36
8,6.708204,3.493427e+19


In [73]:
df_num.transform(lambda x: (x-x.mean() / x.std()))

Unnamed: 0,var1,var2
0,8.660179,98.584011
1,21.660179,251.584011
2,31.660179,331.584011
3,20.660179,260.584011
4,9.660179,109.584011
5,97.660179,967.584011
6,74.660179,403.584011
7,82.660179,576.584011
8,43.660179,758.584011


As seen above, transform comes in handy during feature extraction. As the name suggests, it enables us to extract new features from existing ones.

#### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">DataFrame.groupby().transform()</p>

<a id="3.3.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``DataFrameGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)``**
- Call function producing a like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values. [Official Pandas Document](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html).

In [74]:
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [75]:
df2.groupby("groups")["var1"].mean()

groups
A    36.000000
B    39.333333
C    59.000000
Name: var1, dtype: float64

In [76]:
df2.groupby("groups")["var1"].transform("mean")

0    36.000000
1    39.333333
2    59.000000
3    36.000000
4    39.333333
5    59.000000
6    36.000000
7    39.333333
8    59.000000
Name: var1, dtype: float64

In [77]:
df2 = pd.DataFrame({'groups': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
                   'var1': [10, 23, 33, 22, 11, 99, 76, 84, 45],
                   'var2': [100, 253, 333, 262, 111, 969, 405, 578, 760]})
df2

Unnamed: 0,groups,var1,var2
0,A,10,100
1,B,23,253
2,C,33,333
3,A,22,262
4,B,11,111
5,C,99,969
6,A,76,405
7,B,84,578
8,C,45,760


In [78]:
df2["var1_median_transform"] = df2.groupby("groups")["var1"].transform("median")
df2

Unnamed: 0,groups,var1,var2,var1_median_transform
0,A,10,100,22.0
1,B,23,253,23.0
2,C,33,333,45.0
3,A,22,262,22.0
4,B,11,111,23.0
5,C,99,969,45.0
6,A,76,405,22.0
7,B,84,578,23.0
8,C,45,760,45.0


In [79]:
df2.groupby("groups").var1.median()

groups
A    22.0
B    23.0
C    45.0
Name: var1, dtype: float64

In [80]:
df2["var1_max_transform"] = df2.groupby("groups")["var1"].transform("max")
df2

Unnamed: 0,groups,var1,var2,var1_median_transform,var1_max_transform
0,A,10,100,22.0,76
1,B,23,253,23.0,84
2,C,33,333,45.0,99
3,A,22,262,22.0,76
4,B,11,111,23.0,84
5,C,99,969,45.0,99
6,A,76,405,22.0,76
7,B,84,578,23.0,84
8,C,45,760,45.0,99


In [81]:
df2

Unnamed: 0,groups,var1,var2,var1_median_transform,var1_max_transform
0,A,10,100,22.0,76
1,B,23,253,23.0,84
2,C,33,333,45.0,99
3,A,22,262,22.0,76
4,B,11,111,23.0,84
5,C,99,969,45.0,99
6,A,76,405,22.0,76
7,B,84,578,23.0,84
8,C,45,760,45.0,99


## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:150%; text-align:center; border-radius:10px 10px;">The End of The Session - 05</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

________