# The GroupBy Object

In [1]:
import pandas as pd

In [2]:
fortune = pd.read_csv("fortune1000.csv")

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [3]:
# We observe that the rank method can be used as index for this dataset
fortune.head()
fortune = fortune.set_index("Rank")
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [4]:
fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    1000 non-null   object
 1   Sector     1000 non-null   object
 2   Industry   1000 non-null   object
 3   Revenue    1000 non-null   int64 
 4   Profits    1000 non-null   int64 
 5   Employees  1000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 54.7+ KB


In [5]:
# One way would be manual filtering & using sum method
fortune[fortune["Sector"]=="Retailing"]["Revenue"].sum()
# However, its obvious that writiign such queries for each sector is not practical
# even the dataset could change in future

np.int64(1465076)

#### Using `groupby` method in the backend, creates nested dataframes grouped by the specified column
`groupby()` follows the <mark>"split-apply-combine"</mark> paradigm:<br>
- Split: The DataFrame is split into groups based on one or more columns.
- Apply: A function is applied to each group independently.
- Combine: The results of the function applications are combined into a new DataFrame or Series.

In [6]:
sector = fortune.groupby("Sector")
sector

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026951DA1BE0>

#### Note that output of the `groupby` method is a `DataFrameGroupBy` object. Meaning that its a collection of DataFrames grouped by each *sector* from the 'Sector' column. 

In [7]:
# We see the DataFrame has total 21 sub-dataframe grouped by column 
len(sector)

21

In [8]:
# The 'size' method shows the no. of records grouped in the specified column
# So there're 20 rows in the dataset for sector 'Aerospace & Def' ; there're 15 rows in the dataset for sector 'Apparel' and so on....
sector.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

In [9]:
# Finding average revenue by sector
sector["Revenue"].mean()
# Finding average revenue by sector
sector["Profits"].mean()

Sector
Aerospace & Defense             1437.100000
Apparel                          549.066667
Business Services                553.470588
Chemicals                        754.266667
Energy                          -602.024590
Engineering & Construction       204.000000
Financials                      1872.007194
Food and Drug Stores            1117.266667
Food, Beverages & Tobacco       1195.744186
Health Care                     1414.853333
Hotels, Resturants & Leisure     827.880000
Household Products               515.285714
Industrials                      451.391304
Materials                        102.976744
Media                            973.880000
Motor Vehicles & Parts          1079.083333
Retailing                        597.875000
Technology                      1769.343137
Telecommunications              3242.466667
Transportation                  1226.916667
Wholesalers                      205.825000
Name: Profits, dtype: float64

## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [10]:
sector.get_group("Apparel")
sector.get_group("Media")
sector.get_group("Energy")

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
14,Chevron,Energy,Petroleum Refining,131118,4587,61500
30,Phillips 66,Energy,Petroleum Refining,87169,4227,14000
32,Valero Energy,Energy,Petroleum Refining,81824,3990,10103
42,Marathon Petroleum,Energy,Petroleum Refining,64566,2852,45440
...,...,...,...,...,...,...
981,WPX Energy,Energy,"Mining, Crude-Oil Production",1958,-1727,1040
983,Adams Resources & Energy,Energy,Petroleum Refining,1944,-1,809
995,EP Energy,Energy,"Mining, Crude-Oil Production",1908,-3748,665
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646


##### `get_group` fetches that sub-section dataframe back for us from the DataFrameGroupBy object
##### NOTE: Multiple groups **CANNOT** be fetched at once. 

In [11]:
type(sector.get_group("Apparel"))

pandas.core.frame.DataFrame

## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

In [12]:
print(sector["Revenue"].sum())
print("==============")
print(sector["Revenue"].sum().loc["Apparel"])
print("==============")
print(sector.get_group("Apparel")["Revenue"].sum())
print("==============")
print(fortune[fortune["Sector"]=="Apparel"]["Revenue"].sum())

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64
95968
95968
95968


In [13]:
type(sector["Revenue"].sum())

pandas.core.series.Series

##### Above is an illustration arriving at the same result using different approaches
##### NOTE that `groupby` presents an efficient way of arriving at the result FOR ALL the sectors AT ONCE 


In [14]:
# Another example -- to find total employees working in each sector
sector["Employees"].sum()

Sector
Aerospace & Defense              968057
Apparel                          346397
Business Services               1361050
Chemicals                        463651
Energy                          1188927
Engineering & Construction       406708
Financials                      3359948
Food and Drug Stores            1395398
Food, Beverages & Tobacco       1211632
Health Care                     2678289
Hotels, Resturants & Leisure    2484245
Household Products               646038
Industrials                     1545229
Materials                        638123
Media                            550314
Motor Vehicles & Parts          1082560
Retailing                       6227629
Technology                      3578949
Telecommunications               832468
Transportation                  1536793
Wholesalers                      525597
Name: Employees, dtype: int64

In [15]:
# To find the max amount of profit made in each sector
sector["Profits"].max()

Sector
Aerospace & Defense              7608
Apparel                          3273
Business Services                6328
Chemicals                        7685
Energy                          16150
Engineering & Construction        803
Financials                      24442
Food and Drug Stores             5237
Food, Beverages & Tobacco        7351
Health Care                     18108
Hotels, Resturants & Leisure     5920
Household Products               7036
Industrials                      4833
Materials                         991
Media                            8382
Motor Vehicles & Parts           9687
Retailing                       14694
Technology                      53394
Telecommunications              17879
Transportation                   7610
Wholesalers                      1472
Name: Profits, dtype: int64

In [16]:
# One more, to find average revenue by each company -- starting with largest revenue first
(fortune.groupby("Company"))["Revenue"].mean().sort_values(ascending=False)

Company
Walmart                       482130.0
Exxon Mobil                   246204.0
Apple                         233715.0
Berkshire Hathaway            210821.0
McKesson                      181241.0
                                ...   
EP Energy                       1908.0
New York Community Bancorp      1902.0
Portland General Electric       1898.0
Wendy’s                         1896.0
Briggs & Stratton               1895.0
Name: Revenue, Length: 996, dtype: float64

In [17]:
fortune["Industry"].value_counts()

Industry
Specialty Retailers: Other             42
Utilities: Gas and Electric            41
Chemicals                              30
Mining, Crude-Oil Production           28
Commercial Banks                       28
                                       ..
Waste Management                        5
Computer Peripherals                    4
Education                               3
Mail, Package, and Freight Delivery     2
Advertising, marketing                  2
Name: count, Length: 73, dtype: int64

#### In a given group , methods can be run on multiple columns at once. The result will be in a DataFrame.


In [21]:
sector[["Revenue","Profits"]].mean()
sector[["Revenue","Profits"]].max()

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,96114,7608
Apparel,30601,3273
Business Services,19330,6328
Chemicals,48778,7685
Energy,246204,16150
Engineering & Construction,18114,803
Financials,210821,24442
Food and Drug Stores,153290,5237
"Food, Beverages & Tobacco",67702,7351
Health Care,181241,18108


## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [26]:
# Adding layers to the groupby
sector2 = fortune.groupby(["Sector","Industry"])
sector2.size()


Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [31]:
# finding total revenue in each Sector , further grouped by Industry 
sector2["Revenue"].sum()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            357940
Apparel              Apparel                                           95968
Business Services    Advertising, marketing                            22748
                     Diversified Outsourcing Services                  64829
                     Education                                          7485
                                                                       ...  
Transportation       Trucking, Truck Leasing                           35950
Wholesalers          Miscellaneous                                      8982
                     Wholesalers: Diversified                         176138
                     Wholesalers: Electronics and Office Equipment    147906
                     Wholesalers: Food and Grocery                    111774
Name: Revenue, Length: 79, dtype: int64

In [32]:
# finding average revenue in each Sector , further grouped by Industry 
sector2["Revenue"].mean()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            17897.000000
Apparel              Apparel                                           6397.866667
Business Services    Advertising, marketing                           11374.000000
                     Diversified Outsourcing Services                  4630.642857
                     Education                                         2495.000000
                                                                          ...     
Transportation       Trucking, Truck Leasing                           3994.444444
Wholesalers          Miscellaneous                                     8982.000000
                     Wholesalers: Diversified                          7045.520000
                     Wholesalers: Electronics and Office Equipment    18488.250000
                     Wholesalers: Food and Grocery                    18629.000000
Name: Revenue, Lengt

In [42]:
# Accessing specific group & sub-group fro mthe groupby-ed dataframe
sector2.get_group(("Business Services","Education"))

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
737,Graham Holdings,Business Services,Education,2984,-101,11585
820,Apollo Education Group,Business Services,Education,2591,30,23400
993,DeVry Education Group,Business Services,Education,1910,140,11770


## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [45]:
sector.agg({"Revenue":"sum","Profits":"mean"})

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,357940,1437.1
Apparel,95968,549.066667
Business Services,272195,553.470588
Chemicals,243897,754.266667
Energy,1517809,-602.02459
Engineering & Construction,153983,204.0
Financials,2217159,1872.007194
Food and Drug Stores,483769,1117.266667
"Food, Beverages & Tobacco",555967,1195.744186
Health Care,1614707,1414.853333


In [47]:
# Applies to multi-column groupby as well
sector2.agg({"Revenue":"sum","Profits":"mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profits
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,Aerospace and Defense,357940,1437.100000
Apparel,Apparel,95968,549.066667
Business Services,"Advertising, marketing",22748,774.500000
Business Services,Diversified Outsourcing Services,64829,307.500000
Business Services,Education,7485,23.000000
...,...,...,...
Transportation,"Trucking, Truck Leasing",35950,212.222222
Wholesalers,Miscellaneous,8982,17.000000
Wholesalers,Wholesalers: Diversified,176138,207.720000
Wholesalers,Wholesalers: Electronics and Office Equipment,147906,232.125000


## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).

In [50]:
# Finding 2 companies from each sector with most employees
# The nlargest method will return rows containing X largest values in given column
fortune.nlargest(n=2,columns="Employees")
# Now we want to apply this to each grouped sector

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
218,Yum Brands,"Hotels, Resturants & Leisure",Food Services,13105,1293,505000


##### Now ,its important to decide what param to pass while creating the func. This can be easy to identify since we know that the func will be running on each group of the *Sector* DataframeGroupBy object. 

In [51]:
def top2_employers(sector):
    return sector.nlargest(n=2,columns="Employees")

In [52]:
sector.apply(top2_employers)

  sector.apply(top2_employers)


Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
Apparel,448,Hanesbrands,Apparel,Apparel,5732,429,65300
Apparel,231,VF,Apparel,Apparel,12377,1232,64000
Business Services,199,Aramark,Business Services,Diversified Outsourcing Services,14329,236,216500
Business Services,744,Convergys,Business Services,Diversified Outsourcing Services,2951,169,130000
Chemicals,101,DuPont,Chemicals,Chemicals,27940,1953,52000
Chemicals,56,Dow Chemical,Chemicals,Chemicals,48778,7685,49495
Energy,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
Energy,117,Halliburton,Energy,"Oil and Gas Equipment, Services",23633,-671,65000
