# The GroupBy Object

In [38]:
import pandas as pd
import numpy as np

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [5]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [8]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

fortune_group = fortune.groupby("Sector")
 

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x763de37417f0>

In [14]:
len(fortune_group)
fortune_group.size()
fortune_group.describe()
fortune_group.first()
fortune_group.last()

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,1923,-133,12000
Apparel,Guess,Apparel,2204,82,13500
Business Services,DeVry Education Group,Education,1910,140,11770
Chemicals,H.B. Fuller,Chemicals,2084,87,4425
Energy,Portland General Electric,Utilities: Gas and Electric,1898,172,2646
Engineering & Construction,MDC Holdings,Homebuilders,1909,66,1225
Financials,New York Community Bancorp,Commercial Banks,1902,-47,3448
Food and Drug Stores,Fred’s,Food and Drug Stores,2151,-7,7103
"Food, Beverages & Tobacco",Alliance One International,Tobacco,2066,-15,6835
Health Care,Providence Service,Health Care: Pharmacy and Other Services,1987,84,9072


## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [20]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

sectors = fortune.groupby("Sector")

fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    1000 non-null   object
 1   Sector     1000 non-null   object
 2   Industry   1000 non-null   object
 3   Revenue    1000 non-null   int64 
 4   Profits    1000 non-null   int64 
 5   Employees  1000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 54.7+ KB


In [19]:
sectors.get_group("Technology")

SyntaxError: unterminated string literal (detected at line 1) (4046006922.py, line 1)

AttributeError: 'DataFrameGroupBy' object has no attribute 'info'

## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

In [22]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

sectors = fortune.groupby("Sector")

fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    1000 non-null   object
 1   Sector     1000 non-null   object
 2   Industry   1000 non-null   object
 3   Revenue    1000 non-null   int64 
 4   Profits    1000 non-null   int64 
 5   Employees  1000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 54.7+ KB


In [24]:
sectors["Revenue"].sum()

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

In [25]:
sectors["Employees"].sum()

Sector
Aerospace & Defense              968057
Apparel                          346397
Business Services               1361050
Chemicals                        463651
Energy                          1188927
Engineering & Construction       406708
Financials                      3359948
Food and Drug Stores            1395398
Food, Beverages & Tobacco       1211632
Health Care                     2678289
Hotels, Resturants & Leisure    2484245
Household Products               646038
Industrials                     1545229
Materials                        638123
Media                            550314
Motor Vehicles & Parts          1082560
Retailing                       6227629
Technology                      3578949
Telecommunications               832468
Transportation                  1536793
Wholesalers                      525597
Name: Employees, dtype: int64

In [30]:
sectors["Profits"].max()
sectors["Employees"].mean().astype("int")

Sector
Aerospace & Defense             48402
Apparel                         23093
Business Services               26687
Chemicals                       15455
Energy                           9745
Engineering & Construction      15642
Financials                      24172
Food and Drug Stores            93026
Food, Beverages & Tobacco       28177
Health Care                     35710
Hotels, Resturants & Leisure    99369
Household Products              23072
Industrials                     33591
Materials                       14840
Media                           22012
Motor Vehicles & Parts          45106
Retailing                       77845
Technology                      35087
Telecommunications              55497
Transportation                  42688
Wholesalers                     13139
Name: Employees, dtype: int64

## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [32]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

sectors = fortune.groupby(["Sector", "Industry"])

fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [34]:
sectors.size()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [40]:
np.ceil(sectors["Revenue"].mean())

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            17897.0
Apparel              Apparel                                           6398.0
Business Services    Advertising, marketing                           11374.0
                     Diversified Outsourcing Services                  4631.0
                     Education                                         2495.0
                                                                       ...   
Transportation       Trucking, Truck Leasing                           3995.0
Wholesalers          Miscellaneous                                     8982.0
                     Wholesalers: Diversified                          7046.0
                     Wholesalers: Electronics and Office Equipment    18489.0
                     Wholesalers: Food and Grocery                    18629.0
Name: Revenue, Length: 79, dtype: float64

## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [41]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

sectors = fortune.groupby(["Sector", "Industry"])

fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [43]:
sectors.agg({
    "Revenue":"sum",
    "Profits":"mean",
    "Employees":"max"
}) # dictionary as argument, keys as columns name, values are aggregations

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profits,Employees
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,Aerospace and Defense,357940,1437.100000,197200
Apparel,Apparel,95968,549.066667,65300
Business Services,"Advertising, marketing",22748,774.500000,74900
Business Services,Diversified Outsourcing Services,64829,307.500000,216500
Business Services,Education,7485,23.000000,23400
...,...,...,...,...
Transportation,"Trucking, Truck Leasing",35950,212.222222,33100
Wholesalers,Miscellaneous,8982,17.000000,9200
Wholesalers,Wholesalers: Diversified,176138,207.720000,39600
Wholesalers,Wholesalers: Electronics and Office Equipment,147906,232.125000,78500


## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).

In [44]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

sectors = fortune.groupby(["Sector", "Industry"])

fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [54]:
fortune.nlargest(2, "Employees")

def top_two(sector):
    return sector.nlargest(2, "Employees")


sectors.apply(top_two, include_groups=False).to_csv("sectors.csv")