# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [7]:
fort100 = pd.read_csv("fortune1000.csv",index_col="Rank")
fort100

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [4]:
sectors = fort100.groupby("Sector")

## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [9]:
sectors.get_group("Health Care")

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
6,UnitedHealth Group,Health Care,Health Care: Insurance and Managed Care,157107,5813,200000
12,AmerisourceBergen,Health Care,Wholesalers: Health Care,135962,-135,17000
21,Cardinal Health,Health Care,Wholesalers: Health Care,102531,1215,34500
22,Express Scripts Holding,Health Care,Health Care: Pharmacy and Other Services,101752,2476,25900
...,...,...,...,...,...,...
935,VCA,Health Care,Health Care: Medical Facilities,2134,211,12700
960,PharMerica,Health Care,Health Care: Pharmacy and Other Services,2029,35,5200
965,Bio-Rad Laboratories,Health Care,Medical Products and Equipment,2019,113,7770
977,Hill-Rom Holdings,Health Care,Medical Products and Equipment,1988,48,10000


In [11]:
sectors["Revenue"].sum()

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [14]:
multi_sectors = fort100.groupby(["Sector","Industry"])

multi_sectors.size()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [17]:
multi_sectors.get_group(("Aerospace & Defense","Aerospace and Defense"))

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
60,Lockheed Martin,Aerospace & Defense,Aerospace and Defense,46132,3605,126000
88,General Dynamics,Aerospace & Defense,Aerospace and Defense,31469,2965,99900
118,Northrop Grumman,Aerospace & Defense,Aerospace and Defense,23526,1990,65000
120,Raytheon,Aerospace & Defense,Aerospace and Defense,23247,2074,61000
209,Textron,Aerospace & Defense,Aerospace and Defense,13423,697,35000
245,L-3 Communications,Aerospace & Defense,Aerospace and Defense,11554,-240,38000
282,Precision Castparts,Aerospace & Defense,Aerospace and Defense,10056,1530,30106
378,Huntington Ingalls Industries,Aerospace & Defense,Aerospace and Defense,7020,404,35995


## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [18]:
sectors.agg({"Revenue": "sum", "Profits":"max"})

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,357940,7608
Apparel,95968,3273
Business Services,272195,6328
Chemicals,243897,7685
Energy,1517809,16150
Engineering & Construction,153983,803
Financials,2217159,24442
Food and Drug Stores,483769,5237
"Food, Beverages & Tobacco",555967,7351
Health Care,1614707,18108


In [23]:
sectors.get_group("Aerospace & Defense").sort_values(by="Employees",ascending=False)

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
60,Lockheed Martin,Aerospace & Defense,Aerospace and Defense,46132,3605,126000
88,General Dynamics,Aerospace & Defense,Aerospace and Defense,31469,2965,99900
118,Northrop Grumman,Aerospace & Defense,Aerospace and Defense,23526,1990,65000
120,Raytheon,Aerospace & Defense,Aerospace and Defense,23247,2074,61000
245,L-3 Communications,Aerospace & Defense,Aerospace and Defense,11554,-240,38000
378,Huntington Ingalls Industries,Aerospace & Defense,Aerospace and Defense,7020,404,35995
209,Textron,Aerospace & Defense,Aerospace and Defense,13423,697,35000
282,Precision Castparts,Aerospace & Defense,Aerospace and Defense,10056,1530,30106


## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).

In [28]:
def double_employees_number(df:pd.DataFrame)->pd.Series:
    return df["Employees"] * 2

sectors.apply(double_employees_number,include_groups=False)

Sector               Rank
Aerospace & Defense  24      322800
                     45      394400
                     60      252000
                     88      199800
                     118     130000
                              ...  
Wholesalers          780      11678
                     808      25056
                     837       6732
                     875       7600
                     991       3200
Name: Employees, Length: 1000, dtype: int64