# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [22]:
fortune= pd.read_csv("fortune1000.csv",index_col="Rank")

## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [7]:
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [23]:
sectors=fortune.groupby("Sector")

In [19]:
sectors.max()

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Woodward,Aerospace and Defense,96114,7608,197200
Apparel,Wolverine World Wide,Apparel,30601,3273,65300
Business Services,Western Union,Waste Management,19330,6328,216500
Chemicals,Westlake Chemical,Chemicals,48778,7685,52000
Energy,Xcel Energy,Utilities: Gas and Electric,246204,16150,75600
Engineering & Construction,Tutor Perini,Homebuilders,18114,803,92000
Financials,Zions Bancorp.,Securities,210821,24442,331000
Food and Drug Stores,Whole Foods Market,Food and Drug Stores,153290,5237,431000
"Food, Beverages & Tobacco",WhiteWave Foods,Tobacco,67702,7351,263000
Health Care,inVentiv Health,Wholesalers: Health Care,181241,18108,203500


## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [38]:
fortune.nlargest(2,"Employees")

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
218,Yum Brands,"Hotels, Resturants & Leisure",Food Services,13105,1293,505000


## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [52]:
sectors.apply(func=lambda df:df.nlargest(2,"Revenue"))

Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
Apparel,91,Nike,Apparel,Apparel,30601,3273,62600
Apparel,231,VF,Apparel,Apparel,12377,1232,64000
Business Services,144,ManpowerGroup,Business Services,Temporary Help,19330,419,27000
Business Services,186,Omnicom Group,Business Services,"Advertising, marketing",15134,1094,74900
Chemicals,56,Dow Chemical,Chemicals,Chemicals,48778,7685,49495
Chemicals,101,DuPont,Chemicals,Chemicals,27940,1953,52000
Energy,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
Energy,14,Chevron,Energy,Petroleum Refining,131118,4587,61500


In [51]:
sectors.agg(revenue=pd.NamedAgg("Revenue",np.product))

Unnamed: 0_level_0,revenue
Sector,Unnamed: 1_level_1
Aerospace & Defense,3341121054704140288
Apparel,1780803565882703872
Business Services,5620492334958379008
Chemicals,2861219657572941824
Energy,0
Engineering & Construction,2723841810714591232
Financials,0
Food and Drug Stores,-6918596647498661888
"Food, Beverages & Tobacco",-9151314442816847872
Health Care,-9223372036854775808


In [50]:
import numpy as np

## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).