## Group By

With Group By, we can create several separated dataframes, without the need to create each one of the dataframes manually (comparing, assinign, etc).

In [None]:
import pandas as pd

In [None]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank") # 1000 most profitable companies
sectors = fortune.groupby("Sector")
fortune.head(3)

In [None]:
fortune.groupby("Sector")

In [None]:
type(fortune)

In [None]:
type(sectors)

In [None]:
sectors

## The groupby() Method

In [None]:
len(fortune) # Number of rows
len(sectors) # Number of groups


In [None]:
fortune.Sector.nunique()

In [None]:
sectors.size()

Size of group is the same as value counts. Except value counts will order by quantity.

In [None]:
fortune["Sector"].value_counts()

In [None]:
fortune.head(3)

Gets the first item of each group to show.

In [None]:
sectors.first()

In [None]:
fortune.tail(3)

In [None]:
sectors.last()

In [None]:
sectors.groups

In [None]:
fortune.loc[24]

## Retrieve a Group with the .get_group() Method

In [None]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank") # 1000 most profitable companies
sectors = fortune.groupby("Sector")
fortune.head(3)

In [None]:
sectors.get_group("Energy")
sectors.get_group("Technology")
sectors.get_group("Apparel")

## Methods on the Groupby Object and Dataframe Columns

In [None]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank") # 1000 most profitable companies
sectors = fortune.groupby("Sector")
fortune.head(3)

In [None]:
sectors.max()

In [None]:
sectors.min()

Only apply to numeric columns

In [None]:
sectors.sum()

In [None]:
sectors["Revenue"].sum()

In [None]:
sectors.get_group("Apparel")["Revenue"].sum() # Same as this
# Apparel	95968	8236	346397
sectors.get_group("Apparel")["Revenue"].mean()


In [None]:
sectors.mean()

In [None]:
sectors["Employees"].sum()

In [None]:
sectors["Profits"].max()
sectors["Profits"].min()
sectors["Employees"].mean()

sectors[["Profits", "Revenue"]].sum()


## Grouping by Multiple Columns

In [None]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank") # 1000 most profitable companies
sectors = fortune.groupby(["Sector", "Industry"])
fortune.head(3)

MultIndex series

In [None]:
sectors.size()

In [None]:
sectors.sum()

In [None]:
sectors["Revenue"].sum()
sectors["Employees"].mean()

The .agg() Method

Perform more than one operation over a dataframe.

In [None]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank") # 1000 most profitable companies
sectors = fortune.groupby("Sector")
fortune.head(3)

In [None]:
sectors["Employees"].mean()

In [None]:
sectors.agg({
            "Revenue" : "sum",
            "Profits" : "sum",
            "Employees" : "mean"
            })

Using more than one operations over the numeric columns

In [None]:
sectors.agg(["size", "sum", "mean"])

In [None]:
sectors.agg({
            "Revenue" : ["sum", "mean"],
            "Profits" : "sum",
            "Employees" : "mean"
            })

## Iterating Through Groups

In [65]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank") # 1000 most profitable companies
sectors = fortune.groupby("Sector")
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


Creating an empty array to append data to it

In [68]:
df = pd.DataFrame(columns = fortune.columns)
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees


In [70]:
for sector,data in sectors:
    highest_revenue_company_in_group = data.nlargest(1, "Revenue")
    df = df.append(highest_revenue_company_in_group)

In [71]:
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees
24,Boeing,Aerospace & Defense,Aerospace and Defense,"Chicago, IL",96114,5176,161400
91,Nike,Apparel,Apparel,"Beaverton, OR",30601,3273,62600
144,ManpowerGroup,Business Services,Temporary Help,"Milwaukee, WI",19330,419,27000
56,Dow Chemical,Chemicals,Chemicals,"Midland, MI",48778,7685,49495
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
155,Fluor,Engineering & Construction,"Engineering, Construction","Irving, TX",18114,413,38758
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
7,CVS Health,Food and Drug Stores,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
41,Archer Daniels Midland,"Food, Beverages & Tobacco",Food Production,"Chicago, IL",67702,1849,32300
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [73]:
cities = fortune.groupby("Location")
df = pd.DataFrame(columns = fortune.columns)
cities

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002575281A910>