# Optimize Custom Aggregation Function
In this challenge, your goal is to find the fastest solution to the problem while only using the Pandas library.

### The Challenge
The `college_pop` dataset contains the name, state and population of all higher-ed institutions in the US and its territories. For each state, find the percentage of the total state population made up by the 5 largest colleges of that state.

### Import pandas and build DataFrame from csv file

In [1]:
import pandas as pd
college = pd.read_csv('https://raw.githubusercontent.com/DunderData/Pandas-Challenges/master/data/college_pop.csv')
college.head()

Unnamed: 0,name,state,pop
0,Alabama A & M University,AL,4206.0
1,University of Alabama at Birmingham,AL,11383.0
2,Amridge University,AL,291.0
3,University of Alabama in Huntsville,AL,5451.0
4,Alabama State University,AL,4811.0


### Basic info for the DataFrame

In [2]:
college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 3 columns):
name     7535 non-null object
state    7535 non-null object
pop      6874 non-null float64
dtypes: float64(1), object(2)
memory usage: 176.7+ KB


### Basic summary stats

In [3]:
college.describe()

Unnamed: 0,pop
count,6874.0
mean,2356.83794
std,5474.275871
min,0.0
25%,117.0
50%,412.5
75%,1929.5
max,151558.0


### Remove null values for the 'pop' column

In [4]:
college = college.dropna()

### Updated info for DataFrame

In [5]:
college.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6874 entries, 0 to 7163
Data columns (total 3 columns):
name     6874 non-null object
state    6874 non-null object
pop      6874 non-null float64
dtypes: float64(1), object(2)
memory usage: 214.8+ KB


### Data Wrangling

In [6]:
# I referenced "Python for Data Analysis" - Ch 10, pg 304
def top(df, n=5, column='pop'):
    return df.sort_values(by=column, ascending=False)[:n]

In [7]:
st_top_5 = college.groupby('state').apply(top)
st_top_5.rename(columns={'state': 'ST'}, inplace=True)
st_top_5_total = st_top_5.groupby('ST').sum()
decimal = st_top_5_total / college.groupby('state').sum()

### My solution:

In [8]:
percentages = decimal.apply(lambda x: x * 100)
percentages.rename(columns={'pop': 'top_5_pct_of_total_college_pop'}, inplace=True)
percentages

Unnamed: 0_level_0,top_5_pct_of_total_college_pop
ST,Unnamed: 1_level_1
AK,96.157549
AL,37.076013
AR,42.267468
AS,100.0
AZ,55.148634
CA,7.655917
CO,37.846264
CT,29.667878
DC,75.505618
DE,85.531375
