#### <b>Compiled by Kevin Mugo Mwaniki</b>
#### <b>Contact: +254726279800</b>
#### <b>Email: kevmwaniki254@gmail.com</b>


# <u>Chapter 6: Index Alignment</u>
## <u>Recipes</u>
* [Examining the Index object](#Examining-the-index)
* [Producing Cartesian products](#Producing-Cartesian-products)
* [Exploding indexes](#Exploding-Indexes)
* [Filling values with unequal indexes](#Filling-values-with-unequal-indexes)
* [Appending columns from different DataFrames](#Appending-columns-from-different-DataFrames)
* [Highlighting the maximum value from each column](#Highlighting-maximum-value-from-each-column)
* [Replicating idxmax with method chaining](#Replicating-idxmax-with-method-chaining)
* [Finding the most common maximum](#Finding-the-most-common-maximum)

## <u>Introduction</u>
<p>When multiple Series or DataFrames are combined in some way, each dimension of the data automatically aligns on each axis 
first before any computation happens. This silent and automatic alignment of axes can cause tremendous confusion for the 
uninitiated, but it gives great flexibility to the power user. This chapter explores the Index object in-depth before 
showcasing a variety of recipes that take advantage of its automatic alignment</p>

## <u>Examining the Index object</u>
<p>Each axis of Series and DataFrames has an Index object that labels the values. There are many different types of Index objects, but they all share the same common behavior. All Index objects, except for the special MultiIndex, are single-dimensional data structures that combine the functionality and implementation of Python sets and NumPy ndarrays.</p>

In [127]:
import pandas as pd
import numpy as np
#We will explore the index of the column dataset to understand its functionality
college = pd.read_csv('data/college.csv')

In [128]:
#Assign the index to a variable then output it:
columns = college.columns
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [129]:
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [3]:
#Use the values attribute to access the underlying NumPy array:
columns.values

array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

In [123]:
type(columns)
#loc and iloc

pandas.core.indexes.base.Index

In [4]:
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [130]:
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [5]:
#Select items from the index by integer location with scalars, lists, or slices:
columns[5]

'WOMENONLY'

In [6]:
columns[[1, 8, 10]]

Index(['CITY', 'SATMTMID', 'UGDS'], dtype='object')

In [131]:
college.dtypes

INSTNM                 object
CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
RELAFFIL                int64
SATVRMID              float64
SATMTMID              float64
DISTANCEONLY          float64
UGDS                  float64
UGDS_WHITE            float64
UGDS_BLACK            float64
UGDS_HISP             float64
UGDS_ASIAN            float64
UGDS_AIAN             float64
UGDS_NHPI             float64
UGDS_2MOR             float64
UGDS_NRA              float64
UGDS_UNKN             float64
PPTUG_EF              float64
CURROPER                int64
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
dtype: object

In [7]:
#Other methods that can be implemented on indexes:
columns.min(),columns.max(), columns.isnull().sum()

('CITY', 'WOMENONLY', 0)

In [8]:
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [9]:
#Basic arithmetic and comparison on Index objects:
columns + '_A'

Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')

In [10]:
#Indexes are immutable thus changing values after creation will fail:
#columns[1] = 'city'

In [135]:
l = [66, 65, 64]
t = (66, 65, 64)

In [134]:
type(l)

list

In [11]:
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [12]:
#They support set operations, union, intersection, differences and symmetric:
c1 = columns[:4]
c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [None]:
l1 = [66, 65, 64]
l2 = [66, 65, 67]

#Union
union = [66, 65, 64, 66, 65, 67]

#Intersection
intersection = [66, 65]

#Difference
difference = [64, 67]

In [13]:
c2 = columns[2 : 6]
c2

Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')

In [14]:
c1.union(c2) #Alternative of c1 | c2

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')

In [15]:
c1.symmetric_difference(c2) #Alternative of c1 ^ c2

Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

In [136]:
c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [137]:
c2

Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')

<p>Indexes share some of the same operations as Python sets. Indexes are similar to Python sets in another important way. They 
are (usually) implemented using hash tables, which makefor extremely fast access when selecting rows or columns from a 
DataFrame. As they are implemented using hash tables, the values for the Index object need to be immutable such as a string,
integer, or tuple just like the keys in a Python dictionary.Indexes support duplicate values, and if there happens to be a 
duplicate in any Index, then a hash table can no longer be used for its implementation, and object access becomes much slower.
</p>

## <u>Producing Cartesian products</u>
<p>The first process that occurs when two or more Series or DataFrames operate with one another is the indexes(for both row 
and columns) align before any task is carried out on them. A cartesian product is silently created between the data unless 
they are identical. Here, we will demonstrate two Series overlapping but non-identical indexes are added together thus a 
surprising result</p>

In [16]:
s1 = pd.Series(index = list('aaab'), data = np.arange(4))
s1

a    0
a    1
a    2
b    3
dtype: int32

In [139]:
s2 = pd.Series(index = list('cababb'), data = np.arange(6))
s2

c    0
a    1
b    2
a    3
b    4
b    5
dtype: int32

In [18]:
#Add the two Series together to form a cartesian product:
s1 + s2

a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

In [19]:
#When the Series have the same indexes and elements, a cartesian product does not occur:
s1 = pd.Series(index = list('aaabb'), data = np.arange(5))
s2 = pd.Series(index = list('aaabb'), data = np.arange(5))

In [20]:
s1 + s2

a    0
a    2
a    4
b    6
b    8
dtype: int32

In [21]:
#When the indexes are different but the data is the same, a cartesian product occurs:
s1 = pd.Series(index = list('aaabb'), data = np.arange(5))
s2 = pd.Series(index = list('bbaaa'), data = np.arange(5))

In [22]:
s1 + s2

a    2
a    3
a    4
a    3
a    4
a    5
a    4
a    5
a    6
b    3
b    4
b    4
b    5
dtype: int32

## <u>Exploding indexes</u>
<p>The above processes can produce commically different results when dealing with larger data. Here, we will add up two Series
with similar indexes with few unique values but in different orders. The result will explode the number of values in the 
indexes</p>

In [23]:
#Read in the employee data and set the index equal to the race column:
employee = pd.read_csv('data/employee.csv', index_col = 'RACE')

In [24]:
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Hispanic/Latino,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Full Time,Female,Active,2006-06-12,2012-10-13
Hispanic/Latino,1,LIBRARY ASSISTANT,Library,26125.0,Full Time,Female,Active,2000-07-19,2010-09-18
White,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,Full Time,Male,Active,2015-02-03,2015-02-03
White,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,Full Time,Male,Active,1982-02-08,1991-05-25
White,4,ELECTRICIAN,General Services Department,56347.0,Full Time,Male,Active,1989-06-19,1994-10-22


In [25]:
#Select the BASE_SALARY as two different Series and check whether they are equal
salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']

In [140]:
salary1

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
                                      ...   
NaN                                  40000.0
NaN                                  28024.0
NaN                                  28766.0
NaN                                      NaN
NaN                                  28024.0
Name: BASE_SALARY, Length: 2000, dtype: float64

In [141]:
salary2

RACE
Hispanic/Latino              121862.0
Hispanic/Latino               26125.0
White                         45279.0
White                         63166.0
White                         56347.0
                               ...   
White                         43443.0
Black or African American     66523.0
White                         43443.0
Asian/Pacific Islander        55461.0
Hispanic/Latino               51194.0
Name: BASE_SALARY, Length: 2000, dtype: float64

In [26]:
salary1 is salary2

True

In [27]:
#Any change to one Series will affect the other
#To ensure you receive a brand new copy of the data, use the copy method:
#This will enable the Series to be referenced differently
salary1 = employee.BASE_SALARY.copy()
salary1 = employee.BASE_SALARY.copy()
salary1 is salary2

False

In [28]:
#Let us change the order of the index by sorting it:
salary1 = salary1.sort_index()

In [29]:
salary1.index.value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [142]:
employee.RACE.value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [30]:
salary1.head()

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64

In [31]:
salary2.head()

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

In [144]:
salary1.head()

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64

In [145]:
salary2.head()

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

In [153]:
salary1.index[0:6]

Index(['American Indian or Alaskan Native',
       'American Indian or Alaskan Native',
       'American Indian or Alaskan Native',
       'American Indian or Alaskan Native',
       'American Indian or Alaskan Native',
       'American Indian or Alaskan Native'],
      dtype='object', name='RACE')

In [154]:
salary2.index[0:6]

Index(['Hispanic/Latino', 'Hispanic/Latino', 'White', 'White', 'White',
       'Black or African American'],
      dtype='object', name='RACE')

In [32]:
#Let us add the Series salary together:
salary_add = salary1 + salary2
salary_add.head()

RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

In [146]:
salary_add.shape

(1175424,)

In [147]:
salary1.shape

(2000,)

In [148]:
salary2.shape

(2000,)

In [33]:
#The following will make the Series to explode from 2000 to over 1 million:
salary_add1 = salary1 + salary1
len(salary1), len(salary2), len(salary_add), len(salary_add1)

(2000, 2000, 1175424, 2000)

In [157]:
#We can sum the square of the indexes individual counts. 
#Even missing values in the index produce Cartesian products with themselves:
index_vc = salary1.index.value_counts(dropna = False)
index_vc

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [35]:
index_vc.pow(2).sum()

1175424

## <u>Filling values with unequal indexes</u>
<p>When two Series are added together(with one of the index labels is absent in the other), this will result in the formation
of a missing value. The add method is used to solve this problem. Here, we will add different values from the baseball dataset
with unequal indexes using the fill_value parameter of the add method to ensure that there are no missing values in the output
</p>

In [36]:
baseball_14 = pd.read_csv('data/baseball14.csv', index_col='playerID')
baseball_15 = pd.read_csv('data/baseball15.csv', index_col='playerID')
baseball_16 = pd.read_csv('data/baseball16.csv', index_col='playerID')

In [37]:
baseball_14.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2014,1,HOU,AL,158,660,85,225,47,3,...,59.0,56.0,9.0,36,53.0,7.0,5.0,1.0,5.0,20.0
cartech02,2014,1,HOU,AL,145,507,68,115,21,1,...,88.0,5.0,2.0,56,182.0,6.0,5.0,0.0,4.0,12.0
castrja01,2014,1,HOU,AL,126,465,43,103,21,2,...,56.0,1.0,0.0,34,151.0,1.0,9.0,1.0,3.0,11.0
corpoca01,2014,1,HOU,AL,55,170,22,40,6,0,...,19.0,0.0,0.0,14,37.0,0.0,3.0,1.0,2.0,3.0
dominma01,2014,1,HOU,AL,157,564,51,121,17,0,...,57.0,0.0,1.0,29,125.0,2.0,5.0,2.0,7.0,23.0


In [38]:
baseball_15.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2015,1,HOU,AL,154,638,86,200,40,4,...,66.0,38.0,13.0,33,67.0,8.0,9.0,3.0,6.0,17.0
cartech02,2015,1,HOU,AL,129,391,50,78,17,0,...,64.0,1.0,2.0,57,151.0,1.0,6.0,0.0,5.0,5.0
castrja01,2015,1,HOU,AL,104,337,38,71,19,0,...,31.0,0.0,0.0,33,115.0,1.0,2.0,0.0,3.0,5.0
congeha01,2015,1,HOU,AL,73,201,25,46,11,0,...,33.0,0.0,1.0,23,63.0,0.0,2.0,1.0,2.0,6.0
correca01,2015,1,HOU,AL,99,387,52,108,22,1,...,68.0,14.0,4.0,40,78.0,2.0,1.0,0.0,4.0,10.0


In [39]:
baseball_16.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2016,1,HOU,AL,161,640,108,216,42,5,...,96.0,30.0,10.0,60,70.0,11.0,7.0,3.0,7.0,15.0
bregmal01,2016,1,HOU,AL,49,201,31,53,13,3,...,34.0,2.0,0.0,15,52.0,0.0,0.0,0.0,1.0,1.0
castrja01,2016,1,HOU,AL,113,329,41,69,16,3,...,32.0,2.0,1.0,45,123.0,0.0,1.0,1.0,0.0,9.0
correca01,2016,1,HOU,AL,153,577,76,158,36,3,...,96.0,13.0,3.0,75,139.0,5.0,5.0,0.0,3.0,12.0
gattiev01,2016,1,HOU,AL,128,447,58,112,19,0,...,72.0,2.0,1.0,43,127.0,6.0,4.0,0.0,5.0,12.0


In [40]:
baseball_14.shape

(16, 21)

In [41]:
baseball_15.shape

(15, 21)

In [42]:
baseball_16.shape

(16, 21)

In [43]:
#Use the difference method to discover the index that are in baseball_14 and not in baseball_15 and vice versa:
baseball_14.index.difference(baseball_15)

Index(['altuvjo01', 'cartech02', 'castrja01', 'corpoca01', 'dominma01',
       'fowlede01', 'gonzama01', 'grossro01', 'guzmaje01', 'hoeslj01',
       'krausma01', 'marisja01', 'preslal01', 'singljo02', 'springe01',
       'villajo01'],
      dtype='object', name='playerID')

In [44]:
baseball_14.index.difference(baseball_16)

Index(['altuvjo01', 'cartech02', 'castrja01', 'corpoca01', 'dominma01',
       'fowlede01', 'gonzama01', 'grossro01', 'guzmaje01', 'hoeslj01',
       'krausma01', 'marisja01', 'preslal01', 'singljo02', 'springe01',
       'villajo01'],
      dtype='object', name='playerID')

In [45]:
#How many hits each player has over the three year period
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']

In [46]:
hits_14.head()

playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

In [47]:
hits_15.head()

playerID
altuvjo01    200
cartech02     78
castrja01     71
congeha01     46
correca01    108
Name: H, dtype: int64

In [162]:
hits_16.sample(5)

playerID
gourryu01     34
correca01    158
tuckepr01     22
rasmuco01     76
castrja01     69
Name: H, dtype: int64

In [179]:
#Let us add the two Series over the two years' period:
total_hits = (hits_14 + hits_15).head()
total_hits.sample(5)

playerID
congeha01      NaN
cartech02    193.0
castrja01    174.0
corpoca01      NaN
altuvjo01    425.0
Name: H, dtype: float64

In [None]:
#Cartesian product - Stacking of series with different indices
#If the values in the indices are missing, and they are being combined, they will be dropped fully

In [161]:
total_hits.shape

(5,)

In [49]:
#Checking the number of missing values of the elements above:
total_hits.isnull().sum()

2

In [50]:
#To sort the the missing values, the following code could have instead been used:
hits_14.add(hits_15, fill_value = 0).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

In [51]:
#We can use the chaining method to find the sum of the hits for the three years:
three_yr_hits = hits_14.add(hits_15, fill_value = 0)\
                       .add(hits_16, fill_value = 0)
three_yr_hits.head()

playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

In [52]:
#Confirming that there are no missing values:
three_yr_hits.isnull().sum()

0

In [53]:
#Alternatively:
three_yr_hits.hasnans

False

In [54]:
#Let us select a few columns from the 2014 baseball dataset:
df_14 = baseball_14[['G', 'AB', 'R', 'H']]
df_14.head()

Unnamed: 0_level_0,G,AB,R,H
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,158,660,85,225
cartech02,145,507,68,115
castrja01,126,465,43,103
corpoca01,55,170,22,40
dominma01,157,564,51,121


In [55]:
#Let us select a few features from the 2015 baseball dataset:
df_15 = baseball_15[['AB', 'R', 'H', 'HR']]
df_15.head()

Unnamed: 0_level_0,AB,R,H,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,638,86,200,15
cartech02,391,50,78,24
castrja01,337,38,71,11
congeha01,201,25,46,11
correca01,387,52,108,22


In [56]:
#Adding the two DataFrames and highliting null elements in red:
#Rows with playerID appearing in both DataFrames will be non-missing
(df_14 + df_15).head(10).style.highlight_null('brown')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,,425.0,,171.0
cartech02,898.0,,193.0,,118.0
castrja01,802.0,,174.0,,81.0
congeha01,,,,,
corpoca01,,,,,
correca01,,,,,
dominma01,,,,,
fowlede01,,,,,
gattiev01,,,,,
gomezca01,,,,,


In [57]:
#The fill_value method will also result in missing values for the combination of rows and columns is not in our input data
#AB, H, and R are the only columns that appear in both DataFrames
#For example, the intersection of playerID congeha01 and column G.
#He only appeared in the 2015 dataset that did not have the G column. Therefore, no value was filled with it:
df_14.add(df_15, fill_value = 0).head(10).style.highlight_null('red')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,158.0,425.0,15.0,171.0
cartech02,898.0,145.0,193.0,24.0,118.0
castrja01,802.0,126.0,174.0,11.0,81.0
congeha01,201.0,,46.0,11.0,25.0
corpoca01,170.0,55.0,40.0,,22.0
correca01,387.0,,108.0,22.0,52.0
dominma01,564.0,157.0,121.0,,51.0
fowlede01,434.0,116.0,120.0,,61.0
gattiev01,566.0,,139.0,27.0,66.0
gomezca01,149.0,,36.0,4.0,19.0


## <u>Appending columns from different DataFrames</u>
<p>The indexes must first align before the two or more Series or DataFrames are combined. Here, we will use the employees 
dataset to append a new column containing the maximum salary of an employee in that department</p>

In [58]:
employee = pd.read_csv('data/employee.csv')
employee.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


In [59]:
#Filtering the DEPARTMENT and BASE_SALARY features
dept_sal = employee[['DEPARTMENT', 'BASE_SALARY']]

In [60]:
dept_sal.head()

Unnamed: 0,DEPARTMENT,BASE_SALARY
0,Municipal Courts Department,121862.0
1,Library,26125.0
2,Houston Police Department-HPD,45279.0
3,Houston Fire Department (HFD),63166.0
4,General Services Department,56347.0


In [61]:
#Sorting DEPARTMENT from the first to the last and BASE_SALARY from the largest to the smallest:
dept_sal = dept_sal.sort_values(['DEPARTMENT', 'BASE_SALARY'], ascending=[True, False])
dept_sal.head(20)

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
237,Admn. & Regulatory Affairs,130416.0
1679,Admn. & Regulatory Affairs,103776.0
988,Admn. & Regulatory Affairs,72741.0
693,Admn. & Regulatory Affairs,66825.0
1868,Admn. & Regulatory Affairs,65000.0
971,Admn. & Regulatory Affairs,62129.0
1070,Admn. & Regulatory Affairs,57221.0
1983,Admn. & Regulatory Affairs,55172.0
379,Admn. & Regulatory Affairs,48755.0


In [62]:
#Use the drop_duplicates method to keep the first row of each department
max_dept_sal = dept_sal.drop_duplicates(subset='DEPARTMENT')
max_dept_sal.head()

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
149,City Controller's Office,64251.0
236,City Council,100000.0
647,Convention and Entertainment,38397.0
1500,Dept of Neighborhoods (DON),89221.0


In [63]:
max_dept_sal['DEPARTMENT'].value_counts()

Admn. & Regulatory Affairs        1
City Controller's Office          1
Public Works & Engineering-PWE    1
Planning & Development            1
Parks & Recreation                1
Municipal Courts Department       1
Mayor's Office                    1
Library                           1
Legal Department                  1
Human Resources Dept.             1
Houston Police Department-HPD     1
Houston Information Tech Svcs     1
Houston Fire Department (HFD)     1
Houston Emergency Center (HEC)    1
Houston Airport System (HAS)      1
Housing and Community Devp.       1
Health & Human Services           1
General Services Department       1
Fleet Management Department       1
Finance                           1
Dept of Neighborhoods (DON)       1
Convention and Entertainment      1
City Council                      1
Solid Waste Management            1
Name: DEPARTMENT, dtype: int64

In [64]:
dept_sal['DEPARTMENT'].value_counts()

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Solid Waste Management             43
Fleet Management Department        36
Library                            36
Admn. & Regulatory Affairs         29
Municipal Courts Department        28
Human Resources Dept.              24
Houston Emergency Center (HEC)     23
Housing and Community Devp.        22
General Services Department        22
Dept of Neighborhoods (DON)        17
Legal Department                   17
City Council                       11
Finance                            10
Houston Information Tech Svcs       9
Planning & Development              7
City Controller's Office            5
Mayor's Office                      5
Convention and Entertainment        1
Name: DEPARTMENT, dtype: int64

In [65]:
#x = []
#x.append(dept_sal['DEPARTMENT'].unique())
#x.value_count()

In [66]:
#Put the department column into the index of both DataFrames
max_dept_sal = max_dept_sal.set_index('DEPARTMENT')
employee = employee.set_index('DEPARTMENT')

In [67]:
employee['MAX_DEPT_SALARY'] = max_dept_sal['BASE_SALARY']
employee.sample(30)

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Houston Police Department-HPD,1382,SENIOR POLICE OFFICER,66614.0,White,Full Time,Male,Active,1972-11-13,2002-01-05,199596.0
Houston Police Department-HPD,1904,JAIL ATTENDANT,37211.0,Black or African American,Full Time,Female,Active,2008-05-05,2008-05-05,199596.0
Housing and Community Devp.,1049,FINANCIAL ANALYST IV,70832.0,Asian/Pacific Islander,Full Time,Female,Active,2010-04-05,2014-12-20,98536.0
Houston Police Department-HPD,1995,POLICE OFFICER,43443.0,White,Full Time,Male,Active,2014-06-09,2015-06-09,199596.0
Houston Police Department-HPD,89,SENIOR POLICE OFFICER,66614.0,Hispanic/Latino,Full Time,Male,Active,1985-05-28,2002-09-14,199596.0
Public Works & Engineering-PWE,1858,ADMINISTRATIVE ASSISTANT,38750.0,Black or African American,Full Time,Female,Active,2014-10-27,2015-09-26,178331.0
Houston Police Department-HPD,765,POLICE SERGEANT,81239.0,Hispanic/Latino,Full Time,Male,Active,1994-11-07,2002-02-28,199596.0
Public Works & Engineering-PWE,1297,SENIOR INVENTORY MANAGEMENT CLERK,37357.0,Black or African American,Full Time,Female,Active,1993-10-02,2011-07-23,178331.0
Houston Police Department-HPD,814,SENIOR POLICE OFFICER,,Black or African American,Full Time,Male,Active,1984-01-09,2004-06-05,199596.0
Health & Human Services,918,FINANCIAL ANALYST IV,70762.0,Black or African American,Full Time,Female,Active,1998-09-08,2008-08-23,180416.0


In [68]:
employee.query('BASE_SALARY > MAX_DEPT_SALARY')

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


In [69]:
employee['MAX_DEPT_SALARY'].sample(15)

DEPARTMENT
Houston Police Department-HPD     199596.0
Public Works & Engineering-PWE    178331.0
Houston Fire Department (HFD)     210588.0
Public Works & Engineering-PWE    178331.0
Public Works & Engineering-PWE    178331.0
Public Works & Engineering-PWE    178331.0
Houston Airport System (HAS)      186192.0
Houston Police Department-HPD     199596.0
Houston Fire Department (HFD)     210588.0
Houston Airport System (HAS)      186192.0
Public Works & Engineering-PWE    178331.0
Public Works & Engineering-PWE    178331.0
Houston Fire Department (HFD)     210588.0
Human Resources Dept.             110547.0
Houston Airport System (HAS)      186192.0
Name: MAX_DEPT_SALARY, dtype: float64

## <u>Highlighting the maximum value from each column</u>
<p>This procedure discovers the school that has maximum values for each numeric column and styles the DataFrame to highlight 
the information so that it is easily consumed by a user</p>

In [70]:
college = pd.read_csv('data/college.csv', index_col = 'INSTNM')
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [71]:
college.dtypes

CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
RELAFFIL                int64
SATVRMID              float64
SATMTMID              float64
DISTANCEONLY          float64
UGDS                  float64
UGDS_WHITE            float64
UGDS_BLACK            float64
UGDS_HISP             float64
UGDS_ASIAN            float64
UGDS_AIAN             float64
UGDS_NHPI             float64
UGDS_2MOR             float64
UGDS_NRA              float64
UGDS_UNKN             float64
PPTUG_EF              float64
CURROPER                int64
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
dtype: object

In [72]:
college.dtypes.value_counts()

float64    20
object      4
int64       2
dtype: int64

In [73]:
college.shape

(7535, 26)

In [74]:
#All the other columns besides CITY and STABBR appear to be numeric
#the MD_EARN_WNE_P10 and GRAD_DEBT_MDN_SUPP columns are of type object and not numeric.
#We can do a sample data to confirm this:
college['MD_EARN_WNE_P10'].iloc[0]

'30300'

In [75]:
college['GRAD_DEBT_MDN_SUPP'].iloc[0]

'33888'

In [76]:
#Some values are strings but we would like them to be numeric
#To confirm the presence of non-numeric elements in the data:
college.MD_EARN_WNE_P10.sort_values(ascending = True).head()

INSTNM
Associated Beth Rivkah Schools                       10100
Rosemead Beauty School                               10100
Adrian's College of Beauty Turlock                   10400
University of California-Hastings College of Law    104800
University of Texas Southwestern Medical Center     106900
Name: MD_EARN_WNE_P10, dtype: object

In [77]:
#Some schools have privacy concerns about the data. 
#To force these columns to be numeric, we may use the to_numeric method:
#The coerce parsing reduces errors that arise during this process
cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']
for col in cols:
    college[col] = pd.to_numeric(college[col], errors = 'coerce')    

In [78]:
college.dtypes.loc[cols]

MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object

In [79]:
college.MD_EARN_WNE_P10.sort_values(ascending = False).head()

INSTNM
Medical College of Wisconsin                            233100.0
West Virginia School of Osteopathic Medicine            219900.0
A T Still University of Health Sciences                 219800.0
Albany Medical College                                  214400.0
University of Massachusetts Medical School Worcester    213600.0
Name: MD_EARN_WNE_P10, dtype: float64

In [80]:
college.describe(include = [np.number])

Unnamed: 0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
count,7164.0,7164.0,7164.0,7535.0,1185.0,1196.0,7164.0,6874.0,6874.0,6874.0,...,6874.0,6874.0,6874.0,6853.0,7535.0,6849.0,6849.0,6718.0,5591.0,5993.0
mean,0.014238,0.009213,0.005304,0.190975,522.819409,530.76505,0.005583,2356.83794,0.510207,0.189997,...,0.02395,0.016086,0.045181,0.226639,0.923291,0.530643,0.522211,0.410021,32918.315149,16850.66853
std,0.118478,0.095546,0.072642,0.393096,68.578862,73.469767,0.074519,5474.275871,0.286958,0.224587,...,0.031288,0.050172,0.09344,0.24647,0.266146,0.225544,0.283616,0.228939,14621.845375,8401.582396
min,0.0,0.0,0.0,0.0,290.0,310.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9500.0,1409.0
25%,0.0,0.0,0.0,0.0,475.0,482.0,0.0,117.0,0.2675,0.036125,...,0.0,0.0,0.0,0.0,1.0,0.3578,0.3329,0.2415,23900.0,9500.0
50%,0.0,0.0,0.0,0.0,510.0,520.0,0.0,412.5,0.5557,0.10005,...,0.0175,0.0,0.0143,0.1504,1.0,0.5215,0.5833,0.40075,30700.0,14500.0
75%,0.0,0.0,0.0,0.0,555.0,565.0,0.0,1929.5,0.747875,0.2577,...,0.0339,0.0117,0.0454,0.3769,1.0,0.7129,0.745,0.572275,38800.0,24547.5
max,1.0,1.0,1.0,1.0,765.0,785.0,1.0,151558.0,1.0,1.0,...,0.5333,0.9286,0.9027,1.0,1.0,1.0,1.0,1.0,233100.0,49750.0


In [81]:
#Use the select_dtypes method to filter for only numeric columns.
#This will exclude STABBR and CITY columns, where a maximum value does not make sense with this problem:
college_n = college.select_dtypes(include = [np.number])
college_n.head()

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600.0,33118.5


In [82]:
#To find data that may have only binary values:
criteria = college_n.nunique() == 2
criteria.head()

HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

In [83]:
binary_cols = college_n.columns[criteria].tolist()
binary_cols

['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']

In [84]:
#Drop the binary columns with the drop method:
college_n2 = college_n.drop(labels = binary_cols, axis = 'columns')
college_n2.head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,424.0,420.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,570.0,565.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,,,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,595.0,590.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,425.0,430.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,0.7347,0.7554,0.127,26600.0,33118.5


In [85]:
college_n2.shape

(7535, 18)

In [86]:
#Use the idxmax method to find the index label of the maximum value for each column:
max_cols = college_n2.idxmax(axis = 0)
max_cols

SATVRMID                             California Institute of Technology
SATMTMID                             California Institute of Technology
UGDS                                      University of Phoenix-Arizona
UGDS_WHITE                       Mr Leon's School of Hair Design-Moscow
UGDS_BLACK                           Velvatex College of Beauty Culture
UGDS_HISP                       Thunderbird School of Global Management
UGDS_ASIAN                          Cosmopolitan Beauty and Tech School
UGDS_AIAN                             Haskell Indian Nations University
UGDS_NHPI                                       Palau Community College
UGDS_2MOR                                                 LIU Brentwood
UGDS_NRA               California University of Management and Sciences
UGDS_UNKN             Le Cordon Bleu College of Culinary Arts-San Fr...
PPTUG_EF                        Thunderbird School of Global Management
PCTPELL                                        MTI Business Coll

In [87]:
#Getting an ndarray of the unique column names:
unique_max_cols = max_cols.unique()
unique_max_cols

array(['California Institute of Technology',
       'University of Phoenix-Arizona',
       "Mr Leon's School of Hair Design-Moscow",
       'Velvatex College of Beauty Culture',
       'Thunderbird School of Global Management',
       'Cosmopolitan Beauty and Tech School',
       'Haskell Indian Nations University', 'Palau Community College',
       'LIU Brentwood',
       'California University of Management and Sciences',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'ABC Beauty College Inc',
       'Dongguk University-Los Angeles', 'Medical College of Wisconsin',
       'Southwest University of Visual Arts-Tucson'], dtype=object)

In [88]:
college_n2.loc[unique_max_cols].style.highlight_max()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
California Institute of Technology,765.0,785.0,983.0,0.2787,0.0153,0.1221,0.4385,0.001,0.0,0.057,0.0875,0.0,0.0,0.1126,0.2303,0.0082,77800.0,11812.5
University of Phoenix-Arizona,,,151558.0,0.3098,0.1555,0.076,0.0082,0.0042,0.005,0.1131,0.0131,0.3152,0.0,0.6009,0.592,,,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.625,0.2,,15710.0
Velvatex College of Beauty Culture,,,25.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.7692,0.0,0.52,,
Thunderbird School of Global Management,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,118900.0,
Cosmopolitan Beauty and Tech School,,,110.0,0.0091,0.0,0.0182,0.9727,0.0,0.0,0.0,0.0,0.0,0.3182,0.7761,0.1244,0.9545,,
Haskell Indian Nations University,430.0,440.0,805.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0224,0.8396,0.0,0.2089,22800.0,
Palau Community College,,,602.0,0.0,0.0017,0.0,0.0,0.0,0.9983,0.0,0.0,0.0,0.3887,0.856,0.0,0.2616,24700.0,
LIU Brentwood,,,15.0,0.0,0.1333,0.2667,0.0,0.0,0.0,0.5333,0.0,0.0667,0.4,0.5652,0.7826,0.7826,44600.0,25499.0
California University of Management and Sciences,,,98.0,0.0102,0.0204,0.0,0.0408,0.0,0.0,0.0,0.9286,0.0,0.0,0.0926,0.0556,0.6852,,


## <u>Replicating idmax with method chaining</u>
<p>idmax is a challenging method to replicate using only the methods covered in this notebook. This procedure chains together
the basic methods to find all row index values that contain a maximum column value</p>

In [89]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600.0,33118.5


In [90]:
#Getting numeric columns that are of interest
cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']
for col in cols:
    college[col] = pd.to_numeric(college[col], errors = 'coerce')

In [91]:
college_n = college.select_dtypes(include = [np.number])
college_n.head()

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600.0,33118.5


In [92]:
criteria = college_n.nunique() == 2
criteria.head()

HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

f = college_n[criteria]
f

In [93]:
binary_cols = college_n.columns[criteria].tolist()
binary_cols

['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']

In [94]:
college_n = college_n.drop(labels = binary_cols, axis = 'columns')

In [95]:
#Finding the maximum for each column:
college_n.max().head()

SATVRMID         765.0
SATMTMID         785.0
UGDS          151558.0
UGDS_WHITE         1.0
UGDS_BLACK         1.0
dtype: float64

In [96]:
#Test each value with its column max
college_n.eq(college_n.max()).head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [97]:
#Use of the any method to find rows that have at least one True value:
has_row_max = college_n.eq(college_n.max()).any(axis = 'columns')
has_row_max

INSTNM
Alabama A & M University                                  False
University of Alabama at Birmingham                       False
Amridge University                                        False
University of Alabama in Huntsville                       False
Alabama State University                                  False
                                                          ...  
SAE Institute of Technology  San Francisco                False
Rasmussen College - Overland Park                         False
National Personal Training Institute of Cleveland         False
Bay Area Medical Academy - San Jose Satellite Location    False
Excel Learning Center-San Antonio South                   False
Length: 7535, dtype: bool

In [98]:
#There are 18 columns thus 18 True values in has_row_max
#Let us find out how many they are:
has_row_max.sum()

401

<p>This was a bit unexpected, but it turns out that there are columns with manyrows that equal the maximum value. This is 
common with many of the percentage columns that have a maximum of 1. idxmax returns the first occurrence of the maximum value. 
Let us back up a bit, remove the any method, and look at the output above. Let us run the cumsum method instead to accumulate
all the True values. The first and last three rows are shown:</p>

In [99]:
#cumsum accumulates all True values:
college_n.eq(college_n.max()).cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama at Birmingham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Amridge University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama in Huntsville,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alabama State University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2
Rasmussen College - Overland Park,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2
National Personal Training Institute of Cleveland,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2
Bay Area Medical Academy - San Jose Satellite Location,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2


<p>Some columns have one unique maximum like SATVRMID and SATMTMID, while others like UGDS_WHITE have many. 109 schools have 
100% of their undergraduates as white. If we chain the cumsum method one more time, the value 1 would only appear once in each
column and it would be the first occurrence of the maximum:</p>

In [100]:
college_n.eq(college_n.max()).cumsum().cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama at Birmingham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Amridge University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama in Huntsville,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alabama State University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,7305,7305,415,379750,73107,341103,983,11382,3316,5056,1076,7276,113649,172944,170273,36183,3445,10266
Rasmussen College - Overland Park,7306,7306,416,379859,73135,341239,984,11384,3317,5057,1077,7277,113693,173010,170328,36195,3446,10268
National Personal Training Institute of Cleveland,7307,7307,417,379968,73163,341375,985,11386,3318,5058,1078,7278,113737,173076,170383,36207,3447,10270
Bay Area Medical Academy - San Jose Satellite Location,7308,7308,418,380077,73191,341511,986,11388,3319,5059,1079,7279,113781,173142,170438,36219,3448,10272


In [101]:
#We can now test the equality of each value against 1 with the eq method and then use the any method to find rows that have at least one True value:
has_row_max2 = college_n.eq(college_n.max()) \
                        .cumsum() \
                        .cumsum() \
                        .eq(1) \
                        .any(axis='columns')

In [102]:
#Test that has_row_max2 has no more True values than the number of columns:
has_row_max2.sum()

16

In [103]:
#We need all the institutions where has_row_max2 is True.
#We can simply use boolean indexing on the Series itself:
idxmax_cols = has_row_max2[has_row_max2].index
idxmax_cols

Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')

In [104]:
#We can check whether they are the same as the ones found with the idxmax method:
set(college_n.idxmax().unique()) == set(idxmax_cols)

True

In [105]:
#The whole process can be chained and timed as follows:
%timeit college_n.eq(college_n.max()) \
                 .cumsum() \
                 .cumsum() \
                 .eq(1) \
                 .any(axis='columns') \
                 [lambda x: x].index

6.95 ms ± 578 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [106]:
%timeit college_n.idxmax().values

3.44 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## <u>Finding the most common maximum</u>
<p>We will find the race with the highest undergraduate population for each school. There after, we can find the disribution of
the result in the entire dataset. This can help us answer questions like <i>What percentage of institutions have the most white
students than any other race?</i></p>

In [107]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600.0,33118.5


In [108]:
col_ugds = college.filter(like = ('UGDS_'))
col_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [109]:
#Use idxmax to get the column name with the highest race percentage for each row:
highest_percentage_race = col_ugds.idxmax(axis = 'columns')
highest_percentage_race

INSTNM
Alabama A & M University                                  UGDS_BLACK
University of Alabama at Birmingham                       UGDS_WHITE
Amridge University                                        UGDS_BLACK
University of Alabama in Huntsville                       UGDS_WHITE
Alabama State University                                  UGDS_BLACK
                                                             ...    
SAE Institute of Technology  San Francisco                       NaN
Rasmussen College - Overland Park                                NaN
National Personal Training Institute of Cleveland                NaN
Bay Area Medical Academy - San Jose Satellite Location           NaN
Excel Learning Center-San Antonio South                          NaN
Length: 7535, dtype: object

In [110]:
highest_percentage_race.value_counts()

UGDS_WHITE    4608
UGDS_BLACK    1042
UGDS_HISP      890
UGDS_UNKN      161
UGDS_ASIAN      83
UGDS_AIAN       42
UGDS_NRA        28
UGDS_NHPI       12
UGDS_2MOR        8
dtype: int64

In [111]:
#Use the value_counts method to get the distrubution of maximum occurences:
highest_percentage_race.value_counts(normalize = True)

UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
UGDS_UNKN     0.023422
UGDS_ASIAN    0.012074
UGDS_AIAN     0.006110
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
dtype: float64

In [112]:
highest_percentage_race.value_counts(normalize = True)*100

UGDS_WHITE    67.035205
UGDS_BLACK    15.158569
UGDS_HISP     12.947338
UGDS_UNKN      2.342159
UGDS_ASIAN     1.207448
UGDS_AIAN      0.610998
UGDS_NRA       0.407332
UGDS_NHPI      0.174571
UGDS_2MOR      0.116381
dtype: float64

In [113]:
highest_percentage_race.value_counts()

UGDS_WHITE    4608
UGDS_BLACK    1042
UGDS_HISP      890
UGDS_UNKN      161
UGDS_ASIAN      83
UGDS_AIAN       42
UGDS_NRA        28
UGDS_NHPI       12
UGDS_2MOR        8
dtype: int64

In [114]:
#For the schools with more black students than any other race, what is the distribution for the second highest race percentage?
college_black = col_ugds[highest_percentage_race == 'UGDS_BLACK']
college_black.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137
Concordia College Alabama,0.028,0.8758,0.0373,0.0093,0.0,0.0,0.0031,0.0466,0.0
South University-Montgomery,0.3046,0.6054,0.0153,0.0153,0.0153,0.0096,0.0,0.0019,0.0326


In [115]:
college_black.shape

(1042, 9)

In [117]:
#college_black2 = college_black.@drop('UGDS_BLACK', axis = 'columns')

In [118]:
college_black2.idxmax(axis = 'columns').value_counts(normalize = True)

NameError: name 'college_black2' is not defined