---
# Index Alignment
---

## Examining the Index object
All Index objects, except for the MultiIndex, are single-dimensional data structures that combine the functionality of Python sets and NumPy ndarrays.

Examine the column index of the college dataset and explore much
of its functionality.  
Read in the college dataset, and create a variable columns that holds the column index

In [1]:
import numpy as np
import pandas as pd

In [2]:
college = pd.read_csv('./college.csv')
cols = college.columns
cols

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

Use the `.values` attribute to access the underlying NumPy array

In [3]:
cols.values

array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

Select items from the index by position with a scalar, list, or slice

In [4]:
cols[5]

'WOMENONLY'

In [5]:
cols[[1, 8, -1]]

Index(['CITY', 'SATMTMID', 'GRAD_DEBT_MDN_SUPP'], dtype='object')

In [6]:
cols[2:6]

Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')

Indexes share many of the same methods as Series and DataFrames:

In [7]:
cols.min(), cols.max(), cols.isnull().sum(), cols.value_counts().sum()

('CITY', 'WOMENONLY', 0, 27)

 Basic arithmetic and comparison operators on Index objects

In [8]:
cols + '_A'

Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')

In [9]:
cols > 'G'

array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

In [10]:
uniq, cnt = np.unique((cols > 'G'), return_counts=True)  
dict(zip(uniq, cnt))

{False: 3, True: 24}

Trying to change an Index value after its creation fails. Indexes are immutable
objects:

In [11]:
# cols[1] = 'city'

Indexes support the set operations—union, intersection, difference, and symmetric difference

In [12]:
 c1 = cols[:4]
 c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [13]:
c2 = cols[2:6]
c2

Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')

In [14]:
c1.union(c2)

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')

In [15]:
c1 | c2

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')

In [16]:
c1.symmetric_difference(c2)

Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

In [17]:
c1 ^ c2

Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

## Producing Cartesian products
Construct two Series that have indexes that are different but contain some of the
same values:

In [18]:
s1 = pd.Series(data=list(range(4)), index=list('aaab'))
s1

a    0
a    1
a    2
b    3
dtype: int64

In [19]:
s2 = pd.Series(data=list(range(6)), index=list('cababb'))
s2

c    0
a    1
b    2
a    3
b    4
b    5
dtype: int64

Add the two Series together to produce a Cartesian product.  For each a index value
in s1, we add every a in s2

In [20]:
s1 + s2

a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

In [21]:
(s1+s2).apply(type).unique()

array([<class 'float'>], dtype=object)

## Exploding indexes
Add two larger Series that have indexes with only a few unique values but in
different orders. The result will explode the number of values in the indexes

Read in the employee data and set the index to the RACE column:

In [22]:
employee = pd.read_csv('./employee.csv', index_col='RACE')
employee.head(3)

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Hispanic/Latino,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Full Time,Female,Active,2006-06-12,2012-10-13
Hispanic/Latino,1,LIBRARY ASSISTANT,Library,26125.0,Full Time,Female,Active,2000-07-19,2010-09-18
White,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,Full Time,Male,Active,2015-02-03,2015-02-03


Select the BASE_SALARY column as two different Series. Check to see whether this
operation created two new objects:

In [23]:
salary_1 = employee['BASE_SALARY']
salary_2 = employee['BASE_SALARY']
salary_1 is salary_2

True

In [24]:
salary_2 = employee['BASE_SALARY'].copy()
salary_1 is salary_2

False

Let's change the order of the index for one of the Series by sorting it:

In [25]:
salary_1 = salary_1.sort_index()
salary_1.head(3)

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
Name: BASE_SALARY, dtype: float64

In [27]:
salary_2.head(3)

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
Name: BASE_SALARY, dtype: float64

Let's add these salary Series together

In [28]:
salary_add = salary_1 + salary_2
salary_add.head()

RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

Let's create one more Series of salary1
added to itself and then output the lengths of each Series.

In [29]:
salary_add_1 = salary_1 + salary_1
len(salary_1), len(salary_2), len(salary_add), len(salary_add_1)

(2000, 2000, 1175424, 2000)

## Filling values with unequal indexes
add together multiple Series from the baseball dataset with unequal (but
unique) indexes using the .add method with the fill_value parameter to ensure that
there are no missing values in the result.

Read in the three baseball datasets and set playerID as the index

In [30]:
baseball_14 = pd.read_csv('./baseball14.csv', index_col='playerID')
baseball_15 = pd.read_csv('./baseball15.csv', index_col='playerID')
baseball_16 = pd.read_csv('./baseball16.csv', index_col='playerID')

In [31]:
baseball_14.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2014,1,HOU,AL,158,660,85,225,47,3,7,59.0,56.0,9.0,36,53.0,7.0,5.0,1.0,5.0,20.0
cartech02,2014,1,HOU,AL,145,507,68,115,21,1,37,88.0,5.0,2.0,56,182.0,6.0,5.0,0.0,4.0,12.0
castrja01,2014,1,HOU,AL,126,465,43,103,21,2,14,56.0,1.0,0.0,34,151.0,1.0,9.0,1.0,3.0,11.0
corpoca01,2014,1,HOU,AL,55,170,22,40,6,0,6,19.0,0.0,0.0,14,37.0,0.0,3.0,1.0,2.0,3.0
dominma01,2014,1,HOU,AL,157,564,51,121,17,0,16,57.0,0.0,1.0,29,125.0,2.0,5.0,2.0,7.0,23.0


Use the .difference method on the index to discover which index labels are in
baseball_14 and not in baseball_15, and vice versa

In [34]:
baseball_14.index.difference(baseball_15.index)

Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')

In [35]:
baseball_15.index.difference(baseball_14.index)


Index(['congeha01', 'correca01', 'gattiev01', 'gomezca01', 'lowrije01',
       'rasmuco01', 'tuckepr01', 'valbulu01'],
      dtype='object', name='playerID')

There are quite a few players unique to each index. Let's find out how many hits each
player has in total over the three-year period. The H column contains the number of
hits:

In [36]:
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']
hits_14.head()

playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

Let's first add together two Series using the plus operator

In [37]:
(hits_14 + hits_15).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

Even though players congeha01 and corpoca01 have values for 2015, their result
is missing. Let's use the .add method with the fill_value parameter to avoid
missing values:

In [38]:
hits_14.add(hits_15, fill_value=0).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

Add hits from 2016 by chaining the add method once more:

In [39]:
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()

playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

Check for missing values in the result

In [41]:
hits_total.hasnans

False