---
# Chapter 2: Essential DataFrame Operations
---

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv('./movie.csv')
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


## Selecting multiple DataFrame columns


In [3]:
# Select all the actor and director columns
movie_actor_director = movies[['director_name', 
                               'actor_1_name', 
                               'actor_2_name', 
                               'actor_3_name'
                               ]]
movie_actor_director.head()

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Doug Walker,Doug Walker,Rob Walker,


There are instances when one column of a DataFrame needs to be selected. Using
the index operation can return either a Series or a DataFrame. If we pass in a list
with a single item, we will get back a DataFrame. If we pass in just a string with
the column name, we will get a Series back:

In [4]:
type(movie_actor_director['director_name'])

pandas.core.series.Series

In [5]:
type(movie_actor_director[['director_name']])

pandas.core.frame.DataFrame

We can also use `.loc` to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series

In [6]:
type(movie_actor_director.loc[:, 'director_name'])

pandas.core.series.Series

In [7]:
type(movie_actor_director.loc[:, ['director_name']])

pandas.core.frame.DataFrame

Passing a long list inside the indexing operator might cause readability issues. To help with
this, you may save all your column names to a list variable first.

In [8]:
# Create a list that contains columns name
cols = ['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']
movie_actor_director = movies[cols]

## Selecting columns with methods

There are some DataFrame methods that facilitate columns selection such as `.select_dtypes` if want to select columns by type or, `.filter` method.

Read in the movie dataset. Shorten the column names for display. Use the `.value_counts` method to output the number of columns with each specific data type:

In [9]:
def shorten(col):
  return (
      str(col)
      .replace("facebook_likes", "fb")
      .replace("_for_reviews", "")
  )

movies.rename(columns=shorten, inplace=True)

In [10]:
# dtypes counts
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

Use the `.select_dtypes` method to select only the integer columns

In [11]:
movies.select_dtypes(include='int').head(3)

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000


If you would like to select all the numeric columns, you may pass the string number to the include parameter:

In [12]:
movies.select_dtypes(include='number').head()

Unnamed: 0,num_critic,duration,director_fb,actor_3_fb,actor_1_fb,gross,num_voted_users,cast_total_fb,facenumber_in_poster,num_user,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


If we wanted integer and string columns we could do the following:

In [13]:
movies.select_dtypes(include=['int', 'object']).head(3)

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


To exclude only floating-point columns, do the following

In [14]:
movies.select_dtypes(exclude='float').head(3)


Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names

In [15]:
movies.filter(like='fb').head()

Unnamed: 0,director_fb,actor_3_fb,actor_1_fb,cast_total_fb,actor_2_fb,movie_fb
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0


In [16]:
cols

['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']

In [17]:
movies.filter(items=cols).head(3)

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman


The `.filter` method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:

In [18]:
movies.filter(regex=r'\d').head(3)

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0


## Ordering Column Names
**Guideline:**  

Classify each column as either categorical or continuous
-  Group common columns within the categorical and continuous columns
-  Place the most important groups of columns first with categorical columns before continuous ones

In [19]:
# Output all the column names and scan for similar categorical and continuous
# columns
movies.columns

Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')

In [20]:
cat_core = ["movie_title", "title_year", "content_rating", "genres",]
movies[cat_core].head(1)

Unnamed: 0,movie_title,title_year,content_rating,genres
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi


In [21]:
cat_people = ["director_name", "actor_1_name", "actor_2_name", "actor_3_name"]
movies[cat_people].head(1)

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi


In [22]:
 cat_other = ["color", "country", "language", "plot_keywords", "movie_imdb_link",]
 movies[cat_other].head(1)

Unnamed: 0,color,country,language,plot_keywords,movie_imdb_link
0,Color,USA,English,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...


In [23]:
cont_fb = ["director_fb", "actor_1_fb", "actor_2_fb", "actor_3_fb",
           "cast_total_fb", "movie_fb",]
movies[cont_fb].head(3)

Unnamed: 0,director_fb,actor_1_fb,actor_2_fb,actor_3_fb,cast_total_fb,movie_fb
0,0.0,1000.0,936.0,855.0,4834,33000
1,563.0,40000.0,5000.0,1000.0,48350,0
2,0.0,11000.0,393.0,161.0,11700,85000


In [24]:
cont_finance = ["budget", "gross"]
movies[cont_finance].head(3)

Unnamed: 0,budget,gross
0,237000000.0,760505847.0
1,300000000.0,309404152.0
2,245000000.0,200074175.0


In [25]:
cont_num_reviews = ["num_voted_users", "num_user", "num_critic",]
movies[cont_num_reviews].head(3)

Unnamed: 0,num_voted_users,num_user,num_critic
0,886204,3054.0,723.0
1,471220,1238.0,302.0
2,275868,994.0,602.0


In [26]:
cont_other = ["imdb_score", "duration", "aspect_ratio", "facenumber_in_poster",]
movies[cont_other].head(3)

Unnamed: 0,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,7.9,178.0,1.78,0.0
1,7.1,169.0,2.35,0.0
2,6.8,148.0,2.35,1.0


Concatenate all the lists together to get the final column order. Also, ensure that this
list contains all the columns from the original:

In [27]:
new_col_order = (
    cat_core
    + cat_people
    + cat_other
    + cont_fb
    + cont_finance
    + cont_num_reviews
    + cont_other

)
set(movies.columns) == set(new_col_order)

True

Pass the list with the new column order to the indexing operator of the DataFrame to
reorder the columns

In [28]:
movies[new_col_order].head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,language,plot_keywords,movie_imdb_link,director_fb,actor_1_fb,actor_2_fb,actor_3_fb,cast_total_fb,movie_fb,budget,gross,num_voted_users,num_user,num_critic,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,English,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,0.0,1000.0,936.0,855.0,4834,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,English,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,563.0,40000.0,5000.0,1000.0,48350,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,English,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,0.0,11000.0,393.0,161.0,11700,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0
3,The Dark Knight Rises,2012.0,PG-13,Action|Thriller,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Color,USA,English,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,22000.0,27000.0,23000.0,23000.0,106759,164000,250000000.0,448130642.0,1144337,2701.0,813.0,8.5,164.0,2.35,0.0
4,Star Wars: Episode VII - The Force Awakens,,,Documentary,Doug Walker,Doug Walker,Rob Walker,,,,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,131.0,131.0,12.0,,143,0,,,8,,,7.1,,,0.0


## Summarizing a DataFrame

In [29]:
movies.size

137648

In [30]:
movies.ndim

2

In [31]:
len(movies)

4916

In [32]:
movies.count()

color                   4897
director_name           4814
num_critic              4867
duration                4901
director_fb             4814
actor_3_fb              4893
actor_2_name            4903
actor_1_fb              4909
gross                   4054
genres                  4916
actor_1_name            4909
movie_title             4916
num_voted_users         4916
cast_total_fb           4916
actor_3_name            4893
facenumber_in_poster    4903
plot_keywords           4764
movie_imdb_link         4916
num_user                4895
language                4904
country                 4911
content_rating          4616
budget                  4432
title_year              4810
actor_2_fb              4903
imdb_score              4916
aspect_ratio            4590
movie_fb                4916
dtype: int64

In [33]:
movies.min()

num_critic                                                              1
duration                                                                7
director_fb                                                             0
actor_3_fb                                                              0
actor_1_fb                                                              0
gross                                                                 162
genres                                                             Action
movie_title                                                       #Horror
num_voted_users                                                         5
cast_total_fb                                                           0
facenumber_in_poster                                                    0
movie_imdb_link         http://www.imdb.com/title/tt0006864/?ref_=fn_t...
num_user                                                                1
budget                                

In [34]:
movies.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_critic,4867.0,137.9889,120.2394,1.0,49.0,108.0,191.0,813.0
duration,4901.0,107.0908,25.28602,7.0,93.0,103.0,118.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,7.0,48.0,189.75,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,132.0,366.0,633.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,607.0,982.0,11000.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,5019656.25,25043962.0,61108412.75,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,8361.75,33132.5,93772.75,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,1394.75,3049.0,13616.75,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,1.0,2.0,43.0
num_user,4895.0,267.6688,372.9348,1.0,64.0,153.0,320.5,5060.0


In [35]:
movies.describe(percentiles=[.01, .3, .99]).T

Unnamed: 0,count,mean,std,min,1%,30%,50%,99%,max
num_critic,4867.0,137.9889,120.2394,1.0,2.0,60.0,108.0,546.68,813.0
duration,4901.0,107.0908,25.28602,7.0,43.0,95.0,103.0,189.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,0.0,11.0,48.0,16000.0,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,0.0,176.0,366.0,11000.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,6.08,694.0,982.0,44920.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,8474.8,7914068.6,25043962.0,326412800.0,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,53.0,11864.5,33132.5,681584.6,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,6.0,1684.5,3049.0,62413.9,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,0.0,1.0,8.0,43.0
num_user,4895.0,267.6688,372.9348,1.0,1.94,80.0,153.0,1999.24,5060.0


In [36]:
movies.min(skipna=False)

num_critic                                                            NaN
duration                                                              NaN
director_fb                                                           NaN
actor_3_fb                                                            NaN
actor_1_fb                                                            NaN
gross                                                                 NaN
genres                                                             Action
movie_title                                                       #Horror
num_voted_users                                                         5
cast_total_fb                                                           0
facenumber_in_poster                                                  NaN
movie_imdb_link         http://www.imdb.com/title/tt0006864/?ref_=fn_t...
num_user                                                              NaN
budget                                

## Chaining DataFrame methods

In [39]:
# count all the missing values in each column of the movie dataset
movies.isnull().sum()

color                    19
director_name           102
num_critic               49
duration                 15
director_fb             102
actor_3_fb               23
actor_2_name             13
actor_1_fb                7
gross                   862
genres                    0
actor_1_name              7
movie_title               0
num_voted_users           0
cast_total_fb             0
actor_3_name             23
facenumber_in_poster     13
plot_keywords           152
movie_imdb_link           0
num_user                 21
language                 12
country                   5
content_rating          300
budget                  484
title_year              106
actor_2_fb               13
imdb_score                0
aspect_ratio            326
movie_fb                  0
dtype: int64

In [42]:
movies.isnull().sum().sum()

2654

In [43]:
# A way to determine whether there are any missing values in the DataFrame 
# is to use the .any method twice in succession
movies.isnull().any().any()

True

Most of the columns in the movie dataset with the object data type contain missing values.
By default, aggregation methods (`.min`, `.max`, and `.sum`), do not return anything for object
columns.

In [45]:
c = ['color', 'director_name', 'genres']
movies[c].max()

Series([], dtype: float64)

To force pandas to return something for each column, we must fill in the missing values. Here,
we choose an empty string

In [46]:
movies[c].fillna("").max()

color                    Color
director_name    Étienne Faure
genres                 Western
dtype: object

In [49]:
(
    movies.select_dtypes(include=['object'])
    .fillna("")
    .max()
)

color                                                          Color
director_name                                          Étienne Faure
actor_2_name                                           Zubaida Sahar
genres                                                       Western
actor_1_name                                           Óscar Jaenada
movie_title                                                 Æon Flux
actor_3_name                                           Óscar Jaenada
plot_keywords                                    zombie|zombie spoof
movie_imdb_link    http://www.imdb.com/title/tt5574490/?ref_=fn_t...
language                                                        Zulu
country                                                 West Germany
content_rating                                                     X
dtype: object

## DataFrame Operations
If the DataFrame does
not contain homogeneous data, then the operation is likely to fail. 

In [50]:
colleges = pd.read_csv('./college.csv')
colleges.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [52]:
# If we try to add 5 to colleges dataframe, we've get error
# TypeError: can only concatenate str (not "int") to str
# colleges + 5

To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with **'UGDS_'**. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire

In [64]:
 college_ugds = (
     colleges.filter(like='UGDS_')
     
 )


In [65]:
college_ugds.set_index(colleges.INSTNM, inplace=True)

In [66]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


pandas does bankers rounding, numbers that are exactly halfway between either side
to the even side. Look at what happens to the UGDS_BLACK row of this series when
we round it to two decimal places

In [68]:
name = "Northwest-Shoals Community College"
name in college_ugds.index

True

In [70]:
college_ugds.loc[name]

UGDS_WHITE    0.7912
UGDS_BLACK    0.1250
UGDS_HISP     0.0339
UGDS_ASIAN    0.0036
UGDS_AIAN     0.0088
UGDS_NHPI     0.0006
UGDS_2MOR     0.0012
UGDS_NRA      0.0033
UGDS_UNKN     0.0324
Name: Northwest-Shoals Community College, dtype: float64

In [71]:
college_ugds.loc[name].round(2)

UGDS_WHITE    0.79
UGDS_BLACK    0.12
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

If we add .0001 before rounding, it changes to rounding up

In [72]:
(college_ugds.loc[name] + .0001).round(2)

UGDS_WHITE    0.79
UGDS_BLACK    0.13
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

e. To begin our rounding adventure with operators,
we will first add .00501 to each value of college_ugds

In [73]:
college_ugds + .00501

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03831,0.94031,0.01051,0.00691,0.00741,0.00691,0.00501,0.01091,0.01881
University of Alabama at Birmingham,0.59721,0.26501,0.03331,0.05681,0.00721,0.00571,0.04181,0.02291,0.01501
Amridge University,0.30401,0.42421,0.01191,0.00841,0.00501,0.00501,0.00501,0.00501,0.27651
University of Alabama in Huntsville,0.70381,0.13051,0.04321,0.04261,0.01931,0.00521,0.02221,0.03821,0.04001
Alabama State University,0.02081,0.92581,0.01711,0.00691,0.00601,0.00561,0.01481,0.02931,0.01871
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


Use the floor division operator, //, to round down to the nearest whole number
percentage

In [74]:
(college_ugds + .00501) // 0.01

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,3.0,94.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
University of Alabama at Birmingham,59.0,26.0,3.0,5.0,0.0,0.0,4.0,2.0,1.0
Amridge University,30.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,27.0
University of Alabama in Huntsville,70.0,13.0,4.0,4.0,1.0,0.0,2.0,3.0,4.0
Alabama State University,2.0,92.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


To complete the rounding exercise, divide by 100:

In [75]:
college_ugds_op_round = (
    (college_ugds + 0.00501) // 0.01 / 100
)

In [76]:
 college_ugds_op_round.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


Now use the round DataFrame method to do the rounding automatically for us. Due
to bankers rounding, we add a small fraction before rounding

In [77]:
college_ugds_round = (college_ugds + 0.00001).round(2)
college_ugds_round.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


Use the equals DataFrame method to test the equality of two DataFrames

In [79]:
college_ugds_round.equals(college_ugds_op_round)

True

## Comparing Missing Values
pandas uses the NumPy NaN (`np.nan`) object to represent a missing value. This is an
unusual object and has interesting mathematical properties. For instance, it is not equal to
itself. Even Python's `None` object evaluates as True when compared to itself

In [80]:
np.nan == np.nan

False

In [81]:
np.nan > np.nan

False

In [82]:
np.nan > np.inf

False

In [83]:
np.nan < np.inf

False

In [84]:
None == None

True

In [86]:
None == np.nan

False

In [87]:
np.nan != np.inf

True

### Getting Ready
To get an idea of how the `equals` operator works, let's compare each element to
a scalar value:

In [88]:
college_ugds == 0.0019

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,False,False,True,False,True,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,False,False,False,False,False,False,False,False,False
Rasmussen College - Overland Park,False,False,False,False,False,False,False,False,False
National Personal Training Institute of Cleveland,False,False,False,False,False,False,False,False,False
Bay Area Medical Academy - San Jose Satellite Location,False,False,False,False,False,False,False,False,False


This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis.
Take, for instance, college_ugds compared against itself, as follows:

In [90]:
college_self_compare = college_ugds == college_ugds
college_self_compare.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,True,True,True,True,True,True,True,True,True
University of Alabama at Birmingham,True,True,True,True,True,True,True,True,True
Amridge University,True,True,True,True,True,True,True,True,True
University of Alabama in Huntsville,True,True,True,True,True,True,True,True,True
Alabama State University,True,True,True,True,True,True,True,True,True


In [91]:
college_self_compare.all()

UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

This happens because missing values do not compare equally with one another.
If you tried to count missing values using the equal operator and summing up the
Boolean columns, you would get zero for each one

In [92]:
(college_ugds == np.nan).sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

Instead of using `==` to find missing numbers, use the `.isna` method:

In [93]:
(college_ugds.isna().sum())

UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

The correct way to compare two entire DataFrames with one another is not with the equals operator (`==`) but with the `.equals` method. This method treats **NaNs** that are in the same location as equal (note that the `.eq` method is the equivalent of `==`):

In [94]:
college_ugds.equals(college_ugds)

True

## Transposing the direction of a DataFrame operation
Many DataFrame methods have an axis parameter. This parameter controls the direction
in which the operation takes place. Axis parameters can be *'index'* (or 0) or *'columns'*
(or 1).

In [95]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


operations can be
sensibly done both vertically and horizontally. The `.count` method returns the
number of non-missing values. By default, its axis parameter is set to 0

In [96]:
college_ugds.count()

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

Changing the axis parameter to 'columns' changes the direction of the operation
so that we get back a count of non-missing items in each row

In [97]:
college_ugds.count(axis='columns')

INSTNM
Alabama A & M University                                  9
University of Alabama at Birmingham                       9
Amridge University                                        9
University of Alabama in Huntsville                       9
Alabama State University                                  9
                                                         ..
SAE Institute of Technology  San Francisco                0
Rasmussen College - Overland Park                         0
National Personal Training Institute of Cleveland         0
Bay Area Medical Academy - San Jose Satellite Location    0
Excel Learning Center-San Antonio South                   0
Length: 7535, dtype: int64

Instead of counting non-missing values, we can sum all the values in each row. Each
row of percentages should add up to 1. The .sum method may be used to verify this:

In [98]:
college_ugds.sum(axis='columns').head()

INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

To get an idea of the distribution of each column, the `.median` method can be used

In [99]:
college_ugds.median(axis='index').head()

UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
dtype: float64

The `.cumsum` method with *axis=1* accumulates the race percentages across each row.
It gives a slightly different view of the data. For example, it is very easy to see the exact
percentage of *white* and *black* students for each school

In [100]:
college_ugds.cumsum(axis='columns').head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9686,0.9741,0.976,0.9784,0.9803,0.9803,0.9862,1.0
University of Alabama at Birmingham,0.5922,0.8522,0.8805,0.9323,0.9345,0.9352,0.972,0.9899,0.9999
Amridge University,0.299,0.7182,0.7251,0.7285,0.7285,0.7285,0.7285,0.7285,1.0
University of Alabama in Huntsville,0.6988,0.8243,0.8625,0.9001,0.9144,0.9146,0.9318,0.965,1.0
Alabama State University,0.0158,0.9366,0.9487,0.9506,0.9516,0.9522,0.962,0.9863,1.0


## Determining college campus diversity
Their top 10 diverse colleges with Diversity Index are given
as follows:

In [101]:
pd.read_csv('./college_diversity.csv', index_col='School')

Unnamed: 0_level_0,Diversity Index
School,Unnamed: 1_level_1
"Rutgers University--Newark Newark, NJ",0.76
"Andrews University Berrien Springs, MI",0.74
"Stanford University Stanford, CA",0.74
"University of Houston Houston, TX",0.74
"University of Nevada--Las Vegas Las Vegas, NV",0.74
"University of San Francisco San Francisco, CA",0.74
"San Francisco State University San Francisco, CA",0.73
"University of Illinois--Chicago Chicago, IL",0.73
"New Jersey Institute of Technology Newark, NJ",0.72
"Texas Woman's University Denton, TX",0.72


Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.

Many of these colleges have missing values for all their race columns. We can count
all the missing values for each row and sort the resulting Series from the highest
to lowest.

In [106]:
(
    college_ugds.isnull()
    .sum(axis='columns')
    .sort_values(ascending=False)
    .head()
)


INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

Now that we have seen the colleges that are missing all their race columns, we can use the `.dropna` method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values

In [109]:
college_ugds.dropna(how='all', inplace=True)
college_ugds.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method, `.ge`, to return a DataFrame with a Boolean value for each cell

In [110]:
college_ugds.ge(0.15).head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,True,False,False,False,False,False,False,False
University of Alabama at Birmingham,True,True,False,False,False,False,False,False,False
Amridge University,True,True,False,False,False,False,False,False,True
University of Alabama in Huntsville,True,False,False,False,False,False,False,False,False
Alabama State University,False,True,False,False,False,False,False,False,False


From here, we can use the `.sum` method to count the True values for each college.
Notice that a Series is returned:

In [111]:
diversity_metric = college_ugds.ge(.15).sum(axis='columns')
diversity_metric.head()

INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

To get an idea of the distribution, we will use the `.value_counts` method on this
Series

In [112]:
diversity_metric.value_counts()

1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

Amazingly, two schools have more than 15% in five different race categories. Let's
sort the diversity_metric Series to find out which ones they are

In [113]:
diversity_metric.sort_values(ascending=False).head()

INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools.

In [114]:
college_ugds.loc[['Regency Beauty Institute-Austin', 'Central Texas Beauty College-Temple']]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Regency Beauty Institute-Austin,0.1867,0.2133,0.16,0.0,0.0,0.0,0.1733,0.0,0.2667
Central Texas Beauty College-Temple,0.1616,0.2323,0.2626,0.0202,0.0,0.0,0.1717,0.0,0.1515


It appears that several categories were aggregated into the unknown and two or more
races column. Regardless of this, they both appear to be quite diverse. We can see
how the top five US News schools fared with this basic diversity metric:

In [115]:
us_news_top = ['Rutgers University-Newark',
                  'Andrews University',
                  'Stanford University',
                  'University of Houston',
                  'University of Nevada-Las Vegas']

diversity_metric.loc[us_news_top]

INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

Alternatively, we can find the schools that are least diverse by ordering them by their
maximum race percentage

In [117]:
(
    college_ugds.max(axis='columns')
    .sort_values(ascending=False)
    .head()
)

INSTNM
Dewey University-Manati                     1.0
Yeshiva and Kollel Harbotzas Torah          1.0
Mr Leon's School of Hair Design-Lewiston    1.0
Dewey University-Bayamon                    1.0
Shepherds Theological Seminary              1.0
dtype: float64

We can also determine if any school has all nine race categories exceeding 1%:

In [118]:
(college_ugds > .01).all(axis=1).any()

True