# <u> Chapter 2: Essential DataFrame Operations</u>

## <u>Recipes</u>
* [Selecting multiple DataFrame columns](#Selecting-multiple-DataFrame-columns)
* [Selecting columns with methods](#Selecting-columns-with-methods)
* [Ordering column names sensibly](#Ordering-column-names-sensibly)
* [Operating on the entire DataFrame](#Operating-on-the-entire-DataFrame)
* [Chaining DataFrame methods together](#Chaining-DataFrame-methods-together)
* [Working with operators on a DataFrame](#Working-with-operators-on-a-DataFrame)
* [Comparing missing values](#Comparing-missing-values)
* [Transposing the direction of a DataFrame operation](#Transposing-the-direction-of-a-DataFrame-operation)
* [Determining college campus diversity](#Determining-college-campus-diversity)

In [54]:
import numpy as np
import pandas as pd

In [55]:
pd.set_option('max_rows', 20, 'max_columns', 20)

In [56]:
pwd

'C:\\Users\\User\\Desktop\\PandasCookbook'

In [57]:
movie = pd.read_csv('data/movie.csv')

## <u>Selecting multiple DataFrame columns</u>

In [58]:
#Using one bracket may make the filter be considered like a tuple instead of a DataFrame thus an error
movie_actor_director = movie[['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']]

In [59]:
movie_actor_director.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


In [60]:
#Getting one column from a dataset in the form of a Series:
movie['director_name'].head() 

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

In [61]:
#Getting one column from a dataset in the form of a dataset:
#Note the two bracket syntax as how multiple features are fetched
movie[['director_name']].head()

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker


In [62]:
#A long list in an indexing operator can cause readability issues. A list solves this. Using this format is recommended:
cols = ['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

In [63]:
movie_actor_director = movie[cols]

## <u>Selecting columns with methods</u>

In [64]:
#Counting the number of features with a specific data type:
movie.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

In [65]:
#Getting a dataset with only integers out of the initial dataset:
integers = movie.select_dtypes(include = ['int64']).head()
integers.head()

Unnamed: 0,num_voted_users,cast_total_facebook_likes,movie_facebook_likes
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000
3,1144337,106759,164000
4,8,143,0


In [66]:
##Getting a dataset with only floats out of the initial dataset:
floats = movie.select_dtypes(include=['float']).head()
floats

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35
4,,,131.0,,131.0,,0.0,,,,12.0,7.1,


In [67]:
#Getting a dataset with only numbers out of the initial dataset:
numbers = movie.select_dtypes(include=['number']).head()
numbers

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


In [68]:
#Getting a dataset with only objects out of the initial dataset:
objects = movie.select_dtypes(include = ['object']).head()
objects

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13
3,Color,Christopher Nolan,Christian Bale,Action|Thriller,Tom Hardy,The Dark Knight Rises,Joseph Gordon-Levitt,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,English,USA,PG-13
4,,Doug Walker,Rob Walker,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,


In [69]:
##Getting a dataset with only complex out of the initial dataset:
complex_data = movie.select_dtypes(include = ['complex'])
complex_data.shape

(4916, 0)

In [70]:
complex_data.head()

0
1
2
3
4


In [71]:
#Finding the columns with an exact string name at any part of the string:
facebook = movie.filter(like = 'facebook').head()
facebook

Unnamed: 0,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,cast_total_facebook_likes,actor_2_facebook_likes,movie_facebook_likes
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0


In [72]:
likes = movie.filter(like = 'likes')
likes.head()

Unnamed: 0,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,cast_total_facebook_likes,actor_2_facebook_likes,movie_facebook_likes
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0


In [73]:
likes.shape

(4916, 6)

In [74]:
movie.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [75]:
user = movie.filter(like = 'user').head()
user

Unnamed: 0,num_voted_users,num_user_for_reviews
0,886204,3054.0
1,471220,1238.0
2,275868,994.0
3,1144337,2701.0
4,8,


In [76]:
#Searching for all columns with a digit in their name:
movie_digits = movie.filter(regex = '\d').head()
movie_digits

Unnamed: 0,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,actor_1_name,actor_3_name,actor_2_facebook_likes
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
3,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0
4,,Rob Walker,131.0,Doug Walker,,12.0


## <u>Ordering column names sensibly</u>

<p>There is no specific order for organizing columns in a dataset. The following is a simple guideline to ordering columns:</p>
<ul>
    <li>Classify each column as either discrete or continuous</li>
    <li>Group common columns within the discrete or continuous columns</li>
    <li>Place the most important group of columns first beginning with categorical columns before continuous ones</li>
</ul>

In [77]:
#Output all the column names:
movie.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [78]:
#There is no logical ordering of the column names. You can organize them as follows:
disc_core = ['movie_title', 'title_year', 'content_rating', 'genres']

In [79]:
disc_people = ['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']

In [80]:
disc_other = ['color', 'country', 'language', 'plot_keywords', 'movie_imdb_link']

In [81]:
cont_fb = ['director_facebook_likes', 'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']

In [82]:
cont_finance = ['budget', 'gross']

In [83]:
cont_num_reviews = ['num_voted_users', 'num_user_for_reviews', 'num_critic_for_reviews']

In [84]:
cont_other = ['imdb_score', 'duration', 'aspect_ratio', 'facenumber_in_poster']

In [85]:
#Concatenate all of these lists together to get a new column order
new_col_order = disc_core + disc_people + \
                disc_other + cont_fb + \
                cont_finance + cont_num_reviews + \
                cont_other

In [86]:
new_col_order

['movie_title',
 'title_year',
 'content_rating',
 'genres',
 'director_name',
 'actor_1_name',
 'actor_2_name',
 'actor_3_name',
 'color',
 'country',
 'language',
 'plot_keywords',
 'movie_imdb_link',
 'director_facebook_likes',
 'actor_1_facebook_likes',
 'actor_2_facebook_likes',
 'actor_3_facebook_likes',
 'cast_total_facebook_likes',
 'movie_facebook_likes',
 'budget',
 'gross',
 'num_voted_users',
 'num_user_for_reviews',
 'num_critic_for_reviews',
 'imdb_score',
 'duration',
 'aspect_ratio',
 'facenumber_in_poster']

In [87]:
set(movie.columns) == set(new_col_order)

True

In [88]:
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [89]:
movie2 = movie[new_col_order]

In [90]:
movie2.head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,...,movie_facebook_likes,budget,gross,num_voted_users,num_user_for_reviews,num_critic_for_reviews,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,...,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,...,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,...,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0
3,The Dark Knight Rises,2012.0,PG-13,Action|Thriller,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Color,USA,...,164000,250000000.0,448130642.0,1144337,2701.0,813.0,8.5,164.0,2.35,0.0
4,Star Wars: Episode VII - The Force Awakens,,,Documentary,Doug Walker,Doug Walker,Rob Walker,,,,...,0,,,8,,,7.1,,,0.0


<p>The ordering format of the paper from Hadley Wickham takes a different structure. It uses the format of placing fixed variables first followed by measured variables. In this scenario, it may make sense placing the likes of an actor after his name. You may be also pulling data from a relational database whose first column might be the primary key.</p>

## <u>Operating on the entire dataframe</u>

In [91]:
movie.shape

(4916, 28)

In [92]:
movie.size

137648

In [93]:
movie.ndim

2

In [94]:
len(movie)

4916

In [95]:
movie.count()

color                      4897
director_name              4814
num_critic_for_reviews     4867
duration                   4901
director_facebook_likes    4814
                           ... 
title_year                 4810
actor_2_facebook_likes     4903
imdb_score                 4916
aspect_ratio               4590
movie_facebook_likes       4916
Length: 28, dtype: int64

In [96]:
movie.std()

  movie.std()


num_critic_for_reviews       1.202394e+02
duration                     2.528602e+01
director_facebook_likes      2.832954e+03
actor_3_facebook_likes       1.625875e+03
actor_1_facebook_likes       1.510699e+04
gross                        6.737255e+07
num_voted_users              1.383222e+05
cast_total_facebook_likes    1.816432e+04
facenumber_in_poster         2.023826e+00
num_user_for_reviews         3.729348e+02
budget                       1.002427e+08
title_year                   1.245398e+01
actor_2_facebook_likes       4.011300e+03
imdb_score                   1.127802e+00
aspect_ratio                 1.402940e+00
movie_facebook_likes         1.920602e+04
dtype: float64

In [97]:
movie.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,49.0,93.0,7.0,132.0,607.0,5019656.0,8361.75,1394.75,0.0,64.0,6000000.0,1999.0,277.0,5.8,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
75%,191.0,118.0,189.75,633.0,11000.0,61108410.0,93772.75,13616.75,2.0,320.5,43000000.0,2011.0,912.0,7.2,2.35,2000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


In [98]:
#Pandas skips missing values by default. To change this:
movie.min(skipna = False)

  movie.min(skipna = False)


num_critic_for_reviews                                                     NaN
duration                                                                   NaN
director_facebook_likes                                                    NaN
actor_3_facebook_likes                                                     NaN
actor_1_facebook_likes                                                     NaN
gross                                                                      NaN
genres                                                                  Action
movie_title                                                            #Horror
num_voted_users                                                              5
cast_total_facebook_likes                                                    0
facenumber_in_poster                                                       NaN
movie_imdb_link              http://www.imdb.com/title/tt0006864/?ref_=fn_t...
num_user_for_reviews                                

## <u>Chaining DataFrame methods together</u>

In [99]:
#Getting head of rows of missing values:
movie.isnull().head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,True,True,False,True,False,False,True,False,...,True,True,True,True,True,True,False,False,True,False


In [100]:
#Personal experiment :)
golf = movie.isnull().sum(axis = 0) == 4916
golf

color                      False
director_name              False
num_critic_for_reviews     False
duration                   False
director_facebook_likes    False
                           ...  
title_year                 False
actor_2_facebook_likes     False
imdb_score                 False
aspect_ratio               False
movie_facebook_likes       False
Length: 28, dtype: bool

In [101]:
golf.value_counts()

False    28
dtype: int64

In [102]:
movie.shape

(4916, 28)

In [103]:
movie.isnull().sum()

color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
                          ... 
title_year                 106
actor_2_facebook_likes      13
imdb_score                   0
aspect_ratio               326
movie_facebook_likes         0
Length: 28, dtype: int64

In [104]:
#Getting the total number of missing values as a scalar value:
movie.isnull().sum().sum()

2654

In [105]:
movie.isnull().dtypes.value_counts()

bool    28
dtype: int64

In [106]:
#The object data type will not be subjected to various mathematical operations:
movie[['color', 'movie_title', 'color']].max()

  movie[['color', 'movie_title', 'color']].max()


movie_title    Æon Flux
dtype: object

## <u>Working with operators on a DataFrame</u>

In [107]:
#If the DataFrame does not have homogenous data, the operation is likely to fail
college = pd.read_csv('data/college.csv')

In [108]:
#We will work with homogenous data by selecting columns that begin with UGDS:
college_ugds = college.filter(like = 'UGDS_')
college_ugds.head()

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
1,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
2,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
3,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
4,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [109]:
#Rounding off values to the nearest hundredth:
college_ugds + .00501

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,0.03831,0.94031,0.01051,0.00691,0.00741,0.00691,0.00501,0.01091,0.01881
1,0.59721,0.26501,0.03331,0.05681,0.00721,0.00571,0.04181,0.02291,0.01501
2,0.30401,0.42421,0.01191,0.00841,0.00501,0.00501,0.00501,0.00501,0.27651
3,0.70381,0.13051,0.04321,0.04261,0.01931,0.00521,0.02221,0.03821,0.04001
4,0.02081,0.92581,0.01711,0.00691,0.00601,0.00561,0.01481,0.02931,0.01871
...,...,...,...,...,...,...,...,...,...
7530,,,,,,,,,
7531,,,,,,,,,
7532,,,,,,,,,
7533,,,,,,,,,


In [110]:
#Use the floor division to round to the nearest whole number percentage:
(college_ugds + .00501) // .01

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,3.0,94.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,59.0,26.0,3.0,5.0,0.0,0.0,4.0,2.0,1.0
2,30.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,27.0
3,70.0,13.0,4.0,4.0,1.0,0.0,2.0,3.0,4.0
4,2.0,92.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...
7530,,,,,,,,,
7531,,,,,,,,,
7532,,,,,,,,,
7533,,,,,,,,,


In [111]:
#We can now divide it by 100:
college_ugds_op_round = (college_ugds + .00501) // .01 /100
college_ugds_op_round

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,0.03,0.94,0.01,0.00,0.00,0.0,0.00,0.01,0.01
1,0.59,0.26,0.03,0.05,0.00,0.0,0.04,0.02,0.01
2,0.30,0.42,0.01,0.00,0.00,0.0,0.00,0.00,0.27
3,0.70,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
4,0.02,0.92,0.01,0.00,0.00,0.0,0.01,0.02,0.01
...,...,...,...,...,...,...,...,...,...
7530,,,,,,,,,
7531,,,,,,,,,
7532,,,,,,,,,
7533,,,,,,,,,


In [112]:
#Now use the round DataFrame method to do the rounding automatically for us.
#NumPy rounds numbers that are exactly halfway between either side to the even side. Due to this, we add a small fraction before rounding:
college_ugds_round = (college_ugds + .00001).round(2)
college_ugds_round.head()

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
1,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
2,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
3,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
4,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


In [113]:
college_ugds_op_round.equals(college_ugds_round)

True

In [114]:
#Operators can be replaced with their method equivalents:
college_ugds_op_round_methods = college_ugds.add(.00501) \
                                .floordiv(.01) \
                                .div(100)

In [115]:
college_ugds_op_round_methods.equals(college_ugds_op_round)

True

## <u>Comparing missing values</u>

In [116]:
#The np.nan is not equal to itself but None is equal to itself:
np.nan == np.nan

False

In [117]:
None == None

True

In [118]:
#All other comparisons against np.nan return to False except not equal to:
np.nan != 5

True

In [119]:
# == is used to create an equal element-by-element alternative dataset
college_self_compare = college_ugds == college_ugds

In [120]:
college_self_compare.head()

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True


In [121]:
college_self_compare.tail()

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
7530,False,False,False,False,False,False,False,False,False
7531,False,False,False,False,False,False,False,False,False
7532,False,False,False,False,False,False,False,False,False
7533,False,False,False,False,False,False,False,False,False
7534,False,False,False,False,False,False,False,False,False


In [122]:
#When you use the all method, a different result appears:
#This is because missing values cannot be compared to other missing values
college_self_compare.all()

UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

In [123]:
college_ugds.isnull().sum()

UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

In [124]:
#The two dataframes can be compared correctly in the following two formats:
college_ugds.equals(college_ugds)

True

In [125]:
college_ugds == college_ugds

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...
7530,False,False,False,False,False,False,False,False,False
7531,False,False,False,False,False,False,False,False,False
7532,False,False,False,False,False,False,False,False,False
7533,False,False,False,False,False,False,False,False,False


## <u>Transposing the direction of a DataFrame operation</u>
<p>The <b>axis</b> parameter controls the direction in which an operation occurs. It can either be a value of 0 or 1</p>

In [126]:
#Representing the percentage of undergraduate students of a particular race:
#It is good to set the column that you are focusing on to be the index for ease of referencing:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds = college.filter(like='UGDS_')
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [127]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [128]:
college.shape

(7535, 26)

In [129]:
#Counting the number of non-missing values. By default, the axis is usuallly zero(0):
college_ugds.count()

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [130]:
college_ugds.count(axis = 'index')

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [131]:
#Transposing the operation of the cell above:
college_ugds.count(axis = 'columns')

INSTNM
Alabama A & M University                                  9
University of Alabama at Birmingham                       9
Amridge University                                        9
University of Alabama in Huntsville                       9
Alabama State University                                  9
                                                         ..
SAE Institute of Technology  San Francisco                0
Rasmussen College - Overland Park                         0
National Personal Training Institute of Cleveland         0
Bay Area Medical Academy - San Jose Satellite Location    0
Excel Learning Center-San Antonio South                   0
Length: 7535, dtype: int64

In [132]:
college_ugds.count(axis='columns').head()

INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

In [133]:
#Instead of counting non-missing values, we can sum all the values in each row.
#Each row of percentages should add up to 1. The sum method may be used to verify this:
college_ugds.sum(axis = 'columns').head()

INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

In [134]:
#You can also get an idea of the distribution of each column:
college_ugds.median(axis = 'index')

UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
UGDS_NHPI     0.00000
UGDS_2MOR     0.01750
UGDS_NRA      0.00000
UGDS_UNKN     0.01430
dtype: float64

In [135]:
college_ugds_cumsum = college_ugds.cumsum(axis = 1)
college_ugds_cumsum.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9686,0.9741,0.976,0.9784,0.9803,0.9803,0.9862,1.0
University of Alabama at Birmingham,0.5922,0.8522,0.8805,0.9323,0.9345,0.9352,0.972,0.9899,0.9999
Amridge University,0.299,0.7182,0.7251,0.7285,0.7285,0.7285,0.7285,0.7285,1.0
University of Alabama in Huntsville,0.6988,0.8243,0.8625,0.9001,0.9144,0.9146,0.9318,0.965,1.0
Alabama State University,0.0158,0.9366,0.9487,0.9506,0.9516,0.9522,0.962,0.9863,1.0


In [136]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [137]:
college_ugds_cumsum.shape

(7535, 9)

In [138]:
college_ugds.shape

(7535, 9)

## <u>Determining college campus diversity</u>

In [139]:
coll = pd.read_csv('data/college_diversity.csv', index_col = 'School')

In [140]:
coll.head()

Unnamed: 0_level_0,Diversity Index
School,Unnamed: 1_level_1
"Rutgers University--Newark Newark, NJ",0.76
"Andrews University Berrien Springs, MI",0.74
"Stanford University Stanford, CA",0.74
"University of Houston Houston, TX",0.74
"University of Nevada--Las Vegas Las Vegas, NV",0.74


In [141]:
coll2 = pd.read_csv('data/college_diversity.csv')

In [142]:
coll2.head()

Unnamed: 0,School,Diversity Index
0,"Rutgers University--Newark Newark, NJ",0.76
1,"Andrews University Berrien Springs, MI",0.74
2,"Stanford University Stanford, CA",0.74
3,"University of Houston Houston, TX",0.74
4,"University of Nevada--Las Vegas Las Vegas, NV",0.74


In [143]:
coll.shape

(10, 1)

<p>In this recipe, the diversity metric will be equal to the count of number of races having greater than 15% of the student population</p>

In [144]:
college = pd.read_csv('data/college.csv', index_col = 'INSTNM')
college_ugds = college.filter(like = 'UGDS_')

In [145]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [146]:
#Counting the missing values for each row and sort the values in a Series
(college_ugds.isnull()
            .sum(axis = 1)
            .sort_values(ascending = False)
            .head())

INSTNM
Excel Learning Center-San Antonio South              9
Western State College of Law at Argosy University    9
Albany Law School                                    9
Albany Medical College                               9
A T Still University of Health Sciences              9
dtype: int64

In [147]:
#We can drop all rows that have all their values missing
#If how is set to any, it will drop rows with at least one missing value
college_ugds = college_ugds.dropna(how = 'all')

In [148]:
college_ugds.isnull().sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

In [149]:
#Calculation of the diversity metric. We convert each value to a boolean:
college_ugds.ge(.15)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,True,False,False,False,False,False,False,False
University of Alabama at Birmingham,True,True,False,False,False,False,False,False,False
Amridge University,True,True,False,False,False,False,False,False,True
University of Alabama in Huntsville,True,False,False,False,False,False,False,False,False
Alabama State University,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
Hollywood Institute of Beauty Careers-West Palm Beach,True,True,True,False,False,False,False,False,False
Hollywood Institute of Beauty Careers-Casselberry,False,True,True,False,False,False,False,False,False
Coachella Valley Beauty College-Beaumont,True,False,True,False,False,False,False,False,False
Dewey University-Mayaguez,False,False,True,False,False,False,False,False,False


In [150]:
#Two schoools have more than 15% in five different race categories
#Use of the sum method to count the True values for each college:
diversity_metric = college_ugds.ge(0.15).sum(axis = 1).sort_values(ascending = False)
diversity_metric

INSTNM
Central Texas Beauty College-Temple                               5
Regency Beauty Institute-Austin                                   5
Westwood College-O'Hare Airport                                   4
Regency Beauty Institute-Pasadena                                 4
Soma Institute-The National School of Clinical Massage Therapy    4
                                                                 ..
Professional Business College                                     0
Education and Technology Institute                                0
Taft University System                                            0
Prince Institute-Rocky Mountains                                  0
Spanish-American Institute                                        0
Length: 6874, dtype: int64

In [151]:
#Getting the distribution using the value_counts:
diversity_metric.value_counts()

1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

In [152]:
#Viewing the entire data of the top two schools:
college_ugds.loc[['Regency Beauty Institute-Austin', 'Central Texas Beauty College-Temple']]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Regency Beauty Institute-Austin,0.1867,0.2133,0.16,0.0,0.0,0.0,0.1733,0.0,0.2667
Central Texas Beauty College-Temple,0.1616,0.2323,0.2626,0.0202,0.0,0.0,0.1717,0.0,0.1515


In [153]:
#We can view how the top 10 US News schools fared with this basic diversity metric:
us_news_top = ['Rutgers University-Newark',
'Andrews University',
'Stanford University',
'University of Houston',
'University of Nevada-Las Vegas']

In [154]:
diversity_metric.loc[us_news_top]

INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

In [155]:
#We can find the schools that are the least diverse by ordering them in their maximum race percentage:
college_ugds.max(axis = 1).sort_values(ascending = False).head(10)

INSTNM
Caribbean University-Ponce                                        1.0
Brighton Institute of Cosmetology                                 1.0
Mesivta Torah Vodaath Rabbinical Seminary                         1.0
Rabbinical College Telshe                                         1.0
University of Puerto Rico-Mayaguez                                1.0
Haskell Indian Nations University                                 1.0
Lake Career and Technical Center                                  1.0
Leon Studio One School of Hair Design & Career Training Center    1.0
Dewey University-Hato Rey                                         1.0
Columbia Central University-Caguas                                1.0
dtype: float64

In [156]:
#We can also determine if any school has all nine categories exceeding 1%:
(college_ugds > 0.01).all(axis = 1).any()

True