# Course Solutions

1. [Pandas Intro](#1.-Pandas-Intro)
1. [Selecting Subsets of Series Data](#2.-Selecting-Subsets-of-Series-Data)
1. [Series - Boolean Indexing](#2.-Series---Boolean-Indexing)
1. [Case Study - Calculating Normality of Stock Market Returns](#4.-Case-Study---Calculating-Normality-of-Stock-Market-Returns)
1. [DataFrame Basics](#5.-DataFrame-Basics)
1. [DataFrame Boolean Indexing](#6.-DataFrame-Boolean-Indexing)
1. [Case Study - Do Employees with more Experience make more Money?](#7.-Case-Study---Do-Employees-with-more-Experience-make-more-Money?)
1. [More on DataFrames](#8.-More-on-DataFrames)

# 1. Pandas Intro

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
movie = pd.read_csv('data/movie.csv')

### Problem 1
<span  style="color:green; font-size:16px">Select the column **`gross`**, the total revenue from the movie dataset and store it to a variable with the same name. Output the first 10 values of it.</span>

In [4]:
gross = movie['gross']
gross.head(10)

0    760505847.0
1    309404152.0
2    200074175.0
3    448130642.0
4            NaN
5     73058679.0
6    336530303.0
7    200807262.0
8    458991599.0
9    301956980.0
Name: gross, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">What type of object is **`gross`**?</span>

In [5]:
type(gross)

pandas.core.series.Series

### Problem 3
<span  style="color:green; font-size:16px">What type of object is returned from the values of the index of a Series?</span>

In [6]:
# The values of the index of a series are a numpy ndarray
type(gross.index.values)

numpy.ndarray

### Problem 4
<span  style="color:green; font-size:16px">Divide each value in **`gross`** by one million and save it back to **`gross`**</span>

In [12]:
gross = gross / 1000000

### Problem 5
<span  style="color:green; font-size:16px">What was most revenue generated?</span>

In [13]:
gross.max()

760.50584700000002

### Problem 6
<span  style="color:green; font-size:16px">Select the **`actor_1_name`** column and find the actor who has appeared in the most movies. Make sure to append the **`head`** method to suppress long output.</span>

In [14]:
movie['actor_1_name'].value_counts().head()

Robert De Niro       48
Johnny Depp          36
Nicolas Cage         32
Denzel Washington    29
J.K. Simmons         29
Name: actor_1_name, dtype: int64

### Problem 7
<span  style="color:green; font-size:16px">Write a function that accepts a single argument. The argument will be a Series. Have the function return the difference between the largest and smallest Series value. Run your function with the **`gross`** Series</span>

In [15]:
def min_max(series):
    return series.max() - series.min()

min_max(gross)

760.50568499999997

# 2. Selecting Subsets of Series Data

### Problem 1
<span  style="color:green; font-size:16px">Create a 3 element pandas Series using the Series constructor with characters as the index and numbers as the values. Output the Series.</span>

In [8]:
s = pd.Series(data=[4, 9, 100], index=['b', 'c', 'g'])
s

b      4
c      9
g    100
dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Another way to create a series is to pass a dictionary to the pandas series constructor. The keys of the dictionary become the Series index and the dictionary values become the Series values. Create a dictionary with at least 3 elements and use it to create a series. Output the Series.</span>

In [16]:
d = {'Houston':'South', 'Dallas':'North', 'El Paso':'West'}
s = pd.Series(d)
s

Dallas     North
El Paso     West
Houston    South
dtype: object

### Problem 3
<span  style="color:green; font-size:16px">Use the **`read_csv`** function to read in the movie dataset and set the index to the title of the movie. Output the first 10 rows.</span>

In [18]:
movie = pd.read_csv('data/movie.csv')
movie = movie.set_index('movie_title')
movie.head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


### Problem 4
<span  style="color:green; font-size:16px">Select the director column into a variable with the same name.</span>

In [19]:
director = movie['director_name']

### Problem 5
<span  style="color:green; font-size:16px">Output to the screen the first 10 numbers in the **`director`** Series. Remember to only use **`.loc`** and **`.iloc`** when accessing Series elements.</span>

In [20]:
director.iloc[:10]

movie_title
Avatar                                            James Cameron
Pirates of the Caribbean: At World's End         Gore Verbinski
Spectre                                              Sam Mendes
The Dark Knight Rises                         Christopher Nolan
Star Wars: Episode VII - The Force Awakens          Doug Walker
John Carter                                      Andrew Stanton
Spider-Man 3                                          Sam Raimi
Tangled                                            Nathan Greno
Avengers: Age of Ultron                             Joss Whedon
Harry Potter and the Half-Blood Prince              David Yates
Name: director_name, dtype: object

### Problem 6
<span  style="color:green; font-size:16px">Output **`director`** elements at location 40, 50 and 99</span>

In [22]:
director.iloc[[40, 50, 99]]

movie_title
TRON: Legacy                         Joseph Kosinski
The Great Gatsby                        Baz Luhrmann
The Hobbit: An Unexpected Journey      Peter Jackson
Name: director_name, dtype: object

### Problem 7
<span  style="color:green; font-size:16px">Output the last ten values of the **`director`** Series.</span>

In [23]:
director[-10:]

movie_title
Primer                        Shane Carruth
Cavite                     Neill Dela Llana
El Mariachi                Robert Rodriguez
The Mongol King             Anthony Vallone
Newlyweds                      Edward Burns
Signed Sealed Delivered         Scott Smith
The Following                           NaN
A Plague So Pleasant       Benjamin Roberds
Shanghai Calling                Daniel Hsia
My Date with Drew                  Jon Gunn
Name: director_name, dtype: object

### Problem 8
<span  style="color:green; font-size:16px">Select the directors from the movies **The Fast and the Furious** and **Batman Begins**</span>

In [28]:
director.loc[['The Fast and the Furious', 'Batman Begins']]

movie_title
The Fast and the Furious            Rob Cohen
Batman Begins               Christopher Nolan
Name: director_name, dtype: object

### Problem 9
<span  style="color:green; font-size:16px">Think of a movie you have seen and try to select its director</span>

In [29]:
director.loc['My Big Fat Greek Wedding']

'Joel Zwick'

### Problem 10
<span  style="color:green; font-size:16px">If two Series are added with no indices in common, what will be the outcome? Check your answer by coding this situation.</span>

In [31]:
# all missing values
s1 = pd.Series(data=[1,2,3], index=['a','b','c'])
s2 = pd.Series(data=[1,2,3], index=['d','e','f'])

s1 + s2

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
dtype: float64

### Problem 11
<span  style="color:green; font-size:16px">What if the two series from problem 9 were subtracted, multiplied or divided together?</span>

In [32]:
# same thing would happen
s1 / s2

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
dtype: float64

### Problem 12
<span  style="color:green; font-size:16px">Create two Series that have 3 elements each and when added together yield a Series that has four 4 total elements that are all not missing.</span>

In [33]:
s1 = pd.Series(data=[1,2,3], index=list('aab'))
s2 = pd.Series(data=[1,2,3], index=list('abb'))

s1 + s2

a    2
a    3
b    5
b    6
dtype: int64

# 3. Series - Boolean Indexing

In [34]:
movie = pd.read_csv('data/movie.csv', index_col='movie_title')

### Problem 1
<span  style="color:green; font-size:16px">Select the column **`movie_facebook_likes`** as a Series, save it to a variable with the same name and output its first 10 values.</span>

In [35]:
movie_facebook_likes = movie['movie_facebook_likes']
movie_facebook_likes.head(10)

movie_title
Avatar                                         33000
Pirates of the Caribbean: At World's End           0
Spectre                                        85000
The Dark Knight Rises                         164000
Star Wars: Episode VII - The Force Awakens         0
John Carter                                    24000
Spider-Man 3                                       0
Tangled                                        29000
Avengers: Age of Ultron                       118000
Harry Potter and the Half-Blood Prince         10000
Name: movie_facebook_likes, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Use boolean indexing to select all movies with 0 movie facebook likes.</span>

In [37]:
movie_facebook_likes[movie_facebook_likes == 0].head()

movie_title
Pirates of the Caribbean: At World's End      0
Star Wars: Episode VII - The Force Awakens    0
Spider-Man 3                                  0
Superman Returns                              0
Quantum of Solace                             0
Name: movie_facebook_likes, dtype: int64

### Problem 3
<span  style="color:green; font-size:16px">Use boolean indexing to select all movies that don't have 0 movie facebook likes.</span>

In [39]:
movie_facebook_likes[movie_facebook_likes != 0].head()

movie_title
Avatar                    33000
Spectre                   85000
The Dark Knight Rises    164000
John Carter               24000
Tangled                   29000
Name: movie_facebook_likes, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Use boolean indexing to select all movies with more than 50,000 likes but less than 100,000</span>

In [41]:
criteria = (movie_facebook_likes > 50000) & (movie_facebook_likes < 100000)
movie_facebook_likes[criteria].head()

movie_title
Spectre                                        85000
Pirates of the Caribbean: On Stranger Tides    58000
The Hobbit: The Battle of the Five Armies      65000
The Amazing Spider-Man                         56000
The Hobbit: The Desolation of Smaug            83000
Name: movie_facebook_likes, dtype: int64

### Problem 5
<span  style="color:green; font-size:16px">Sort the values of the output from problem 4 and output the head and tail to validate the boolean indexing.</span>

In [44]:
criteria = (movie_facebook_likes > 50000) & (movie_facebook_likes < 100000)
sorted4 = movie_facebook_likes[criteria].sort_values()
sorted4.head()

movie_title
The Raid: Redemption                     51000
Exodus: Gods and Kings                   51000
The Rock                                 51000
Me Before You                            51000
The Hunger Games: Mockingjay - Part 1    52000
Name: movie_facebook_likes, dtype: int64

In [45]:
sorted4.tail()

movie_title
Guardians of the Galaxy            96000
This Is the End                    97000
Prometheus                         97000
Abraham Lincoln: Vampire Hunter    98000
The Big Short                      99000
Name: movie_facebook_likes, dtype: int64

### Problem 6
<span  style="color:green; font-size:16px">Use boolean indexing to select movies with facebook likes less than 10000 or greater than 100,000.</span>

In [51]:
criteria = (movie_facebook_likes < 10000) | (movie_facebook_likes > 100000)
movie_facebook_likes[criteria].head(10)

movie_title
Pirates of the Caribbean: At World's End           0
The Dark Knight Rises                         164000
Star Wars: Episode VII - The Force Awakens         0
Spider-Man 3                                       0
Avengers: Age of Ultron                       118000
Batman v Superman: Dawn of Justice            197000
Superman Returns                                   0
Quantum of Solace                                  0
Pirates of the Caribbean: Dead Man's Chest      5000
Man of Steel                                  118000
Name: movie_facebook_likes, dtype: int64

### Problem 7
<span  style="color:green; font-size:16px">Use boolean indexing to select movies with facebook likes less than 1000 but greater than 0 or greater than 100,000.</span>

In [53]:
criteria = ((movie_facebook_likes > 0) & (movie_facebook_likes < 10000)) | (movie_facebook_likes > 100000)
movie_facebook_likes[criteria].head(10)

movie_title
The Dark Knight Rises                                 164000
Avengers: Age of Ultron                               118000
Batman v Superman: Dawn of Justice                    197000
Pirates of the Caribbean: Dead Man's Chest              5000
Man of Steel                                          118000
The Avengers                                          123000
Jurassic World                                        150000
World War Z                                           129000
The Great Gatsby                                      115000
Indiana Jones and the Kingdom of the Crystal Skull      5000
Name: movie_facebook_likes, dtype: int64

### Problem 8
<span  style="color:green; font-size:16px">Use boolean indexing to select movies with facebook likes less than 1000 but greater than 0 or greater than 100,000 but less than 120,000. Think about breaking up the boolean expression into two different criteria before combining them together.</span>

In [54]:
criteria1 = ((movie_facebook_likes > 0) & (movie_facebook_likes < 10000)) 
criteria2 = ((movie_facebook_likes > 100000) & (movie_facebook_likes < 120000))
criteria_all = criteria1 | criteria2
movie_facebook_likes[criteria_all].head(10)

movie_title
Avengers: Age of Ultron                               118000
Pirates of the Caribbean: Dead Man's Chest              5000
Man of Steel                                          118000
The Great Gatsby                                      115000
Indiana Jones and the Kingdom of the Crystal Skull      5000
Evan Almighty                                           2000
Inside Out                                            118000
The Lovers                                               677
Transformers                                            8000
Night at the Museum: Battle of the Smithsonian          2000
Name: movie_facebook_likes, dtype: int64

### Problem 9
<span  style="color:green; font-size:16px">Reverse the boolean selection from problem 8.</span>

In [55]:
criteria1 = ((movie_facebook_likes > 0) & (movie_facebook_likes < 10000)) 
criteria2 = ((movie_facebook_likes > 100000) & (movie_facebook_likes < 120000))
criteria_all = criteria1 | criteria2
movie_facebook_likes[~criteria_all].head(10)

movie_title
Avatar                                         33000
Pirates of the Caribbean: At World's End           0
Spectre                                        85000
The Dark Knight Rises                         164000
Star Wars: Episode VII - The Force Awakens         0
John Carter                                    24000
Spider-Man 3                                       0
Tangled                                        29000
Harry Potter and the Half-Blood Prince         10000
Batman v Superman: Dawn of Justice            197000
Name: movie_facebook_likes, dtype: int64

### Problem 10
<span  style="color:green; font-size:16px">Use the **`between`** method to replicate problem 5.</span>

In [57]:
criteria = movie_facebook_likes.between(50000, 100000, inclusive=False)
movie_facebook_likes[criteria].head()

movie_title
Spectre                                        85000
Pirates of the Caribbean: On Stranger Tides    58000
The Hobbit: The Battle of the Five Armies      65000
The Amazing Spider-Man                         56000
The Hobbit: The Desolation of Smaug            83000
Name: movie_facebook_likes, dtype: int64

### Problem 11
<span  style="color:green; font-size:16px">How many movies have more than 100,000 facebook likes?</span>

In [58]:
len(movie_facebook_likes[movie_facebook_likes > 100000])

43

# 4. Case Study - Calculating Normality of Stock Market Returns

### Problem 1
<span  style="color:green; font-size:16px">Use pandas-datareader to return data for the automaker Tesla (TSLA). Output the first 10 rows and save the closing price to **`tsla_close`**</span>

In [60]:
import pandas_datareader as pdr
tsla = pdr.DataReader('tsla', 'google')
tsla.head(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-06-29,19.0,25.0,17.54,23.89,18783276
2010-06-30,25.79,30.42,23.3,23.83,17194394
2010-07-01,25.0,25.92,20.27,21.96,8229863
2010-07-02,23.0,23.1,18.71,19.2,5141807
2010-07-06,20.0,20.0,15.83,16.11,6879296
2010-07-07,16.4,16.63,14.98,15.8,6924914
2010-07-08,16.14,17.52,15.57,17.46,7719539
2010-07-09,17.58,17.9,16.55,17.4,4058606
2010-07-12,17.95,18.07,17.0,17.05,2203570
2010-07-13,17.39,18.64,16.9,18.14,2680060


In [61]:
tsla_close = tsla['Close']

### Problem 2
<span  style="color:green; font-size:16px">Use one line of code (with method chaining) to find the daily percentage returns of TSLA and drop any missing values. Save the result to **`tsla_change`**.</span>

In [63]:
tsla_change = tsla_close.pct_change().dropna()
tsla_change.head()

Date
2010-06-30   -0.002512
2010-07-01   -0.078473
2010-07-02   -0.125683
2010-07-06   -0.160937
2010-07-07   -0.019243
Name: Close, dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">Find the mean daily return for Tesla. Then select the first and last closing prices. Also get the number of trading days. Store all four of these values into variables.</span>

In [65]:
mean = tsla_change.mean()
first = tsla_close.iloc[0]
last = tsla_close.iloc[-1]
n = tsla_change.size
mean, first, last, n

(0.002003997942126599, 23.890000000000001, 329.92000000000002, 1777)

### Problem 4
<span  style="color:green; font-size:16px">If Tesla returned its mean percentage return every single day since the first day you have data, what would its last closing price be? Is it the same as the actual last closing price? You need to use all the variables calculated from problem 3.</span>

In [69]:
first * (mean + 1) ** n

837.98553717868447

In [67]:
last

329.92000000000002

### Problem 5
<span  style="color:green; font-size:16px">Find the raw number of standard deviations away from the mean for the Tesla daily returns. Save this to a variable **`z_score_raw`**. What is the max and minimum score?</span>

In [77]:
std = tsla_change.std()
z_score_raw = tsla_change.sub(mean).div(std)
z_score_raw.head()

Date
2010-06-30   -0.138667
2010-07-01   -2.471363
2010-07-02   -3.921158
2010-07-06   -5.003791
2010-07-07   -0.652468
Name: Close, dtype: float64

In [78]:
z_score_raw.max(), z_score_raw.min()

(7.4299662146296868, -5.9968267401356767)

### Problem 6
<span  style="color:green; font-size:16px">What percentage did Tesla stock increase when it had its highest maximum raw z-score?</span>

In [81]:
tsla_change[z_score_raw == z_score_raw.max()]

Date
2013-05-09    0.243951
Name: Close, dtype: float64

### Problem 7
<span  style="color:green; font-size:16px">Create a function that accepts a stock ticker symbol (amzn for example) and returns the percentage of prices within 1, 2, and 3 standard deviations from the mean. Use your function to return results for different stocks (tsla, fb, slb, gm, etc...)</span>

In [84]:
def stock_pct_finder(symbol):
    prices = pdr.DataReader(symbol, 'google')
    close = prices['Close']
    close_change = close.pct_change().dropna()
    
    mean = close_change.mean()
    std = close_change.std()
    z_score = close_change.sub(mean).div(std).abs()
    
    pct_within1 = z_score.lt(1).mean()
    pct_within2 = z_score.lt(2).mean()
    pct_within3 = z_score.lt(3).mean()

    return pct_within1, pct_within2, pct_within3

In [85]:
stock_pct_finder('amzn')

(0.78714436248682829, 0.95626975763962063, 0.9847207586933614)

In [86]:
stock_pct_finder('fb')

(0.80322828593389706, 0.95695618754804002, 0.98923904688701003)

In [87]:
stock_pct_finder('slb')

(0.74236037934668075, 0.94573234984193888, 0.98630136986301364)

In [88]:
stock_pct_finder('tsla')

(0.7861564434440067, 0.95385481148002249, 0.98424310635903212)

### Problem 8
<span  style="color:green; font-size:16px"> How many days did Tesla close above 100 and below 150?</span>

In [89]:
criteria = (tsla_close > 100) & (tsla_close < 150)

#inspect data
tsla_close[criteria].head(15)

Date
2013-05-28    110.33
2013-05-29    104.63
2013-05-30    104.95
2013-06-07    102.04
2013-06-10    100.05
2013-06-14    100.30
2013-06-17    102.20
2013-06-18    103.39
2013-06-19    104.68
2013-06-20    100.65
2013-06-24    101.49
2013-06-25    102.40
2013-06-26    105.72
2013-06-27    109.25
2013-06-28    107.36
Name: Close, dtype: float64

In [90]:
tsla_close[criteria].size

90

In [91]:
# or using sum method
criteria.sum()

90

In [93]:
# one line
((tsla_close > 100) & (tsla_close < 150)).sum()

90

### Problem 9
<span  style="color:green; font-size:16px"> How many days did Tesla close below 50 or above 200?</span>

In [94]:
((tsla_close < 50) | (tsla_close > 200)).sum()

1447

### Problem 10
<span  style="color:green; font-size:16px"> Lookup the definition for interquartile range and slice Tesla closing prices so it contains just the interquartile range. There are multiple ways to do this. Check the **`quantile`** method.</span>

In [96]:
# a few ways to do this
n = tsla_close.size
first_q = n // 4
third_q = n // 4 * 3

tsla_close.sort_values().iloc[first_q:third_q].head()

Date
2011-12-12    30.41
2012-05-30    30.41
2010-12-13    30.55
2012-07-23    30.66
2012-09-24    30.66
Name: Close, dtype: float64

In [100]:
# can use the quantile method
q1 = tsla_close.quantile(.25)
q3 = tsla_close.quantile(.75)

criteria = (tsla_close >= q1) & (tsla_close <= q3)

tsla_close[criteria].sort_values().head()

Date
2011-12-12    30.41
2012-05-30    30.41
2010-12-13    30.55
2012-09-24    30.66
2012-07-23    30.66
Name: Close, dtype: float64

### Problem 11
<span  style="color:green; font-size:16px">Use the **`idxmax`** method to find the index label of the highest closing price. Find out how many trading days it has been since Tesla recorded it highest closing price.</span>

In [101]:
highest_index = tsla_close.idxmax()
highest_index

Timestamp('2017-06-23 00:00:00')

In [103]:
tsla_close.loc[highest_index:].size

19

# 5. DataFrame Basics

In [104]:
import pandas as pd
import numpy as np
movie = pd.read_csv('data/movie.csv', index_col='movie_title')

### Problem 1
<span  style="color:green; font-size:16px">Use the **`describe`** method on the movie dataset and include only the object columns. Transpose the output.</span>

In [107]:
movie.describe(include=[np.object]).T

Unnamed: 0,count,unique,top,freq
color,4897,2,Color,4693
director_name,4814,2397,Steven Spielberg,26
actor_2_name,4903,3030,Morgan Freeman,18
genres,4916,914,Drama,233
actor_1_name,4909,2095,Robert De Niro,48
actor_3_name,4893,3519,Steve Coogan,8
plot_keywords,4764,4756,based on novel,4
movie_imdb_link,4916,4916,http://www.imdb.com/title/tt0387877/?ref_=fn_t...,1
language,4904,47,English,4582
country,4911,65,USA,3710


### Problem 2
<span  style="color:green; font-size:16px">Use the **`type`** function to output the object type for both the index and the columns.</span>

In [108]:
type(movie.columns)

pandas.core.indexes.base.Index

In [109]:
type(movie.index)

pandas.core.indexes.base.Index

### Problem 3
<span  style="color:green; font-size:16px">Select all three actor name columns.</span>

In [110]:
movie[['actor_1_name', 'actor_2_name', 'actor_3_name']].head()

Unnamed: 0_level_0,actor_1_name,actor_2_name,actor_3_name
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,CCH Pounder,Joel David Moore,Wes Studi
Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport
Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman
The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
Star Wars: Episode VII - The Force Awakens,Doug Walker,Rob Walker,


### Problem 4
<span  style="color:green; font-size:16px">Select the content rating column as a Series and then as a DataFrame</span>

In [111]:
movie['content_rating'].head()

movie_title
Avatar                                        PG-13
Pirates of the Caribbean: At World's End      PG-13
Spectre                                       PG-13
The Dark Knight Rises                         PG-13
Star Wars: Episode VII - The Force Awakens      NaN
Name: content_rating, dtype: object

In [112]:
movie[['content_rating']].head()

Unnamed: 0_level_0,content_rating
movie_title,Unnamed: 1_level_1
Avatar,PG-13
Pirates of the Caribbean: At World's End,PG-13
Spectre,PG-13
The Dark Knight Rises,PG-13
Star Wars: Episode VII - The Force Awakens,


### Problem 5
<span  style="color:green; font-size:16px">Select the 3rd and 5th rows from the movie dataset</span>

In [114]:
movie.iloc[[3,5]]

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,212204,1873,Polly Walker,1.0,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000


### Problem 6
<span  style="color:green; font-size:16px">Select the 3rd and 5th columns from the movie dataset</span>

In [116]:
movie.iloc[:, [3,5]].head()

Unnamed: 0_level_0,duration,actor_3_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1
Avatar,178.0,855.0
Pirates of the Caribbean: At World's End,169.0,1000.0
Spectre,148.0,161.0
The Dark Knight Rises,164.0,23000.0
Star Wars: Episode VII - The Force Awakens,,


### Problem 7
<span  style="color:green; font-size:16px">Select the first 5 rows and the last 5 columns</span>

In [117]:
movie.iloc[:5, -5:]

Unnamed: 0_level_0,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Avatar,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,2007.0,5000.0,7.1,2.35,0
Spectre,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,12.0,7.1,,0


### Problem 8
<span  style="color:green; font-size:16px">Select the movie 'The Dark Night Rises'</span>

In [120]:
movie.loc['The Dark Knight Rises'].head()

color                                  Color
director_name              Christopher Nolan
num_critic_for_reviews                   813
duration                                 164
director_facebook_likes                22000
Name: The Dark Knight Rises, dtype: object

### Problem 9
<span  style="color:green; font-size:16px">The values of the index are stored in a numpy array. Numpy arrays use only integer location for selection. Output the movie title from 50 to 100 using the values of the index. </span>

In [122]:
movie.index.values[50:100]

array(['The Great Gatsby', 'Prince of Persia: The Sands of Time',
       'Pacific Rim', 'Transformers: Dark of the Moon',
       'Indiana Jones and the Kingdom of the Crystal Skull',
       'The Good Dinosaur', 'Brave', 'Star Trek Beyond', 'WALL·E',
       'Rush Hour 3', '2012', 'A Christmas Carol', 'Jupiter Ascending',
       'The Legend of Tarzan',
       'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe',
       'X-Men: Apocalypse', 'The Dark Knight', 'Up', 'Monsters vs. Aliens',
       'Iron Man', 'Hugo', 'Wild Wild West',
       'The Mummy: Tomb of the Dragon Emperor', 'Suicide Squad',
       'Evan Almighty', 'Edge of Tomorrow', 'Waterworld',
       'G.I. Joe: The Rise of Cobra', 'Inside Out', 'The Jungle Book',
       'Iron Man 2', 'Snow White and the Huntsman', 'Maleficent',
       'Dawn of the Planet of the Apes', 'The Lovers', '47 Ronin',
       'Captain America: The Winter Soldier', 'Shrek Forever After',
       'Tomorrowland', 'Big Hero 6', 'Wreck-It Ralph', 'T

### Problem 10
<span  style="color:green; font-size:16px">Select the first 3 rows and 3 columns using **`.loc`**</span>

In [123]:
movie.loc[:'Spectre', :'num_critic_for_reviews']

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,Color,James Cameron,723.0
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0
Spectre,Color,Sam Mendes,602.0


### Problem 11
<span  style="color:green; font-size:16px">Select a single scalar value using **`.loc`** and then do the same things with **`.at`** and use %timeit to see the speed difference.</span>

In [125]:
movie.loc['My Big Fat Greek Wedding', 'gross']

241437427.0

In [126]:
movie.at['My Big Fat Greek Wedding', 'gross']

241437427.0

In [124]:
%timeit movie.loc['My Big Fat Greek Wedding', 'gross']

9.24 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [127]:
%timeit movie.at['My Big Fat Greek Wedding', 'gross']

5.94 µs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# 6. DataFrame Boolean Indexing

### Problem 1
<span  style="color:green; font-size:16px">Select all Asian employees that make more than 100,000 dollars?</span>

In [128]:
employee = pd.read_csv('data/employee.csv')

In [130]:
employee[(employee['RACE'] == 'Asian/Pacific Islander') & (employee['BASE_SALARY'] > 100000)].head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
217,8745,GIS MANAGER,Houston Information Tech Svcs,102019.0,Asian/Pacific Islander,Full Time,Male,Active,2006-11-13,2013-07-06
237,2438,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),Admn. & Regulatory Affairs,130416.0,Asian/Pacific Islander,Full Time,Female,Active,2002-05-24,2013-07-20
719,8947,DEPUTY DIRECTOR (EXECUTIVE LEVEL),Houston Police Department-HPD,163228.0,Asian/Pacific Islander,Full Time,Male,Active,1977-07-20,1998-08-08
1090,5019,SUPERVISING ENGINEER,Public Works & Engineering-PWE,102297.0,Asian/Pacific Islander,Full Time,Male,Active,2008-10-20,2012-11-24
1166,4502,IRM MANAGER,Public Works & Engineering-PWE,124861.0,Asian/Pacific Islander,Full Time,Male,Active,1999-12-20,2009-11-14


### Problem 2
<span  style="color:green; font-size:16px">What percentage of Asian employees make more than 100,000 dollars?</span>

In [133]:
asian_salary = employee.loc[(employee['RACE'] == 'Asian/Pacific Islander'), 'BASE_SALARY']
asian_salary.head()

6      71680.0
38     67499.0
96     59077.0
111    31554.0
134    77076.0
Name: BASE_SALARY, dtype: float64

In [135]:
asian_salary.gt(100000).mean()

0.084112149532710276

In [136]:
asian_salary.gt(100000).value_counts(normalize=True)

False    0.915888
True     0.084112
Name: BASE_SALARY, dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">What is the ratio of males to females? What is the ratio of males to females for those that make more than 100,000? How about for those that make less than 30,000?</span>

In [138]:
employee['GENDER'].value_counts(normalize=True)

Male      0.6985
Female    0.3015
Name: GENDER, dtype: float64

In [139]:
employee.loc[employee['BASE_SALARY'] > 100000, 'GENDER'].value_counts(normalize=True)

Male      0.631579
Female    0.368421
Name: GENDER, dtype: float64

In [140]:
employee.loc[employee['BASE_SALARY'] < 30000, 'GENDER'].value_counts(normalize=True)

Male      0.59542
Female    0.40458
Name: GENDER, dtype: float64

### Problem 4
<span  style="color:green; font-size:16px">What is the distribution of race? What is the distribution of race for those that make over 100,000?</span>

In [149]:
employee['RACE'].value_counts(normalize=True)

Black or African American            0.356234
White                                0.338422
Hispanic/Latino                      0.244275
Asian/Pacific Islander               0.054453
American Indian or Alaskan Native    0.005598
Others                               0.001018
Name: RACE, dtype: float64

In [151]:
employee.loc[employee['BASE_SALARY'] > 100000, 'RACE'].value_counts(normalize=True)

White                        0.543860
Black or African American    0.210526
Asian/Pacific Islander       0.157895
Hispanic/Latino              0.087719
Name: RACE, dtype: float64

### Problem 5
<span  style="color:green; font-size:16px">Save the two distributions you found in problem 4 to variables. They should be Series. Divide the greater than 100k distribution by the other.</span>

In [155]:
dist_all = employee['RACE'].value_counts(normalize=True)
dist_100k = employee.loc[employee['BASE_SALARY'] > 100000, 'RACE'].value_counts(normalize=True)

In [156]:
dist_100k / dist_all

American Indian or Alaskan Native         NaN
Asian/Pacific Islander               2.899656
Black or African American            0.590977
Hispanic/Latino                      0.359101
Others                                    NaN
White                                1.607044
Name: RACE, dtype: float64

### Problem 6
<span  style="color:green; font-size:16px">Select all Females that are part of part of the Houston police department and all males that are in the Library department. Also Select only the DEPARTMENT and GENDER columns</span>

In [163]:
criteria1 = (employee['DEPARTMENT'] == 'Houston Police Department-HPD') & (employee['GENDER'] == 'Female')
criteria2 = (employee['DEPARTMENT'] == 'Library') & (employee['GENDER'] == 'Male')
employee.loc[criteria1 | criteria2, ['DEPARTMENT', 'GENDER']].head(16)

Unnamed: 0,DEPARTMENT,GENDER
55,Houston Police Department-HPD,Female
67,Houston Police Department-HPD,Female
113,Houston Police Department-HPD,Female
123,Houston Police Department-HPD,Female
136,Houston Police Department-HPD,Female
137,Houston Police Department-HPD,Female
179,Houston Police Department-HPD,Female
185,Houston Police Department-HPD,Female
188,Houston Police Department-HPD,Female
196,Houston Police Department-HPD,Female


### Problem 7
<span  style="color:green; font-size:16px">Select all the white, black and hispanic employees that are in the houston police department, Houston fire department and the Parks & Recreation department. Also select only the RACE and DEPARTMENT columns.</span>

In [167]:
races = ['White', 'Black or African American', 'Hispanic/Latino']
depts = ['Houston Police Department-HPD', 'Houston Fire Department (HFD)', 'Parks & Recreation']
criteria = employee['RACE'].isin(races) & employee['DEPARTMENT'].isin(depts)
employee.loc[criteria, ['RACE', 'DEPARTMENT']].head()

Unnamed: 0,RACE,DEPARTMENT
2,White,Houston Police Department-HPD
3,White,Houston Fire Department (HFD)
5,Black or African American,Houston Police Department-HPD
10,Hispanic/Latino,Houston Fire Department (HFD)
14,Black or African American,Houston Police Department-HPD


### Problem 8
<span  style="color:green; font-size:16px">What is the most most common department for black females? How about for Hispanic males?</span>

In [171]:
criteria = (employee['RACE'] == 'Black or African American') & (employee['GENDER'] == 'Female')
employee.loc[criteria, 'DEPARTMENT'].value_counts().head()

Houston Police Department-HPD     76
Public Works & Engineering-PWE    61
Health & Human Services           43
Parks & Recreation                15
Houston Airport System (HAS)      14
Name: DEPARTMENT, dtype: int64

In [172]:
criteria = (employee['RACE'] == 'Hispanic/Latino') & (employee['GENDER'] == 'Male')
employee.loc[criteria, 'DEPARTMENT'].value_counts().head()

Houston Police Department-HPD     116
Houston Fire Department (HFD)      91
Public Works & Engineering-PWE     49
Houston Airport System (HAS)       18
Parks & Recreation                 15
Name: DEPARTMENT, dtype: int64

In [173]:
# programmatically get the top department
bf = employee.loc[criteria, 'DEPARTMENT'].value_counts()
bf.index[0]

'Houston Police Department-HPD'

### Problem 9
<span  style="color:green; font-size:16px">Who makes more money, 'Black or African American' Females or White Males?</span>

In [174]:
criteria = (employee['RACE'] == 'Black or African American') & (employee['GENDER'] == 'Female')
black_female = employee.loc[criteria, 'BASE_SALARY']

criteria = (employee['RACE'] == 'White') & (employee['GENDER'] == 'Male')
white_male = employee.loc[criteria, 'BASE_SALARY']

black_female.mean(), white_male.mean()

(48915.42123287671, 63940.38811881188)

### Problem 10
<span  style="color:green; font-size:16px">Set the index to be **`UNIQUE_ID`** and save result to a new variable. The use **`.loc`** to select employees 2440 and 480 and columns DEPARTMENT through GENDER.</span>

In [178]:
employee_idx = employee.set_index('UNIQUE_ID')

In [180]:
employee_idx.loc[[2440, 480], 'DEPARTMENT':'GENDER']

Unnamed: 0_level_0,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2440,Houston Police Department-HPD,66614.0,Black or African American,Full Time,Male
480,Houston Police Department-HPD,27914.0,Black or African American,Full Time,Female


# 7. Case Study - Do Employees with more Experience make more Money?

In [181]:
import pandas as pd
import numpy as np

employee = pd.read_csv('data/employee.csv', parse_dates=['HIRE_DATE', 'JOB_DATE'])
employee['NEW_CONSTANT_COLUMN'] = 5
employee['RACE_GENDER'] = employee['RACE'] + '-' + employee['GENDER']

np.random.seed(123)
n = len(employee)
employee['RANDOM_BONUS'] = np.random.rand(n) * .1

employee['YEARS_EXPERIENCE'] = (pd.Timestamp('2016-12-1') - employee['HIRE_DATE']) / pd.Timedelta(1, 'Y')
employee['EXPERIENCE_LEVEL'] =  pd.cut(employee['YEARS_EXPERIENCE'], 
                                       bins=[0, 5, 15, 100], 
                                       labels=['Novice', 'Experienced', 'Senior'])

### Problem 1
<span  style="color:green; font-size:16px">Create new columns **`BONUS`** and **`TOTAL_COMP`**. Use column **`RANDOM_BONUS`** to calculate the bonus.</span>

In [183]:
employee['BONUS'] = employee['RANDOM_BONUS'] * employee['BASE_SALARY']
employee['TOTAL_COMP'] = employee['BASE_SALARY'] + employee['BONUS']

employee.iloc[:3, -5:]

Unnamed: 0,RANDOM_BONUS,YEARS_EXPERIENCE,EXPERIENCE_LEVEL,BONUS,TOTAL_COMP
0,0.069647,10.472494,Experienced,8487.31279,130349.31279
1,0.028614,16.369946,Senior,747.539013,26872.539013
2,0.022685,1.826184,Novice,1027.160697,46306.160697


### Problem 2
<span  style="color:green; font-size:16px">Use the **`EXPERIENCE_LEVEL`** column to determine if more experienced employees make more money.</span>

In [184]:
novice = employee.loc[employee['EXPERIENCE_LEVEL'] == 'Novice', 'BASE_SALARY']
exper = employee.loc[employee['EXPERIENCE_LEVEL'] == 'Experienced', 'BASE_SALARY']
senior = employee.loc[employee['EXPERIENCE_LEVEL'] == 'Senior', 'BASE_SALARY']

novice.mean(), exper.mean(), senior.mean()

(44987.484, 55264.92867981791, 63638.224209078406)

# 8. More on DataFrames

In [185]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')

### Problem 1
<span  style="color:green; font-size:16px">Re-read the college.csv file into the variable **`college2`**. Use the documentation of the **`read_csv`** function to assign the index column INSTNM on read, skip the first 20 rows but keep the header.</span>

In [186]:
college2 = pd.read_csv('data/college.csv', index_col='INSTNM', header=0, skiprows=range(1,21))
college2.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
George C Wallace State Community College-Hanceville,Hanceville,AL,0.0,0.0,0.0,0,,,0.0,4920.0,0.863,0.0612,0.0362,0.0065,0.0089,0.0,0.0,0.0059,0.0183,0.4203,1,0.5026,0.4192,0.3229,28800,11186
George C Wallace State Community College-Selma,Selma,AL,0.0,0.0,0.0,0,,,0.0,1513.0,0.1956,0.7449,0.0026,0.004,0.0013,0.0,0.0033,0.004,0.0443,0.384,1,0.7645,0.0,0.3318,24200,PrivacySuppressed
Herzing University-Birmingham,Birmingham,AL,0.0,0.0,0.0,0,,,0.0,302.0,0.3543,0.5265,0.0166,0.0066,0.0,0.0,0.0563,0.0,0.0397,0.5497,1,0.6541,0.7736,0.7813,42300,23216.5
Huntingdon College,Montgomery,AL,0.0,0.0,0.0,1,510.0,490.0,0.0,1149.0,0.6388,0.1993,0.0252,0.0078,0.0122,0.0017,0.0261,0.0061,0.0827,0.2097,1,0.3982,0.7153,0.1937,36500,26230
Heritage Christian University,Florence,AL,0.0,0.0,0.0,1,,,0.0,62.0,0.7419,0.1129,0.0484,0.0,0.0323,0.0161,0.0,0.0161,0.0323,0.4355,1,0.6087,0.4493,0.5942,PrivacySuppressed,PrivacySuppressed


### Problem 2
<span  style="color:green; font-size:16px">Take a close look at the **`min`** and **`max`** columns. Many columns range from 0 to 1. What kind of data do you think they represent?</span>

They are probably binary indicator variables or percentages with real numbers from 0 to 1.

### Problem 3
<span  style="color:green; font-size:16px">Sort first by **`STABBR`** ascending and by **`CITY`** descending. Rread the docs on **`sort_values`** to learn how to sort two columns at the same time.</span>

In [187]:
college.sort_values(by=['STABBR', 'CITY'], ascending=[True, False]).head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Alaska Christian College,Soldotna,AK,0.0,0.0,0.0,1,,,0.0,68.0,0.0588,0.0,0.0147,0.0,0.7794,0.0,0.0147,0.0,0.1324,0.0735,1,0.8868,0.6792,0.2264,,PrivacySuppressed
AVTEC-Alaska's Institute of Technology,Seward,AK,0.0,0.0,0.0,0,,,0.0,889.0,0.5388,0.0112,0.0427,0.0157,0.1879,0.0112,0.0529,0.0,0.1395,0.6817,1,0.0737,0.0664,0.7127,33500.0,PrivacySuppressed
Alaska Bible College,Palmer,AK,0.0,0.0,0.0,1,,,0.0,27.0,0.8519,0.0,0.037,0.0,0.0741,0.0,0.037,0.0,0.0,0.1481,1,0.3571,0.2857,0.4286,,PrivacySuppressed
University of Alaska Southeast,Juneau,AK,0.0,0.0,0.0,0,,,0.0,1428.0,0.4748,0.0119,0.0623,0.0357,0.1029,0.0147,0.0686,0.0049,0.2241,0.5112,1,0.1769,0.1996,0.555,37400.0,16875
University of Alaska Fairbanks,Fairbanks,AK,0.0,0.0,0.0,0,,,0.0,5536.0,0.4259,0.021,0.0522,0.0126,0.1284,0.0027,0.0401,0.011,0.306,0.3887,1,0.2263,0.255,0.4519,36200.0,19355


### Problem 4
<span  style="color:green; font-size:16px">Rename column **`HBCU`** to **`HISTORICALLY_BLACK`**, **`STABBR`** to **`STATE_ABBR`** and index **`Alabama State University`** to **`ASU`** all in one line of code. </span>

In [188]:
college.rename(columns={'HBCU':'HISTORICALLY_BLACK', 'STABBR':'STATE_ABBR'}, 
               index={'Alabama State University':'ASU'}).head()

Unnamed: 0_level_0,CITY,STATE_ABBR,HISTORICALLY_BLACK,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
ASU,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Problem 5
<span  style="color:green; font-size:16px">Sort the index in-place. Output the head of the DataFrame.</span>

In [189]:
college.sort_index(inplace=True)

In [190]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
A & W Healthcare Educators,New Orleans,LA,0.0,0.0,0.0,0,,,0.0,40.0,0.0,0.975,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.125,1,0.7018,0.8596,0.6667,,19022.5
A T Still University of Health Sciences,Kirksville,MO,0.0,0.0,0.0,0,,,0.0,,,,,,,,,,,,1,,,,219800,PrivacySuppressed
ABC Beauty Academy,Garland,TX,0.0,0.0,0.0,0,,,0.0,30.0,0.0,0.0333,0.0333,0.9333,0.0,0.0,0.0,0.0,0.0,0.0,0,0.7857,0.0,0.8286,,PrivacySuppressed
ABC Beauty College Inc,Arkadelphia,AR,0.0,0.0,0.0,0,,,0.0,38.0,0.2895,0.6579,0.0526,0.0,0.0,0.0,0.0,0.0,0.0,0.2105,1,0.9815,1.0,0.4688,PrivacySuppressed,16500
AI Miami International University of Art and Design,Miami,FL,0.0,0.0,0.0,0,,,0.0,2778.0,0.0324,0.0198,0.4773,0.0018,0.0,0.0,0.0018,0.0025,0.4644,0.2185,1,0.5507,0.6966,0.3262,29900,31000


### Problem 6
<span  style="color:green; font-size:16px">Use the **`max`** method across the rows for DataFrame **`college_ugds`**. Take the results and apply the pandas **`cut`** function to create a Series with 3 category labels on how 'diverse' the school is.</span>

In [191]:
college_ugds = college.filter(like='UGDS_')

In [192]:
max_race = college_ugds.max(axis='columns')

diversity = pd.cut(max_race, bins=[0, .4, .7, 1], labels=['High', 'Medium', 'Low'])

diversity.head(15)

INSTNM
A & W Healthcare Educators                                Low
A T Still University of Health Sciences                   NaN
ABC Beauty Academy                                        Low
ABC Beauty College Inc                                 Medium
AI Miami International University of Art and Design    Medium
AIB College of Business                                Medium
AOMA Graduate School of Integrative Medicine              NaN
ASA College                                              High
ASI Career Institute                                   Medium
ASM Beauty World Academy                                  Low
ATA Career Education                                   Medium
ATA College                                              High
ATEP at IVC                                               NaN
ATI College-Norwalk                                    Medium
ATS Institute of Technology                               Low
dtype: category
Categories (3, object): [High < Medium < Low]

### Problem 7
<span  style="color:green; font-size:16px">Use the **`select_dtpyes`** method on the **`college`** DataFrame to select only the numeric columns. Save this DataFrame to **`college_num`**.</span>

In [193]:
college_num = college.select_dtypes(include=[np.number])

college_num.head()

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A & W Healthcare Educators,0.0,0.0,0.0,0,,,0.0,40.0,0.0,0.975,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.125,1,0.7018,0.8596,0.6667
A T Still University of Health Sciences,0.0,0.0,0.0,0,,,0.0,,,,,,,,,,,,1,,,
ABC Beauty Academy,0.0,0.0,0.0,0,,,0.0,30.0,0.0,0.0333,0.0333,0.9333,0.0,0.0,0.0,0.0,0.0,0.0,0,0.7857,0.0,0.8286
ABC Beauty College Inc,0.0,0.0,0.0,0,,,0.0,38.0,0.2895,0.6579,0.0526,0.0,0.0,0.0,0.0,0.0,0.0,0.2105,1,0.9815,1.0,0.4688
AI Miami International University of Art and Design,0.0,0.0,0.0,0,,,0.0,2778.0,0.0324,0.0198,0.4773,0.0018,0.0,0.0,0.0018,0.0025,0.4644,0.2185,1,0.5507,0.6966,0.3262


### Problem 8
<span  style="color:green; font-size:16px">Use **`filter`** to slim your DataFrame down to the **SAT** columns. Then lookup how to use the **`dropna`** method and return a DataFrame that has no missing values. Style your DataFrame with the **`bar`** method on the top 10 rows of this DataFrame.</span>

In [194]:
sat = college.filter(like='SAT').dropna()
sat.head(10).style.bar()

Unnamed: 0_level_0,SATVRMID,SATMTMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Abilene Christian University,530,545
Abraham Baldwin Agricultural College,465,460
Adams State University,475,509
Adelphi University,550,565
Adrian College,500,490
Adventist University of Health Sciences,473,453
Alabama A & M University,424,420
Alabama State University,425,430
Alaska Pacific University,555,503
Albany College of Pharmacy and Health Sciences,555,610


### Problem 9
<span  style="color:green; font-size:16px">How many colleges have more than 10,000 students and  are religiously affiliated?</span>

In [195]:
college[(college.RELAFFIL == 1) & (college.UGDS > 10000)]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Baylor University,Waco,TX,0.0,0.0,0.0,1,610.0,620.0,0.0,13801.0,0.6402,0.0733,0.1413,0.0625,0.0036,0.0005,0.0451,0.0309,0.0024,0.0162,1,0.2135,0.452,0.0245,48200,25131
Brigham Young University-Idaho,Rexburg,ID,0.0,0.0,0.0,1,515.0,505.0,0.0,23865.0,0.8011,0.0048,0.0303,0.0094,0.0035,0.0044,0.0569,0.0659,0.0238,0.3462,1,0.4733,0.2138,0.371,38800,11000
Brigham Young University-Provo,Provo,UT,0.0,0.0,0.0,1,630.0,630.0,0.0,27163.0,0.832,0.005,0.0563,0.0195,0.0037,0.0058,0.0344,0.0314,0.0118,0.0981,1,0.3702,0.1921,0.122,57200,11000
DePaul University,Chicago,IL,0.0,0.0,0.0,1,,,0.0,15858.0,0.5518,0.0832,0.1756,0.0778,0.0008,0.0017,0.0388,0.0292,0.0411,0.1438,1,0.3504,0.578,0.2019,50300,23500
Indiana Wesleyan University-Marion,Marion,IN,0.0,0.0,0.0,1,530.0,525.0,0.0,10218.0,0.7531,0.1825,0.0307,0.0065,0.0024,0.0008,0.0206,0.0023,0.0012,0.0762,1,0.3816,0.7019,0.6919,46300,24160
Kennesaw State University,Kennesaw,GA,0.0,0.0,0.0,1,545.0,535.0,0.0,23058.0,0.6082,0.1904,0.0755,0.0339,0.0023,0.0016,0.042,0.0186,0.0273,0.2397,1,0.4067,0.5462,0.2518,40000,22750
Liberty University,Lynchburg,VA,0.0,0.0,0.0,1,525.0,510.0,0.0,49340.0,0.5121,0.155,0.0166,0.0093,0.0059,0.0022,0.0227,0.0135,0.2626,0.4458,1,0.4984,0.6648,0.6265,35600,23250
Loyola University Chicago,Chicago,IL,0.0,0.0,0.0,1,575.0,580.0,0.0,10042.0,0.6028,0.0376,0.132,0.1105,0.001,0.0021,0.0604,0.0352,0.0183,0.0833,1,0.2817,0.6092,0.0804,50700,25000
Saint Leo University,Saint Leo,FL,0.0,0.0,0.0,1,,,0.0,11976.0,0.3823,0.3696,0.1104,0.0124,0.005,0.0022,0.0145,0.0256,0.0781,0.3059,1,0.4828,0.6032,0.7228,42100,25000
St John's University-New York,Queens,NY,0.0,0.0,0.0,1,540.0,560.0,0.0,10878.0,0.338,0.189,0.1452,0.182,0.0024,0.0033,0.0448,0.053,0.0423,0.0297,1,0.3009,0.5599,0.038,52700,25910


In [196]:
# final answer
college[(college.RELAFFIL == 1) & (college.UGDS > 10000)].shape[0]

10