# Course Solutions

1. [Pandas Intro](#1.-Pandas-Intro)
1. [Selecting Subsets of Data - DataFrames](#2.-Selecting-Subsets-of-Data---DataFrames)
1. [Selecting Subsets of Data - Series](#3.-Selecting-Subsets-of-Data---Series)
1. [Boolean Indexing - DataFrames](#4.-Boolean-Indexing---DataFrames)
1. [Boolean Indexing More](#5.-Boolean-Indexing-More)
1. [Series Attributes and Statistical Methods](#6.-Series-Attributes-and-Statistical-Methods)
1. [Series Methods More](#7.-Series-Methods-More)
1. [String Series Methods](#8.-String-Series-Methods)
1. [DataFrame Attributes and Methods](#9.-DataFrame-Attributes-and-Methods)
1. [DataFrame Methods More](#10.-DataFrame-Methods-More)

# 1. Pandas Intro

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
bikes = pd.read_csv('data/bikes.csv')

### Problem 1
<span  style="color:green; font-size:16px">Select the column **`events`**, the type of weather that was recorded and assign it to a variable with the same name. Output the first 10 values of it.</span>

In [2]:
events = bikes['events']
events.head(10)

0    mostlycloudy
1    partlycloudy
2    mostlycloudy
3    mostlycloudy
4    partlycloudy
5    mostlycloudy
6          cloudy
7          cloudy
8          cloudy
9    mostlycloudy
Name: events, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">What type of object is **`events`**?</span>

In [3]:
# it's a Series
type(events)

pandas.core.series.Series

### Problem 3
<span  style="color:green; font-size:16px">Select the last 2 rows of the **`bikes`** DataFrame and assign it to the variable **`bikes_last_2`**. What type of object is **`bikes_last_2`**?</span>

In [4]:
# it's a DataFrame
bikes_last_2 = bikes.tail(2)
type(bikes_last_2)

pandas.core.frame.DataFrame

### Problem 4
<span  style="color:green; font-size:16px">What type of object is returned from the **`dtypes`** attribute?</span>

In [5]:
# a Series
type(bikes.dtypes)

pandas.core.series.Series

# 2. Selecting Subsets of Data - DataFrame

In [6]:
movie = pd.read_csv('data/movie.csv', index_col='title')

### Problem 1
<span  style="color:green; font-size:16px">Select the column with the director's name as a Series</span>

In [7]:
movie['director_name'].head()

title
Avatar                                            James Cameron
Pirates of the Caribbean: At World's End         Gore Verbinski
Spectre                                              Sam Mendes
The Dark Knight Rises                         Christopher Nolan
Star Wars: Episode VII - The Force Awakens          Doug Walker
Name: director_name, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">Select the column with the director's name and number of Facebook likes.</span>

In [8]:
movie[['director_name', 'director_fb']].head()

Unnamed: 0_level_0,director_name,director_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Avatar,James Cameron,0.0
Pirates of the Caribbean: At World's End,Gore Verbinski,563.0
Spectre,Sam Mendes,0.0
The Dark Knight Rises,Christopher Nolan,22000.0
Star Wars: Episode VII - The Force Awakens,Doug Walker,131.0


### Problem 3
<span  style="color:green; font-size:16px">Select all columns for the movie 'The Dark Knight Rises'.</span>

In [9]:
movie.loc['The Dark Knight Rises']

year                                                            2012
color                                                          Color
content_rating                                                 PG-13
duration                                                         164
director_name                                      Christopher Nolan
director_fb                                                    22000
actor1                                                     Tom Hardy
actor1_fb                                                      27000
actor2                                                Christian Bale
actor2_fb                                                      23000
actor3                                          Joseph Gordon-Levitt
actor3_fb                                                      23000
gross                                                    4.48131e+08
genres                                               Action|Thriller
num_reviews                       

### Problem 4
<span  style="color:green; font-size:16px">Select all columns for the movies 'Tangled' and 'Avatar'.</span>

In [10]:
movie.loc[['Tangled', 'Avatar']]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,M.C. Gainey,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9


### Problem 5
<span  style="color:green; font-size:16px">What year was 'Tangled' and 'Avatar' made and what was their IMBD scores?</span>

In [11]:
movie.loc[['Tangled', 'Avatar'], ['year', 'imdb_score']]

Unnamed: 0_level_0,year,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Tangled,2010.0,7.8
Avatar,2009.0,7.9


### Problem 6
<span  style="color:green; font-size:16px">Select the rows with integer location 10, 5, and 1</span>

In [12]:
movie.iloc[[10, 5, 1]]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Batman v Superman: Dawn of Justice,2016.0,Color,PG-13,183.0,Zack Snyder,0.0,Henry Cavill,15000.0,Lauren Cohan,4000.0,Alan D. Purwin,2000.0,330249062.0,Action|Adventure|Sci-Fi,673.0,371639,based on comic book|batman|sequel to a reboot|...,English,USA,250000000.0,6.9
John Carter,2012.0,Color,PG-13,132.0,Andrew Stanton,475.0,Daryl Sabara,640.0,Samantha Morton,632.0,Polly Walker,530.0,73058679.0,Action|Adventure|Sci-Fi,462.0,212204,alien|american civil war|male nipple|mars|prin...,English,USA,263700000.0,6.6
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1


### Problem 7
<span  style="color:green; font-size:16px">Select the columns with integer location 10, 5, and 1</span>

In [13]:
movie.iloc[:, [10, 5, 1]].head()

Unnamed: 0_level_0,actor3,director_fb,color
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,Wes Studi,0.0,Color
Pirates of the Caribbean: At World's End,Jack Davenport,563.0,Color
Spectre,Stephanie Sigman,0.0,Color
The Dark Knight Rises,Joseph Gordon-Levitt,22000.0,Color
Star Wars: Episode VII - The Force Awakens,,131.0,


# 3. Selecting Subsets of Data - Series

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select `actor1` as a Series. Who is the `actor1` for 'My Big Fat Greek Wedding'?</span>

In [14]:
movie = pd.read_csv('data/movie.csv', index_col='title')
actor1 = movie['actor1']

In [15]:
actor1.loc['My Big Fat Greek Wedding']

'Nia Vardalos'

### Problem 2
<span  style="color:green; font-size:16px">Find `actor1` for your favorite two movies?</span>

In [16]:
actor1.loc[['Titanic', 'Blood Diamond']]

title
Titanic          Leonardo DiCaprio
Blood Diamond    Leonardo DiCaprio
Name: actor1, dtype: object

### Problem 3
<span  style="color:green; font-size:16px">Select the last 10 values from `actor1` using two different ways?</span>

In [17]:
actor1.iloc[-10:]

title
Primer                       Shane Carruth
Cavite                         Ian Gamazon
El Mariachi                Carlos Gallardo
The Mongol King             Richard Jewell
Newlyweds                      Kerry Bishé
Signed Sealed Delivered        Eric Mabius
The Following                  Natalie Zea
A Plague So Pleasant           Eva Boehnke
Shanghai Calling                 Alan Ruck
My Date with Drew              John August
Name: actor1, dtype: object

In [18]:
actor1.tail(10)

title
Primer                       Shane Carruth
Cavite                         Ian Gamazon
El Mariachi                Carlos Gallardo
The Mongol King             Richard Jewell
Newlyweds                      Kerry Bishé
Signed Sealed Delivered        Eric Mabius
The Following                  Natalie Zea
A Plague So Pleasant           Eva Boehnke
Shanghai Calling                 Alan Ruck
My Date with Drew              John August
Name: actor1, dtype: object

# 4. Boolean Indexing - DataFrames

In [20]:
movie = pd.read_csv('data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as `actor1`. How many of these movies has he starred in?</span>

In [21]:
filt = movie['actor1'] == 'Tom Hanks'
hanks_movies = movie[filt]
hanks_movies.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story 3,2010.0,Color,G,103.0,Lee Unkrich,125.0,Tom Hanks,15000.0,John Ratzenberger,1000.0,Don Rickles,721.0,414984497.0,Adventure|Animation|Comedy|Family|Fantasy,453.0,544884,college|day care|escape|teddy bear|toy,English,USA,200000000.0,8.3
The Polar Express,2004.0,Color,G,100.0,Robert Zemeckis,0.0,Tom Hanks,15000.0,Eddie Deezen,726.0,Peter Scolari,267.0,665426.0,Adventure|Animation|Family|Fantasy,188.0,120798,boy|christmas|christmas eve|north pole|train,English,USA,165000000.0,6.6
Angels & Demons,2009.0,Color,PG-13,146.0,Ron Howard,2000.0,Tom Hanks,15000.0,Ayelet Zurer,745.0,Armin Mueller-Stahl,294.0,133375846.0,Mystery|Thriller,298.0,207839,conclave|illuminati|murder|reference to bernin...,English,USA,150000000.0,6.7
The Da Vinci Code,2006.0,Color,PG-13,174.0,Ron Howard,2000.0,Tom Hanks,15000.0,Seth Gabel,574.0,Jürgen Prochnow,362.0,217536138.0,Mystery|Thriller,294.0,314253,based on supposedly true story|holy grail|mary...,English,USA,125000000.0,6.6
Cloud Atlas,2012.0,Color,R,172.0,Tom Tykwer,670.0,Tom Hanks,15000.0,Jim Sturgess,5000.0,Jim Broadbent,1000.0,27098580.0,Drama|Sci-Fi,511.0,284825,composer|future|letter|nonlinear timeline|nurs...,English,Germany,102000000.0,7.5


He's starred in 24 movies

In [22]:
hanks_movies.shape

(24, 21)

### Problem 2
<span  style="color:green; font-size:16px">Select movies with and IMDB score greater than 9.</span>

In [23]:
filt = movie['imdb_score'] > 9
movie[filt]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Shawshank Redemption,1994.0,Color,R,142.0,Frank Darabont,0.0,Morgan Freeman,11000.0,Jeffrey DeMunn,745.0,Bob Gunton,461.0,28341469.0,Crime|Drama,199.0,1689764,escape from prison|first person narration|pris...,English,USA,25000000.0,9.3
Towering Inferno,,Color,,65.0,John Blanchard,0.0,Martin Short,770.0,Andrea Martin,179.0,Joe Flaherty,176.0,,Comedy,,10,,English,Canada,,9.5
Dekalog,,Color,TV-MA,55.0,,,Krystyna Janda,20.0,Olaf Lubaszenko,3.0,Olgierd Lukaszewicz,2.0,447093.0,Drama,53.0,12590,meaning of life|moral challenge|morality|searc...,Polish,Poland,,9.1
The Godfather,1972.0,Color,R,175.0,Francis Ford Coppola,0.0,Al Pacino,14000.0,Marlon Brando,10000.0,Robert Duvall,3000.0,134821952.0,Crime|Drama,208.0,1155770,crime family|mafia|organized crime|patriarch|r...,English,USA,6000000.0,9.2
Kickboxer: Vengeance,2016.0,,,90.0,John Stockwell,134.0,Matthew Ziff,260000.0,T.J. Storm,454.0,Sam Medina,354.0,,Action,2.0,246,,,USA,17000000.0,9.1


### Problem 3
<span  style="color:green; font-size:16px">Select all movies from the 1970s.</span>

In [24]:
filt1 = movie['year'] >= 1970
filt2 = movie['year'] <= 1979
filt_all = filt1 & filt2

movie[filt_all].head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
All That Jazz,1979.0,Color,R,123.0,Bob Fosse,189.0,Roy Scheider,813.0,Ben Vereen,388.0,Max Wright,87.0,,Comedy|Drama|Music|Musical,84.0,19228,dancer|editing|stand up comedian|surgery|vomiting,English,USA,,7.8
Superman,1978.0,Color,PG,188.0,Richard Donner,503.0,Marlon Brando,10000.0,Margot Kidder,593.0,Ned Beatty,467.0,134218018.0,Action|Adventure|Drama|Romance|Sci-Fi,169.0,126357,1970s|clark kent|planet|superhero|year 1978,English,USA,55000000.0,7.3
Solaris,1972.0,Black and White,PG,115.0,Andrei Tarkovsky,0.0,Donatas Banionis,29.0,Anatoliy Solonitsyn,29.0,Natalya Bondarchuk,12.0,,Drama|Mystery|Sci-Fi,144.0,54057,hallucination|ocean|psychologist|scientist|spa...,Russian,Soviet Union,1000000.0,8.1
Mean Streets,1973.0,Color,R,112.0,Martin Scorsese,17000.0,Robert De Niro,22000.0,David Carradine,926.0,David Proval,354.0,32645.0,Crime|Drama|Romance|Thriller,112.0,67797,bar|catholic guilt|epilepsy|italian american|m...,English,USA,500000.0,7.4
Star Trek: The Motion Picture,1979.0,Color,PG,143.0,Robert Wise,338.0,Leonard Nimoy,12000.0,Nichelle Nichols,664.0,Walter Koenig,643.0,82300000.0,Adventure|Mystery|Sci-Fi,134.0,63330,alien|space|space station|spacecraft|warp speed,English,USA,35000000.0,6.4


### Problem 4
<span  style="color:green; font-size:16px">Select all movies from the 1970s that had IMDB scores greater than 8</span>

In [25]:
filt1 = movie['year'] >= 1970
filt2 = movie['year'] <= 1979 
filt3 = movie['imdb_score'] > 8
filt_all = filt1 & filt2 & filt3

movie[filt_all].head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Solaris,1972.0,Black and White,PG,115.0,Andrei Tarkovsky,0.0,Donatas Banionis,29.0,Anatoliy Solonitsyn,29.0,Natalya Bondarchuk,12.0,,Drama|Mystery|Sci-Fi,144.0,54057,hallucination|ocean|psychologist|scientist|spa...,Russian,Soviet Union,1000000.0,8.1
Apocalypse Now,1979.0,Color,R,289.0,Francis Ford Coppola,0.0,Harrison Ford,11000.0,Marlon Brando,10000.0,Robert Duvall,3000.0,78800000.0,Drama|War,261.0,450676,army|green beret|insanity|jungle|vietnam,English,USA,31500000.0,8.5
The Deer Hunter,1978.0,Color,R,183.0,Michael Cimino,517.0,Robert De Niro,22000.0,Meryl Streep,11000.0,John Savage,652.0,,Drama|War,140.0,232577,escape|friend|party|pittsburgh steelers|vietnam,English,UK,15000000.0,8.2
The Godfather: Part II,1974.0,Color,R,220.0,Francis Ford Coppola,0.0,Robert De Niro,22000.0,Al Pacino,14000.0,Robert Duvall,3000.0,57300000.0,Crime|Drama,149.0,790926,1950s|corrupt politician|lake tahoe nevada|mel...,English,USA,13000000.0,9.0
Star Wars: Episode IV - A New Hope,1977.0,Color,PG,125.0,George Lucas,0.0,Harrison Ford,11000.0,Peter Cushing,1000.0,Kenny Baker,504.0,460935665.0,Action|Adventure|Fantasy|Sci-Fi,282.0,911097,death star|empire|galactic war|princess|rebellion,English,USA,11000000.0,8.7


### Problem 5
<span  style="color:green; font-size:16px">Select movies that were rated either R, PG-13, or PG.</span>

In [26]:
ratings = ['R', 'PG-13', 'PG']
filt = movie['content_rating'].isin(ratings)

movie[filt].head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
John Carter,2012.0,Color,PG-13,132.0,Andrew Stanton,475.0,Daryl Sabara,640.0,Samantha Morton,632.0,Polly Walker,530.0,73058679.0,Action|Adventure|Sci-Fi,462.0,212204,alien|american civil war|male nipple|mars|prin...,English,USA,263700000.0,6.6


### Problem 6
<span  style="color:green; font-size:16px">Select movies that are either rated PG-13 or were made after 2010.</span>

In [27]:
filt1 = movie['content_rating'] == 'PG-13'
filt2 = movie['year'] > 2010
filt_all = filt1 | filt2

movie[filt_all].head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
John Carter,2012.0,Color,PG-13,132.0,Andrew Stanton,475.0,Daryl Sabara,640.0,Samantha Morton,632.0,Polly Walker,530.0,73058679.0,Action|Adventure|Sci-Fi,462.0,212204,alien|american civil war|male nipple|mars|prin...,English,USA,263700000.0,6.6


### Problem 7
<span  style="color:green; font-size:16px">Reverse the condition from problem 6. In words, what have you selected.</span>

The following selects non-PG-13 movies made in the year 2010 or before.

In [28]:
filt1 = movie['content_rating'] == 'PG-13'
filt2 = movie['year'] > 2010
filt_all = filt1 | filt2
movie[~filt_all].head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,M.C. Gainey,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Harry Potter and the Half-Blood Prince,2009.0,Color,PG,153.0,David Yates,282.0,Alan Rickman,25000.0,Daniel Radcliffe,11000.0,Rupert Grint,10000.0,301956980.0,Adventure|Family|Fantasy|Mystery,375.0,321795,blood|book|love|potion|professor,English,UK,250000000.0,7.5
The Chronicles of Narnia: Prince Caspian,2008.0,Color,PG,150.0,Andrew Adamson,80.0,Peter Dinklage,22000.0,Pierfrancesco Favino,216.0,Damián Alcázar,201.0,141614023.0,Action|Adventure|Family|Fantasy,258.0,149922,brother brother relationship|brother sister re...,English,USA,225000000.0,6.6
Alice in Wonderland,2010.0,Color,PG,108.0,Tim Burton,13000.0,Johnny Depp,40000.0,Alan Rickman,25000.0,Anne Hathaway,11000.0,334185206.0,Adventure|Family|Fantasy,451.0,306320,alice in wonderland|mistaking reality for drea...,English,USA,200000000.0,6.5


# 5. Boolean Indexing More

### Problem 1
<span  style="color:green; font-size:16px">Select the wind speed column a a Series and assign it to a variable. Are there any negative wind speeds?</span>

In [29]:
wind = bikes['wind_speed']
wind.head()

0    12.7
1     6.9
2    16.1
3    16.1
4    17.3
Name: wind_speed, dtype: float64

Yes, there is really strong negative wind! Or maybe its just bad data...

In [30]:
filt = wind < 0
wind[filt].head()

22990   -9999.0
27168   -9999.0
28368   -9999.0
29308   -9999.0
29309   -9999.0
Name: wind_speed, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Select the events and gender columns for all trip durations longer than 1,000 seconds.</span>

In [31]:
filt = bikes['tripduration'] > 1000
cs = ['events', 'gender']
bikes.loc[filt, cs].head()

Unnamed: 0,events,gender
2,mostlycloudy,Male
8,cloudy,Male
10,mostlycloudy,Male
11,mostlycloudy,Male
12,partlycloudy,Male


### Problem 3
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index. We will use this DataFrame for the rest of the problems. Select all the movies such that the Facebook likes for actor 2 are greater than those for actor 1.</span>

In [32]:
movie = pd.read_csv('data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


There are none!

In [33]:
filt = movie['actor2_fb'] > movie['actor2_fb']
movie[filt]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### Problem 4
<span  style="color:green; font-size:16px">Select the year, content rating, and IMDB score columns for movies from the year 2016 with IMDB score less than 4.</span>

In [34]:
filt1 = movie['year'] == 2016
filt2 = movie['imdb_score'] < 4
filt_all = filt1 & filt2
cs = ['year', 'content_rating', 'imdb_score']

movie.loc[filt_all, cs]

Unnamed: 0_level_0,year,content_rating,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fifty Shades of Black,2016.0,R,3.5
Cabin Fever,2016.0,Not Rated,3.7
God's Not Dead 2,2016.0,PG,3.4


# 6. Series Attributes and Statistical Methods

In [35]:
import pandas as pd

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index and assign the `imdb_score` as a Series to variable `score`. Output the first 5 values.</span>

In [36]:
movie = pd.read_csv('data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


In [37]:
score = movie['imdb_score']
score.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

In [38]:
score.max()

9.5

In [39]:
score.min()

1.6

### Problem 3
<span  style="color:green; font-size:16px">How many missing values are there in the `score`?</span>

In [40]:
len(score) - score.count()

0

### Problem 4
<span  style="color:green; font-size:16px">How many movies have scores greater than 6? (Remember that True/False evaluates to 1/0)</span>

In [41]:
(score > 6).sum()

3368

### Problem 5
<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

In [42]:
score.median() - score.mean()

0.1625711960943912

# 7. Series Methods More

### Problem 1
<span  style="color:green; font-size:16px">What percentage of actor 1 Facebook likes are missing?</span>

In [43]:
movie = pd.read_csv('data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


In [44]:
movie['actor1_fb'].isna().mean()

0.0014239218877135883

### Problem 2
<span  style="color:green; font-size:16px">Use the `notna` method to find the number of non-missing values in the actor 1 Facebook like column. Verify this number is the same as the `count` method.</span>

In [45]:
movie['actor1_fb'].notna().sum()

4909

In [46]:
movie['actor1_fb'].count()

4909

### Problem 3
<span  style="color:green; font-size:16px">How many unique directors are there? Look up the `unique` and `nunique` methods</span>

In [47]:
movie['director_name'].nunique()

2397

### Problem 4
<span  style="color:green; font-size:16px">Select the `year` column, sort it, and drop any duplicates? Look up the `drop_duplicates` method.</span>

In [48]:
movie['year'].sort_values().drop_duplicates()

title
Intolerance: Love's Struggle Throughout the Ages           1916.0
Over the Hill to the Poorhouse                             1920.0
The Big Parade                                             1925.0
Metropolis                                                 1927.0
The Broadway Melody                                        1929.0
Hell's Angels                                              1930.0
A Farewell to Arms                                         1932.0
42nd Street                                                1933.0
It Happened One Night                                      1934.0
Top Hat                                                    1935.0
The Charge of the Light Brigade                            1936.0
The Prisoner of Zenda                                      1937.0
You Can't Take It with You                                 1938.0
Gone with the Wind                                         1939.0
Rebecca                                                    1940.0
How 

# 8. String Series Methods

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it.

Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

In [49]:
movie = pd.read_csv('data/movie.csv', index_col='title')
actor1 = movie['actor1'].dropna()
actor1.value_counts().head()

Robert De Niro       48
Johnny Depp          36
Nicolas Cage         32
Denzel Washington    29
Matt Damon           29
Name: actor1, dtype: int64

In [50]:
actor1.value_counts().index[0]

'Robert De Niro'

### Problem 2
<span  style="color:green; font-size:16px">What percent of movies have the top 100 most frequent actor 1's appeared in?</span>

In [51]:
actor1.value_counts(normalize=True).iloc[:100].sum()

0.325117131798737

### Problem 3
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

In [52]:
(actor1.value_counts() == 1).sum()

1379

### Problem 4
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

In [53]:
more_than_3e = actor1.str.count('e') > 3
more_than_3e.sum()

101

In [54]:
actor1[more_than_3e].unique()

array(['Jennifer Lawrence', 'Keanu Reeves', 'Seychelle Gabriel',
       'Jeremy Renner', 'Amber Stevens West', 'Peter Greene',
       'Steven Anthony Lawrence', 'Cedric the Entertainer',
       'Sean Pertwee', 'Xander Berkeley', 'Kathleen Freeman',
       'Pierre Perrier', 'Catherine Deneuve', 'George Kennedy',
       'Leighton Meester', 'Steve Guttenberg', 'Emmanuelle Seigner',
       'Jurnee Smollett-Bell', 'Steve Oedekerk',
       'Johannes Silberschneider', 'Bernadette Peters',
       'Jacqueline McKenzie', 'Dee Bradley Baker', 'Jennifer Freeman',
       'Gene Tierney', 'Roscoe Lee Browne', 'Phoebe Legere',
       'Eric Sheffer Stevens', 'Michael Greyeyes', 'Steven Weber',
       'George Newbern', 'Florence Henderson', 'Michelle Simone Miller',
       'Chemeeka Walker', 'Fereshteh Sadre Orafaiy'], dtype=object)

### Problem 5
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name. Note: When using the </span>

In [55]:
actor1[actor1.str.contains('Johnson').values].unique()

array(['Don Johnson', 'Dwayne Johnson', 'Richard Johnson', 'Eric Johnson',
       'Bill Johnson', 'Nicole Randall Johnson', 'R. Brandon Johnson'],
      dtype=object)

### Problem 6
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

In [56]:
actor1.str.endswith('x').sum()

28

### Problem 7
<span  style="color:green; font-size:16px">The Pandas string methods overlap with the builtin Python string methods. Find all the public method names that are in-common to both. Then find the public methods that are unique to each.</span>

In [57]:
python_str = {method for method in dir(str) if method[0] != '_'}
pandas_str = {method for method in dir(actor1.str) if method[0] != '_'}

In [58]:
python_str & pandas_str

{'capitalize',
 'center',
 'count',
 'encode',
 'endswith',
 'find',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill'}

In [59]:
python_str - pandas_str

{'casefold',
 'expandtabs',
 'format',
 'format_map',
 'isidentifier',
 'isprintable',
 'maketrans',
 'splitlines'}

In [60]:
pandas_str - python_str

{'cat',
 'contains',
 'decode',
 'extract',
 'extractall',
 'findall',
 'get',
 'get_dummies',
 'len',
 'match',
 'normalize',
 'pad',
 'repeat',
 'slice',
 'slice_replace',
 'wrap'}

# 9. DataFrame Attributes and Methods

In [61]:
import pandas as pd
college = pd.read_csv('data/college.csv', index_col='instnm')
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
movie = pd.read_csv('data/movie.csv', index_col='title')
pd.options.display.max_columns = 100

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

In [62]:
movie = pd.read_csv('data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


actor 1 has the most FB likes.

In [63]:
movie[['actor1_fb', 'actor2_fb', 'actor3_fb']].mean()

actor1_fb    6494.488491
actor2_fb    1621.923516
actor3_fb     631.276313
dtype: float64

It's also a good idea to assign just the actor FB likes to their own DataFrame:

In [64]:
actors_fb = movie[['actor1_fb', 'actor2_fb', 'actor3_fb']]
actors_fb.mean()

actor1_fb    6494.488491
actor2_fb    1621.923516
actor3_fb     631.276313
dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

In [65]:
actors_fb.sum(axis='columns').head()

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

In [66]:
filt = actors_fb.sum(axis='columns') > 10000
movie.loc[filt, 'gross'].median() / 10 ** 6

42.3919155

In [67]:
movie.loc[~filt, 'gross'].median() / 10 ** 6

16.8157525

### Problem 4
<span  style="color:green; font-size:16px">For each movies made in the year 2016, what is the median of the total actor FB likes?</span>

In [68]:
criteria = movie['year'] == 2016
cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
movie.loc[criteria, cols].sum(axis='columns').median()

3571.5

### Problem 5
<span  style="color:green; font-size:16px">Using the **college** dataset, find the number of non-missing values in each column and again for each row.</span>

In [69]:
college = pd.read_csv('data/college.csv')

non-missing for each column

In [70]:
college.count()

instnm                7535
city                  7535
stabbr                7535
hbcu                  7164
menonly               7164
womenonly             7164
relaffil              7535
satvrmid              1185
satmtmid              1196
distanceonly          7164
ugds                  6874
ugds_white            6874
ugds_black            6874
ugds_hisp             6874
ugds_asian            6874
ugds_aian             6874
ugds_nhpi             6874
ugds_2mor             6874
ugds_nra              6874
ugds_unkn             6874
pptug_ef              6853
curroper              7535
pctpell               6849
pctfloan              6849
ug25abv               6718
md_earn_wne_p10       6413
grad_debt_mdn_supp    7503
dtype: int64

non-missing values for the rows

In [71]:
college.count(axis='columns').head()

0    27
1    27
2    25
3    27
4    27
dtype: int64

### Problem 6
<span  style="color:green; font-size:16px">What is the average number of missing values for each row?</span>

In [72]:
college.count(axis='columns').mean()

23.70763105507631

# 10. DataFrame Methods More

In [74]:
import pandas as pd
college = pd.read_csv('data/college.csv', index_col='instnm')

### Problem 1
<span  style="color:green; font-size:16px">Find the number of missing values for each row.</span>

In [75]:
college.isna().sum(axis='columns').head(10)

instnm
Alabama A & M University               0
University of Alabama at Birmingham    0
Amridge University                     2
University of Alabama in Huntsville    0
Alabama State University               0
The University of Alabama              0
Central Alabama Community College      2
Athens State University                2
Auburn University at Montgomery        0
Auburn University                      0
dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">What percentage of rows have more than 5 missing values?</span>

In [76]:
(college.isna().sum(axis='columns') > 5).mean()

0.09011280690112806

### Problem 3
<span  style="color:green; font-size:16px">Read the documentation on the `dropna` method. What is the shape of the returned DataFrame when called with the defaults? Call it again except only drop rows if `ugds` is missing. What is the shape of this DataFrame?</span>

In [77]:
college.dropna().shape

(1171, 26)

Pandas forces you to use a list with the **`subset`** parameter. This is unfortunately inconsistent behavior.

In [78]:
college.dropna(subset=['ugds']).shape

(6874, 26)

### Problem 4
<span  style="color:green; font-size:16px">Create a new boolean column in the college named 'Verbal Higher' that is True for every college that has a verbal than math SAT score. Find the mean of this new column. Why does this number look suspiciously low?</span>

In [79]:
college['Verbal Higher'] = college['satvrmid'] > college['satmtmid']

In [80]:
college['Verbal Higher'].mean()

0.048042468480424684

One reason it is so low is that there are mostly missing values for the SAT columns and the comparison operators return False when comparing missing values. Notice that 84% of the values are missing for both SAT columns.

In [81]:
college[['satmtmid', 'satvrmid']].isna().mean()

satmtmid    0.841274
satvrmid    0.842734
dtype: float64

### Problem 5
<span  style="color:green; font-size:16px">Find the real percentage of schools with higher verbal than math SAT scores.</span>

Drop the rows with missing SAT values first.

In [82]:
sat = college[['satmtmid', 'satvrmid']].dropna()
sat.head()

Unnamed: 0_level_0,satmtmid,satvrmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,420.0,424.0
University of Alabama at Birmingham,565.0,570.0
University of Alabama in Huntsville,590.0,595.0
Alabama State University,430.0,425.0
The University of Alabama,565.0,555.0


In [83]:
(sat['satvrmid'] > sat['satmtmid']).mean()

0.30574324324324326

Can also find all those school with equal scores in both subjects.

In [84]:
(sat['satvrmid'] == sat['satmtmid']).mean()

0.08699324324324324

# 11. Split-Apply-Combine Aggregation Basics

In [85]:
import pandas as pd
nyc = pd.read_csv('data/nyc_deaths.csv')
nyc.head()

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71
3,2007,Accidents,F,White,162
4,2007,Accidents,M,Asian,53


### Problem 1
<span  style="color:green; font-size:16px">What year had the most deaths?</span>

In [5]:
year_deaths = nyc.groupby('Year').agg({'Deaths':'sum'})
year_deaths.idxmax()

Deaths    2008
dtype: int64

In [6]:
# one line
nyc.groupby('Year').agg({'Deaths':'sum'}).idxmax()

Deaths    2008
dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

In [16]:
nyc.groupby('Race').agg({'Deaths':'sum'}).sort_values('Deaths', ascending=False)

Unnamed: 0_level_0,Deaths
Race,Unnamed: 1_level_1
White,206487
Black,111116
Hispanic,74802
Asian and Pacific Islander,26355
Unknown,6238


### Use the employee dataset for the remaining problems

In [91]:
emp = pd.read_csv('data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,3


### Problem 3
<span  style="color:green; font-size:16px">Find the maximum salary for each gender.</span>

In [18]:
emp.groupby('GENDER').agg({'BASE_SALARY':'max'})

Unnamed: 0_level_0,BASE_SALARY
GENDER,Unnamed: 1_level_1
Female,178331.0
Male,275000.0


### Problem 4
<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [20]:
emp.groupby('DEPARTMENT').agg({'BASE_SALARY':'median'}).head()

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Admn. & Regulatory Affairs,37710.0
City Controller's Office,57054.0
City Council,54000.0
Convention and Entertainment,38397.0
Dept of Neighborhoods (DON),43742.0


### Problem 5
<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [23]:
emp.groupby('RACE', as_index=False).agg({'BASE_SALARY':'mean'})

Unnamed: 0,RACE,BASE_SALARY
0,American Indian or Alaskan Native,60272.1
1,Asian/Pacific Islander,61660.304762
2,Black or African American,50137.801493
3,Hispanic/Latino,52345.562771
4,Others,51278.0
5,White,64419.799012


# 12. Grouping and Aggregating with Multiple Columns

In [92]:
emp = pd.read_csv('data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,3


### Problem 1
<span  style="color:green; font-size:16px">For each department and gender find the number of unique position titles, the total number of employees and the average salary. Make sure there is no multi-index for the index or columns.</span>

In [93]:
data = emp.groupby(['dept', 'gender'], as_index=False).agg({'title':['nunique','size'],
                                                            'salary':'mean'})

data.columns = ['dept', 'gender', 'Num Unique Positions', 'Size', 'Mean Salary']
data.head(10)

Unnamed: 0,dept,gender,Num Unique Positions,Size,Mean Salary
0,Health & Human Services,Female,45,77,48652.376623
1,Health & Human Services,Male,22,26,59240.0
2,Houston Airport System (HAS),Female,23,35,53730.114286
3,Houston Airport System (HAS),Male,36,68,54124.323529
4,Houston Fire Department (HFD),Female,13,21,52853.047619
5,Houston Fire Department (HFD),Male,24,344,60394.322674
6,Houston Police Department-HPD,Female,37,144,52432.840278
7,Houston Police Department-HPD,Male,26,426,63131.586854
8,Parks & Recreation,Female,11,17,41043.117647
9,Parks & Recreation,Male,17,36,38662.583333


### Problem 2
<span  style="color:green; font-size:16px">For each department, race and gender find the maximum years of experience and salary.</span>

In [95]:
emp.groupby(['dept','race','gender']).agg({'experience': 'max',
                                           'salary': 'max'}).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,experience,salary
dept,race,gender,Unnamed: 3_level_1,Unnamed: 4_level_1
Health & Human Services,Asian,Female,23,94149.0
Health & Human Services,Asian,Male,25,70864.0
Health & Human Services,Black,Female,34,103270.0
Health & Human Services,Black,Male,29,180416.0
Health & Human Services,Hispanic,Female,25,65589.0
Health & Human Services,Hispanic,Male,14,58406.0
Health & Human Services,Native American,Female,17,58855.0
Health & Human Services,White,Female,33,100791.0
Health & Human Services,White,Male,8,120799.0
Houston Airport System (HAS),Asian,Female,23,32157.0


## Use the college dataset for the rest of the problems

In [96]:
college = pd.read_csv('data/college.csv')
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Problem 3
<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

In [99]:
size = college.groupby('city').size()
size.head()

city
ARTESIA     1
Aberdeen    3
Abilene     5
Abingdon    2
Abington    1
dtype: int64

In [100]:
size.sort_values(ascending=False).head()

city
New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
dtype: int64

### Without groupby
Just use **`value_counts`**! Much easier

In [47]:
college['CITY'].value_counts().head()

New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
Name: CITY, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [101]:
college.groupby('stabbr').agg({'ugds': 'max'}).head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,12865.0
AL,29851.0
AR,21405.0
AS,1276.0
AZ,151558.0
CA,44744.0
CO,25873.0
CT,18016.0
DC,10433.0
DE,18222.0


### Problem 5
<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [102]:
# They have more
college.groupby('distanceonly').agg({'ugds': 'mean'})

Unnamed: 0_level_0,ugds
distanceonly,Unnamed: 1_level_1
0.0,2334.648135
1.0,6245.74359


### Problem 6
<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

In [106]:
filt = college['relaffil'] == 1
cr = college[filt]
rel_oper_mean = cr.groupby(['stabbr']).agg({'curroper': 'mean'})
rel_oper_mean.head()

Unnamed: 0_level_0,curroper
stabbr,Unnamed: 1_level_1
AK,1.0
AL,0.916667
AR,0.944444
AZ,0.444444
CA,0.585366


In [107]:
# Utah. Answer makes sense.
rel_oper_mean.sort_values('curroper').head()

Unnamed: 0_level_0,curroper
stabbr,Unnamed: 1_level_1
UT,0.4
AZ,0.444444
NV,0.5
CA,0.585366
CT,0.647059
