# 6. DataFrame Descriptive Statistic Methods
DataFrames have identical [descriptive statistical methods][1] as Series. Again, we distinguish between methods that aggregate and those that do not. A method that performs an aggregation returns a **single** number to represent the description. Examples of methods that aggregate are:

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats

In [1]:
import pandas as pd
pd.options.display.max_columns = 50
college = pd.read_csv('../data/college.csv')
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


# Major Differences between DataFrame and Series Methods
When calling one of the above methods on a DataFrame, it is applied to each individual column by default. For instance, if we call the **`sum`** method, each column will be summed individually. Calling the **`sum`** method on a Series produces a single scalar value, while a DataFrame produces a sum for each column.

### Select numeric columns
Many of these statistical methods above work only for numeric columns. We will select all the columns that have undergraduate race proportion data. These columns are located together and start with **`UGDS_WHITE`** and end at **`UGDS_UNKN`**.

In [2]:
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
college_race.head()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
1,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
2,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
3,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
4,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


### Take the mean of each column
Let's demonstrate calling the **`mean`** aggregation method on each column.

In [3]:
college_race.mean()

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

## Did you notice what type of object was returned?
Pandas takes the mean of each column and returns a Series. The new Series has the column names as the index and the mean as the values.

Let's see a couple more aggregations:

In [4]:
college_race.max()

ugds_white    1.0000
ugds_black    1.0000
ugds_hisp     1.0000
ugds_asian    0.9727
ugds_aian     1.0000
ugds_nhpi     0.9983
ugds_2mor     0.5333
ugds_nra      0.9286
ugds_unkn     0.9027
dtype: float64

In [5]:
college_race.std()

ugds_white    0.286958
ugds_black    0.224587
ugds_hisp     0.221854
ugds_asian    0.073777
ugds_aian     0.070196
ugds_nhpi     0.033125
ugds_2mor     0.031288
ugds_nra      0.050172
ugds_unkn     0.093440
dtype: float64

## Changing the Direction of the Operation
Since DataFrames are two-dimensional you might be interested on doing an operation that happens across the rows - summing up each row for instance.

## The `axis` parameter controls the direction of the operation.
Nearly all DataFrame methods have an **`axis`** parameter. This is a very crucial parameter to understand. It controls the direction of the operation. By default, operations happen down each column.

## Referencing each axis by number and by label
DataFrames are two-dimensional and therefore have two axes. The rows are referenced by the number 0 and also by the label 'index'. The columns are referenced by the number 1 and also by the label 'columns'.

## Default value of `axis` is 0
The default value for the **`axis`** parameter is 0. You an also refer to it as 'index'. Let's take the mean again for each column, but instead use the string 'index' for the value of the **`axis`** parameter.

In [6]:
college_race.mean(axis='index')

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

This is the exact same thing as **`axis=0`**, which is the default:

In [7]:
college_race.mean(axis=0)

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

## Use `axis='columns'`
Let's sum each row by changing the direction of the operation by setting the **`axis`** parameter equal to **`columns`**. The total should equal 1 as each row contains all the race distribution of a single school.

In [9]:
college_race.head()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
1,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
2,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
3,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
4,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [10]:
college_race.sum(axis='columns').head()

0    1.0000
1    0.9999
2    1.0000
3    1.0000
4    1.0000
dtype: float64

You can also use **`axis=1`**

In [11]:
college_race.sum(axis=1).head()

0    1.0000
1    0.9999
2    1.0000
3    1.0000
4    1.0000
dtype: float64

## I always use `axis='columns'`
I always use **`axis='columns'`** and never **`axis=1`**. The reason for this is that the string 'columns' is much more descriptive than the integer 1. I also always use **`axis='index'`** instead of **`axis=0`** for the same reason.

## Confusion between string 'index' and 'columns'
It's definitely confusing and difficult to remember which direction the operation is going to happen. A little trick that helps me remember is that when using **`axis='columns'`** the result is going to be the same length as a **column** in the DataFrame. 

![][1]

[1]: images/df_axis.jpg

In [12]:
college_race.shape

(7535, 9)

In [13]:
len(college_race.sum(axis=1))

7535

### Summary of `axis`
* axis 0 - default axis for all DataFrame methods. It's preferred reference label is 'index'. The operations happen vertically, up and down columns. **`df.sum()`** finds the sum of each column individually.
* axis 1 - It's preferred reference is 'columns'. The operations happen horizontally, left to right. **`df.sum(axis='columns')`** sums each row individually.

# Non-Aggregation DataFrame methods
The non-aggregation DataFrame methods keep the shape of the DataFrame but can change each value. Let's round all the values to two digits.

In [14]:
college_race.round(2).head()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
0,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
1,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
2,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
3,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
4,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


## Some of the methods don't have an `axis` parameter
Methods such as **`round`** work independently of the axis and therefore do not have an **`axis`** parameter. Other methods however, such as **`cumsum`**, do have an **`axis`** parameter.

Let's call **`cumsum`** in both directions.

In [15]:
college_race.cumsum().head()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
1,0.6255,1.1953,0.0338,0.0537,0.0046,0.0026,0.0368,0.0238,0.0238
2,0.9245,1.6145,0.0407,0.0571,0.0046,0.0026,0.0368,0.0238,0.2953
3,1.6233,1.74,0.0789,0.0947,0.0189,0.0028,0.054,0.057,0.3303
4,1.6391,2.6608,0.091,0.0966,0.0199,0.0034,0.0638,0.0813,0.344


In [16]:
college_race.cumsum(axis='columns').head()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
0,0.0333,0.9686,0.9741,0.976,0.9784,0.9803,0.9803,0.9862,1.0
1,0.5922,0.8522,0.8805,0.9323,0.9345,0.9352,0.972,0.9899,0.9999
2,0.299,0.7182,0.7251,0.7285,0.7285,0.7285,0.7285,0.7285,1.0
3,0.6988,0.8243,0.8625,0.9001,0.9144,0.9146,0.9318,0.965,1.0
4,0.0158,0.9366,0.9487,0.9506,0.9516,0.9522,0.962,0.9863,1.0


## Get Summary Statistics for all columns with the `describe` method
The describe method calculates several summary statistics for each column and is a great way to inspect all of your data at once. Notice that a DataFrame is returned with the name of each summary statistic in the **index**.

In [17]:
college_race.describe()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
count,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0
mean,0.510207,0.189997,0.161635,0.033544,0.013813,0.004569,0.02395,0.016086,0.045181
std,0.286958,0.224587,0.221854,0.073777,0.070196,0.033125,0.031288,0.050172,0.09344
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.2675,0.036125,0.0276,0.0025,0.0,0.0,0.0,0.0,0.0
50%,0.5557,0.10005,0.0714,0.0129,0.0026,0.0,0.0175,0.0,0.0143
75%,0.747875,0.2577,0.198875,0.0327,0.0073,0.0025,0.0339,0.0117,0.0454
max,1.0,1.0,1.0,0.9727,1.0,0.9983,0.5333,0.9286,0.9027


### The `describe` method with non-numeric columns
The **`college_race`** DataFrame from above contains only numeric columns. If **`describe`** is called on a DataFrame containing a mix of numeric and non-numeric columns, then summary statistics for just the numeric columns will be returned. The others will be ignored.

The original **`college`** DataFrame contains a mix of data types. Let's use describe on it. Notice how the number of columns after calling **`describe`** decreased.

In [18]:
college.shape

(7535, 27)

In [19]:
college.describe()

Unnamed: 0,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv
count,7164.0,7164.0,7164.0,7535.0,1185.0,1196.0,7164.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6853.0,7535.0,6849.0,6849.0,6718.0
mean,0.014238,0.009213,0.005304,0.190975,522.819409,530.76505,0.005583,2356.83794,0.510207,0.189997,0.161635,0.033544,0.013813,0.004569,0.02395,0.016086,0.045181,0.226639,0.923291,0.530643,0.522211,0.410021
std,0.118478,0.095546,0.072642,0.393096,68.578862,73.469767,0.074519,5474.275871,0.286958,0.224587,0.221854,0.073777,0.070196,0.033125,0.031288,0.050172,0.09344,0.24647,0.266146,0.225544,0.283616,0.228939
min,0.0,0.0,0.0,0.0,290.0,310.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,475.0,482.0,0.0,117.0,0.2675,0.036125,0.0276,0.0025,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.3578,0.3329,0.2415
50%,0.0,0.0,0.0,0.0,510.0,520.0,0.0,412.5,0.5557,0.10005,0.0714,0.0129,0.0026,0.0,0.0175,0.0,0.0143,0.1504,1.0,0.5215,0.5833,0.40075
75%,0.0,0.0,0.0,0.0,555.0,565.0,0.0,1929.5,0.747875,0.2577,0.198875,0.0327,0.0073,0.0025,0.0339,0.0117,0.0454,0.3769,1.0,0.7129,0.745,0.572275
max,1.0,1.0,1.0,1.0,765.0,785.0,1.0,151558.0,1.0,1.0,1.0,0.9727,1.0,0.9983,0.5333,0.9286,0.9027,1.0,1.0,1.0,1.0,1.0


In [20]:
college.describe().shape

(8, 22)

### Calling `describe` on non-numeric columns
The **`describe`** method can work with non-numeric columns. Pass the **`include`** parameter a string of the data type you would like to use. Let's see the summary with the string and Datetime columns from the **`bikes`** DataFrame. Notice that the summary statistics are very different.

In [21]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.describe(include='object')

Unnamed: 0,usertype,gender,from_station_name,to_station_name,events
count,50089,50089,50089,50089,50089
unique,3,2,600,595,11
top,Subscriber,Male,Clinton St & Washington Blvd,Clinton St & Washington Blvd,partlycloudy
freq,50079,37654,879,905,16998


## Transposing a  DataFrame with the `T` attribute
Transposing a DataFrame 'turns' the data 90 degrees. The columns and the rows switch places. The first column is now the first row, etc...

The **`.T`** attribute transposes the DataFrame. I find this useful after running the **`describe`** method with long output.

In [22]:
college.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hbcu,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
menonly,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
womenonly,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
relaffil,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
satvrmid,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
satmtmid,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
distanceonly,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
ugds,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
ugds_white,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
ugds_black,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


# Exercises

In [23]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
movie = pd.read_csv('../data/movie.csv', index_col='title')

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

In [36]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


In [33]:
movie.describe().head()

Unnamed: 0,year,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,gross,num_reviews,num_voted_users,budget,imdb_score
count,4810.0,4901.0,4814.0,4909.0,4903.0,4893.0,4054.0,4867.0,4916.0,4432.0,4916.0
mean,2002.447609,107.090798,691.014541,6494.488491,1621.923516,631.276313,47644510.0,137.988905,82644.924939,36547490.0,6.437429
std,12.453977,25.286015,2832.954125,15106.986884,4011.299523,1625.874802,67372550.0,120.239379,138322.162547,100242700.0,1.127802
min,1916.0,7.0,0.0,0.0,0.0,0.0,162.0,1.0,5.0,218.0,1.6
25%,1999.0,93.0,7.0,607.0,277.0,132.0,5019656.0,49.0,8361.75,6000000.0,5.8


In [38]:
actor_fb = movie.loc[:,'actor1_fb':'actor3_fb':2]
actor_fb.head()

Unnamed: 0_level_0,actor1_fb,actor2_fb,actor3_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,1000.0,936.0,855.0
Pirates of the Caribbean: At World's End,40000.0,5000.0,1000.0
Spectre,11000.0,393.0,161.0
The Dark Knight Rises,27000.0,23000.0,23000.0
Star Wars: Episode VII - The Force Awakens,131.0,12.0,


In [35]:
actor_fb.mean()

actor1_fb    6494.488491
actor2_fb    1621.923516
actor3_fb     631.276313
dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

In [45]:
total_likes = actor_fb.sum(axis='columns')
total_likes.head()

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">What percentage of movies have more than 10,000 total actor FB likes?</span>

In [53]:
pct10000 = (actor_fb.sum(axis=1) > 10000)
pct10000.mean()*100

29.82099267697315

### Problem 4
<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

In [64]:
total = (pct10000 == True)
movie.loc[total, 'gross'].median() / 10**6

42.3919155

In [67]:
movie.loc[~total, 'gross'].median() / 10**6

16.8157525

### Problem 5
<span  style="color:green; font-size:16px">From problem 4, it appears that movies with more than 10,000 total actor FB likes gross 2.5 times as much. This may be due to the fact that newer movies have more actors that are recognized by FB users. Find the median year produced for both groups.</span>

In [68]:
movie.loc[total, 'year'].median()

2006.0

In [69]:
movie.loc[~total, 'year'].median()

2005.0

### Problem 6
<span  style="color:green; font-size:16px">For each movies made in the year 2016, what is the median of the total actor FB likes?</span>

In [78]:
year = movie['year']==2016
cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
movie.loc[year, cols].sum(axis='columns').median()

3571.5

### Problem 7
<span  style="color:green; font-size:16px">Write a function that has a single parameter, `year`. Have it return the median of the total actor FB likes for the given year. Test your function with the year 2016 and verify the result with problem 6.</span>

In [97]:
def func_likes_year(year):
    year = movie['year'] == year
    cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
    return movie.loc[year, cols].sum(axis='columns').median()

In [98]:
func_likes_year(2016)

3571.5

### Problem 8
<span  style="color:green; font-size:16px">Write a loop to print out the year and median total actor FB likes for that year from 1990 to 2016</span>

In [100]:
for year in range(1990,2017):
    print(year, func_likes_year(year))

1990 2017.0
1991 2436.0
1992 2147.5
1993 2018.0
1994 2368.5
1995 2612.0
1996 2692.5
1997 1964.0
1998 2482.0
1999 2595.0
2000 2378.0
2001 2424.0
2002 2146.0
2003 2019.0
2004 2298.0
2005 2072.0
2006 2359.0
2007 2002.5
2008 2400.0
2009 2145.0
2010 2411.0
2011 2818.5
2012 2426.0
2013 2420.0
2014 2084.0
2015 2063.0
2016 3571.5


In [101]:
lol = {print(year, func_likes_year(year)) for year in range(1990,2017)}
lol

1990 2017.0
1991 2436.0
1992 2147.5
1993 2018.0
1994 2368.5
1995 2612.0
1996 2692.5
1997 1964.0
1998 2482.0
1999 2595.0
2000 2378.0
2001 2424.0
2002 2146.0
2003 2019.0
2004 2298.0
2005 2072.0
2006 2359.0
2007 2002.5
2008 2400.0
2009 2145.0
2010 2411.0
2011 2818.5
2012 2426.0
2013 2420.0
2014 2084.0
2015 2063.0
2016 3571.5


{None}

### Problem 9
<span  style="color:green; font-size:16px">Using the **college** dataset, find the number of non-missing values in each column and again for each row.</span>

In [109]:
college.head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [110]:
college.count().head() #non missing values in columns

city         7535
stabbr       7535
hbcu         7164
menonly      7164
womenonly    7164
dtype: int64

In [111]:
college.count(axis=1).head() #non missing values in rows

instnm
Alabama A & M University               26
University of Alabama at Birmingham    26
Amridge University                     24
University of Alabama in Huntsville    26
Alabama State University               26
dtype: int64

### Problem 10
<span  style="color:green; font-size:16px">What is the average number of missing values for each row?</span>

In [115]:
college.isna().mean(axis='columns').head() #is non missing value

instnm
Alabama A & M University               0.000000
University of Alabama at Birmingham    0.000000
Amridge University                     0.076923
University of Alabama in Huntsville    0.000000
Alabama State University               0.000000
dtype: float64

### Problem 11
<span  style="color:green; font-size:16px">The `ugds` column of the college dataset contains the total undergraduate population. What is the least number of colleges it would take to have have a total of more than 5 million students.</span>

In [124]:
ugds_cmsm = college['ugds'].sort_values(ascending=False).cumsum()
ugds_cmsm.head()

instnm
University of Phoenix-Arizona    151558.0
Ivy Tech Community College       229215.0
Miami Dade College               290685.0
Lone Star College System         350605.0
Houston Community College        408689.0
Name: ugds, dtype: float64

In [146]:
(ugds_cmsm < 5000000).sum()

184

In [147]:
ugds_sort = college['ugds'].sort_values(ascending=False)
ugds_sort.head()

instnm
University of Phoenix-Arizona    151558.0
Ivy Tech Community College        77657.0
Miami Dade College                61470.0
Lone Star College System          59920.0
Houston Community College         58084.0
Name: ugds, dtype: float64

In [148]:
ugds_sort.iloc[:184].sum()

4989478.0

In [149]:
ugds_sort.iloc[:185].sum()

5007289.0