# 09. DataFrame Attributes and Methods


## DataFrames and Series - Many attributes and methods in common
The good news for us is that the DataFrames and Series have most of their attributes and methods in common so you won't have to remember too much more to use them.

In [None]:
import pandas as pd

Use a set comprehension to get all public methods for each type:

## View the API for complete list of functionality
Just as we did for Series, it can be helpful to see the entire list of attributes and methods for a DataFrame. Please visit the [DataFrame section][1] of the API.

## Minimally Sufficient Pandas
I can't stress enough how important it is to stick with a minimal subset of pandas when doing an analysis. Using more obscure methods does not make you a better analyst.

## Bikes Dataset
We will use the bikes dataset to introduce the core attributes and methods of DataFrames.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe

In [1]:
import pandas as pd
var_b = pd.read_csv('data/bikes.csv', parse_dates=['starttime', 'stoptime'])
var_b.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


## Core DataFrame Attributes

See the complete list of [DataFrame attributes][1]

* **`index`**
* **`columns`**
* **`values`**
* **`dtypes`**
* **`shape`**

Let's further explore these attributes:

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#attributes-and-underlying-data

In [3]:
var_b.index

RangeIndex(start=0, stop=50089, step=1)

## The columns are an Index object
The columns are always going to be some kind of Index object just like the index. You can more or less think of the Index objects as an array of data.

In [2]:
var_b.columns

Index(['trip_id', 'usertype', 'gender', 'starttime', 'stoptime',
       'tripduration', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_name',
       'latitude_end', 'longitude_end', 'dpcapacity_end', 'temperature',
       'visibility', 'wind_speed', 'precipitation', 'events'],
      dtype='object')

In [4]:
type(var_b.columns)

pandas.core.indexes.base.Index

### `values` returns a 2-D NumPy array
The **`values`** attribute returns a 2-D NumPy array.

In [5]:
var_b.values

array([[7147, 'Subscriber', 'Male', ..., 12.7, -9999.0, 'mostlycloudy'],
       [7524, 'Subscriber', 'Male', ..., 6.9, -9999.0, 'partlycloudy'],
       [10927, 'Subscriber', 'Male', ..., 16.1, -9999.0, 'mostlycloudy'],
       ..., 
       [17534972, 'Subscriber', 'Male', ..., 16.1, -9999.0, 'partlycloudy'],
       [17535645, 'Subscriber', 'Female', ..., 11.5, -9999.0,
        'partlycloudy'],
       [17536246, 'Subscriber', 'Male', ..., 15.0, -9999.0, 'partlycloudy']], dtype=object)

### `dtypes` returns a Series of data types
The **`dtypes`** attributes returns a Series of data types where the index of the Series is the column names and the values are the actual data type.

In [6]:
var_b.dtypes

trip_id                       int64
usertype                     object
gender                       object
starttime            datetime64[ns]
stoptime             datetime64[ns]
tripduration                  int64
from_station_name            object
latitude_start              float64
longitude_start             float64
dpcapacity_start            float64
to_station_name              object
latitude_end                float64
longitude_end               float64
dpcapacity_end              float64
temperature                 float64
visibility                  float64
wind_speed                  float64
precipitation               float64
events                       object
dtype: object

### `shape` returns a tuple of the number of rows and columns

In [7]:
var_b.shape

(50089, 19)

### The `len` function returns the number of rows
The built-in Python **`len`** function returns the number of rows.

In [8]:
len(var_b)

50089

# The `info` method

In [9]:
var_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 19 columns):
trip_id              50089 non-null int64
usertype             50089 non-null object
gender               50089 non-null object
starttime            50089 non-null datetime64[ns]
stoptime             50089 non-null datetime64[ns]
tripduration         50089 non-null int64
from_station_name    50089 non-null object
latitude_start       50083 non-null float64
longitude_start      50083 non-null float64
dpcapacity_start     50083 non-null float64
to_station_name      50089 non-null object
latitude_end         50077 non-null float64
longitude_end        50077 non-null float64
dpcapacity_end       50077 non-null float64
temperature          50089 non-null float64
visibility           50089 non-null float64
wind_speed           50089 non-null float64
precipitation        50089 non-null float64
events               50089 non-null object
dtypes: datetime64[ns](2), float64(10), int64(2), 

# Basic Arithmetic Operations with a DataFrame
We now cover what happens when we use the basic mathematical operators **`+, -, *, /, **, //`** on a DataFrame.

### Attempt to add 5 to bikes
If we try and add 5 to bikes we will get an error as we have a mix of numeric, object, and datetime columns. 

In [10]:
try:
    var_b + 5
except Exception as e:
    print(type(e), e)

<class 'ValueError'> Cannot add integral value to Timestamp without freq.


### Select just numeric data with `select_dtypes`
DataFrames have a unique method to them called **`select_dtypes`** which selects a subset of columns with the passed type. Use the data type you want as a string to select it - int, float, bool, object, datetime, timedelta, and category.

Let's see some examples:

In [11]:
var_b.select_dtypes('int').head()

Unnamed: 0,trip_id,tripduration
0,7147,993
1,7524,623
2,10927,1040
3,12907,667
4,13168,130


In [12]:
var_b.select_dtypes('float').head()

Unnamed: 0,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
0,41.88105,-87.61697,11.0,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0
1,41.88338,-87.64117,31.0,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0
2,41.909592,-87.653497,15.0,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0
3,41.894556,-87.653449,19.0,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0
4,41.909396,-87.677692,19.0,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0


In [13]:
var_b.select_dtypes('datetime').head()

Unnamed: 0,starttime,stoptime
0,2013-06-28 19:01:00,2013-06-28 19:17:00
1,2013-06-28 22:53:00,2013-06-28 23:03:00
2,2013-06-30 14:43:00,2013-06-30 15:01:00
3,2013-07-01 10:05:00,2013-07-01 10:16:00
4,2013-07-01 11:16:00,2013-07-01 11:18:00


In [14]:
var_b.select_dtypes('object').head()

Unnamed: 0,usertype,gender,from_station_name,to_station_name,events
0,Subscriber,Male,Lake Shore Dr & Monroe St,Michigan Ave & Oak St,mostlycloudy
1,Subscriber,Male,Clinton St & Washington Blvd,Wells St & Walton St,partlycloudy
2,Subscriber,Male,Sheffield Ave & Kingsbury St,Dearborn St & Monroe St,mostlycloudy
3,Subscriber,Male,Carpenter St & Huron St,Clark St & Randolph St,mostlycloudy
4,Subscriber,Male,Damen Ave & Pierce Ave,Damen Ave & Pierce Ave,partlycloudy


#### Use the string 'number' to select all numeric data
This selects all int and float columns.

In [16]:
func_number_var_b = var_b.select_dtypes('number')
func_number_var_b.head()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
0,7147,993,41.88105,-87.61697,11.0,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0
1,7524,623,41.88338,-87.64117,31.0,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0
2,10927,1040,41.909592,-87.653497,15.0,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0
3,12907,667,41.894556,-87.653449,19.0,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0
4,13168,130,41.909396,-87.677692,19.0,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0


## Try adding 5 to `bikes_number`
Let's try adding 5 to the **`bikes_number`** DataFrame which consists of only numeric columns:

In [17]:
(func_number_var_b + 5).head()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
0,7152,998,46.88105,-82.61697,16.0,46.90096,-82.623777,20.0,78.9,15.0,17.7,-9994.0
1,7529,628,46.88338,-82.64117,36.0,46.89993,-82.63443,24.0,74.1,15.0,11.9,-9994.0
2,10932,1045,46.909592,-82.653497,20.0,46.88132,-82.629521,28.0,78.0,15.0,21.1,-9994.0
3,12912,672,46.894556,-82.653449,24.0,46.884576,-82.63189,36.0,77.0,15.0,21.1,-9994.0
4,13173,135,46.909396,-82.677692,24.0,46.909396,-82.677692,24.0,78.0,15.0,22.3,-9994.0


### Other numeric operators
All the other numeric operators work in the same manner. They all apply the operation to every value in the DataFrame. For instance, the following does floor division to each value:

In [18]:
(func_number_var_b // 17).head()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
0,420,58,2.0,-6.0,0.0,2.0,-6.0,0.0,4.0,0.0,0.0,-589.0
1,442,36,2.0,-6.0,1.0,2.0,-6.0,1.0,4.0,0.0,0.0,-589.0
2,642,61,2.0,-6.0,0.0,2.0,-6.0,1.0,4.0,0.0,0.0,-589.0
3,759,39,2.0,-6.0,1.0,2.0,-6.0,1.0,4.0,0.0,0.0,-589.0
4,774,7,2.0,-6.0,1.0,2.0,-6.0,1.0,4.0,0.0,1.0,-589.0


## Comparison Operators with DataFrames
The comparison operators work similarly to the mathematical ones and will return a DataFrame of all boolean columns:

In [19]:
(func_number_var_b > 5).head()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
0,True,True,True,False,True,True,False,True,True,True,True,False
1,True,True,True,False,True,True,False,True,True,True,True,False
2,True,True,True,False,True,True,False,True,True,True,True,False
3,True,True,True,False,True,True,False,True,True,True,True,False
4,True,True,True,False,True,True,False,True,True,True,True,False


# Data Dictionaries
A data dictionary is a very important element of a data analysis and at a minimum gives us the column name and description of each column. Other information on each column can be kept in it such as the data type of each column or number of missing values.

We have a data dictionary for the college dataset.

In [20]:
var_c = pd.read_csv('data/college.csv', index_col='instnm')
var_c.head()

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [None]:
pd.read_csv('data/college_data_dictionary.csv')

# Descriptive Statistics Methods for DataFrames
DataFrames have identical [descriptive statistical methods][1] as Series. Again, we distinguish between methods that aggregate and those that do not.

A method that performs an aggregation returns a **single** number to represent the description. Examples of methods that aggregate are:
* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats

# Major Differences between DataFrame and Series Methods
When calling one of the above methods on a DataFrame, it is applied to each individual column by default. For instance, if we call the **`sum`** method, each column will be summed individually. Calling the **`sum`** method on a Series produces a single scalar value, while a DataFrame produces a sum for each column.

### Select some numeric columns
Many of these statistical methods above work only for numeric columns. We will select all the columns that have undergraduate race proportion data. These columns are located together and start with **`UGDS_WHITE`** and end at **`UGDS_UNKN`**.

In [21]:
func_collrace = var_c.loc[:, 'ugds_white':'ugds_unkn']
func_collrace.head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


### Take the mean of each column
Let's demonstrate calling the **`mean`** aggregation method on each column.

In [22]:
func_collrace.mean()

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

In [23]:
func_collrace.max()

ugds_white    1.0000
ugds_black    1.0000
ugds_hisp     1.0000
ugds_asian    0.9727
ugds_aian     1.0000
ugds_nhpi     0.9983
ugds_2mor     0.5333
ugds_nra      0.9286
ugds_unkn     0.9027
dtype: float64

In [24]:
func_collrace.std()

ugds_white    0.286958
ugds_black    0.224587
ugds_hisp     0.221854
ugds_asian    0.073777
ugds_aian     0.070196
ugds_nhpi     0.033125
ugds_2mor     0.031288
ugds_nra      0.050172
ugds_unkn     0.093440
dtype: float64

## Changing the Direction of the Operation
Since DataFrames are two-dimensional you might be interested on doing an operation that happens across the rows - summing up each row for instance.

## The `axis` parameter controls the direction of the operation.
Nearly all DataFrame methods have an **`axis`** parameter. This is a very crucial parameter to understand.

## Referencing each axis by number and by label
DataFrames are two-dimensional and therefore have two axes. The rows are referenced by the number 0 and also by the label 'index'. The columns are referenced by the number 1 and also by the label 'columns'.

## Default value of `axis` is 0
The default value for the **`axis`** parameter is 0. You an also refer to it as 'index'. Let's take the mean again for each column, but instead use the string 'index' for the value of the **`axis`** parameter.

In [25]:
func_collrace.mean(axis='index')

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

This is the exact same thing as **`axis=0`**, which is the default:

In [26]:
func_collrace.mean(axis=0)

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

## Use `axis='columns'`
Let's sum each row by changing the direction of the operation by setting the **`axis`** parameter equal to **`columns`**. The total should equal 1 as each row contains all the race distribution of a single school.

In [27]:
func_collrace.sum(axis='columns').head()

instnm
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

You can also use **`axis=1`**

In [28]:
func_collrace.sum(axis=1).head()

instnm
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

## I always use `axis='columns'`
I always use **`axis='columns'`** and never **`axis=1`**. The reason for this is that the string 'columns' is much more descriptive than the integer 1. I also always use **`axis='index'`** instead of **`axis=0`** for the same reason.

## Confusion between string 'index' and 'columns'
It's definitely confusing and difficult to remember which direction the operation is going to happen. A little trick that helps me remember is that when using **`axis='columns'`** the result is going to be the same length as a **column** in the DataFrame. 

In [29]:
func_collrace.shape

(7535, 9)

In [30]:
len(func_collrace.sum(axis=1))

7535

### Summary of `axis`
* axis 0 - default axis for all DataFrame methods. It's preferred reference label is 'index'. The operations happen vertically, up and down columns. **`df.sum()`** finds the sum of each column individually.
* axis 1 - It's preferred reference is 'columns'. The operations happen horizontally, left to right. **`df.sum(axis='columns')`** sums each row individually.

# Non-Aggregation DataFrame methods
The non-aggregation DataFrame methods keep the shape of the DataFrame but can change each value. Let's round all the values to two digits.

In [31]:
func_collrace.round(2).head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


## Some of the methods don't have an `axis` parameter
Methods such as **`round`** work independently of the axis and therefore do not have an **`axis`** parameter. Other methods however, such as **`cumsum`**, do have an **`axis`** parameter.

Let's call **`cumsum`** in both directions.

In [32]:
func_collrace.cumsum().head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.6255,1.1953,0.0338,0.0537,0.0046,0.0026,0.0368,0.0238,0.0238
Amridge University,0.9245,1.6145,0.0407,0.0571,0.0046,0.0026,0.0368,0.0238,0.2953
University of Alabama in Huntsville,1.6233,1.74,0.0789,0.0947,0.0189,0.0028,0.054,0.057,0.3303
Alabama State University,1.6391,2.6608,0.091,0.0966,0.0199,0.0034,0.0638,0.0813,0.344


In [None]:
college_race.cumsum(axis='columns').head()

## Get Summary Statistics for all columns with the `describe` method
The describe method calculates several summary statistics for each column and is a great way to inspect all of your data at once. Notice that a DataFrame is returned with the name of each summary statistic in the **index**.

In [33]:
func_collrace.describe()

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
count,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0
mean,0.510207,0.189997,0.161635,0.033544,0.013813,0.004569,0.02395,0.016086,0.045181
std,0.286958,0.224587,0.221854,0.073777,0.070196,0.033125,0.031288,0.050172,0.09344
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.2675,0.036125,0.0276,0.0025,0.0,0.0,0.0,0.0,0.0
50%,0.5557,0.10005,0.0714,0.0129,0.0026,0.0,0.0175,0.0,0.0143
75%,0.747875,0.2577,0.198875,0.0327,0.0073,0.0025,0.0339,0.0117,0.0454
max,1.0,1.0,1.0,0.9727,1.0,0.9983,0.5333,0.9286,0.9027


### Calling `describe` on non-numeric columns
The **`describe`** method can work with non-numeric columns. Pass the **`include`** parameter a string of the data type you would like to use. Let's see the summary with the string and Datetime columns from the **`bikes`** DataFrame. Notice that the summary statistics are very different.

In [34]:
var_b.describe(include='object')

Unnamed: 0,usertype,gender,from_station_name,to_station_name,events
count,50089,50089,50089,50089,50089
unique,3,2,600,595,11
top,Subscriber,Male,Clinton St & Washington Blvd,Clinton St & Washington Blvd,partlycloudy
freq,50079,37654,879,905,16998


## Transposing a  DataFrame with the `T` attribute
Transposing a DataFrame 'turns' the data 90 degrees. The columns and the rows switch places. The first column is now the first row, etc...

The **`.T`** attribute transposes the DataFrame. I find this useful after running the **`describe`** method with long output.

In [35]:
var_c.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hbcu,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
menonly,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
womenonly,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
relaffil,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
satvrmid,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
satmtmid,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
distanceonly,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
ugds,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
ugds_white,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
ugds_black,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


# Exercises

In [38]:
import pandas as pd
var_c = pd.read_csv('data/college.csv', index_col='instnm')
func_collrace = var_c.loc[:, 'ugds_white':'ugds_unkn']
var_m = pd.read_csv('data/movie.csv', index_col='title')

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

In [40]:
var_m.head() # read movie dataset

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


In [44]:
var_fl = var_m[['actor1_fb','actor2_fb','actor3_fb']]
var_fl.head()
               # actor1 fb likes

Unnamed: 0_level_0,actor1_fb,actor2_fb,actor3_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,1000.0,936.0,855.0
Pirates of the Caribbean: At World's End,40000.0,5000.0,1000.0
Spectre,11000.0,393.0,161.0
The Dark Knight Rises,27000.0,23000.0,23000.0
Star Wars: Episode VII - The Force Awakens,131.0,12.0,


In [45]:
func_avg_fl = var_fl.mean()
func_avg_fl.head()
# calc mean for each actor facebook like 

actor1_fb    6494.488491
actor2_fb    1621.923516
actor3_fb     631.276313
dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

In [49]:
var_fl.sum(axis='columns').head()
# your code here

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
dtype: float64

### Problem 4
<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

In [69]:
var_fl.sum(axis='columns').head()

#func_totfl = var_m['actor1_fb'] + var_m['actor2_fb'] + var_m['actor3_fb']
#func_totfl.head()
# func for total fb likes for movies

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
dtype: float64

In [71]:
func_filt_by_rev = var_fl.sum(axis='columns') > 10000
var_m.loc[func_filt_by_rev, 'gross'].median() / 10 ** 6

#(func_totfl > 10000).head() 
# movies with more thatn 10K actor fb likes

42.3919155

In [72]:
var_m.loc[~func_filt_by_rev, 'gross'].median() / 10 ** 6

16.8157525

In [74]:
#var_rev = var_m['gross']
#var_rev.head()
# variable for revenue

In [73]:
#df_m10k_fl = func_totfl.median(var_rev)
#df_m10k_fl.head()

# med meeting first condition

### Problem 6
<span  style="color:green; font-size:16px">For each movies made in the year 2016, what is the median of the total actor FB likes?</span>

In [77]:
func_criteria = var_m['year'] == 2016
var_col = ['actor1_fb', 'actor2_fb', 'actor3_fb']
var_m.loc[func_criteria, var_col].sum(axis='columns').median()

#var_m('year'==2016).head()
# movies made in 2016

3571.5

In [None]:
# med of movies in 2016

### Problem 9
<span  style="color:green; font-size:16px">Using the **college** dataset, find the number of non-missing values in each column and again for each row.</span>

In [78]:
var_c = pd.read_csv('data/college.csv')

### Problem 10
<span  style="color:green; font-size:16px">What is the average number of missing values for each row?</span>

In [81]:
var_c.count(axis='columns').mean()

23.70763105507631

In [80]:
var_c.count(axis='columns').head()

0    27
1    27
2    25
3    27
4    27
dtype: int64