## Analysis background ##

We have a data file containing records corresponding to surgical cases. For each case we know some basic information about the patient's scheduled case including an urgency code, 
the date the case was scheduled, insurance status, surgical service, and the number of days prior to surgery in which the case was scheduled. In data model terms, the first 4 variables 
are dimensions and the last variable is a measure. Of course, we could certainly use things like SQL or Excel Pivot Tables to do very useful things with this data. 
In a [previous tutorial](http://hselab.org/getting-started-R-group-by.html) I showed how R can be used to do the same things as well as to do some things that are much more difficult using SQL or Excel. In this tutorial
I'll do the same things as in the R tutorial but will use Python instead. Since many business analytics folks end up using both R and Python, it is important to be able to switch your brain between R and Python modes.

This tutorial will just scratch the surface of getting started with Pandas. There are [numerous introductory tutorials](http://pandas.pydata.org/pandas-docs/dev/tutorials.html) and I encourage you to explore them. A word of warning, the Pandas book is starting to get a little outdated with respect to some details. So, the [official documentation](http://pandas.pydata.org/pandas-docs/stable/) is also a very good place to become familiar with.


## Preliminaries ##
Python is a major force in the scientific and analytic computing worlds. It's an easy language to learn, is very list-centric, dynamically typed, and has many
contributed modules to facilitate the kinds of computing we do for business analytics. It's great for rapid development, algorithm testing, interactive
data and model exploration, and as a "glue" or scripting language (like Perl or Ruby). It also has some similarities to R
and to interactive environments like Matlab or Mathematica. There is a huge community of Python developers creating 3rd party modules for all kinds of computing problems. For
example, [adodbpi](http://adodbapi.sourceforge.net/) is a module for connecting to databases using ADO from Python. I use it in my new Python based Hillmaker.
Python is open source, free, and cross-platform. It is [fun to program in Python](http://xkcd.com/353/). For this tutorial, we'll use the IPython Notebook shell.

IPython has something called [magic functions](http://ipython.org/ipython-doc/dev/interactive/tutorial.html) which start with the % sign and facilitate common tasks like interacting with the OS, controlling the IPython session, timing
code, profiling code, getting a history of your session and many, many more. Just type %magic at the IPython prompt to get an overview. Hit <esc> to get back to the prompt. The '!' will let you 
enter a system command (such as `ls` in Linux or `dir` in Windows). While !cd should work, I've had problems getting it to work in Windows. However, using the magic function %cd seems to work
just fine.

If you didn't get yourself into the directory containing the data file for this tutorial, you can use the !magic to do so. If that doesn't work, try %cd magic.

Import some modules we'll need.

- pandas - Python for data analysis
- numpy - the underlying array engine
- matplotlib (we'll load this later)- for graphing

Pandas is a Python module for data analysis. It focuses on data structures that make it efficient and easy to do "group by", time series and other data analytics types
of things. It's not a stat package. Wes McKinney developed pandas while working as a quant in the financial services industry and recently published 
[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (a must have). His [blog](http://blog.wesmckinney.com/) has tons of great stuff
on pandas and the computational aspects of data analytics. It uses numpy as its underlying numerical array engine.

In [1]:
import pandas as pd
import numpy as np


In [2]:
# Let's check what version of pandas, numpy and matplotlib we are using
print ("pandas version ", pd.__version__)
print ("numpy version ", np.__version__)


pandas version  0.18.1
numpy version  1.11.1


Read the data and about the basic data structures
-------------------------------------------------

One of he main data structures in pandas is the `DataFrame`. It's similar to a data frame in R and you can think of it as just a data table with field names and an index. An index is just a set of row labels and by default will just be an integer sequence starting at 0. However, Pandas takes great pains to be smart in the way it uses these indexes. Here's a quote from the documentation:

> Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken  unless done so explicitly by you.

This tenet will make it easy for us to do things like joining separate data structures on their indexes and automatically aligning data structures even if one of them has "missing rows".

A really nice tutorial for learning about Pandas data structures, written for those coming with a SQL background is [http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/). You should definitely take a look at this tutorial (and ideally create a notebook and work along with it).


Ok, here we go. A simple way to create a data frame is to read a csv file into pandas with the `read_csv` function. Like `csv` module in Python or the `read.csv` function in R, there are numerous parameters for doing things like specifying the delimiter, dealing with quoted text, indicating whether or not a header line exists, or passing in a list of desired column names. The [IO Tools section of the documentation](http://pandas-docs.github.io/pandas-docs-travis/io.html) has all the details.

In [3]:
sched_df = pd.read_csv('data/SchedDaysAdv.csv')

In [4]:
# Typing the name of the dataframe will either get you some rows or info about the structure.
sched_df

Unnamed: 0,ID,SurgeryDate,Service,ScheduledDaysInAdvance,Urgency,InsuranceStatus
0,0,2012-07-05 00:00:00,Cardiology,9,Routine,Private
1,1,2009-10-08 00:00:00,Podiatry,34,Routine,Private
2,2,2009-06-11 00:00:00,Oral-Maxillofacial Surg,22,Routine,Private
3,3,2011-02-18 00:00:00,General Surgery,1,Urgent,Private
4,4,2012-08-20 00:00:00,Orthopedic Surgery,14,Routine,Private
5,5,2010-12-16 00:00:00,General Surgery,38,Routine,Medicare
6,6,2009-12-21 00:00:00,Orthopedic Surgery,10,Routine,Private
7,7,2009-03-24 00:00:00,GYN Surgery,34,Routine,Private
8,8,2009-11-25 00:00:00,Urology/GU Surgery,25,Routine,Private
9,9,2009-10-01 00:00:00,General Surgery,12,Routine,Medicaid


We can check out the structure of this dataframe with the `info()` function. Do a `help(info)` to learn more
about this handy function.

In [5]:
sched_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
ID                        20000 non-null int64
SurgeryDate               20000 non-null object
Service                   20000 non-null object
ScheduledDaysInAdvance    20000 non-null int64
Urgency                   20000 non-null object
InsuranceStatus           20000 non-null object
dtypes: int64(2), object(4)
memory usage: 937.6+ KB


Notice that a default index was created using an integer sequence from 0 to 19999. The index itself is an instance of a pandas `Int64Index` object. 

Just like an R `data.frame` is really just a collection of R `vector` objects (each having their own data type), a pandas `DataFrame` is a collection of `Series` objects. Each `Series` can have its own data type and a `Series` also has an index. So, a `DataFrame` is really a collection of `Series` objects with a shared index.

Also, it looks like the datetime field has been interpreted by pandas as an object.  There's pleny of trickiness and numerous details in working with dates and times in pandas. For now, let's just take a look at the dataframe and learn some basic techniques for referencing columns and rows. It's very similar to what we will do with R `data.frame` objects.

In [None]:
# Check out the first few rows of sched_df
sched_df.head() 

In [None]:
# Check out the last few rows.
sched_df.tail() 

In [None]:
# Number of rows and columns?
sched_df.shape

Selecting columns and rows
--------------------------

There are multiple ways of selecting subsets of a pandas `DataFrame`. 


### Rows

We saw about that `head` and `tail` return 5 rows from the beginning and end of the data frame. To get a specific number:

In [6]:
# First 10
sched_df.head(10)

Unnamed: 0,ID,SurgeryDate,Service,ScheduledDaysInAdvance,Urgency,InsuranceStatus
0,0,2012-07-05 00:00:00,Cardiology,9,Routine,Private
1,1,2009-10-08 00:00:00,Podiatry,34,Routine,Private
2,2,2009-06-11 00:00:00,Oral-Maxillofacial Surg,22,Routine,Private
3,3,2011-02-18 00:00:00,General Surgery,1,Urgent,Private
4,4,2012-08-20 00:00:00,Orthopedic Surgery,14,Routine,Private
5,5,2010-12-16 00:00:00,General Surgery,38,Routine,Medicare
6,6,2009-12-21 00:00:00,Orthopedic Surgery,10,Routine,Private
7,7,2009-03-24 00:00:00,GYN Surgery,34,Routine,Private
8,8,2009-11-25 00:00:00,Urology/GU Surgery,25,Routine,Private
9,9,2009-10-01 00:00:00,General Surgery,12,Routine,Medicaid


In [7]:
# Last 7
sched_df.tail(7)

Unnamed: 0,ID,SurgeryDate,Service,ScheduledDaysInAdvance,Urgency,InsuranceStatus
19993,19993,2011-01-12 00:00:00,Gastroenterology,52,Routine,Medicaid
19994,19994,2010-10-15 00:00:00,Oral-Maxillofacial Surg,11,Routine,Private
19995,19995,2011-04-26 00:00:00,GYN Surgery,36,Routine,Private
19996,19996,2009-07-17 00:00:00,Oral-Maxillofacial Surg,22,Routine,Private
19997,19997,2010-11-02 00:00:00,GYN Surgery,50,Routine,
19998,19998,2012-12-20 00:00:00,General Surgery,21,Routine,Medicare
19999,19999,2012-07-26 00:00:00,Obstetrics,1,Emergency,Private


You can also use Python "index slicing" just as we did with Numpy arrays. Remember your slicing rules... Think about what the next command will do before trying it. Also notice the output style difference when using `print`.

In [8]:
print (sched_df[1:4])

   ID          SurgeryDate                  Service  ScheduledDaysInAdvance  \
1   1  2009-10-08 00:00:00                 Podiatry                      34   
2   2  2009-06-11 00:00:00  Oral-Maxillofacial Surg                      22   
3   3  2011-02-18 00:00:00          General Surgery                       1   

   Urgency InsuranceStatus  
1  Routine         Private  
2  Routine         Private  
3   Urgent         Private  


### Columns

You can select columns by name. As you might expect, if you select a single column you get a `Series` object and if you select more than one column you are getting a `DataFrame` object. To select multiple columns, pass in a list of column names.

In [None]:
sched_df['ScheduledDaysInAdvance'].head()

In [None]:
# What is the data type of a single column from a data frame?
type(sched_df['ScheduledDaysInAdvance'])

In [None]:
# See data type for a data frame


In [None]:
# You can also reference column names like attributes, and yes, tab completion works.
sched_df.ScheduledDaysInAdvance.head()

In [None]:
# To select two columns, use a list of column names.
sched_df[['PatientID','ScheduledDaysInAdvance']].head()

In [None]:
# Combining row and column selection using the above methods looks like this. Notice
# that the syntax is df[rows][cols].
sched_df[0:4][['PatientID','ScheduledDaysInAdvance']]

Note, this is different from the way we referenced elements in a 2D numpy array - x[rows to select,cols to select]. A `DataFrame` is not a 2D array (which it can't be of course - why?). It's really more like a dictionary of `Series` objects with the key being the column name.

### Boolean indexing

A very powereful technique is to select rows using logical vectors. We will use this quite a bit in both Python and R. If you've had MIS 546 with me, you'll see that this is a little like using Boolean arrays inside of Excel array formulas. For example to select all the rows for which Service is 'Podiatry':

In [None]:
sched_df[sched_df.Service == 'Cardiology']

You can combine conditions using the `&` (and) and `|` (or) operators. **IMPORTANT** Use parentheses around the logical clauses!

In [None]:
sched_df[sched_df.Service == 'Podiatry' & sched_df.Urgency == 'Urgent']

In [None]:
sched_df[(sched_df.Service == 'Podiatry') & (sched_df.Urgency == 'Urgent')]

In [None]:
# You try it

# Select the PatientID, Service, InsuranceStatus and Urgency columns for those rows in 
# which ScheduledDaysInAdvance is greater than 60.

This is just the tip of the iceberg in terms of indexing and selecting subsets of `DataFrame` and `Series` objects. Pandas provides functions such as `loc` for label based selection, `iloc` for positional selection and `ix` for mixed label and positional selection. This makes very sophisticated selection possible, especially in the case of hierarchical indexes. See the [Indexing and Selecting chapter of the official documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html). We'll return to this topic as needed. It is one of the strengths of Pandas.

## Analysis ##

We will start with creating a new computed column, do some basic summary stats, move on to more complex calculations and finish up with some basic graphing.


 


In [None]:
'aaa'.upper()

### Create a new computed column
This can actually be a little bit tricky. For example, let's assume we wanted to create a new column called `Service_short` that contained just the first three letters, all in caps, of the `Service` field. You would think that something like the following should work:

```
sched_df['Service_short'] = sched_df['Service'][0:3].upper()
```

It doesn't work. See below. We want our transformation to work "element-wise" but instead it's trying to work with the entire series object. 

In [None]:
# This won't work
sched_df['Service_short'] = sched_df['Service'][0:3].upper()

So, the basic pattern is to use a combination of the `map` function and what is known as a *lambda function* (sometimes called an *anonymous function*). A *lambda function* is an inline function that has no name. Sounds useless but ends up being pretty useful. Here's the basic pattern applied to this example:

In [None]:
# Approach 1
sched_df['Service_short_1'] = sched_df['Service'].map(lambda x: x[0:3].upper())
print(sched_df['Service_short_1'].head())

We could also encapsulate the lambda logic in a separate user defined function and just call that. Think carefully about the difference and when you might want to apply each of these two approaches.

In [None]:
# Approach 2
def abbrev_upper(longstring, abbrev_len):
    return longstring[0:abbrev_len].upper()

sched_df['Service_short_2'] = sched_df['Service'].map(lambda x: abbrev_upper(x, 3))
print(sched_df['Service_short_2'].head())

### Basic summary stats ###

Let's start with some basic summary statistics regarding lead time by various dimensions.

Since ScheduledDaysInAdvance is the only measure, we'll do a bunch  of descriptive statistics on it.

In [None]:
sched_df['ScheduledDaysInAdvance'].describe()

This is similar to R's `summary()` function. A few questions:

* What pandas data type is returned by `describe`? 
* What shape is the object returned by `describe`?
* What does `describe` do for a column like InsuranceStatus?
* What happens if you use `describe` on a data frame?

How about some percentiles?

In [None]:
p05_leadtime = sched_df.ScheduledDaysInAdvance.quantile(0.05)
p05_leadtime

In [None]:
p95_leadtime = sched_df['ScheduledDaysInAdvance'].quantile(0.95)
p95_leadtime

### Histogram and box plot ###

A popular plotting module for Python is [matplotlib](http://matplotlib.org/). The project homepage has many links to resources for learning it, with a very good place to start
being the official [documentation](http://matplotlib.org/contents.html) and the [gallery of graphs](http://matplotlib.org/gallery.html). The are a few modes of using matplotlib. There is a **pyplot** mode which is particularly well suited for interactive plotting
in a Python shell like [IPython](http://ipython.org/) (much like one would work in Mathematica or MATLAB). You can also use matplotlib within Python scripts either with
the pyplot commands or via an objected oriented API (similar to [ggplot2](http://ggplot2.org/) for plotting in [R](http://www.r-project.org/)).

Here is a basic histogram for ScheduledDaysInAdvance. For a more API based approach, see [this example from the matplotlib page](http://matplotlib.org/examples/statistics/histogram_demo_features.html) as
well as the next version of our histogram below.

In [None]:
import numpy as np
import matplotlib
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from IPython.core.display import display # This was the missing import

In [None]:
print ("matplotlib version ", matplotlib.__version__)

In [None]:
%matplotlib inline

In [None]:
data = sched_df['ScheduledDaysInAdvance']
print(len(data))

We'll use Sturges' Rule to determine the number of bins to use for our histogram. While this rule has some shortcomings, it's good enough for this example. See https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width.

In [None]:
import math

In [None]:
# Let's figure out how to compute this
num_bins = ???
print(num_bins)

In [None]:
"""
Demo of the histogram (hist) function with a few features.

In addition to the basic histogram, this demo shows a few optional features:

    * Setting the number of data bins
    * The ``normed`` flag, which normalizes bin heights so that the integral of
      the histogram is 1. The resulting histogram is a probability density.
    * Setting the face color of the bars
    * Setting the opacity (alpha value).

"""

# the histogram of the data
# normed=1 plots probs instead of counts, alpha in [0,1] is transparency level (RGBA colors)
plt.hist(data, num_bins, normed=1, facecolor='green', alpha=0.5)
plt.xlabel('Days')
plt.ylabel('Probability')
plt.title('Histogram of Schedule Lead Time')
plt.axis([0, 200, 0, 0.06])
plt.grid(True)
plt.show()


In [None]:
"""
Same as demo above but modified number of bins, color of bars and transparency level

In addition to the basic histogram, this demo shows a few optional features:

    * Setting the number of data bins
    * The ``normed`` flag, which normalizes bin heights so that the integral of
      the histogram is 1. The resulting histogram is a probability density.
    * Setting the face color of the bars
    * Setting the opacity (alpha value).

"""

# the histogram of the data
# normed=1 plots probs instead of counts, alpha in [0,1] is transparency level (RGBA colors)
plt.hist(data, 50, normed=1, facecolor='blue', alpha=0.75)
plt.xlabel('Days')
plt.ylabel('Probability')
plt.title('Histogram of Schedule Lead Time')
plt.axis([0, 200, 0, 0.06])
plt.grid(True)
plt.show()

Now let's make the bars grey so we can start to see how to reference plot parts and modify them. This takes a while to learn and I make heavy use
of Stack Overflow - [http://stackoverflow.com/questions/14088687/python-changing-plot-background-color-in-matplotlib](http://stackoverflow.com/questions/14088687/python-changing-plot-background-color-in-matplotlib).
The matplotlib [hist() documentation](http://matplotlib.org/api/pyplot_api.html?highlight=hist#matplotlib.pyplot.hist) is essential. A useful [demo](http://matplotlib.org/examples/pylab_examples/histogram_demo_extended.html) is 
included in the matplotlib [examples area](http://matplotlib.org/examples/index.html).

The `pyplot.hist()` function actually returns a tuple (*n, bins, patches*) where *n* is an array of y values, *bins* is an array of left bin edges on x-axis, and *patches* is a
list of `Patch` objects (the bars in this case). By creating histogram and capturing the return values, we can make it easier to make changes to the graph. In addition, notice that
we are saving the `Figure` and `Axes` objects. If we were writing code in Python script, we could iteratively update a graph with a sequence of pyplot commands since they always
correspond to the *current figure*. However, in an IPython/Jupyter notebook, the reference to the current figure is lost everytime a cell is evaluated. By creating the variables `fig` and 
`ax` below, we can get around this problem. Remember, we can still create a plot with a sequence of pyplot commands as long as we do it all within one cell (just as we did above
with the first histogram).

In [None]:
fig1 = plt.figure()
ax1 = fig1.add_subplot(1,1,1)
n, bins, patches = plt.hist(sched_df['ScheduledDaysInAdvance'], 50, normed=1, facecolor='grey', alpha=0.75)

Now let's change the bars to blue and the plot area background to a light grey. For both, we will be setting a `facecolor` property. For the bars it will be the facecolor of the 
`Patch` objects in the patches variable and for the plot area background it will be the facecolor of the `Axes` object associated with this plot. So, plots are housed inside figures. A figure could
contain multiple subplots (more on this soon). Each plot contains an `Axes` object that knows all about the axes of the plot. Time spent perusing examples and checking the documentation
at the matplotlib site is time well spent if you want to learn matplotlib.

In [None]:
ax1.patch.set_facecolor('#F0F0F0')
ax1.set_title('Histogram of Scheduled Lead Time')
ax1.set_xlabel('Days')
ax1.set_ylabel('Probability')
ax1.grid(True, color='k')
[axp.set_facecolor('blue') for axp in ax1.patches] 

# Seems like there should be a simpler way. Of course, it's easy 
# to just rerun the plt.hist() with the desired color.
# However, this fine level of control makes it possible to set individual
# bar colors based on some condition. In fact, when you originally create the histogram and are specifying the
# color property you can actually set color=[<some list>] where the list contains the colors of each bar. In the
# process of generating the color list you could do all kinds of logical tests to pick the color of each bar.
display(fig1)


Admittedly, matplotlib can be tricky to work with because of it's very detailed API. The pandas project provides some higher level wrapper functions to matplotlib to make
it a little easier to create standard plots. This is an area that pandas is actively working to expand. Let's recreate the first histogram with pandas and overlay a density plot
on it as well. Let's also truncate the x-axis at 100 so we can see the details a little better.

In [None]:
sched_df['ScheduledDaysInAdvance'].hist(bins=50, color='k', alpha=0.3, normed=True)
sched_df['ScheduledDaysInAdvance'].plot(kind='kde', style='k--', xlim=[0,100], title='Histo of Sched Lead Time (using pandas)')

### Box plots ###

We'll start with a basic box plot of lead time grouped by insurance status. The [matplotlib boxplot demo](http://matplotlib.org/examples/pylab_examples/boxplot_demo.html) and 
[more advanced demo](http://matplotlib.org/examples/pylab_examples/boxplot_demo2.html ) are quite helpful. In particular,
we see that stacked box plots are possible by passing in a list of data vectors to be summarized. So, we could do the group by data shaping to create the list to pass in to
matplotlib `boxplot()` function. Seems like a job for pandas.  

In [None]:

bp = sched_df.boxplot(column='ScheduledDaysInAdvance', by='InsuranceStatus')
fig2 = plt.gcf() # 'g'et 'c'urrent 'f'igure so we can use it later
ax2 = plt.gca()  # 'g'et 'c'urrent 'a'xes so we can use it later

Let's try rotating x-axis labels. Much searching and trying leads to learning how to set X-axis label rotation. It's easier if you are creating the plot rather than modifying and existing plot.

In [None]:
labels = ax2.get_xticklabels()
for label in labels:
    label.set_rotation(90)
display(fig2)

Still hard to read, let's recreate the boxplot as a horizontally oriented set of plots.

In [None]:
bp = sched_df.boxplot(column='ScheduledDaysInAdvance', by='InsuranceStatus', vert=False)

### Group by summaries ###
Everything we've done so far (except for the box plots) has not considered any of the dimensions (factors, group by fields, etc.). Pandas is well suited for grouping and
aggregation. We'll do means and 95th percentiles of ScheduledDaysInAdvance by Urgency.

Start by creating a `GroupBy` object. A pandas `GroupBy` object doesn't actually do any aggregate math yet - it just creates a data structure that is "ready" for such operations.

In [None]:
sched_df_grp1 = sched_df.groupby(['Urgency'])

Now we can use it to compute whatever summary stats we'd like.

In [None]:
# counts by Urgency
counts_by_urgency = sched_df_grp1['ScheduledDaysInAdvance'].size()
print(counts_by_urgency)

# Let's see what data structure the expression above actually is. What do you think it is?
type(counts_by_urgency)

In [None]:
# mean of ScheduledDaysInAdvance by Urgency
sched_df_grp1['ScheduledDaysInAdvance'].mean()

In [None]:
# 95th percentile of ScheduledDaysInAdvance by Urgency
sched_df_grp1['ScheduledDaysInAdvance'].quantile(0.95)

Now group by Urgency and InsStatus and recompute the summary statistics.

In [None]:
sched_df_grp2 = sched_df.groupby(['Urgency','InsuranceStatus'])
sched_df_grp2['ScheduledDaysInAdvance'].mean()

In [None]:
sched_df_grp2['ScheduledDaysInAdvance'].quantile(0.95)

### Histograms revisited ###
Let's just see how easy it is do a matrix of histograms - something that is no fun at all in Excel.

In [None]:
sched_df['ScheduledDaysInAdvance'].hist(bins=50, facecolor='k', alpha=0.3, normed=True, by=sched_df['Urgency'])

The output of the `hist()` function with the `by` keyword is an array of lists of `AxesSubplotObjects`. Each row of the array contains a list with two elements each. If you want to
suppress that Out[] message above the graph, just end the In[] line with a `;`.

It looks like a pandas bug in that the histogram properties (such as color) specified aren't having any effect. Perhaps they can be changed after the plotting. Something to explore later.



## Related Links ##

- [matplotlib api date demo with tick and label formatting](http://matplotlib.org/examples/api/date_demo.html)
- [matplotlib tutorial](http://www.loria.fr/~rougier/teaching/matplotlib/#tutorials)


## Solutions to Challenges

### Questions about `describe`

A few questions:

* What pandas data type is returned by `describe`? 
* What shape is the object returned by `describe`?
* What does `describe` do for a column like InsuranceStatus?
* What happens if you use `describe` on a data frame?

In [None]:
# What pandas data type is returned by describe?
type(sched_df.ScheduledDaysInAdvance.describe())

In [None]:
# What shape is the object returned by describe?
sched_df.ScheduledDaysInAdvance.describe().shape

In [None]:
# What does describe do for a column like InsuranceStatus?
sched_df.InsuranceStatus.describe()

In [None]:
# What happens if you use describe on a data frame?
sched_df.describe()

### Selecting multiple columns and using boolean indexing for rows

In [None]:
# You try it

# Select the PatientID, Service, InsuranceStatus and Urgency columns for those rows in 
# which ScheduledDaysInAdvance is greater than 60.

data_1 = sched_df[sched_df.ScheduledDaysInAdvance > 60][['PatientID','Service','InsuranceStatus','Urgency']]

In [None]:
data_1

### Sturges rule

In [None]:
# Let's figure out how to compute this
num_bins = math.ceil(math.log2(len(sched_df)))
print(num_bins)