In [47]:
from IPython.core.display import HTML

def set_css_style(css_file_path):
    """
    Read the custom CSS file and load it into Jupyter.
    Pass the file path to the CSS file.
    """
    styles = open(css_file_path, "r").read()
    return HTML(styles)

set_css_style('styles/custom.css')

### Python Libraries

- Python Packages are collections of Python files (called modules) that contain functionality not built into Python.

  - The functions are describes in module modules -- those are just file, structured into packages (folders)
  
![](images/packages_modules.png)

### Types of Python Packages

There are three types of modules in Python

1. Modules bundled with the Python distribution but not available for use by default.
  - Ex. `datetime`, `random`, `collections`, etc.

2. Packages typically not available in Python but which are installed by the Conda installer. Those are also not available by default.
  - `numpy`, `Pandas`, `matplotlib`, etc.

3. Other specialized packages that need to be installed manually.
  - Ex. `BioPython`, `astropy`, `tensorflow`, etc.

### Working with Modules
- An external module (functionality not built into python) needs to be imported first before it can be used
  - modules are imported using:
  
```python
import some_module
from som_module import some_functionality
```
- The first statement imports __all__ the functionality in the `some_module`


- The latter statement imports only `some_functionalty`


- The difference between both is one of preference and consists of how the functionality can be called in both cases. 


- To call the functionality in the first example, you use 

```python
some_mdule.some_functionality
```
- To call the functionality in the second example, you use 

```python
some_functionality
```

### Example of imports

- The example below imports the `math` module and all its functionality

```python
>>> import math
>>> math.ceil(10.3)
11

>>> math.cos(2*math.pi)
1.0
```

- The example below imports only the `log` function and the constant `e`
  - we can separate the objects we import from a module using a comma

- The example computes $log(e^3)$

```python
>>>  from math import log, e
        log(e**3)
        
>>>  3.0
```
- In the above, since we did not explicitly import `cos`, the following command wouldn't work

```python
>>>cos(2*math.pi)

-------------------------------------------------
NameError       Traceback (most recent call last)
...
NameError: name 'cos' is not defined

```



### Brief Introfuction to Pandas

- `pandas` is the defacto package for working with tabular data
  - Think of it as Excel on steroids
    
- Supports a plethora of file formats (Excel, CSV, TSV, JSON, hdf5, ....)
- Supports date and time data inherently 


![](images/pandas_architecture.png)


### Pandas `Series` and `DataFrames`

- The `Conda` installer includes `pandas.` 

- `pandas` relies principally on two types of data structures:

  1. `Series`: those are list-like objects that store data in a given order.

    - It helps to think of Series as columns (or rows) in MS. Excel.

  2. `DataFrames`: those are spreadsheet-like tables that contain one or more Series.

    - It helps to think of `DataFrames` as tables (or spreadsheets) in MS. Excel.

    - Similar to R's data.frame.


### `Series`


- `Series` is "similar" to a python `lists.`

  - An ordered collection of items of the `same datatype.`
  
- Series contain an additional array of labels that are associated with each data entry
![](images/series.png)



### About the Data

- We will be working with the data in the `data/spending.csv` file. The data contains a subset the cost of drugs prescribes to Medicare patients by prescribing doctors

- The complete dataset is publicly available on the  Centers for Medicare & Medicaid Services ([`CMS` website](https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html)):

- This toy dataset contains the following columns:

| Column |Description|
|:----------|-----------|
| `unique_id`| A unique identifier for a Medicare claim to CMS |
| `doctor_id` | The Unique Identifier of the doctor who <br/> prescribed the medicine  |
| `specialty` | The specialty of the doctor prescribed the medicine |
| `medication` | The medication prescribed |
| `nb_beneficiaries` | The number of beneficiaries the <br/> medicine was prescribed to  |
| `spending` | The total cost of the medicine prescribed <br/>for the CMS |


![](images/medicare_data.png)

### Example of Series as Rows and Columns

- It helps to think of Pandas array as either a row (or a column) of an Excel spreadsheet

![](images/possible_series.png)




### Creating a `pandas` `Series`


- To work with `pandas`, you need to import it first

```python
import pandas
```

To `create` a pandas `Series`, you can call the `Series` function and pass it a `list` of values and a `list` of labels

```python

>>> s =  pandas.Series( data= [1234, 'DIAZEPAM', 3, '$32'],
                            index= ['id', 'med', 'ben', 'spen'])

>>> s
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
```


In [2]:
import pandas
s =  pandas.Series( data= [1234, 'DIAZEPAM', 3, '$32'], index= ['doctor_id', 'medication', 'nb_beneficiaries', 'spending'])
s

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object

### Indexing `Series`

- You can access the data in `pandas` by index (position of the object) using the same approach seen in list indexing

```python
>>> s[1]
'DIAZEPAM'

>>> s[-1]
'$32'
```

- You can also access a value in the array using the data index

```python
>>> s["medication"]
'DIAZEPAM'

s["spending"]
'$32'
```
- Based on  the above, it is fair to think of a `Series` as a hybrid between lists an dictionaries

In [11]:
print(s[1])
print(s["medication"])

DIAZEPAM
DIAZEPAM


### Subsetting `Series`

- As with lists, subsetting `Series` can be carried out throught the range operator ":"

```python
>>> s[0:3]
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
dtype: object
```

- Subsetting `Series` can also be done with lists of indices

```python
>>> s[[0,2,1]]
doctor_id               1234
nb_beneficiaries           3
medication          DIAZEPAM
dtype: object
```

- Note that the line above contains two sets of square brackets `[[ ]]`, the first is the indexing operator, the second ,inner, set is the delimiter for the list.
    

In [18]:
s[0:3]

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
dtype: object

In [19]:
s[[0,2]]


doctor_id           1234
nb_beneficiaries       3
dtype: object

### Subsetting Series - Cont'd

- Subsetting series can also be done throught lists of labels

```python

>>> s[["doctor_id", "nb_beneficiaries"]]
doctor_id           1234
nb_beneficiaries       3
dtype: object
```



- Note that line above also contains two sets of square brackets `[[ ]]`, the first is the indexing operator, the second, inner, set is the delimiter for the list.
    
    

In [5]:
s[["doctor_id", "nb_beneficiaries"]]

doctor_id           1234
nb_beneficiaries       3
dtype: object

### `DataFrames` - Cont'd


- It helps to think to `DataFrames` are spreadsheet-like tables made of sets of ordered Series


![](images/df_as_series.png)


- The example above illustrates column series, but the analogy holds for row `Series`

### Reading a `DataFrame` From a File

- A `DataFrame` can be created by reading data from a file

  - `pandas` supports many input formats, including Excel, CSV, TSV, SAS, STATA, etc.

- We can read the example tab-delimited (TSV) spending table using:
    
```python
>>> spending_df  = pandas.read_table("data/spending.csv")
```

- Jupyter beautifies `DataFrame`s by printing them  as `HTML` tables.

  - Using `print` prints them as text instead. So try to print them by including the name as the last statement of in the cell.
  

In [26]:
spending_df  = pandas.read_table("data/spending.csv")

spending_df

Unnamed: 0,unique_id,doctor_id,specialty,medication,nb_beneficiaries,spending
0,AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
1,AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
2,CC128705,1298765423,Cardiology,NADOLOL,13,
3,GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
4,YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"
5,YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"
6,YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,"$3,131.96"
7,PL346720,1326175365,Family,OXYCODONE HCL,87,"$12,881.04"
8,GZ129032,1518970284,Hemato-oncology,DIGOXIN,54,"$3,766.34"


### Dimensions of the DataFrame 

- A useful property of the `DataFrame` is the `shape.`

  - This property describes the number of lines and the number of columns in your `DataFrame`.
  
- `shape` returns a tuple; a list of elements describing the number of row and the number of columns in an array. 

  - Tuples are delimited by `( )` rather `[ ]`

```python
>>>  spending_df.shape
(8, 6)
```

- The above indicates that the `spending_df`  `DataFrame` has `8` rows (table entries) and `6` columns.

### DataFrame Indexes and Columns

- As opposed to `Series` which only have indexes, `DataFrame`s have `Indexes` labels and positions and `Column` labels and positions.
  - In the example below, since a row index was not explicitly provided, the labels for the rows are the same as the indices, i.e., 0 through 8.


- They can, therefore, be indexed by row index, by row label, column index column label, or by a combination of both


![](images/df_index_cols.png)



- Row indexing can be carried out by passing the `iloc` (index location) operator a single index or a list of indexes.

  -  The `iloc` operator uses `[]` instead of the `()` that methods and functions use.


```python    
>>> spending_df.iloc[3]
```

or

```python    
>>> spending_df.iloc[[1,5]]
```

- When a single index is given, `pandas` returns a `Series`
  - Remember that a row is simply a Series
  
- When a list of indices is given, `pandas` returns another `DataFrame` 
  - The returned `DataFrame` is a subset of the original
  

In [30]:
# Returns row, or Series
spending_df.iloc[3]

unique_id              GH890091
doctor_id            1346358827
specialty                Family
medication          HYDROCODONE
nb_beneficiaries            331
spending              $8,511.14
Name: 3, dtype: object

In [31]:
# Returns two rows as a DataFrame

spending_df.iloc[[1,5]]

Unnamed: 0,unique_id,doctor_id,specialty,medication,nb_beneficiaries,spending
1,AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
5,YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"


### Indexing by Rows and Column indexes 

- You can pass `iloc` a combination of row and column indexes using the following construct

```python
 data_frame.iloc[ row_index_info, column_index_info]
```
- `row_index_info` and `column_index_info` can be either a single index or a list of indexes.

Example:

```python
>>>  spending_df.iloc[3, 1]
'MIRTAZAPINE'

>> spending_df.iloc[2, [1,3]]
doctor_id     1298765423
medication       NADOLOL
Name: 2, dtype: object

>>>  spending_df.iloc[[2,4], [1,3]]
      medication    spending
2    HYDROCODONE   $8,511.14
4  OXYCODONE HCL   \$12,881.04
```

### Reading Tables with Index Labels

- Rather than using the default integer index label created by `pandas`, a `DataFrame` can be indexed using one or more columns of data.

- The index need not be an integer and can consist of any type.

##### Recall

```python
spending_df  = pandas.read_table("data/spending.csv")
```

![](images/default_index.png)

### Specifying and Index Labels

- We can use any column of the data to label the indexes by passing the label of the column you want to use to `pandas` `read_table()` function.
  - For example, we can use the `unique_id` column as index labels 

```python
spending_df  = pandas.read_table( "data/spending.csv", 
                                  index_col=["unique_id"] )
```

![](images/custom_index.png)


### Inspecting DataFrames
- When a `DataFrame` contains a large number of rows it's common to use the method `head` or `tail` to display the first of last five entries, respectively, of the `DataFrame`.


In [35]:
# before specifying the index column

spending_df.head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
CC128705,1298765423,Cardiology,NADOLOL,13,
GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"


In [None]:
spending_df.tail()

In [37]:

spending_df  = pandas.read_table( "data/spending.csv", 
                                  index_col=["unique_id"] )

# after specifying the index column
# note that index is now unique_id
spending_df.head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
CC128705,1298765423,Cardiology,NADOLOL,13,
GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"


### Indexing DataFrames Columns using Labels

- Accessing data in the `DataFrame` can also be done using index or column labels.


- Indexing a column takes a column's label or a list of index labels

```python
spending_df["nb_beneficiaries"]
```
or 
```python
spending_df[["doctor_id", "nb_beneficiaries"]]
```

- When, a single label is given, `pandas` returns a `Series`
  - Remember that a row is simply a Series
  
  
- When, a list of labels is given, `pandas` returns another `DataFrame` 

  - The returned `DataFrame` is a subset of the original

In [38]:
spending_df["nb_beneficiaries"]

unique_id
AB789982    226
AV967778    103
CC128705     13
GH890091    331
YY219322     28
YY190561     86
YY572610    191
PL346720     87
GZ129032     54
Name: nb_beneficiaries, dtype: int64

In [39]:
spending_df[["doctor_id", "nb_beneficiaries"]]

Unnamed: 0_level_0,doctor_id,nb_beneficiaries
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
AB789982,1952310666,226
AV967778,1952310666,103
CC128705,1298765423,13
GH890091,1346358827,331
YY219322,1548247315,28
YY190561,1548247315,86
YY572610,1548247315,191
PL346720,1326175365,87
GZ129032,1518970284,54


### Indexing `DataFrames` Row Using Labels

- Row indexing using labels can be carried out using the `loc` (index location) operator


  -  The `loc` operator uses `[]` instead of the `()` that methods and functions use.
 
-  `loc` can take a single label or a list of labels 

```python    
spending_df.loc["AV967778"]
```

or

```python    
spending_df.loc[["AV967778","YY219322"]]
```

- When a single index is given, `pandas` returns a `Series`.
  - Remember that a row is simply a `Series`.
  
- When a list of indices is given, `pandas` returns another `DataFrame`
  - The returned `DataFrame` is a subset of the original

In [40]:
spending_df.loc["AV967778"]

doctor_id           1952310666
specialty           Psychiatry
medication            DIAZEPAM
nb_beneficiaries           103
spending               $662.87
Name: AV967778, dtype: object

In [42]:
spending_df.loc[["AV967778","YY219322"]]

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"


### Subsetting by Range on Labels

- Although not useful or intuitive in this case,  the range operator also works with the index and column labels.


- Thefore, the following lines are all valid lines 

```python    
spending_df.loc["AB789982", "specialty":"spending"]    

spending_df.loc["AB789982":"GZ129032", "specialty"]    

spending_df.loc["AB789982":"GZ129032","specialty":"spending"]    
```

- As opposed to ranges in lists, the upper limit of the range is included  in the result


In [45]:
spending_df.loc["AB789982":"GZ129032", "specialty":"nb_beneficiaries"]

Unnamed: 0_level_0,specialty,medication,nb_beneficiaries
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AB789982,Psychiatry,CLONAZEPAM,226
AV967778,Psychiatry,DIAZEPAM,103
CC128705,Cardiology,NADOLOL,13
GH890091,Family,HYDROCODONE,331
YY219322,Psychiatry,ALPRAZOLAM,28
YY190561,Psychiatry,GABAPENTIN,86
YY572610,Psychiatry,MIRTAZAPINE,191
PL346720,Family,OXYCODONE HCL,87
GZ129032,Hemato-oncology,DIGOXIN,54



##### Indexing review:

| Syntax                |   Meaning    |
|:--------------------------|:-------------------|
| `dataframe["col_name"]`    |  Return `Series` of "col_name"'s  |
| `dataframe[["col_name_1", "col_name_2"] ]`    |  Returns `DataFrame` with  "col_name_1" and "col_name_2"|
| `dataframe.loc["label"]`| returns entry indexed by "some_label" |
| `dataframe.loc["label", ["col_3", 'col_5']]`| return entry indexed by "some_label", subsets columns to only "col_3" and "col_5" |
| `dataframe.loc[["label_1", "label_2"], ['col_3', 'col_5']]`| returns lines with indices "label_1" and "label", subsets columns to only "col_3" and "col_5" |
| `dataframe.iloc[23, [0, 1]]`| returns line 23, and only values of columns 0 and 1 |
| `dataframe.iloc[[1,2], [0, 1]]`| returns lines 1 and 2, and only values of columns 0 and 1 |


### Practical

- Start with a Jupyter Notebook

- Reead the file `intro_pandas_practical.tsv` located in the data folder into a new `pandas` DataFrame called `spending_practical_df`.

  - Make sure you import the appropriate library first.
  
- Use an appropriate `pandas` method to display the first five lines of `spending_practical_df`


- How many rows and columns does the DataFrame contain?

- Write a statement that returns the `spending` Series (column) of  `spending_practical_df`.

- Write a single statement that returns lines 1st, 5th  and 10th lines of `spending_practical_df`.

- Write a single statement that returns the columns `specialty` and `spending` for the rows with labels 12 and 21 of `spending_practical_df`.

