In [4]:
from IPython.core.display import HTML
import pandas as pd 

pd.set_option('display.float_format', lambda x: '%.3f' % x)

def set_css_style(css_file_path):
    """
    Read the custom CSS file and load it into Jupyter.
    Pass the file path to the CSS file.
    """
    styles = open(css_file_path, "r").read()
    return HTML(styles)

set_css_style('styles/custom.css')


### About the Data
- We will be working with the data in the `spending_10k.csv` TSV file.
- The file contains a larger dataset (`10,000` entries) of the `spending.tsv` data explored in earlier modules.
- The data has the same column names


In [5]:
spending_df = pd.read_table('data/spending_10k.tsv', index_col="unique_id", dtype={"doctor_id":"object"})
spending_df.head(10)

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NX531425,1255626040,FAMILY PRACTICE,METFORMIN HCL,30,135.24
QG879256,1699761833,FAMILY PRACTICE,ALLOPURINOL,30,715.76
FW363228,1538148804,INTERNAL MEDICINE,LOSARTAN POTASSIUM,146,1056.47
WD733417,1730200619,PSYCHIATRY,OLANZAPINE,13,28226.97
XW149832,1023116894,FAMILY PRACTICE,PRAVASTATIN SODIUM,348,8199.48
QT485324,1952359671,FAMILY PRACTICE,HYDROCHLOROTHIAZIDE,57,247.01
NA293426,1841235223,FAMILY PRACTICE,SEVELAMER CARBONATE,11,4869.32
IF945618,1326095662,INTERNAL MEDICINE,FLUTICASONE/SALMETEROL,20,7832.46
PH384257,1821126830,HEMATOLOGY/ONCOLOGY,ZOLPIDEM TARTRATE,14,65.21
JY407340,1710986088,INTERNAL MEDICINE,MECLIZINE HCL,47,861.67


### Overview


- In this section, we will tackle the handy `groupby` method.


- We also cover the split-apply-combine scheme to:
  - Aggregate data in each group
  - Transform data in each group
  - Filter the data in each group
  - Thin the data in each group


### `group_by` and DataFrame groups

- The `groupby()` method is used to group the data using values on one or more columns.

- `groupby` takes as input one or more column labels, which it uses to group the data.

```python
df_1.groupby("X")
```


![](images/groupby.png)



### Identifying Groups from a GroupBy Object


```python
spending_df.groupby('specialty')
```

![](images/group_by_specialty.png)

- The `groupby` method returns an object of type `DataFrameGroupBy.`
  - This is not a `DataFrame`, and does not, therefore, have the `DataFrame` methods discussed in previous modules. 




### Extracting Groups From `DataFrameGroupBy` Object

- The `DataFrameGroupBy` object has a handy method called `get_group`, which returns all the entries of a specified group
  - The entries of a group are the lines which match on the key used in the `groupby` operation.


- The example below returns all the entries of the  `"ADDICTION MEDICINE"` group.
  - Those are the lines for which the `specialty` is `"ADDICTION MEDICINE"`

In [6]:
spending_by_specialty = spending_df.groupby('specialty')

spending_by_specialty.get_group("ADDICTION MEDICINE")

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56


### `groupby` and Group-Specific Processing


- Getting groups can be easily implemented using indexing by writing, for instance:

```python
spending_df[spending_df["specialty"] == "ADDICTION MEDICINE"]
```

- An ideal use-case for `groupby` consists of applying operations to each group independently.

- For instance, to compute the total spending by `specialty`, we need to:
  - Split the data by `specialty`
  - Sum the total `spending` for each group
  - Combine the sums for each group into a new `DataFrame`




![](images/example_group_by_2.png)


### Split Apply Combine Paradigm

- `groupby()` is often applied in the context of the data processing paradigm called "split-apply-combine"

  - Split: you need to split the data into chunks defined using one or more columns
  - Apply: apply some operation on the chunks generated. 
    - Ex. Count the number of rows in each chunk, average the values, etc...
  - Combine: combine the results of the applied operation into a new `DataFrame`




### Split-Apply-Combine Example

![](images/split_apply_combine_example.png)

- The class of Split-Apply-Combine applied here is referred to as aggregation.
  - Aggregations refer to any operation that reduces data to a single value.

### The 3 ( or 3  $\frac{1}{2}$) Classes of Opearations on Groups


- Three are ( 3 $\frac{1}{2}$) classes of split-apply-combine operations that can be applied to group data.


1\.$~~$__Aggregations__ generate a single value for each group
  
2\.$~~$ __Transformations__ convert the data and generate a group of the same size as the original group.

3\.$~~$ __Filters__ retain or discard a group based on group-specific boolean computations.


3$\frac{1}{2}$\.$~$"__Thinning__" drops entries in a group based on some defined logic.


### Aggregations

- __Aggregations__ aggregate the data in each group, i.e., they reduce the data to a single value. This includes, for instance, computing group sums, means, maximums, minimums, _etc_.


- The diagram below illustrates grouping by values of column `X` and computing the average over the values of column `Y.`




![](./images/aggregate.png)



__Transformations cont'd__

- transform the data in a way that is group-specific.
  -  Ex. for specialty, we want to transform the column `nb_number` into the values small, large or medium, depending on whether the nb_beneficiaries value is, respectively, much smaller than the mean, much larger than the mean or close to the mean.


  - The number of entries per group resulting from a transformation is the same as the number of entries in the group before the transformation.



- The diagram below shows an example where the data in column "Y" in transformed by dividing it by the group mean.

![](./images/transform.png)


__Filtering__  group consist of dropping or retaining that group in a way that depends on a group-specific computation that returns `True` or `False`. 

- For instance, we can filter specialties that don't have enough entries or for which the mean `spending` if below a certain threshold.
  - Groups are either retained or discarded. Groups that are retained are unmodified.


- The diagram below shows an example where groups are filtered if their sum for column `Y` is less than 10.

![](images/filter.png)

__Thinning__ the data consist of reducing the number of entries using a group-specific operation. Thinning can be useful to sub-sample the data at the group level, or for returning the top `n` entries in each group (), etc.

  - As opposed to aggregating functions, thinning does not have to reduce the group into a single entry; although it could

    
![](images/thin.png) 


##### Aggregating the Data Using `groupby`

- Aggregation is commonly used to compute summary statistics on each of the groups.

- Some of the interesting/important summary aggregation methods `DataFrameGroupBy`  objects are:


|Methods| Decription|
|:----------|:----------------|
| `mean`, `median` | Computes the mean and the median in each group| 
| `min` , `max` | computes the min and max in each group| 
| `size` | computes the number of values in each group| 




##### Aggregating the Data Using `groupby` Cont'd 


- The functions above all use the same syntax, which leverages method chaining:
 
```python
spending_df.groupby('specialty').sum()
# or
spending_df.groupby('specialty').min()
```
- You can modify the behavior of the aggregation methods using their parameters.


In [7]:
spending_df.groupby('specialty').sum(numeric_only= True).head()


Unnamed: 0_level_0,nb_beneficiaries,spending
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1
ADDICTION MEDICINE,74,920.06
ALLERGY/IMMUNOLOGY,1063,189174.06
ANESTHESIOLOGY,1673,142804.73
CARDIAC ELECTROPHYSIOLOGY,1041,225543.62
CARDIAC SURGERY,33,12432.92



### Applying Functions to Group Columns

- the method called `agg` can be used where complex or custom aggregation logic is required,  
 The method `agg` takes a function (or a list of functions) and uses it (them) to aggregate the group.

- Example, we can use `sum_spending_CAD` to return the sum of the spending in Canadian Dollars.



```python
def sum_spending_CAD(x):
    return x.sum() * 1.26

spending_by_specialty['spending'].agg(sum_spending_CAD)
```


- Note that `agg` can also take a dictionary of functions to aggregate on.

```python 
spending_by_specialty.agg({'nb_beneficiaries' :sum,
                           'spending' : max)
```

In [13]:
def sum_spending_CAD(x):
    return x.sum() * 1.26

spending_by_specialty[['nb_beneficiaries', 'spending']].agg(
                                    {  'nb_beneficiaries': sum, 
                                       'spending': sum_spending_CAD
                                    }).head()

Unnamed: 0_level_0,nb_beneficiaries,spending
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1
ADDICTION MEDICINE,74,1159.276
ALLERGY/IMMUNOLOGY,1063,238359.316
ANESTHESIOLOGY,1673,179933.96
CARDIAC ELECTROPHYSIOLOGY,1041,284184.961
CARDIAC SURGERY,33,15665.479


In [14]:
spending_by_specialty['spending'].agg([sum, min, max]).head()

Unnamed: 0_level_0,sum,min,max
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ADDICTION MEDICINE,920.06,19.56,817.88
ALLERGY/IMMUNOLOGY,189174.06,109.8,52389.61
ANESTHESIOLOGY,142804.73,35.33,34073.91
CARDIAC ELECTROPHYSIOLOGY,225543.62,69.85,89101.54
CARDIAC SURGERY,12432.92,442.91,11990.01


In [15]:
spending_by_specialty.agg({'nb_beneficiaries' :sum,
                           'spending' : max}).head()


Unnamed: 0_level_0,nb_beneficiaries,spending
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1
ADDICTION MEDICINE,74,817.88
ALLERGY/IMMUNOLOGY,1063,52389.61
ANESTHESIOLOGY,1673,34073.91
CARDIAC ELECTROPHYSIOLOGY,1041,89101.54
CARDIAC SURGERY,33,11990.01


##### Transforming the Data in `groupby`

- As opposed to aggregations, which reduce the data into a single value, transformations modify the data but don't change the shape of the groups

- Transformation as useful to applying operations that are group specific.



##### Transforming the Data in `groupby` Cont'd


- The example below computes the percent contribution of each entry to each specialty by applying a transformation that normalizes the entry's spending over the total spending in that specialty. 

![](images/transform_spending.png)


In [16]:
spending_by_specialty["spending"].get_group("ADDICTION MEDICINE")

unique_id
VG585760    82.620
GJ278932   817.880
TX420809    19.560
Name: spending, dtype: float64

### Applying a Transformation

- Applying a transformation is done using the method called `transform`.


- The method `transform` takes as input a function name, which it calls on each group of the `DataFrameGroupBy` object

In [17]:
def my_function(x):
    return (x   / x.sum() ) * 100
    


# below transform the spending into a percentage and save it a new column
# called `spending_pct`
spending_df["spending_pct"] = spending_by_specialty['spending'].transform(my_function)


spending_df[spending_df['specialty'] == "ADDICTION MEDICINE"]


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62,8.98
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88,88.894
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56,2.126


In [18]:
# The code below allows us to sort and view the important drugs 
# by spending in each specialty.  Note, however, that we still have duplicates of 
# drugs (ex. ALBUTEROL SULFATE) since the same drug can occur in more than  one entry
spending_df.sort_values(['specialty', 'spending_pct'], ascending=[True, False]).head(30).head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88,88.894
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62,8.98
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56,2.126
XY715196,1376691626,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,102,52389.61,27.694
DL492570,1962588053,ALLERGY/IMMUNOLOGY,OMALIZUMAB,12,29153.71,15.411


#### More complex Transformations

- As noted above, drugs are still duplicated across `doctor_ids` within the same `specialty.`

- To see the percent spending by `drug` column, we need to group on both the `specialy` and the `medication` and sum the `spending_pct` computed previously

```python
medication_spendng_pct =  spending_df.groupby(["specialty", "medication"])["spending_pct"].sum()
```



#### More complex Transformations- cont'd

- Since we are grouping on two columns, the resulting index of `medication_spending_pct_serie` will also contain two columns


- We need to reset (or drop) the index using the method `reset_index` before we can sort on specialty and spending_pct as we did above


```python
medication_spendng_pct = spending_df.groupby(["specialty", "medication"])["spending_pct"].sum().reset_index()


medication_spendng_pct.sort_value(["specialty", "spending_pct"], ascending=[True, False])
```

In [21]:
medication_spendng_pct = spending_df.groupby(["specialty", "medication"])["spending_pct"].sum().reset_index()

medication_spendng_pct.sort_values(["specialty", "spending_pct"], ascending=[True, False]).head(5)


Unnamed: 0,specialty,medication,spending_pct
0,ADDICTION MEDICINE,BUSPIRONE HCL,88.894
1,ADDICTION MEDICINE,LAMOTRIGINE,8.98
2,ADDICTION MEDICINE,LORAZEPAM,2.126
12,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,41.899
16,ALLERGY/IMMUNOLOGY,MOMETASONE FUROATE,18.141


##### Filtering Groups

- Filtering a group is done using the method called `filter`.


- The method `filter` takes as input a function name, which it calls on each group of the `DataFrameGroupBy` object
  - The function must return either `True` or `False`.
  - Groups for which the function returns `False` are dropped.


- The resulting` DataFrame` has its entries in the same order as the original `DataFrame`.
 


#### Unique


- `unique` is a very useful `Series` method. 
  - The method `unique` returns the set of unique `values` in `Series`.


In [43]:
spending_df['specialty'].unique()

array(['FAMILY PRACTICE', 'INTERNAL MEDICINE', 'PSYCHIATRY',
       'HEMATOLOGY/ONCOLOGY', 'OPHTHALMOLOGY', 'NEUROLOGY',
       'NURSE PRACTITIONER', 'NEPHROLOGY', 'DENTIST', 'SPECIALIST',
       'GENERAL PRACTICE', 'INTERVENTIONAL CARDIOLOGY',
       'OBSTETRICS/GYNECOLOGY', 'PHYSICIAN ASSISTANT', 'CARDIOLOGY',
       'ENDOCRINOLOGY', 'RHEUMATOLOGY', 'OPTOMETRY',
       'STUDENT IN AN ORGANIZED HEALTH CARE EDUCATION/TRAINING PROGRAM',
       'PULMONARY DISEASE', 'DERMATOLOGY',
       'INTERVENTIONAL PAIN MANAGEMENT', 'PSYCHIATRY & NEUROLOGY',
       'GASTROENTEROLOGY', 'GERIATRIC MEDICINE', 'UROLOGY',
       'MEDICAL ONCOLOGY', 'PHYSICAL MEDICINE AND REHABILITATION',
       'EMERGENCY MEDICINE', 'ORTHOPEDIC SURGERY',
       'CARDIAC ELECTROPHYSIOLOGY', 'OTOLARYNGOLOGY', 'ALLERGY/IMMUNOLOGY',
       'PODIATRY', 'CERTIFIED CLINICAL NURSE SPECIALIST',
       'INFECTIOUS DISEASE', 'UNKNOWN PHYSICIAN SPECIALTY CODE',
       'ANESTHESIOLOGY', 'PEDIATRIC MEDICINE', 'PAIN MANAGEMENT',
       

In [44]:

def filter_on_spending(x):
     return x['spending'].sum() > 50000

high_spending_df = spending_df[["specialty", 'spending']].groupby('specialty').filter(filter_on_spending))

#
high_spending_df['specialty'].unique() 



array(['FAMILY PRACTICE', 'INTERNAL MEDICINE', 'PSYCHIATRY',
       'HEMATOLOGY/ONCOLOGY', 'OPHTHALMOLOGY', 'NEUROLOGY',
       'NURSE PRACTITIONER', 'NEPHROLOGY', 'GENERAL PRACTICE',
       'INTERVENTIONAL CARDIOLOGY', 'OBSTETRICS/GYNECOLOGY',
       'PHYSICIAN ASSISTANT', 'CARDIOLOGY', 'ENDOCRINOLOGY',
       'RHEUMATOLOGY', 'OPTOMETRY', 'PULMONARY DISEASE', 'DERMATOLOGY',
       'INTERVENTIONAL PAIN MANAGEMENT', 'PSYCHIATRY & NEUROLOGY',
       'GASTROENTEROLOGY', 'GERIATRIC MEDICINE', 'UROLOGY',
       'MEDICAL ONCOLOGY', 'PHYSICAL MEDICINE AND REHABILITATION',
       'EMERGENCY MEDICINE', 'ORTHOPEDIC SURGERY',
       'CARDIAC ELECTROPHYSIOLOGY', 'ALLERGY/IMMUNOLOGY', 'PODIATRY',
       'CERTIFIED CLINICAL NURSE SPECIALIST', 'INFECTIOUS DISEASE',
       'ANESTHESIOLOGY', 'PEDIATRIC MEDICINE', 'PAIN MANAGEMENT',
       'HEMATOLOGY', 'GENERAL SURGERY', 'DIAGNOSTIC RADIOLOGY'], dtype=object)

##### Thinning the Data

- Thinning the data consist in reducing the number of entries using a group opearation.
- As opposed to aggregating functions, thinning does not have to reduce the group into a single entry; although it could
- Also, thinning function, does not have to return true or false not return exactly the same entry in each group

- Thinning can be use, for instance, to return only the top 3 entries  in each category, or to randomly sample a small subset of from each category.



### Thinning Methods and `apply`

- `pandas` offers a few methods for thinning the data.
  - Ex. `nlargest`, `nsmallest`, etc.
    
    
- However, thinning  is most often carried out using a method  called `apply.` 



- The  method `apply` takes as input a function name, which it calls on each group of the `DataFrameGroupBy` object.


In [54]:
spending_by_specialty['spending'].nlargest(2)

specialty                  unique_id
ADDICTION MEDICINE         GJ278932      817.880
                           VG585760       82.620
ALLERGY/IMMUNOLOGY         XY715196    52389.610
                           DL492570    29153.710
ANESTHESIOLOGY             WD732008    34073.910
                           ZJ839161    33127.750
CARDIAC ELECTROPHYSIOLOGY  XZ523373    89101.540
                           RR251593    59935.970
CARDIAC SURGERY            YC312951    11990.010
                           FK638917      442.910
Name: spending, dtype: float64

In [24]:
spending_by_specialty['spending'].nsmallest(3)

specialty                  unique_id
ADDICTION MEDICINE         TX420809     19.560
                           VG585760     82.620
                           GJ278932    817.880
ALLERGY/IMMUNOLOGY         HQ120242    109.800
                           HN843226    173.050
                           LE617956    190.120
ANESTHESIOLOGY             IS925171     35.330
                           XZ351859     38.960
                           HY359879     56.860
CARDIAC ELECTROPHYSIOLOGY  XR445715     69.850
Name: spending, dtype: float64

### Sampling a DataFrame

- Another Interesting Usecase for thinning is to sumbsample a `DataFrame`.

- This is necessary to maintain group composions.

- This can be achived using the DataFrame mthod called `sample.` 

  - Two parameters are relevant in this scenario,`n` the number of samples to randomly select or `frac` a portion of the data to retun
  - We are interested the latter

```python
 spending_df.sample(frac=0.001)
```


In [25]:
# return 0.01% of the data, i.e 10 entries
spending_df.sample(frac=0.01).head()


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
UC393942,1942272653,FAMILY PRACTICE,CLONAZEPAM,25,129.94,0.002
IA801487,1962438861,INTERNAL MEDICINE,LIDOCAINE,16,4473.19,0.046
PR105009,1891765079,PHYSICAL MEDICINE AND REHABILITATION,GABAPENTIN,101,2071.38,1.332
AK980177,1245284090,INTERNAL MEDICINE,AZITHROMYCIN,58,367.42,0.004
OM839383,1285694505,INTERNAL MEDICINE,"INSULIN GLARGINE,HUM.REC.ANLOG",54,17829.88,0.183


In [None]:
# return 0.01% of the data, i.e 10 entries
spending_df.sample(n=10) 

In [27]:
# We sample only 10% of the Data in each category

def sample_10p(x):
    return x.sample(frac=0.1, )
    
spending_by_specialty.apply(sample_10p).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
specialty,unique_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ALLERGY/IMMUNOLOGY,OR379110,1861499741,ALLERGY/IMMUNOLOGY,ALBUTEROL SULFATE,63,4437.24,2.346
ALLERGY/IMMUNOLOGY,HQ120242,1811919988,ALLERGY/IMMUNOLOGY,IRBESARTAN,12,109.8,0.058
ANESTHESIOLOGY,GK755216,1427168574,ANESTHESIOLOGY,LEVOTHYROXINE SODIUM,12,179.88,0.126
ANESTHESIOLOGY,EQ176933,1760454805,ANESTHESIOLOGY,SUMATRIPTAN SUCCINATE,25,581.18,0.407
ANESTHESIOLOGY,WD732008,1538235213,ANESTHESIOLOGY,OXYCODONE HCL,56,34073.91,23.86


In [76]:
print(spending_by_specialty.get_group("CARDIAC ELECTROPHYSIOLOGY").shape)

print(spending_by_specialty.get_group("ANESTHESIOLOGY").shape)

print(spending_by_specialty.get_group("CARDIOLOGY").shape)


(20, 6)
(30, 6)
(445, 6)


In [29]:
subsampled_spending_df = spending_by_specialty.apply(sample_10p)


print(subsampled_spending_df.loc["CARDIAC ELECTROPHYSIOLOGY"].shape)

print(subsampled_spending_df.loc["ANESTHESIOLOGY"].shape)

print(subsampled_spending_df.loc["CARDIOLOGY"].shape)




(2, 6)
(3, 6)
(44, 6)


In [30]:
subsampled_spending_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
specialty,unique_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ALLERGY/IMMUNOLOGY,KL206491,1124085436,ALLERGY/IMMUNOLOGY,AZELASTINE HCL,15,1379.06,0.729
ALLERGY/IMMUNOLOGY,OR379110,1861499741,ALLERGY/IMMUNOLOGY,ALBUTEROL SULFATE,63,4437.24,2.346
ANESTHESIOLOGY,WD732008,1538235213,ANESTHESIOLOGY,OXYCODONE HCL,56,34073.91,23.86
ANESTHESIOLOGY,OQ632520,1558305185,ANESTHESIOLOGY,DULOXETINE HCL,59,6072.36,4.252
ANESTHESIOLOGY,QS118214,1871688952,ANESTHESIOLOGY,TRAMADOL HCL,15,148.2,0.104



### Practical

- Start with a new Jupyter Notebook.


- Read the file `data/grouping_practical.tsv` located in the data folder into a new `pandas` DataFrame called `spending_practical_df`.

  - Make sure you import the appropriate module first.


- Filter out any specialties that have less than 200 records or for which the total number of beneficiaries is less than 15,00.

  - Print your results as a sorted `DataFrame`. The sort order should include specialty (Ascending), nb_beneficiaries (descending), spending (descending).

  - How many specialties pass this filtering


- We covered the code below in this module. Do you remember what it does?

```python 

def my_function(x):
    return (x   / x.sum() ) * 100
    
spending_practical_df["spending_pct"] = spending_by_specialty['spending'].transform(my_function)
spending_practical_df.head()

medication_spending_pct = spending_practical_df.groupby(["specialty", "medication"])["spending_pct"].sum().reset_index()
```

- Copy and paste the code into a cell. Run the cell and print the first five rows of `medication_spending_pct` `DataFrame` using the method `head`. 

- Group `medication_spendng_pct` on specialty and filter the specialties for which the sum of the top 2 medicines in terms of spending_pct is < 80%. For instance, the sum of the `spending_pct` for the highest 2 entries for `"ADDICTION MEDICINE"`  is 88.89% + 8.98% =  97.87%. Therefore, we should retain this specialty. However, the sum of the top 2 medicines in "ALLERGY/IMMUNOLOGY" is 41.89% + 8.14% = 43.10%; therefore, we should discard this specialty.

- Print only the top entries in each group. For instance, in "ADDICTION MEDICINE," you should only print the following two lines:

```bash
0    ADDICTION MEDICINE    BUSPIRONE HCL      88.894203
1    ADDICTION MEDICINE    LAMOTRIGINE         8.979849
```




