In [27]:
from IPython.core.display import HTML
import pandas as pd 

pd.set_option('display.float_format', lambda x: '%.3f' % x)

def set_css_style(css_file_path):
    """
    Read the custom CSS file and load it into Jupyter.
    Pass the file path to the CSS file.
    """
    styles = open(css_file_path, "r").read()
    return HTML(styles)

set_css_style('styles/custom.css')


# Operations on multiple `DataFrame` and Handling Missing Values

### Specifyin Columns' Data Types

- Before proceeding, we need to read in the processed `DataFrame` we generated during the `previous` session.
  -  Will specify the type of the `doctor_id` column while reading the file to avoid that it's re-read as a string.

- The other values should be inferred appropriately by `pandas` from the data

```python
import pandas as pd
spending_df = pd.read_csv(  "data/spending_correc_data_t.csv", 
                            index_col="unique_id", 
                            dtype={'doctor_id': "object"})
```

In [74]:
import pandas as pd
spending_df = pd.read_csv("data/spending_correc_data_t.csv", index_col="unique_id", dtype={'doctor_id': "object"})
spending_df.dtypes

doctor_id            object
specialty            object
medication           object
nb_beneficiaries      int64
spending            float64
dtype: object

In [29]:
spending_df.iloc[6:]

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AV967778,1952310666,Psychiatry,DIAZEPAM,103,662.87
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13,


### Missing Values (NaN)

- Recall that `shape` returns the number of rows in a `Series`, and the number of rows and the number of columns in a `DataFrame`


- The number of rows can be different from the number of values. For instance, the code below shows that the `spending` `Series` (column) has 9 rows.


```python
>>> spending_df["spending"].shape
(9, 5)
```
- The method `count()`, which counts the number of values in a Series show that there are only 8 values.

```python
>>> spending_df["spending"].count()
8
```

- The discrepancy arises because the last row of the `spendinf_df` contains a missing value.

  - Missing values appear as __`nan`__ (not a number) in the DataFrame.
    - They may also appear as NaN or NAN.
 - Missing values are not counted as values in `pandas`, even if they occupy a cell. 



### Other Methods For Summarizing Data

- In addition to `describe` and `count`, `DataFrames` have many other methods to summarize their data.


- Methods for computing summary statistics in `pandas` can automatically:

  - Handle missing values when computing summary statistics.
  - Can automatically infer compatible columns types.


- Therefore, computing the mean of `spending_df`  will automatically:
  - Exclude the  `NaN` value observerd in the `spending` column
  - Only execute the operation on the columns for which the  computation does not generate an error


In [30]:
# The numeric_only is required to avoid obtaining the mean for `doctor_id` since it does not generate an error

spending_df.mean(numeric_only=True)

# Alternatively we would avoid unnecessary values
spending_df[["nb_beneficiaries", "spending"]].mean()


nb_beneficiaries    124.333
spending           4321.735
dtype: float64

Similarly, you can compute many other summary statistics on the table such as the minimum (min), maximum (max) variance (var), standard deviation (std), etc. See table below for other useful summary statistics.

##### Other Useful Summary Statistics

| Method                |   Description     |
|:--------------------------|:--------------|
| `min`, `max`, `idxmin`, `idxmax`  | Computes the numeric (for numeric value) or alphanumeric (for object values) row-wise min, max in a Series or DataFrame|
| `idxmin`, `idxmax`  | Computes the min and index of the max for numeric columns in a `Series` or `DataFrame` |
| `sum`, `mean`, `std`, `var`   |  Computes the row-wise sum, mean, standard deviation and variance in a `Series` or DataFrame|
| `count` |  returns the number of non-NaN values in the in a `Series` or `DataFrame` |
| `value_counts` |  returns the frequency for each value in the `Series` |
| `describe` | Computes row-wise statistics |



In [31]:
spending_df["specialty"].describe()

count              9
unique             4
top       Psychiatry
freq               5
Name: specialty, dtype: object

In [32]:
spending_df.count()

doctor_id           9
specialty           9
medication          9
nb_beneficiaries    9
spending            8
dtype: int64

In [33]:
spending_df["specialty"].value_counts()

Psychiatry         5
Family             2
Hemato-oncology    1
Cardiology         1
Name: specialty, dtype: int64

### Arithmetic  Operations and Data Alignment - 1

- Executing an arithmetic operation on a `Series`  first aligns them by their matching indices.
  - A new index is created from the  indexes of both Series

- Values for indexes present in only one of the Series are filled with missing values (`NaN`) in the second `Series`.

```python
df_1["AA"] + df_2["AA"]
```

![](images/alignment_arithmetic_col.png)

### Arithmetic  Operations and Data Alignment - 2

- The logic is identical when dealing with row Series

```python
df_1["A"] + df_2["D"]
```
![](images/alignment_arithmetic_row.png)

### Vectorization

- The arithmetic operations discussed in the Python intro (`+`, `-`, `*`, `/`, `**`) are applied implicitly in a pairwise manner between the operands.
    - I.e., we don't need to iteratively apply it to each pair of elements (known as for loops).
  
 - This is referred to as vectorization and works seamlessly between any `Series` of the same size.
  
- For example, to compute the average spending per beneficiary, we can simply divide the `spending` column by the `nb_beneficiaries` column. 


```python
spending_df["spending"] / spending_df["nb_beneficiaries"]
```



### Alignment with a Scalar Value

- On the other hand, arithmetic operations between `Series` and a scalar require expanding the scalar into a `Series` of the same dimension as the other operand.

  - This is called _broadcasting_

```python
spending_df["spending"] / 1.2
```

![](images/alignment.png)

### Comparison Operations 

- Comparison Operations (`"<"` , `">"` , `"=="` , `">="` , `"<="` , `"<>"` , `"!="`) are applied the same way to as arithmetic operations

- Can only be applied to identically-labeled `Series` objects or to `Series` and a scalar 
 - The scalar is first broadcast to a compatible shape with identical labels (similar to addition above).


In [35]:
spending_df["nb_beneficiaries"] > 150 

unique_id
YY572610     True
YY219322    False
YY190561    False
PL346720    False
GZ129032    False
GH890091     True
AV967778    False
AB789982     True
CC128705    False
Name: nb_beneficiaries, dtype: bool

### Comparison Operations  and Indexing

- Comparisons operators are ideal for querying and subsetting the `DataFrame` since, as seen before, we can subset a `Series` using another list (or a `Series`) of `Boolean`s. 
Since the output of `Series` comparisons is a `Series` of `Boolean`s, we can subset a `DataFrame` using the output of comparison operators.

- Ex. to select only rows where nb_beenficiaries > 150. We can write the following:

```python
rows_gt_150 =  spending_df["nb_beneficiaries"] > 150 
spending_df[rows_gt_150]
```
- It's common to bypass the need for the intermediate variable and simply write.

```python
spending_df[  spending_df["nb_beneficiaries"] > 150  ]
```



In [115]:
true_false_rows =  spending_df["nb_beneficiaries"] > 150 
print(true_false_rows)
spending_df[true_false_rows]


unique_id
YY572610     True
YY219322    False
YY190561    False
PL346720    False
GZ129032    False
GH890091     True
AV967778    False
AB789982     True
CC128705    False
Name: nb_beneficiaries, dtype: bool


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,3131.96
GH890091,1346358827,Family,HYDROCODONE,331,8511.14
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,1848.88


### Subset and Compatible Shapes

- The outcome of the code above is, in fact, equivalent to:

```python
    spending_df[[True, False, False, False, False, True, False, True, False]] 
```
![](images/filter_dataframe.png )

- The above returns an error if the size of `Boolean` `Series` does not have the same shape as the data it is indexing.


![](images/bool_indexing_error.png)


In [118]:
spending_df[[True, False, False, False, False, True, False, True, False]]


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,3131.96
GH890091,1346358827,Family,HYDROCODONE,331,8511.14
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,1848.88


### Composing Conditional Expression

- `pandas` uses a different set of operators for joining conditional expressions than those built into Python. 

- Python's `and` is replaced with `&`. The `or`  is replaced with `|` 
- For insatnace, to filter based on `nb_benecficiaries` and `spending`, we write:

    
```python
true_false_rows =  (spending_df["nb_beneficiaries"] > 150) 
                        &  
                   (spending_df["spending"] 8000.00)
                   
print(true_false_rows)

spending_df[true_false_rows]
```


In [123]:
true_false_rows =  (spending_df["nb_beneficiaries"] > 150) &  (spending_df["spending"]  < 8000.00)
print(true_false_rows)
spending_df[true_false_rows]

unique_id
YY572610     True
YY219322    False
YY190561    False
PL346720    False
GZ129032    False
GH890091    False
AV967778    False
AB789982     True
CC128705    False
dtype: bool


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,3131.96
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,1848.88


# Negating Booleans - 1

- `pandas` equivalent for the Python's `not` operator is `~`
- The `~` (read not), takes Boolean(s) and computes it complement (  inverse)
- For instance, if
```python
>>> true_false_rows
0     True
1    False
2    False
3    False
....
Name: nb_beneficiaries, dtype: bool
```
then
```python
>>> ~ true_false_rows
0    False
1     True
2     True
3     True
```

# Negating Booleans - 2

Therefore, to take the complement of an conditional expresison, we can simply preceed the expression by `~`
```python
true_false_rows =  ~(spending_df["nb_beneficiaries"] > 150) &  (spending_df["spending"]  < 8000.00)
print(true_false_rows)
spending_df[true_false_rows]
```

In [36]:
true_false_rows =  ~(spending_df["nb_beneficiaries"] > 150) &  (spending_df["spending"]  < 8000.00)
print(true_false_rows)
spending_df[true_false_rows]


unique_id
YY572610    False
YY219322     True
YY190561     True
PL346720    False
GZ129032     True
GH890091    False
AV967778     True
AB789982    False
CC128705    False
dtype: bool


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86,1807.16
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54,3766.34
AV967778,1952310666,Psychiatry,DIAZEPAM,103,662.87


### Working with Missing Data

- Given the pervasiveness of missing values in real data, `pandas` provides easy to use functionality for handling missing values.

- The overall approach for working with missing values in `pandas` is similar to that adopted in R, S and other statistical packages.

- When working with missing data, the objectives can be boiled down to:
  - Identifying missing value
  - Filling missing values
  - Filtering rows or column with missing values


##### 1. Identifying Missing Values

- This is typically achieved with the  `isnull()` method.
- This method returns `True` if a cell contains a `NaN` value, and returns `False` otherwise.

- When applied to the spending_df, the spending value for `unique_id` CC128705 evaluates `isnull()` to `True`.


In [38]:
spending_df.isnull()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,False,False,False,False,False
YY219322,False,False,False,False,False
YY190561,False,False,False,False,False
PL346720,False,False,False,False,False
GZ129032,False,False,False,False,False
GH890091,False,False,False,False,False
AV967778,False,False,False,False,False
AB789982,False,False,False,False,False
CC128705,False,False,False,False,True


##### 2. Filtering Out Missing Values - 1

- Various approaches can be used to filter out missing value.


- For instance, you can discard missing values using subsetting.


- For example, you can filter out missing values in the `spending` `Series` using:

```python
spending_df [ ~ spending_df["spending"].isnull() ]
```

- Other methods that aggregate `Booleans` across rows or columns can also be applied.



##### Filtering Out Missing Values - 2

- `pandas` also has the `dropna` method to drop missing value
  - `na` in `dropna` is short for not available -- the convention used in R
    
- `dropna` has an optional paramter `axis` along which to drop the the column or the row that contains the `NaN`
  - `axis= 0` or `axis= "rows"` can be used interchangeably to drop rows containing `NaN`s
  - `axis= 1` or `axis= "columns"` can be used interchangeably to drop columns containing `NaN`s


- The operation does not overwrite the original data, but, instead, returs a new `DataFrame` with the `NaN` dropped

![](images/axis_drop.png)

In [42]:
spending_df.dropna(axis='rows')

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28
YY190561,1548247315,Psychiatry,GABAPENTIN,86
PL346720,1326175365,Family,OXYCODONE HCL,87
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54
GH890091,1346358827,Family,HYDROCODONE,331
AV967778,1952310666,Psychiatry,DIAZEPAM,103
AB789982,1952310666,Psychiatry,CLONAZEPAM,226
CC128705,1298765423,Cardiology,NADOLOL,13


### DataFrame Axes


- This concept of `axis` is recurrent throughout `pandas`.



- Many of the operations we've seen earlier have the parameter axis.



- For instance we can the methods `sum()`, `min()`, `max()`, etc.. can all be applied row- or column-wise



- It helps to think about the operation as being carried across the axis.


In [43]:
test_df = pd.DataFrame(
               { "Col_A":[1,2,3,4,5,6], 
                 "Col_B": [2,3,4,5,6,7], 
                 "Col_C": [3,4,5,6,7,8]
               }, 
                 index=["row_0", "row_1", "row_2", "row_3", "row_4", "row_5"] )
                
test_df

Unnamed: 0,Col_A,Col_B,Col_C
row_0,1,2,3
row_1,2,3,4
row_2,3,4,5
row_3,4,5,6
row_4,5,6,7
row_5,6,7,8



![](images/axis_example.png)

In [46]:
print(test_df.sum(axis='rows'))

print("------------------------")

print(test_df.sum(axis='columns'))


Col_A    21
Col_B    27
Col_C    33
dtype: int64
------------------------
row_0     6
row_1     9
row_2    12
row_3    15
row_4    18
row_5    21
dtype: int64


### Dropping `NaN` Based on Conditions

- In addition to the parameter `axis`,  `dropna` has other useful parameters that can we can use to customize the way we drop rows or columns from DataFrames.

| Parameter | Description |
|:----------:|:------------|
| `how` | (`any`) drops a row or a column if any of its value are `NaN`. <br/> (`all`) drops a row or a column if all of its values are `NaN` | 
| `thresh` | Defines the minimum number of non-`NaN` required before a column is dropped. <br/> Useful for dropping `variables` (columns) with too many (above threshold ) `NaN`s |
|`subset`| Defines a list of columns to consider. |


In [54]:


temp_spending_df = spending_df.copy()

# pd.np.nan represent the constant nan value, 
# similar to how math.pi or math.e represent the mathematical constants pi and e

temp_spending_df.loc[["YY572610","PL346720","GH890091","AB789982"],'specialty'] = pd.np.nan

temp_spending_df.loc[["YY572610","AV967778", "YY219322"],'nb_beneficiaries'] = pd.np.nan

temp_spending_df


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,,MIRTAZAPINE,,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,,662.87
AB789982,1952310666,,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,


In [55]:
temp_spending_df.dropna(axis ='rows', subset=["medication", "spending"])

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,,MIRTAZAPINE,,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,,662.87
AB789982,1952310666,,CLONAZEPAM,226.0,1848.88


In [57]:
# leave a column if I have at least "thresh" number of values
# otherwise, drop it
temp_spending_df.dropna(axis=1, thresh=2)

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,,MIRTAZAPINE,,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,,662.87
AB789982,1952310666,,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,


##### Filling In Missing Values 

- There  are two conventional approaches for filling missing value:
  - Filling the value with a constant 
  - Filling the value dynamically with something computed on the fly.


- Both approaches can be carried out using the method (`fillna`). The difference is the value we pass to `fillna()`.


##### 1. `fillna` with Statis Values 

- fill_na can take either:
  - A scalar constant which replaces all missing values of a `DataFrame`
  - A dictionary with specific values for each column


```python
temp_spending_df.fillna(0)
temp_spending_df.fillna( { "specialty": "UNKNOWN", 
                          "nb_beneficiaries": 0, 
                          "spending": 0 } )

```

In [225]:
temp_spending_df.fillna( { "specialty": "UNKNOWN", 
                          "nb_beneficiaries": 0, 
                          "spending": 0 } )

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,UNKNOWN,MIRTAZAPINE,0.0,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28.0,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,UNKNOWN,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,UNKNOWN,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,0.0,662.87
AB789982,1952310666,UNKNOWN,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,0.0


##### 2. `fillna` with Dynamic Values

- Dynamic filling of missing values requires passing values that depend on the data
- A simple strategy would, for instance, fill missing values with a column's mean or median value

```python
average_spending = temp_spending_df["nb_beneficiaries"].mean()
temp_spending_df["nb_beneficiaries"].fillna(average_spending)
```


- `NaN` values can also be interpolated using more sophisticated schemes (ex. regression, modeling, randomly from a parameterized distribution, etc...)
  


In [64]:
temp_spending_df.fillna( { "specialty": "UNKNOWN", 
                          "nb_beneficiaries": temp_spending_df["nb_beneficiaries"].mean(), 
                          "spending": temp_spending_df["spending"].mean() } )



Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,UNKNOWN,MIRTAZAPINE,86.5,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,86.5,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,UNKNOWN,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,UNKNOWN,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,86.5,662.87
AB789982,1952310666,UNKNOWN,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,4321.735


##### 2. `fillna` with Dynamic Values - cont'd

- Instead of manually giving `fillna` a dictionary, we can give it a function which returns a dictionary. 

```python
mean_values = temp_spending_df.mean(numeric_only=True)
temp_spending_df.fillna(mean_values)
```
- The statement above works because `pandas` knows that:
  - `fillna` can take a dictionary
  - `median_values` is a Series, which is similar to a dictionary 
    - `pandas` knows how to convert a `Series` to a dictionary 


In [68]:

temp_spending_df.fillna(temp_spending_df.mean())

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,,MIRTAZAPINE,132.833,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,132.833,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,132.833,662.87
AB789982,1952310666,,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,4321.735


##### 2. `fillna` with Dynamic Values - cont'd

- `fillna` can also take parameters that modify its behavior to back fill or forward fill `NaN` values
  - Back filling fills a `NaN` value using the one that comes immediately after it

  ```python
  temp_spending_df.fillna(method='bfill')
  ```
  - Forward filling fills a `NaN` value using the one that comes immediately before it
  ```python
  temp_spending_df.fillna(method='ffill')
  ```
  
- The above is very useful for filling time series data

In [66]:
temp_spending_df.fillna(method='bfill') 

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,Psychiatry,MIRTAZAPINE,86.0,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,86.0,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,Hemato-oncology,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,Psychiatry,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,226.0,662.87
AB789982,1952310666,Cardiology,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,


In [67]:
temp_spending_df.fillna(method='ffill') 

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY572610,1548247315,,MIRTAZAPINE,,3131.96
YY219322,1548247315,Psychiatry,ALPRAZOLAM,,1964.49
YY190561,1548247315,Psychiatry,GABAPENTIN,86.0,1807.16
PL346720,1326175365,Psychiatry,OXYCODONE HCL,87.0,12881.04
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54.0,3766.34
GH890091,1346358827,Hemato-oncology,HYDROCODONE,331.0,8511.14
AV967778,1952310666,Psychiatry,DIAZEPAM,331.0,662.87
AB789982,1952310666,Psychiatry,CLONAZEPAM,226.0,1848.88
CC128705,1298765423,Cardiology,NADOLOL,13.0,1848.88


### Custom Missing Data

- Important: some datasets and or domain use custom characters or strings to encode missing data. 

  - For example in genomics, a missing genotype is often encoded as "NN."
  - Also, some statistical applications don't have constants that represent missing values (such as `pandas`' `NaN` or `R`'s `NA`).
   - Such applications often use small or large, out of range values. 
  - Ex. Age  = 999 or income = $-999999999.99, etc.
  - Custom replacement code should be used to identify these values and replace them with `NaN`s -- This will be the subject of the practical. 

##### Final Note on The `axis` Parameter

- Sometime, it helps to think about the parameter `axis` regarding the shape of the resulting `DataFrame`.


- For functions that generate new values,  `axis = "rows"` produces another row, whereas `axis =columns` produces another column.


- For functions that drop values (such as `dropna`),  `axis = "rows"` drops one or more rows, whereas `axis = "columns"` drops one or more columns.



### Practical #

- We will be using the dataset provided in ....

- Read the dataset into a DataFrame called XYZ. 
  - When reading in the file, indicate that that the data type of W is of type X and the type of Y is Z.
    
    

- Which columns have most missing values

- Drop all rows for which the value of X is something and the value of Y is lower than something.


- Drop the columns that don't have at least 9k value

- Delete rows that have more than `X` missing values

- Replace the remaining missing values with the median of the column in which they occur.




In [73]:
spending_practical_df = pd.read_table("data/exploring_data_practical.tsv")


Unnamed: 0,unique_id,doctor_id,specialty,medication,nb_beneficiaries,spending
0,BK982218,1750389599,INTERNAL MEDICINE,AZITHROMYCIN,12,77.26
1,CG916968,1952344418,CARDIOLOGY,SIMVASTATIN,85,767.83
2,SA964720,1669522744,INTERNAL MEDICINE,INSULIN DETEMIR,14,5409.29
3,TR390895,1639597115,STUDENT IN AN ORGANIZED HEALTH CARE EDUCATION/...,LOSARTAN POTASSIUM,11,65.62
4,JA436080,1073781571,NEUROLOGY,LAMOTRIGINE,12,8873.7
