## Context
- **Restructuring data**
    - pd.melt()
    - pd.pivot()
    
    - pd.pivot_table()
- **Dealing with Missing Values**
    - None and nan values
    - isna() and isnull()
    - dropna()
    - fillna()
     

### Importing our data

- For this topic we will be using **data of few drugs** being developed by **PFizer**


Link: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing

In [None]:
# !gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ -O Pfizer_1.csv

#### What is the data about?

- Temperature (K)
- Pressure (P)

are recorded after an **interval of 1 hour** everyday to monitor the drug stability in a drug development test

==> These data points are thus used to **identify the optimal set of values of parameters** for the stability of the drugs

#### Now, Let's explore this dataset


In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('../Pfizer_1.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       18 non-null     object 
 1   Drug_Name  18 non-null     object 
 2   Parameter  18 non-null     object 
 3   1:30:00    16 non-null     float64
 4   2:30:00    16 non-null     float64
 5   3:30:00    12 non-null     float64
 6   4:30:00    14 non-null     float64
 7   5:30:00    16 non-null     float64
 8   6:30:00    18 non-null     int64  
 9   7:30:00    16 non-null     float64
 10  8:30:00    14 non-null     float64
 11  9:30:00    16 non-null     float64
 12  10:30:00   18 non-null     int64  
 13  11:30:00   16 non-null     float64
 14  12:30:00   18 non-null     int64  
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB


In [4]:
data.head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,23.0,22.0,,21.0,21.0,22,23.0,21.0,22.0,20,20.0,21
1,15-10-2020,diltiazem hydrochloride,Pressure,12.0,13.0,,11.0,13.0,14,16.0,16.0,24.0,18,19.0,20
2,15-10-2020,docetaxel injection,Temperature,,17.0,18.0,,17.0,18,,,23.0,23,25.0,25
3,15-10-2020,docetaxel injection,Pressure,,22.0,22.0,,22.0,23,,,27.0,26,29.0,28
4,15-10-2020,ketamine hydrochloride,Temperature,24.0,,,27.0,,26,25.0,24.0,23.0,22,21.0,20


In [5]:
data.tail()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
13,17-10-2020,diltiazem hydrochloride,Pressure,3.0,4.0,4.0,4.0,6.0,8,9.0,,9.0,11,13.0,14
14,17-10-2020,docetaxel injection,Temperature,12.0,13.0,14.0,15.0,16.0,17,18.0,19.0,20.0,21,22.0,23
15,17-10-2020,docetaxel injection,Pressure,20.0,22.0,22.0,22.0,22.0,23,25.0,26.0,27.0,28,29.0,28
16,17-10-2020,ketamine hydrochloride,Temperature,13.0,14.0,15.0,16.0,17.0,18,19.0,20.0,21.0,22,23.0,24
17,17-10-2020,ketamine hydrochloride,Pressure,8.0,9.0,10.0,11.0,11.0,12,12.0,11.0,12.0,13,14.0,15


<!-- #### Let's check the shape of this dataset -->

### Melting in Pandas

As we saw earlier, the dataset has 18 rows and 15 columns

If you notice further, you'll see:

- The **columns are `1:30:00`, `2:30:00`, `3:30:00`, ... so on**

- `Temperature` and `Pressure` **of each date** is **in a separate row**

#### Can we restructure our data into a better format?

<!-- Maybe do something more intuitive -->

Maybe we can have a column for `time`, with `timestamps` as the column value

**Where will the Temperature/Pressure values go?**

We can similarly create one column containing the values of these parameters




==> **"Melt" timestamp columns into two columns** - timestamp and corresponding values
<!-- Something like our DataFrame will have columns [`ID`, `Date`, `Parameter`, `time`, `result`] -->




#### How can we restructure our data into having every row corresponding to a single reading?


In [6]:
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])

Unnamed: 0,Date,Parameter,Drug_Name,variable,value
0,15-10-2020,Temperature,diltiazem hydrochloride,1:30:00,23.0
1,15-10-2020,Pressure,diltiazem hydrochloride,1:30:00,12.0
2,15-10-2020,Temperature,docetaxel injection,1:30:00,
3,15-10-2020,Pressure,docetaxel injection,1:30:00,
4,15-10-2020,Temperature,ketamine hydrochloride,1:30:00,24.0
...,...,...,...,...,...
211,17-10-2020,Pressure,diltiazem hydrochloride,12:30:00,14.0
212,17-10-2020,Temperature,docetaxel injection,12:30:00,23.0
213,17-10-2020,Pressure,docetaxel injection,12:30:00,28.0
214,17-10-2020,Temperature,ketamine hydrochloride,12:30:00,24.0


This converts our data from `wide` to `long` format


Notice the `id_vars are set of variables which remain unmelted

#### How does `pd.melt()` work?


- Pass in the **DataFrame**
- Pass in the **column names to not melt** <!--**that we DON'T want to change** -->




But we can provide better names to these new columns

#### How can we rename the columns "variable" and "value" as per our original dataframe?

In [9]:
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
            var_name = "time",
            value_name = 'reading')

data_melt

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0
...,...,...,...,...,...
211,17-10-2020,diltiazem hydrochloride,Pressure,12:30:00,14.0
212,17-10-2020,docetaxel injection,Temperature,12:30:00,23.0
213,17-10-2020,docetaxel injection,Pressure,12:30:00,28.0
214,17-10-2020,ketamine hydrochloride,Temperature,12:30:00,24.0


**Conclusion**

<!-- - Columns from `1:30:00` to `12:30:00` are conviniently **melted to a single column `time`** -->
- The labels of the timestamp columns are conviniently **melted into a single column** - `time`
- It retained all values in column `reading`

--------------

- The labels of columns such as `1:30:00`, `2:30:00` have now become categories of the variable column
- The **values from columns we are melting** are stored in **value** column


### Pivot

Now suppose we want to convert our data back to **wide format**

The reason could be to maintain the structure for storing or some other purpose.

Notice:

- The variables `Date`, `Drug_Name` and `Parameter` will remain same

- The column names will be extracted from the column `time`

- The values will be extracted from the column `readings`

#### How can we restructure our data back to the original wide format, before it was melted?




In [12]:
data_melt.pivot(index=['Date','Drug_Name','Parameter'],  # Column to use to make new frame’s index
                columns = 'time',                  # Column to use to make new frame’s columns
                values='reading')                   # Columns to use for populating new frame’s values.

Unnamed: 0_level_0,Unnamed: 1_level_0,time,10:30:00,11:30:00,12:30:00,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00
Date,Drug_Name,Parameter,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
15-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,12.0,13.0,,11.0,13.0,14.0,16.0,16.0,24.0
15-10-2020,diltiazem hydrochloride,Temperature,20.0,20.0,21.0,23.0,22.0,,21.0,21.0,22.0,23.0,21.0,22.0
15-10-2020,docetaxel injection,Pressure,26.0,29.0,28.0,,22.0,22.0,,22.0,23.0,,,27.0
15-10-2020,docetaxel injection,Temperature,23.0,25.0,25.0,,17.0,18.0,,17.0,18.0,,,23.0
15-10-2020,ketamine hydrochloride,Pressure,9.0,9.0,11.0,8.0,,,7.0,,9.0,10.0,11.0,10.0
15-10-2020,ketamine hydrochloride,Temperature,22.0,21.0,20.0,24.0,,,27.0,,26.0,25.0,24.0,23.0
16-10-2020,diltiazem hydrochloride,Pressure,24.0,,27.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,25.0
16-10-2020,diltiazem hydrochloride,Temperature,40.0,,42.0,34.0,35.0,36.0,36.0,37.0,38.0,37.0,38.0,39.0
16-10-2020,docetaxel injection,Pressure,28.0,29.0,30.0,23.0,24.0,,25.0,26.0,27.0,28.0,29.0,28.0
16-10-2020,docetaxel injection,Temperature,56.0,57.0,58.0,46.0,47.0,,48.0,48.0,49.0,50.0,52.0,55.0


Notice,

We are getting **multiple indices** here

#### How can we reset this to a single-index dataframe?

In [11]:
data_melt.pivot(index=['Date','Drug_Name','Parameter'],
                columns = 'time',
                values='reading').reset_index()

time,Date,Drug_Name,Parameter,10:30:00,11:30:00,12:30:00,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00
0,15-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,12.0,13.0,,11.0,13.0,14.0,16.0,16.0,24.0
1,15-10-2020,diltiazem hydrochloride,Temperature,20.0,20.0,21.0,23.0,22.0,,21.0,21.0,22.0,23.0,21.0,22.0
2,15-10-2020,docetaxel injection,Pressure,26.0,29.0,28.0,,22.0,22.0,,22.0,23.0,,,27.0
3,15-10-2020,docetaxel injection,Temperature,23.0,25.0,25.0,,17.0,18.0,,17.0,18.0,,,23.0
4,15-10-2020,ketamine hydrochloride,Pressure,9.0,9.0,11.0,8.0,,,7.0,,9.0,10.0,11.0,10.0
5,15-10-2020,ketamine hydrochloride,Temperature,22.0,21.0,20.0,24.0,,,27.0,,26.0,25.0,24.0,23.0
6,16-10-2020,diltiazem hydrochloride,Pressure,24.0,,27.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,25.0
7,16-10-2020,diltiazem hydrochloride,Temperature,40.0,,42.0,34.0,35.0,36.0,36.0,37.0,38.0,37.0,38.0,39.0
8,16-10-2020,docetaxel injection,Pressure,28.0,29.0,30.0,23.0,24.0,,25.0,26.0,27.0,28.0,29.0,28.0
9,16-10-2020,docetaxel injection,Temperature,56.0,57.0,58.0,46.0,47.0,,48.0,48.0,49.0,50.0,52.0,55.0



==> `pivot()` is the exact opposite of melt

#### How does `pivot()` work?

- Column `Time` is pivoted upon `Date`, `Drug_Name` and `Parameter`





In [15]:
data_melt.head()

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0


Now if you notice,

We are **using 2 rows** to log readings for a single experiment.


#### Can we further restructure our data into dividing the Parameter column into T/P?

A format like:

`Date | time | Drug_Name | Pressure | Temperature`

 would be really suitable

- We want to **split one single column into multiple columns**

#### How can we divide the Parameter column again?

In [21]:
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
                                        columns = 'Parameter',
                                        values='reading')

data_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Parameter,Pressure,Temperature
Date,time,Drug_Name,Unnamed: 3_level_1,Unnamed: 4_level_1
15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
15-10-2020,10:30:00,docetaxel injection,26.0,23.0
15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...
17-10-2020,8:30:00,docetaxel injection,26.0,19.0
17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
17-10-2020,9:30:00,docetaxel injection,27.0,20.0


<!-- Notice that a **multi-index** df has been created

We change this using **`reset_index()`** -->

We can use `reset_index()` to remove the multi-index

In [34]:
data_tidy = data_tidy.reset_index()
data_tidy

None,index,Date,time,Drug_Name,Pressure,Temperature
0,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...,...,...
103,103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0
104,104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
105,105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
106,106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0


We can rename our ```index``` column from `Parameter` to simply `None`

In [35]:
data_tidy.columns

Index(['index', 'Date', 'time', 'Drug_Name', 'Pressure', 'Temperature'], dtype='object', name='None')

In [36]:
data_tidy.columns.name = 'None'

In [37]:
data_tidy.head()

None,index,Date,time,Drug_Name,Pressure,Temperature
0,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0


### Pivot_table



Now suppose we want to find some insights, like **mean temperature day wise**

#### Can we use pivot to find the day-wise mean value of temperature for each drug?

In [38]:
data_tidy.pivot(index=['Drug_Name'],
                columns = 'Date',
                values=['Temperature'])

ValueError: Index contains duplicate entries, cannot reshape

#### Why did we get an error?

- We need to find the **average** of temperature values throughout a day

- If you notice, the error shows **duplicate entries.**

Hence the index values should be unique entry for each row.

#### What can we do to get our required mean values then?


In [39]:
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature'], aggfunc=np.mean)

None,Temperature,Temperature,Temperature
Date,15-10-2020,16-10-2020,17-10-2020
Drug_Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
diltiazem hydrochloride,21.454545,37.454545,15.636364
docetaxel injection,20.75,51.454545,17.5
ketamine hydrochloride,23.555556,11.5,18.5


This function is similar to pivot, with an extra feature of an aggregator

#### How does `pivot_table` work?
- The initial parameters are same as how we do in `pivot()`
- As an extra parameter, we pass the **type of aggregator**

Note:

- We could have done this using `groupby` too

- In fact, `pivot_table` uses `groupby` in the backend to group the data and perform the aggregration

- The only difference is in the type of output we get using both functions



#### Similarly, what if we want to find the minimum values of temperature and pressure on a particular date?

In [40]:
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature', 'Pressure'], aggfunc=np.min)

None,Pressure,Pressure,Pressure,Temperature,Temperature,Temperature
Date,15-10-2020,16-10-2020,17-10-2020,15-10-2020,16-10-2020,17-10-2020
Drug_Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
diltiazem hydrochloride,11.0,18.0,3.0,20.0,34.0,10.0
docetaxel injection,22.0,23.0,20.0,17.0,46.0,12.0
ketamine hydrochloride,7.0,12.0,8.0,20.0,8.0,13.0


### Handling Missing Values



If you notive, there are many "NaN" values in our data

In [41]:
data_tidy.head()

None,index,Date,time,Drug_Name,Pressure,Temperature
0,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0


#### What are these "NaN" values?
They are basically **missing values**

#### What are missing values?
A Missing Value signifies an **empty cell/no data**

There can be 2 kinds of missing values:

  1. `None`
  2. `NaN` (short for Not a Number)

#### Whats the difference between the "None" and "NaN"?

The diff mainly lies in their datatype

In [42]:
type(None)

NoneType

In [43]:
type(np.nan)

float

**None type** is for missing values in a column with **non-number entries**
- E.g.-strings

**NaN** occurs for columns with **number entries**

Note:

Pandas uses these values nearly **interchangeably**, converting between them where appropriate, based on column datatype

In [44]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For **numerical** types, Pandas changes **None to NaN** type


In [45]:
pd.Series(["1", "np.nan", "2", None])

0         1
1    np.nan
2         2
3      None
dtype: object

In [46]:
pd.Series(["1", "np.nan", "2", np.nan])

0         1
1    np.nan
2         2
3       NaN
dtype: object

For **object** type, the **None is preserved** and not changed to NaN

Now we have the basic idea about missing values

#### How to know the count of missing values for each row/column?


In [47]:
data.isna().head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
3,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
4,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False


We can also use isnull to get the same results

In [48]:
data.isnull().head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
3,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
4,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False


#### But, why do we have two methods, "isna" and "isnull" for the same operation?

isnull() is just an alias for isna()

In [49]:
pd.isnull

<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

In [50]:
pd.isna

<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

As we can see, function signature is same for both

`isna()` returns a **boolean dataframe**, with each cell as a boolean value

This value corresponds to **whether the cell has a missing value**

On top of this, we can use `.sum()` to find the count

In [51]:
data.isna().sum()

Date         0
Drug_Name    0
Parameter    0
1:30:00      2
2:30:00      2
3:30:00      6
4:30:00      4
5:30:00      2
6:30:00      0
7:30:00      2
8:30:00      4
9:30:00      2
10:30:00     0
11:30:00     2
12:30:00     0
dtype: int64

This gives us the total number of missing values in each column

#### Can we also get the number of missing values in each row?

In [52]:
data.isna().sum(axis=1)

0     1
1     1
2     4
3     4
4     3
5     3
6     1
7     1
8     1
9     1
10    2
11    2
12    1
13    1
14    0
15    0
16    0
17    0
dtype: int64

Note:

By default the value is `axis=0` in sum()


#### We have identified the null count, but how do we deal with them?

We have two options:
- delete the rows/columns containing the null values
- fill the missing values with some data/estimate

Let's first look at deleting the rows

#### How can we drop rows containing null values?

In [53]:
data.dropna()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
14,17-10-2020,docetaxel injection,Temperature,12.0,13.0,14.0,15.0,16.0,17,18.0,19.0,20.0,21,22.0,23
15,17-10-2020,docetaxel injection,Pressure,20.0,22.0,22.0,22.0,22.0,23,25.0,26.0,27.0,28,29.0,28
16,17-10-2020,ketamine hydrochloride,Temperature,13.0,14.0,15.0,16.0,17.0,18,19.0,20.0,21.0,22,23.0,24
17,17-10-2020,ketamine hydrochloride,Pressure,8.0,9.0,10.0,11.0,11.0,12,12.0,11.0,12.0,13,14.0,15


Rows with **even a single missing value** have been deleted

#### What if we want to delete the columns having missing value?


In [54]:
data.dropna(axis=1)

Unnamed: 0,Date,Drug_Name,Parameter,6:30:00,10:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,22,20,21
1,15-10-2020,diltiazem hydrochloride,Pressure,14,18,20
2,15-10-2020,docetaxel injection,Temperature,18,23,25
3,15-10-2020,docetaxel injection,Pressure,23,26,28
4,15-10-2020,ketamine hydrochloride,Temperature,26,22,20
5,15-10-2020,ketamine hydrochloride,Pressure,9,9,11
6,16-10-2020,diltiazem hydrochloride,Temperature,38,40,42
7,16-10-2020,diltiazem hydrochloride,Pressure,23,24,27
8,16-10-2020,docetaxel injection,Temperature,49,56,58
9,16-10-2020,docetaxel injection,Pressure,27,28,30


=> Every column which had even a single missing value has been deleted

#### But what are the problems with deleting rows/columns?

One of the major problems:
- loss of data

Instead of dropping, it would be better to **fill the missing values with some data**

#### How can we fill the missing values with some data?

In [55]:
data.fillna(0).head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,23.0,22.0,0.0,21.0,21.0,22,23.0,21.0,22.0,20,20.0,21
1,15-10-2020,diltiazem hydrochloride,Pressure,12.0,13.0,0.0,11.0,13.0,14,16.0,16.0,24.0,18,19.0,20
2,15-10-2020,docetaxel injection,Temperature,0.0,17.0,18.0,0.0,17.0,18,0.0,0.0,23.0,23,25.0,25
3,15-10-2020,docetaxel injection,Pressure,0.0,22.0,22.0,0.0,22.0,23,0.0,0.0,27.0,26,29.0,28
4,15-10-2020,ketamine hydrochloride,Temperature,24.0,0.0,0.0,27.0,0.0,26,25.0,24.0,23.0,22,21.0,20


**What is fillna(0) doing?**

It fills all missing values with 0

We can do the same on a particular column too

In [56]:
data['2:30:00'].fillna(0)

0     22.0
1     13.0
2     17.0
3     22.0
4      0.0
5      0.0
6     35.0
7     19.0
8     47.0
9     24.0
10     9.0
11    12.0
12    19.0
13     4.0
14    13.0
15    22.0
16    14.0
17     9.0
Name: 2:30:00, dtype: float64


#### What other values can we use to fill the missing values ?

We can use some **kind of estimator** too
- An estimator like **mean or median**

#### How would you calculate the mean of the column `2:30:00`?


In [57]:
data['2:30:00'].mean()

18.8125

Now let's fill the NaN values with the mean value of the column

In [58]:
data['2:30:00'].fillna(data['2:30:00'].mean())

0     22.0000
1     13.0000
2     17.0000
3     22.0000
4     18.8125
5     18.8125
6     35.0000
7     19.0000
8     47.0000
9     24.0000
10     9.0000
11    12.0000
12    19.0000
13     4.0000
14    13.0000
15    22.0000
16    14.0000
17     9.0000
Name: 2:30:00, dtype: float64

But this doesn't feel right. What could be wrong with this?

#### Can we use the mean of all compounds as average for our estimator?

- **Different drugs** have **different characteristics**
- We can't simply do an average and fill the null values

**Then what could be a solution here?**

We could fill the null values of **respective compounds with their respective means**







<!-- #### How can we find mean temperature drug-wise? -->

In [None]:
# data_tidy.groupby("Drug_Name")["Temperature"].mean()

#### How can we form a column with mean temperature of respective compounds?

We can use `apply` that we learnt earlier

Let's first create a function to calculate the mean

In [59]:
def temp_mean(x):
  x['Temperature_avg'] = x['Temperature'].mean() # We will name the new col Temperature_avg
  return x

Now we can form a new column based on the average values of temperature for each drug

In [60]:
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(temp_mean)
data_tidy

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  data_tidy=data_tidy.groupby(["Drug_Name"]).apply(temp_mean)


None,index,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg
0,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485
1,1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097
2,2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677
3,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485
4,4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097
...,...,...,...,...,...,...,...
103,103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097
104,104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677
105,105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485
106,106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097


Now we fill the null values in Temperature using this new column!

In [61]:
data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True)
data_tidy

None,index,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg
0,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485
1,1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097
2,2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677
3,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485
4,4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097
...,...,...,...,...,...,...,...
103,103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097
104,104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677
105,105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485
106,106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097


In [62]:
data_tidy.isna().sum()

None
index               0
Date                0
time                0
Drug_Name           0
Pressure           13
Temperature         0
Temperature_avg     0
dtype: int64

Great!!

We have removed the null values of our Temperature column

Let's do the same for Pressure

In [63]:
def pr_mean(x):
  x['Pressure_avg'] = x['Pressure'].mean()
  return x
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True)
data_tidy

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)


None,index,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg
0,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242
1,1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097,25.483871
2,2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677,11.935484
3,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242
4,4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097,25.483871
...,...,...,...,...,...,...,...,...
103,103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097,25.483871
104,104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677,11.935484
105,105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485,15.424242
106,106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097,25.483871


In [64]:
data_tidy.isna().sum()

None
index              0
Date               0
time               0
Drug_Name          0
Pressure           0
Temperature        0
Temperature_avg    0
Pressure_avg       0
dtype: int64

This gives us a **basic idea** about working with missing values

We will further learn more on this during later lectures of **feature engineering**