### Outline
  - **Restructuring data** 
    - pd.melt()
    - pd.pivot()

    - pd.pivot_table()
    - pd.cut()
  - Dealing with Missing Values
    - None and nan values
    - isna() and isnull()
  - String method in pandas
    
  - Handling datetime 

  - **Writing to a file** 



### Today's Agenda

- Today, we'll continue looking at **Pandas** library

- We'll cover **Restructuring Data using Pandas**

- We'll also look at **handling missing values** and **understanding DateTime datatype**



### Importing our data

- For this topic we will be using **data of few drugs** being developed by **PFizer**


Link: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing

In [5]:
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ

'gdown' is not recognized as an internal or external command,
operable program or batch file.


#### What is the data about?

- Temperature (K)
- Pressure (P)

are recorded after an **interval of 1 hour** everyday to monitor the drug stability in a drug development test

==> These data points are thus used to **identify the optimal set of values of parameters** for the stability of the drugs

#### Now, Let's explore this dataset


In [6]:
import pandas as pd
import numpy as np

In [7]:
data = pd.read_csv('Pfizer.csv')

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       18 non-null     object 
 1   Drug_Name  18 non-null     object 
 2   Parameter  18 non-null     object 
 3   1:30:00    16 non-null     float64
 4   2:30:00    16 non-null     float64
 5   3:30:00    12 non-null     float64
 6   4:30:00    14 non-null     float64
 7   5:30:00    16 non-null     float64
 8   6:30:00    18 non-null     int64  
 9   7:30:00    16 non-null     float64
 10  8:30:00    14 non-null     float64
 11  9:30:00    16 non-null     float64
 12  10:30:00   18 non-null     int64  
 13  11:30:00   16 non-null     float64
 14  12:30:00   18 non-null     int64  
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB


In [9]:
data.head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,23.0,22.0,,21.0,21.0,22,23.0,21.0,22.0,20,20.0,21
1,15-10-2020,diltiazem hydrochloride,Pressure,12.0,13.0,,11.0,13.0,14,16.0,16.0,24.0,18,19.0,20
2,15-10-2020,docetaxel injection,Temperature,,17.0,18.0,,17.0,18,,,23.0,23,25.0,25
3,15-10-2020,docetaxel injection,Pressure,,22.0,22.0,,22.0,23,,,27.0,26,29.0,28
4,15-10-2020,ketamine hydrochloride,Temperature,24.0,,,27.0,,26,25.0,24.0,23.0,22,21.0,20


In [10]:
data.tail()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
13,17-10-2020,diltiazem hydrochloride,Pressure,3.0,4.0,4.0,4.0,6.0,8,9.0,,9.0,11,13.0,14
14,17-10-2020,docetaxel injection,Temperature,12.0,13.0,14.0,15.0,16.0,17,18.0,19.0,20.0,21,22.0,23
15,17-10-2020,docetaxel injection,Pressure,20.0,22.0,22.0,22.0,22.0,23,25.0,26.0,27.0,28,29.0,28
16,17-10-2020,ketamine hydrochloride,Temperature,13.0,14.0,15.0,16.0,17.0,18,19.0,20.0,21.0,22,23.0,24
17,17-10-2020,ketamine hydrochloride,Pressure,8.0,9.0,10.0,11.0,11.0,12,12.0,11.0,12.0,13,14.0,15


<!-- #### Let's check the shape of this dataset -->

### Melting in Pandas

As we saw earlier, the dataset has 18 rows and 15 columns

If you notice further, you'll see:

- The **columns are `1:30:00`, `2:30:00`, `3:30:00`, ... so on**

- `Temperature` and `Pressure` **of each date** is **in a separate row**

#### Can we restructure our data into a better format?

<!-- Maybe do something more intuitive -->

Maybe we can have a column for `time`, with `timestamps` as the column value

**Where will the Temperature/Pressure values go?**

We can similarly create one column containing the values of these parameters




==> **"Melt" timestamp columns into two columns** - timestamp and corresponding values
<!-- Something like our DataFrame will have columns [`ID`, `Date`, `Parameter`, `time`, `result`] -->




#### How can we restructure our data into having every row corresponding to a single reading?


In [11]:
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])                

Unnamed: 0,Date,Parameter,Drug_Name,variable,value
0,15-10-2020,Temperature,diltiazem hydrochloride,1:30:00,23.0
1,15-10-2020,Pressure,diltiazem hydrochloride,1:30:00,12.0
2,15-10-2020,Temperature,docetaxel injection,1:30:00,
3,15-10-2020,Pressure,docetaxel injection,1:30:00,
4,15-10-2020,Temperature,ketamine hydrochloride,1:30:00,24.0
...,...,...,...,...,...
211,17-10-2020,Pressure,diltiazem hydrochloride,12:30:00,14.0
212,17-10-2020,Temperature,docetaxel injection,12:30:00,23.0
213,17-10-2020,Pressure,docetaxel injection,12:30:00,28.0
214,17-10-2020,Temperature,ketamine hydrochloride,12:30:00,24.0


This converts our data from `wide` to `long` format


Notice the `id_vars are set of variables which remain unmelted

#### How does `pd.melt()` work?


- Pass in the **DataFrame**
- Pass in the **column names to not melt** <!--**that we DON'T want to change** -->




But we can provide better names to these new columns

#### How can we rename the columns "variable" and "value" as per our original dataframe?

In [12]:
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'], 
            var_name = "time",      
            value_name = 'reading')  

data_melt

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0
...,...,...,...,...,...
211,17-10-2020,diltiazem hydrochloride,Pressure,12:30:00,14.0
212,17-10-2020,docetaxel injection,Temperature,12:30:00,23.0
213,17-10-2020,docetaxel injection,Pressure,12:30:00,28.0
214,17-10-2020,ketamine hydrochloride,Temperature,12:30:00,24.0


**Conclusion**

<!-- - Columns from `1:30:00` to `12:30:00` are conviniently **melted to a single column `time`** -->
- The labels of the timestamp columns are conviniently **melted into a single column** - `time`
- It retained all values in column `reading`

--------------

- The labels of columns such as `1:30:00`, `2:30:00` have now become categories of the variable column
- The **values from columns we are melting** are stored in **value** column


### Pivot

Now suppose we want to convert our data back to **wide format**

The reason could be to maintain the structure for storing or some other purpose.

Notice:

- The variables `Date`, `Drug_Name` and `Parameter` will remain same

- The column names will be extracted from the column `time`

- The values will be extracted from the column `readings`

#### How can we restructure our data back to the original wide format, before it was melted?




In [13]:
data_melt.pivot(index=['Date','Drug_Name','Parameter'],  # Column to use to make new frame’s index
                columns = 'time',                  # Column to use to make new frame’s columns
                values='reading')                   # Columns to use for populating new frame’s values.

Unnamed: 0_level_0,Unnamed: 1_level_0,time,10:30:00,11:30:00,12:30:00,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00
Date,Drug_Name,Parameter,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
15-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,12.0,13.0,,11.0,13.0,14.0,16.0,16.0,24.0
15-10-2020,diltiazem hydrochloride,Temperature,20.0,20.0,21.0,23.0,22.0,,21.0,21.0,22.0,23.0,21.0,22.0
15-10-2020,docetaxel injection,Pressure,26.0,29.0,28.0,,22.0,22.0,,22.0,23.0,,,27.0
15-10-2020,docetaxel injection,Temperature,23.0,25.0,25.0,,17.0,18.0,,17.0,18.0,,,23.0
15-10-2020,ketamine hydrochloride,Pressure,9.0,9.0,11.0,8.0,,,7.0,,9.0,10.0,11.0,10.0
15-10-2020,ketamine hydrochloride,Temperature,22.0,21.0,20.0,24.0,,,27.0,,26.0,25.0,24.0,23.0
16-10-2020,diltiazem hydrochloride,Pressure,24.0,,27.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,25.0
16-10-2020,diltiazem hydrochloride,Temperature,40.0,,42.0,34.0,35.0,36.0,36.0,37.0,38.0,37.0,38.0,39.0
16-10-2020,docetaxel injection,Pressure,28.0,29.0,30.0,23.0,24.0,,25.0,26.0,27.0,28.0,29.0,28.0
16-10-2020,docetaxel injection,Temperature,56.0,57.0,58.0,46.0,47.0,,48.0,48.0,49.0,50.0,52.0,55.0


Notice, 

We are getting **multiple indices** here

#### How can we reset this to a single-index dataframe?

In [14]:
data_melt.pivot(index=['Date','Drug_Name','Parameter'], 
                columns = 'time',                  
                values='reading').reset_index()

time,Date,Drug_Name,Parameter,10:30:00,11:30:00,12:30:00,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00
0,15-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,12.0,13.0,,11.0,13.0,14.0,16.0,16.0,24.0
1,15-10-2020,diltiazem hydrochloride,Temperature,20.0,20.0,21.0,23.0,22.0,,21.0,21.0,22.0,23.0,21.0,22.0
2,15-10-2020,docetaxel injection,Pressure,26.0,29.0,28.0,,22.0,22.0,,22.0,23.0,,,27.0
3,15-10-2020,docetaxel injection,Temperature,23.0,25.0,25.0,,17.0,18.0,,17.0,18.0,,,23.0
4,15-10-2020,ketamine hydrochloride,Pressure,9.0,9.0,11.0,8.0,,,7.0,,9.0,10.0,11.0,10.0
5,15-10-2020,ketamine hydrochloride,Temperature,22.0,21.0,20.0,24.0,,,27.0,,26.0,25.0,24.0,23.0
6,16-10-2020,diltiazem hydrochloride,Pressure,24.0,,27.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,25.0
7,16-10-2020,diltiazem hydrochloride,Temperature,40.0,,42.0,34.0,35.0,36.0,36.0,37.0,38.0,37.0,38.0,39.0
8,16-10-2020,docetaxel injection,Pressure,28.0,29.0,30.0,23.0,24.0,,25.0,26.0,27.0,28.0,29.0,28.0
9,16-10-2020,docetaxel injection,Temperature,56.0,57.0,58.0,46.0,47.0,,48.0,48.0,49.0,50.0,52.0,55.0



==> `pivot()` is the exact opposite of melt

#### How does `pivot()` work?

- Column `Time` is pivoted upon `Date`, `Drug_Name` and `Parameter`





In [15]:
data_melt.head()

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0


Now if you notice,

We are **using 2 rows** to log readings for a single experiment. 


#### Can we further restructure our data into dividing the Parameter column into T/P?

A format like:

`Date | time | Drug_Name | Pressure | Temperature`

 would be really suitable

- We want to **split one single column into multiple columns**

#### How can we divide the Parameter column again?

In [16]:
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'], 
                                        columns = 'Parameter',  
                                        values='reading') 

data_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Parameter,Pressure,Temperature
Date,time,Drug_Name,Unnamed: 3_level_1,Unnamed: 4_level_1
15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
15-10-2020,10:30:00,docetaxel injection,26.0,23.0
15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...
17-10-2020,8:30:00,docetaxel injection,26.0,19.0
17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
17-10-2020,9:30:00,docetaxel injection,27.0,20.0


<!-- Notice that a **multi-index** df has been created

We change this using **`reset_index()`** -->

We can use `reset_index()` to remove the multi-index

In [17]:
data_tidy = data_tidy.reset_index()
data_tidy

Parameter,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0


We can rename our ```index``` column from `Parameter` to simply `None`

In [18]:
data_tidy.columns.name = 'None'

In [19]:
data_tidy.head()

None,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0


### Pivot_table



Now suppose we want to find some insights, like **mean temperature day wise**

#### Can we use pivot to find the day-wise mean value of temperature for each drug?

In [20]:
data_tidy.pivot(index=['Drug_Name'],  
                columns = 'Date',
                values=['Temperature']) 

ValueError: Index contains duplicate entries, cannot reshape

#### Why did we get an error?

- We need to find the **average** of temperature values throughout a day

- If you notice, the error shows **duplicate entries.**

Hence the index values should be unique entry for each row.

#### What can we do to get our required mean values then?


In [21]:
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature'], aggfunc=np.mean)

None,Temperature,Temperature,Temperature
Date,15-10-2020,16-10-2020,17-10-2020
Drug_Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
diltiazem hydrochloride,21.454545,37.454545,15.636364
docetaxel injection,20.75,51.454545,17.5
ketamine hydrochloride,23.555556,11.5,18.5


This function is similar to pivot, with an extra feature of an aggregator

#### How does `pivot_table` work?
- The initial parameters are same as how we do in `pivot()`
- As an extra parameter, we pass the **type of aggregator**

Note:

- We could have done this using `groupby` too

- In fact, `pivot_table` uses `groupby` in the backend to group the data and perform the aggregration

- The only difference is in the type of output we get using both functions



#### Similarly, what if we want to find the minimum values of temperature and pressure on a particular date?

In [22]:
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature', 'Pressure'], aggfunc=np.min)

None,Pressure,Pressure,Pressure,Temperature,Temperature,Temperature
Date,15-10-2020,16-10-2020,17-10-2020,15-10-2020,16-10-2020,17-10-2020
Drug_Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
diltiazem hydrochloride,11.0,18.0,3.0,20.0,34.0,10.0
docetaxel injection,22.0,23.0,20.0,17.0,46.0,12.0
ketamine hydrochloride,7.0,12.0,8.0,20.0,8.0,13.0


### Handling Missing Values



If you notive, there are many "NaN" values in our data

In [23]:
data_tidy.head()

None,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0


#### What are these "NaN" values?
They are basically **missing values**

#### What are missing values?
A Missing Value signifies an **empty cell/no data**

There can be 2 kinds of missing values:

  1. `None`
  2. `NaN` (short for Not a Number)

#### Whats the difference between the "None" and "NaN"?

The diff mainly lies in their datatype

In [24]:
type(None)

NoneType

In [25]:
type(np.nan)

float

**None type** is for missing values in a column with **non-number entries** 
- E.g.-strings

**NaN** occurs for columns with **number entries**

Note:

Pandas uses these values nearly **interchangeably**, converting between them where appropriate, based on column datatype

In [26]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For **numerical** types, Pandas changes **None to NaN** type


In [27]:
pd.Series(["1", "np.nan", "2", None])

0         1
1    np.nan
2         2
3      None
dtype: object

In [28]:
pd.Series(["1", "np.nan", "2", np.nan])

0         1
1    np.nan
2         2
3       NaN
dtype: object

For **object** type, the **None is preserved** and not changed to NaN

Now we have the basic idea about missing values

#### How to know the count of missing values for each row/column?


In [29]:
data.isna().head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
3,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
4,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False


We can also use isnull to get the same results

In [30]:
data.isnull().head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
3,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
4,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False


#### But, why do we have two methods, "isna" and "isnull" for the same operation?

isnull() is just an alias for isna()

In [31]:
pd.isnull

<function pandas.core.dtypes.missing.isna(obj)>

In [32]:
pd.isna

<function pandas.core.dtypes.missing.isna(obj)>

As we can see, function signature is same for both

`isna()` returns a **boolean dataframe**, with each cell as a boolean value

This value corresponds to **whether the cell has a missing value**

On top of this, we can use `.sum()` to find the count

In [33]:
data.isna().sum()

Date         0
Drug_Name    0
Parameter    0
1:30:00      2
2:30:00      2
3:30:00      6
4:30:00      4
5:30:00      2
6:30:00      0
7:30:00      2
8:30:00      4
9:30:00      2
10:30:00     0
11:30:00     2
12:30:00     0
dtype: int64

This gives us the total number of missing values in each column

#### Can we also get the number of missing values in each row?

In [34]:
data.isna().sum(axis=1)

0     1
1     1
2     4
3     4
4     3
5     3
6     1
7     1
8     1
9     1
10    2
11    2
12    1
13    1
14    0
15    0
16    0
17    0
dtype: int64

Note:

By default the value is `axis=0` in sum()


#### We have identified the null count, but how do we deal with them?

We have two options:
- delete the rows/columns containing the null values
- fill the missing values with some data/estimate

Let's first look at deleting the rows

#### How can we drop rows containing null values?

In [35]:
data.dropna()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
14,17-10-2020,docetaxel injection,Temperature,12.0,13.0,14.0,15.0,16.0,17,18.0,19.0,20.0,21,22.0,23
15,17-10-2020,docetaxel injection,Pressure,20.0,22.0,22.0,22.0,22.0,23,25.0,26.0,27.0,28,29.0,28
16,17-10-2020,ketamine hydrochloride,Temperature,13.0,14.0,15.0,16.0,17.0,18,19.0,20.0,21.0,22,23.0,24
17,17-10-2020,ketamine hydrochloride,Pressure,8.0,9.0,10.0,11.0,11.0,12,12.0,11.0,12.0,13,14.0,15


Rows with **even a single missing value** have been deleted

#### What if we want to delete the columns having missing value?


In [36]:
data.dropna(axis=1)

Unnamed: 0,Date,Drug_Name,Parameter,6:30:00,10:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,22,20,21
1,15-10-2020,diltiazem hydrochloride,Pressure,14,18,20
2,15-10-2020,docetaxel injection,Temperature,18,23,25
3,15-10-2020,docetaxel injection,Pressure,23,26,28
4,15-10-2020,ketamine hydrochloride,Temperature,26,22,20
5,15-10-2020,ketamine hydrochloride,Pressure,9,9,11
6,16-10-2020,diltiazem hydrochloride,Temperature,38,40,42
7,16-10-2020,diltiazem hydrochloride,Pressure,23,24,27
8,16-10-2020,docetaxel injection,Temperature,49,56,58
9,16-10-2020,docetaxel injection,Pressure,27,28,30


=> Every column which had even a single missing value has been deleted

#### But what are the problems with deleting rows/columns?

One of the major problems: 
- loss of data

Instead of dropping, it would be better to **fill the missing values with some data**

#### How can we fill the missing values with some data?

In [37]:
data.fillna(0).head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,23.0,22.0,0.0,21.0,21.0,22,23.0,21.0,22.0,20,20.0,21
1,15-10-2020,diltiazem hydrochloride,Pressure,12.0,13.0,0.0,11.0,13.0,14,16.0,16.0,24.0,18,19.0,20
2,15-10-2020,docetaxel injection,Temperature,0.0,17.0,18.0,0.0,17.0,18,0.0,0.0,23.0,23,25.0,25
3,15-10-2020,docetaxel injection,Pressure,0.0,22.0,22.0,0.0,22.0,23,0.0,0.0,27.0,26,29.0,28
4,15-10-2020,ketamine hydrochloride,Temperature,24.0,0.0,0.0,27.0,0.0,26,25.0,24.0,23.0,22,21.0,20


**What is fillna(0) doing?**

It fills all missing values with 0

We can do the same on a particular column too

In [38]:
data['2:30:00'].fillna(0)

0     22.0
1     13.0
2     17.0
3     22.0
4      0.0
5      0.0
6     35.0
7     19.0
8     47.0
9     24.0
10     9.0
11    12.0
12    19.0
13     4.0
14    13.0
15    22.0
16    14.0
17     9.0
Name: 2:30:00, dtype: float64


#### What other values can we use to fill the missing values ?

We can use some **kind of estimator** too
- An estimator like **mean or median**

#### How would you calculate the mean of the column `2:30:00`?


In [39]:
data['2:30:00'].mean()

18.8125

Now let's fill the NaN values with the mean value of the column

In [40]:
data['2:30:00'].fillna(data['2:30:00'].mean())

0     22.0000
1     13.0000
2     17.0000
3     22.0000
4     18.8125
5     18.8125
6     35.0000
7     19.0000
8     47.0000
9     24.0000
10     9.0000
11    12.0000
12    19.0000
13     4.0000
14    13.0000
15    22.0000
16    14.0000
17     9.0000
Name: 2:30:00, dtype: float64

But this doesn't feel right. What could be wrong with this?

#### Can we use the mean of all compounds as average for our estimator?

- **Different drugs** have **different characteristics** 
- We can't simply do an average and fill the null values

**Then what could be a solution here?**

We could fill the null values of **respective compounds with their respective means**







<!-- #### How can we find mean temperature drug-wise? -->

In [41]:
# data_tidy.groupby("Drug_Name")["Temperature"].mean()

#### How can we form a column with mean temperature of respective compounds?

We can use `apply` that we learnt earlier

Let's first create a function to calculate the mean

In [42]:
def temp_mean(x):
  x['Temperature_avg'] = x['Temperature'].mean() # We will name the new col Temperature_avg
  return x

Now we can form a new column based on the average values of temperature for each drug

In [43]:
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(temp_mean)
data_tidy

None,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097
...,...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097


Now we fill the null values in Temperature using this new column!

In [44]:
data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True)
data_tidy

None,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097
...,...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097


In [45]:
data_tidy.isna().sum()

None
Date                0
time                0
Drug_Name           0
Pressure           13
Temperature         0
Temperature_avg     0
dtype: int64

Great!! 

We have removed the null values of our Temperature column

Let's do the same for Pressure

In [46]:
def pr_mean(x):
  x['Pressure_avg'] = x['Pressure'].mean()
  return x
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True)
data_tidy

None,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097,25.483871
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677,11.935484
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097,25.483871
...,...,...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097,25.483871
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677,11.935484
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485,15.424242
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097,25.483871


In [47]:
data_tidy.isna().sum()

None
Date               0
time               0
Drug_Name          0
Pressure           0
Temperature        0
Temperature_avg    0
Pressure_avg       0
dtype: int64

This gives us a **basic idea** about working with missing values

We will further learn more on this during later lectures of **feature engineering**

### Pandas Cut




Sometimes, we would want our data to be in **categorical format instead of continous data**. 

#### What do we mean by converting continous into categorical data?

Lets say, instead of knowing specific test values of a month, I want to know its type

#### What could be the types?

Depends on level of granularity we want to have - Low, Medium, High, V High

We could have defined more (or less) categories

#### But how can bucketisation of continous data help?

- Since, we can get the count of different categories
- We can get a idea of the bin which category (range of values) most of the temperature values lie.

#### What function can we use to convert cont. to cat. data?

 - Will use pd.cut()
 - We need to provide:
  - the continous data
  - bins edges (array of numbers) to "cut" the entire range
  - labels corresponding to every bin


Let's try to us this on our max (temp) column to categorise the data into bins

But, to define categories, lets first check min and max temp values 

In [48]:
data_tidy

None,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097,25.483871
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677,11.935484
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097,25.483871
...,...,...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0,30.387097,25.483871
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0,17.709677,11.935484
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0,24.848485,15.424242
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0,30.387097,25.483871


In [49]:
print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max())

8.0 58.0


Min value = 8, Max value is 58.

- Lets's keep some buffer for future values and take the range from 5-60(instead of 8-58)
- Lets divide this data into 4 bins of 10-15 values each

In [50]:
temp_points = [5, 20, 35, 50, 60]
temp_labels = ['low','medium','high','very_high'] # Here labels define the severity of the resultant output of the test
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels)
data_tidy.head()

None,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg,temp_cat
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242,low
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,30.387097,25.483871,medium
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677,11.935484,medium
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242,low
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,30.387097,25.483871,medium


In [51]:
data_tidy['temp_cat'].value_counts()

low          50
medium       38
high         15
very_high     5
Name: temp_cat, dtype: int64

### String function and motivation for datetime


 

#### What kind of questions can we use string methods for?

Find rows which contains a particular string

Say,

#### How you can you filter rows containing "hydrochloric" in their drug name?


In [52]:
data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()

None,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg,temp_cat
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242,low
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,17.709677,11.935484,medium
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242,low
5,15-10-2020,11:30:00,ketamine hydrochloride,9.0,21.0,17.709677,11.935484,medium
6,15-10-2020,12:30:00,diltiazem hydrochloride,20.0,21.0,24.848485,15.424242,medium



So in general, we will be using the following format:

     > Series.str.function()

Series.str can be used to **access the values of the series as strings** and apply several methods to it.


Now suppose we want to form a new column based on the year of the experiments?

#### What can we do form a column containing the year?


In [53]:
data_tidy['Date'].str.split('-')

0      [15, 10, 2020]
1      [15, 10, 2020]
2      [15, 10, 2020]
3      [15, 10, 2020]
4      [15, 10, 2020]
            ...      
103    [17, 10, 2020]
104    [17, 10, 2020]
105    [17, 10, 2020]
106    [17, 10, 2020]
107    [17, 10, 2020]
Name: Date, Length: 108, dtype: object

To extract the year we need to select the last element of each list

In [54]:
data_tidy['Date'].str.split('-').apply(lambda x:x[2])

0      2020
1      2020
2      2020
3      2020
4      2020
       ... 
103    2020
104    2020
105    2020
106    2020
107    2020
Name: Date, Length: 108, dtype: object

But there are certain problems with this approach:

- The **dtype of the output is still an object**, we would prefer a number type
- The date format will always **not be in day-month-year**, it can vary 

Thus, to work with such date-time type of data, we can use a special method of pandas

### Datetime

Lets start with understanding a date-time type of data

#### How can we handle handle date-time data-types?

  - We can do using the `to_datetime()` function of pandas
  - It takes as input:
    - Array/Scalars with values having proper date/time format
    - `dayfirst`: Indicating if the day comes first in the date format used
    - `yearfirst`: Indicates if year comes first in the date format

Let's first merge our ```Date``` and ```time``` columns into a new timestamp column




In [55]:
data_tidy['timestamp'] = data_tidy['Date']+ " "+ data_tidy['time']

In [56]:
data_tidy.drop(['Date', 'time'], axis=1, inplace=True)

In [57]:
data_tidy.head()

None,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg,temp_cat,timestamp
0,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242,low,15-10-2020 10:30:00
1,docetaxel injection,26.0,23.0,30.387097,25.483871,medium,15-10-2020 10:30:00
2,ketamine hydrochloride,9.0,22.0,17.709677,11.935484,medium,15-10-2020 10:30:00
3,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242,low,15-10-2020 11:30:00
4,docetaxel injection,29.0,25.0,30.387097,25.483871,medium,15-10-2020 11:30:00


Lets convert our `timestamp` col now

In [58]:
data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp']) # will leave to explore how you can mention datetime format by your own

data_tidy

None,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg,temp_cat,timestamp
0,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242,low,2020-10-15 10:30:00
1,docetaxel injection,26.0,23.0,30.387097,25.483871,medium,2020-10-15 10:30:00
2,ketamine hydrochloride,9.0,22.0,17.709677,11.935484,medium,2020-10-15 10:30:00
3,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242,low,2020-10-15 11:30:00
4,docetaxel injection,29.0,25.0,30.387097,25.483871,medium,2020-10-15 11:30:00
...,...,...,...,...,...,...,...
103,docetaxel injection,26.0,19.0,30.387097,25.483871,low,2020-10-17 08:30:00
104,ketamine hydrochloride,11.0,20.0,17.709677,11.935484,low,2020-10-17 08:30:00
105,diltiazem hydrochloride,9.0,13.0,24.848485,15.424242,low,2020-10-17 09:30:00
106,docetaxel injection,27.0,20.0,30.387097,25.483871,low,2020-10-17 09:30:00


In [59]:
data_tidy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 108 entries, 0 to 107
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Drug_Name        108 non-null    object        
 1   Pressure         108 non-null    float64       
 2   Temperature      108 non-null    float64       
 3   Temperature_avg  108 non-null    float64       
 4   Pressure_avg     108 non-null    float64       
 5   temp_cat         108 non-null    category      
 6   timestamp        108 non-null    datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(4), object(1)
memory usage: 10.3+ KB


The **type of `timestamp` column** has been **changed to `datetime`** from `object`

Now, Let's look at a single timestamp using Pandas

#### How can we **extract information** from a single **timestamp** using Pandas?

In [60]:
ts = data_tidy['timestamp'][0]
ts

Timestamp('2020-10-15 10:30:00')

#### Now how can we extract the year from this date ?


In [61]:
ts.year

2020

Similarly we can also access the month and day using the `month` and `day` attributes

In [62]:
ts.month

10

In [63]:
ts.day

15

#### But what if we want to know the name of the month or the day of the week on that date ?
  - We can find it using `month_name()` and `day_name()` methods

In [64]:
ts.month_name()

'October'

In [65]:
ts.day_name()

'Thursday'

In [66]:
ts.dayofweek

3

In [67]:
ts.hour

10

In [68]:
ts.minute

30

... and so on

We can similarly extract minutes and seconds


#### This data parsing from string to date-time makes it easier to work with data

We can use this data from the columns as a whole using ```.dt``` object

In [69]:
data_tidy['timestamp'].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x0000029DFF8D2B50>

- **`dt` gives properties of values in a column**

- From this **`DatetimeProperties` of column `'end'`**, we can **extract `year`**

In [70]:
data_tidy['timestamp'].dt.year

0      2020
1      2020
2      2020
3      2020
4      2020
       ... 
103    2020
104    2020
105    2020
106    2020
107    2020
Name: timestamp, Length: 108, dtype: int64

Now, Let's **create the new column using these extracted values from the property**

We will use strfttime, short for stringformat time, to modify our datetime format

Let's learn this with the help of few examples

In [71]:
data_tidy['timestamp'][0]

Timestamp('2020-10-15 10:30:00')

In [72]:
print(data_tidy['timestamp'][0].strftime('%Y')) # Formatter for year

2020


In [73]:
print(data_tidy['timestamp'][0].strftime('%m')) # Formatter for month

10


In [74]:
print(data_tidy['timestamp'][0].strftime('%d')) # Formatter for day

15


In [75]:
print(data_tidy['timestamp'][0].strftime('%H')) # Formatter for hour

10


In [76]:
print(data_tidy['timestamp'][0].strftime('%M')) # Formatter for minutes

30


In [77]:
print(data_tidy['timestamp'][0].strftime('%S')) # Formatter for seconds

00


Similarly we can combine the format types to modify the date-time format as per our convinience

In [78]:
data_tidy['timestamp'][0].strftime('%m-%d')

'10-15'

### Writing to file


#### How can we write our dataframe to a csv file?


- We have to **provide the path and file_name** in which you want to store the data

In [79]:
data_tidy.to_csv('pfizer_tidy.csv', sep=",") 