
<h1><img align="right" width="20%" src="../multimedia/kite_logo.png"> CXLP </h1>

# 01 - Basics

## Objective
This notebook shows basic functionality of the `CXLP` package.

## Preliminaries

In [1]:
import pandas as pd
import pathlib

import cxlp

In [2]:
current_timestamp = pd.Timestamp.now()
print(f"Execution of this notebook started on {current_timestamp}")

Execution of this notebook started on 2023-10-17 16:52:46.590399


In [3]:
path_data = pathlib.Path('.', 'data')
path_results = pathlib.Path('.', 'results')

# Make sure paths exist.
for path_ in [path_data, path_results]:    
    if not path_.exists():
        print(f"Directory {path_} does not exist. Creating...", flush=False, end='')
        path_.mkdir(parents=True)
        print("\tDONE!")

In [4]:
product = 'clp'

## Read data
Make sure to replace `path_to_updated_file` with your proper path
(i.e., replace `my_user_name` with your actual user.)

In [5]:
path_to_updated_file = pathlib.Path(r'C:\Users\my_user_name\Gilead Sciences\TCF04 MSAT Data Science - Process Files')
path_to_updated_file = pathlib.Path(r'C:\Users\amoncadatorres\Gilead Sciences\TCF04 MSAT Data Science - Process Files')

# path_new_file = cxlp.copy_most_recent_data_file(product, 'working', path_to_updated_file, path_data)
path_new_file = './data/2023-08-25 CLP Working Data.xlsx'

In [6]:
df = cxlp.read_excel_file_raw(path_new_file)

Reading YESCARTA file 2023-08-25 CLP Working Data.xlsx... 	DONE!


## Pre-processing
Pre-processing steps can be done individually...

> Remember that renaming columns should always be the first 
> pre-processing step.


In [7]:
path_dictionary = '../dictionaries/CLP Data Dictionary v2.3.xlsx'
df_renamed = cxlp.rename_columns(df, product, path_dictionary)
df_preprocessed = cxlp.clean_dataframe(df_renamed, product, path_dictionary)

Renaming columns...	DONE!
Cleaning columns...
+ Cleaning categorical column batch_id...	 DONE!
+ Cleaning categorical column cell_order...	 DONE!
+ Cleaning categorical column subject_id...	 DONE!
+ Cleaning categorical column apheresis_site...	 DONE!
+ Cleaning categorical column apheresis_country...	 DONE!
+ Cleaning categorical column suite...	 DONE!
+ Cleaning categorical column facility_intermediate...	 DONE!
+ Cleaning categorical column facility_final_product...	 DONE!
+ Cleaning categorical column item_number...	 DONE!
+ Cleaning categorical column indication...	 DONE!
+ Cleaning categorical column line_therapy...	 DONE!
+ Cleaning categorical column run_type...	 DONE!
+ Cleaning categorical column apheresis_type...	 DONE!
+ Cleaning categorical column process_type...	 DONE!
+ Cleaning categorical column cell_order_reclassified_frozen...	 DONE!
- Column start_day0_date will not be cleaned and left as is.
- Column harvest_date will not be cleaned and left as is.
- Column termina

...or all of them together in a single line. 

In [8]:
df_preprocessed2 = cxlp.preprocess_raw_dataframe(df, product, path_dictionary)

Renaming columns...	DONE!
Cleaning columns...
+ Cleaning categorical column batch_id...	 DONE!
+ Cleaning categorical column cell_order...	 DONE!
+ Cleaning categorical column subject_id...	 DONE!
+ Cleaning categorical column apheresis_site...	 DONE!
+ Cleaning categorical column apheresis_country...	 DONE!
+ Cleaning categorical column suite...	 DONE!
+ Cleaning categorical column facility_intermediate...	 DONE!
+ Cleaning categorical column facility_final_product...	 DONE!
+ Cleaning categorical column item_number...	 DONE!
+ Cleaning categorical column indication...	 DONE!
+ Cleaning categorical column line_therapy...	 DONE!
+ Cleaning categorical column run_type...	 DONE!
+ Cleaning categorical column apheresis_type...	 DONE!
+ Cleaning categorical column process_type...	 DONE!
+ Cleaning categorical column cell_order_reclassified_frozen...	 DONE!
- Column start_day0_date will not be cleaned and left as is.
- Column harvest_date will not be cleaned and left as is.
- Column termina

Notice how `df_preprocessed` and `df_preprocessed2` are the same.

## Feature engineering
There are certain columns that aren't part of the CLP/XLP data,
but that they are quite useful. We add those columns to the DataFrame here.

Similarly to pre-procesing, this can be done individually...

In [9]:
df_production_date = cxlp.add_production_date(df_preprocessed)
df_production_dates = cxlp.add_production_date_columns(df_production_date)
df_harvested = cxlp.add_harvested(df_production_dates)
df_harvest_day = cxlp.add_harvest_day(df_harvested)
df_attempts = cxlp.add_attempt_number_type(df_harvest_day) 
df_verified = cxlp.add_verified(df_attempts)
df_engineered = cxlp.add_success(df_verified)

Columns harvest_date and start_day0_date are present.
Adding column `production_date`...	DONE!
Adding column `production_date_numeric`...	DONE!
Adding column `production_date_formatted`...	DONE!
Adding column `production_date_q_formatted`...	DONE!
Adding column `production_date_q`...	DONE!
Adding column `production_date_quarter`...	DONE!
Adding column `production_date_month`...	DONE!
Adding column `production_date_year`...	DONE!
Adding column `production_date_short`...	DONE!
Adding column `production_date_short_formatted`...	DONE!
Adding column `harvested`...	DONE!
Adding column `harvest_day`...	DONE!
Adding columns `attempt_number` and `attempt_type`...	DONE!
Adding column `verified`...	DONE!
Adding column `success`...	DONE!


...or all of them at once. Notice that for this to work, you should use
the column names defined in `rename_columns`.

In [10]:
df_engineered2 = cxlp.add_engineered_columns(df_preprocessed)

Columns harvest_date and start_day0_date are present.
Adding column `production_date`...	DONE!
Adding column `production_date_numeric`...	DONE!
Adding column `production_date_formatted`...	DONE!
Adding column `production_date_q_formatted`...	DONE!
Adding column `production_date_q`...	DONE!
Adding column `production_date_quarter`...	DONE!
Adding column `production_date_month`...	DONE!
Adding column `production_date_year`...	DONE!
Adding column `production_date_short`...	DONE!
Adding column `production_date_short_formatted`...	DONE!
Adding column `harvested`...	DONE!
Adding column `harvest_day`...	DONE!
Adding columns `attempt_number` and `attempt_type`...	DONE!
Adding column `verified`...	DONE!
Adding column `success`...	DONE!


Once again, notice how `df_engineered` and `df_engineered2` are the same.
## Filtering
We can do filtering very easily using [`pandas.DataFrame.query`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)

Every time we use `.query`, it is good practice to add `.copy()` at
the end to make sure that a copy is generated.

In [11]:
df_tcf03 = df_engineered.query('facility_final_product == "tcf03"').copy()

print(df_tcf03['facility_final_product'].unique())

['tcf03']


We can also query using a variable by adding `@` in the query.

In [12]:
site_interest = 'tcf04'
df_tcf04 = df_engineered.query('facility_final_product == @site_interest').copy()

print(df_tcf04['facility_final_product'].unique())

['tcf04']
