# Create a new project

This notebook shows how to define your project before the data analysis. You create an entity named "project" that contains the full configuration of your project: project name, folder names, data file names, datasets, clinical groups and so on. The entity "project" is usually created once, saved in a JSON file on your computer and then reused for different data analyses. You can regenerate it when you wish to add new datasets and/or clinical groups.

In [1]:
# Imports
from src.statgenex.entity import Project, Dataset

## Project name and localisation

First, define the project name and the localisation of the project on your computer in the "project_options":
* **name**: project name
* **root_dir**: path to the folder in which the project will be created

In [2]:
project_options = {
    'name': 'LysOnc',
    'root_dir': 'C:/WORK/PROJECTS/',
    }

project = Project(**project_options)
print(project)

Project [name = LysOnc, root_dir = C:/WORK/PROJECTS/, project_dir = C:/WORK/PROJECTS/LysOnc/, data_dir = C:/WORK/PROJECTS/LysOnc/data/, results_dir = C:/WORK/PROJECTS/LysOnc/results/]


By default, the project entity will automatically define the following folders:
* **project_dir**: project location
* **data_dir**: folder that contains data files
* **results_dir**: folder that will contain future outputs of the statistical analyses (results and figures)

If you need to customize "data_dir" and "results_dir", please define them in the "project_options".
```json
{
    "data_dir": "path to the customized data folder",
    "results_dir": "path to the customized result folder",
}
```

In [3]:
# Generate the project folders
project.dump()

This will create the project (project.json) on your computer as well as the corresponding folders. At this step the project is empty. No dataset is defined.
```json
project.json
{
    "name": "LysOnc",
    "root_dir": "C:/WORK/PROJECTS/",
    "project_dir": "C:/WORK/PROJECTS/LysOnc/",
    "data_dir": "C:/WORK/PROJECTS/LysOnc/data/",
    "results_dir": "C:/WORK/PROJECTS/LysOnc/results/",
    "json_file": "C:/WORK/PROJECTS/LysOnc/project.json",
    "datasets": {}
}
```

## Add a new dataset to the project

Put the data files in the data folder (as defind in "data_dir") of the project. 

At least, two files are expected:
* **data_filename**: the name of the file containing gene expression data
* **expgroup_filename**: the name of the file containing bio-clinical and demographic annotations of the samples (experimental grouping) 

In [4]:
dataset_options = {
    'name': 'TCGA-BRCA',
    'data_filename': 'expression_data_tcga_brca_TCGA-BRCA_log_fpkm_1226_samples_42851_genes.csv',
    'expgroup_filename': 'clinical_TCGA-BRCA.csv',
    }

dataset = Dataset(**dataset_options)
project.add_dataset(dataset)
project.print_summary()

Project LysOnc C:/WORK/PROJECTS/LysOnc/
Dataset TCGA-BRCA: no groups defined


## Add groups to the dataset

Provide rules to define the groups of samples. Four types of filters are available:
* **categorical filters** for categorical variables in the "exproup" file
* **quantitive filters** for numerical variables in the "expgroup" file
* **expression filters** to identify groups of low and high expression for a gene in the "data" file
* **secondary filters** to create intersections between the previously defined groups

In [5]:
categorical_filters = {
    'NT': [{'tissue_status': ['normal']}], # Non tumour (NT) breast
    'All-tumours': [{'tissue_status': ['tumoral']}],
    'Stage-I': [{'ajcc_pathologic_tumor_stage_shared_stage_tnm_categories': ['Stage I', 'Stage IA', 'Stage IB']}],
    'Stage-II': [{'ajcc_pathologic_tumor_stage_shared_stage_tnm_categories': ['Stage II', 'Stage IIA', 'Stage IIB']}],
    'Stage-III': [{'ajcc_pathologic_tumor_stage_shared_stage_tnm_categories': ['Stage III', 'Stage IIIA', 'Stage IIIB', 'Stage IIIC']}],
    'Stage-IV': [{'ajcc_pathologic_tumor_stage_shared_stage_tnm_categories': ['Stage IV']}],
    'Luminal-A': [{'tissue_status': ['tumoral']}, {'pam50': ['luminal-A']}],
    'Luminal-B': [{'tissue_status': ['tumoral']}, {'pam50': ['luminal-B']}],
    'HER2-enriched': [{'tissue_status': ['tumoral']}, {'pam50': ['HER2-enriched']}],
    'Basal-like': [{'tissue_status': ['tumoral']}, {'pam50': ['basal-like']}],
    'T1N0': [{'ajcc_tumor_pathologic_pt_shared_stage_pathologic_categories': ['T1', 'T1a', 'T1b', 'T1c']}, {'ajcc_nodes_pathologic_pn_shared_stage_pathologic_m': ['N0', 'N0 (i-)', 'N0 (i+)']}, {'diagnoses_1_ajcc_pathologic_m': ['M0', 'cM0 (i+)']}],
    'N0': [{'ajcc_nodes_pathologic_pn_shared_stage_pathologic_m': ['N0', 'N0 (i-)', 'N0 (i+)', 'N0 (mol+)']}, {'diagnoses_1_ajcc_pathologic_m': ['M0', 'cM0 (i+)']}],
    'N1': [{'ajcc_nodes_pathologic_pn_shared_stage_pathologic_m': ['N1', 'N1a', 'N1b', 'N1c', 'N1mi']}, {'diagnoses_1_ajcc_pathologic_m': ['M0', 'cM0 (i+)']}],
    'N2': [{'ajcc_nodes_pathologic_pn_shared_stage_pathologic_m': ['N2', 'N2a', 'N2b', 'N2c', 'N2mi']}, {'diagnoses_1_ajcc_pathologic_m': ['M0', 'cM0 (i+)']}],
    'N3': [{'ajcc_nodes_pathologic_pn_shared_stage_pathologic_m': ['N3', 'N3a', 'N3b', 'N3c', 'N3mi']}, {'diagnoses_1_ajcc_pathologic_m': ['M0', 'cM0 (i+)']}],
    'M1': [{'diagnoses_1_ajcc_pathologic_m': ['M1']}],
    'Claudin-low': [{'tissue_status': ['tumoral']}, {'claudin_low': [1]}],
}

In [6]:
quantitative_filters = {
    'Young_N_and_T': [{'age_min' : [0, 60]}], # {column: [min, max]}
    'Old_N_and_T': [{'age_min' : [60, 150]}],
}

In [7]:
expression_filters = {
    'SMYD2-': {'gene': 'SMYD2', 'ref_group': 'All-tumours', 'threshold_type': 'median', 'class': 'low'}, 
    'SMYD2+': {'gene': 'SMYD2', 'ref_group': 'All-tumours', 'threshold_type': 'median', 'class': 'high'},
    'BCAR3-': {'gene': 'BCAR3', 'ref_group': 'All-tumours', 'threshold_type': 'median', 'class': 'low'}, 
    'BCAR3+': {'gene': 'BCAR3', 'ref_group': 'All-tumours', 'threshold_type': 'median', 'class': 'high'},
    }

In [8]:
secondary_filters = {
    'Young': ['All-tumours', 'Young_N_and_T'], # intersection of the selected groups
    'Old': ['All-tumours', 'Old_N_and_T'],
}

In [9]:
# The generation of the groups may take some time. Please wait.
dataset.generate_groups(categorical_filters=categorical_filters, 
                        quantitative_filters=quantitative_filters, 
                        expression_filters=expression_filters,
                        secondary_filters=secondary_filters)
project.print_summary()

Project LysOnc C:/WORK/PROJECTS/LysOnc/
Dataset TCGA-BRCA: NT (113), All-tumours (1113), Stage-I (179), Stage-II (619), Stage-III (244), Stage-IV (18), Luminal-A (547), Luminal-B (202), HER2-enriched (82), Basal-like (193), T1N0 (151), N0 (449), N1 (295), N2 (98), N3 (50), M1 (22), Claudin-low (33), Young_N_and_T (641), Old_N_and_T (566), SMYD2- (557), SMYD2+ (556), BCAR3- (557), BCAR3+ (556), Young (577), Old (520)


## Save the project

Save the complete project with datasets and groups in "project.json" file on your computer.

In [10]:
project.dump()