# RARE
RARE: Relevant Association Rare-variant-bin Evolver is an evolutionary algorithm approach to binning rare variants as a rare variant association analysis tool
## What is RARE? 
RARE is an evolutionary algorithm that constructs bins of rare variant features with relevant association to class (univariate and/or multivariate interactions) through the following steps:
1. Random bin initializaiton or expert knowledge input
2. Repeated evolutionary cycles consisting of:
    - Candidate bin evaluation with univariate scoring (chi-square test) or Relief-based scoring (MultiSURF algorithm); note: new scoring options currently under testing
    - Genetic operations (parent selection, crossover, and mutation) to generate the next generation of candidate bins
3. Final bin evaluation and summary of top bins

### What this notebook contains:  
1. An example of a dataset RARE can be run on
2. How to run RARE 
3. How to interpret RARE's results

In [1]:
from skrare.methods import *
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

In [2]:
sim_data = pd.read_csv('Experiment1.csv')

For this example, the data was created with 1000 instances, 50 total rare variant features, and 10 predictive rare variant features. 

In [3]:
display(sim_data)

Unnamed: 0,P_1,P_2,P_3,P_4,P_5,P_6,P_7,P_8,P_9,P_10,...,R_32,R_33,R_34,R_35,R_36,R_37,R_38,R_39,R_40,Class
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Parameters for RARE:
1. given_starting_point: whether or not expert knowledge is being inputted (True or False)
2. amino_acid_start_point: if RARE is starting with expert knowledge, input the list of features here; otherwise None
3. amino_acid_bins_start_point: if RARE is starting with expert knowledge, input the list of bins of features here; otherwise None
4. iterations: the number of evolutionary cycles RARE will run
5. original_feature_matrix: the dataset
6. label_name: label for the class/endpoint column in the dataset (e.g., 'Class')
7. rare_variant_MAF_cutoff: the minor allele frequency cutoff separating common features from rare variant features
8. set_number_of_bins: the population size of candidate bins
9. min_features_per_group: the minimum number of features in a bin
10. max_number_of_groups_with_feature: the maximum number of bins containing a feature
11. scoring_method: 'Univariate', 'Relief', or 'Relief only on bin and common features'
12. score_based_on_sample: if Relief scoring is used, whether or not bin evaluation is done based on a sample of instances rather than the whole dataset
13. score_with_common_variables: if Relief scoring is used, whether or not common features should be used as context for evaluating rare variant bins
14. instance_sample_size: if bin evaluation is done based on a sample of instances, input the sample size here
15. crossover_probability: the probability of each feature in an offspring bin to crossover to the paired offspring bin (recommendation: 0.5 to 0.8)
16. mutation_probability: the probability of each feature in a bin to be deleted (a proportionate probability is automatically applied on each feature outside the bin to be added (recommendation: 0.05 to 0.5 depending on situation and number of iterations run)
17. elitism_parameter: the proportion of elite bins in the current generation to be preserved for the next evolutionary cycle (recommendation: 0.2 to 0.8 depending on conservativeness of approach and number of iterations run)
18. random_seed: the seed value needed to generate a random number to create randomness in crossover and mutation
19. bin_size_variability_constraint: sets the max bin size of children to be n times the size of their sibling (recommendation: 2, with larger or smaller values the population would trend heavily towards small or large bins without exploring the search space)
20. max_features_per_bin: sets a max value for the number of features per bin

In [None]:
%%capture
for replicate in range (0,1):
    print('Experiment:');
    bin_feature_matrix, common_features_and_bins_matrix, amino_acid_bins, amino_acid_bin_scores, rare_feature_MAF_dict, common_feature_MAF_dict, rare_feature_df, common_feature_df, MAF_0_features = RARE_v2(False, None, None, 100, sim_data, 'Class', 0.05, 50, 5, 25, 'Relief', True, False, 500, 0.8, 0.1, 0.4, 10, 2, None);
        
    # Summary of the best bins
    summary = Top_Rare_Variant_Bins_Summary(rare_feature_df, 'Class', amino_acid_bins, amino_acid_bin_scores, rare_feature_MAF_dict, 5, bin_feature_matrix)
    
    #Exporting summary to a CSV file
    summary.to_csv('Experiment1Results.csv', index=False)

### Results
The bins are ranked in order of their MultiSURF or Univariate Score with higher scores corresponding to better bins. In this example, we're performing relief-based scoring using the MultiSURF Algorithm. This algorithm allows us to evaluate a given bin relative to all other bins in the population, selecting features in a way that is sensitive to feature interactions. 

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
display(summary)

# SciKit Learn Package
RARE has been implemented as a Scikit-learn package. 

In [None]:
data = pd.read_csv('Experiment1.csv')

In [None]:
from sklearn.pipeline import Pipeline
from skrare.rare import RARE

In [None]:
pipe_RARE = Pipeline(steps=[("RARE", RARE(given_starting_point=False, amino_acid_start_point=None, 
amino_acid_bins_start_point=None, iterations=1000, label_name= “Class”, rare_variant_MAF_cutoff=0.05, set_number_of_bins=50, min_features_per_group=5, max_number_of_groups_with_feature=25, scoring_method=‘Relief’, score_based_on_sample=True, score_with_common_variables=False, instance_sample_size=500, crossover_probability=0.8, mutation_probability=0.1, elitism_parameter=0.4, random_seed=None, bin_size_variability=None))])