# Descriptive and Exploratory Answers

The below cells describe some Magic numbers and column indices. They offer very rudimentary insight into what columns map on to what data. While we're at it let us also define a handful of helpful functions. 
**PLEASE FIX THE FILE PATH TO SOMETHING THAT POINTS TO A FILE ON YOUR COMPUTER**

In [65]:
# I've gone ahead and used "=" in place of "<-" because the latest R doesn't
# care, also makes the code more (Pythonic)readable IMHO


# Legend of MAGIC Numbers, File Paths and Arguments
CORRELATION_STRATEGY = "pearson"
PATH_TO_DATA = "~/Documents/datascience/"
PRIMARY_FILE = "Data_Adults_1.csv"
INFO_START = 2
INFO_END = 14
DISORDER_START = 15
DISORDER_END = 76
SURVEY_START = 77
SURVEY_END = 378
RCBF_RAW_START = 379
RCBF_RAW_END = 636
RCBF_SCALED_START = 637
RCBF_SCALED_END = 754

In [169]:
# Computes Correlation Between 2 quantities and returns a correlation vector.
# Param: quant_a : The first data frame to correlate
# Param: quant_b : The second data frame to correlate with 
# Return:  A vector of column correlation
correlate = function(quant_a, quant_b = NA) {
    if (is.na(quant_b)){
        cor_val = cor(quant_a, method = CORRELATION_STRATEGY)
    } else {
        cor_val = cor(quant_a, quant_b, method = CORRELATION_STRATEGY)
    }
    return (cor_val)
}


# This function returns a matrix with 2 rows, one with n smallest values and the other with n largest.
# Param: vec : The vector from which the values need to be extracted
# Param: n : The number of values needed
# Param: req_plot : A switch which will optionally plot the data if required
# Return: A row bound matrix of n smallest and largest values respectively
sort_plot = function(vec, n, req_plot = FALSE) {
    smallest_n = sort(vec)[1:n]
    largest_n = sort(vec, TRUE)[1:n]
    if (req_plot) {
        plot(vec)
    }
    return (rbind(smallest_n, largest_n))
}


# Function prefixes the patient ID column to a data fram for kicks, actually it
# does this so that we can find out who is where.
# Param: start_index : The beginning of the columns to bind the patient ID
# ahead of
# Param: end_index : The end of the columns
# Return: a data frame with the required columns prefixed by patient ID 
prefix_id = function(start_index, end_index) {
    return (cbind(cleansed_data[2], cleansed_data[, start_index : end_index]))
}


# Function prints the min, max and median of each column in a data set.
# Param: dataset : The data set to crunch
min_max_mean_extract = function(dataset) {
    print("Minimum Values")
    print(sapply(dataset[-1], min))
    print("Maximum Values")
    print(sapply(dataset[-1], max))
    print("Mean Values")
    print(sapply(dataset[-1], mean))
}


## Function to discern levels of nominal variables.
# Param: quantity : The quanttity to potentially factorize
discern_levels = function(quantity) {
    char_vals = quantity[sapply(quantity, function(quantity) !is.numeric(quantity))]
    apply(char_vals, 2, unique)
}

## Descriptive Answers
<hr>

### Descriptive Question - 1
What is the size of our training data? What is the length of each feature vector and how many features does each vector have?

### Descriptive Question - 2
What does the presence of Null values indicate? How should they be dealt with?

In [170]:
spect_data = data.frame(read.csv(paste0(PATH_TO_DATA, PRIMARY_FILE), as.is = TRUE))
cleansed_data = na.omit(spect_data) # deal with Null values
cat("The dataset as is has", nrow(spect_data), "rows and", ncol(spect_data), "columns\n")
cat("The downsampled dataset, omitting rows that have NA values has", nrow(cleansed_data), 
    "rows and", ncol(cleansed_data), "columns\n")

The dataset as is has 7674 rows and 754 columns
The downsampled dataset, omitting rows that have NA values has 2796 rows and 754 columns


**Answer 1**: As evidenced by the code above even when we elect to throw out any row that has a NA value we still have a sizeable datset. This doesn't mean however that we will not use the data thrown out. We will simply downsample till we have a complete yet sizeable dataset and draw corellation conclusions etc. from it. Once we have these theories we will try to apply them to less downsampled datasets to see if we can fill in missing data and verify if our theories still hold.

**Answer 2**: As stated in the answer above, NA values were thrown out. We still had a considerably large dataset after doing so. 

### Descriptive Question - 3
What do the features in the vector indicate? What are they for? How are the feature values arrived at?

In [171]:
cleansed_data_patient_info = cleansed_data[,INFO_START : INFO_END]
cleansed_data_disorder_info = prefix_id(DISORDER_START, DISORDER_END)
cleansed_data_survey_info = prefix_id(SURVEY_START, SURVEY_END)
cleansed_data_RCBF_raw_info = prefix_id(RCBF_RAW_START, RCBF_RAW_END)
cleansed_data_RCBF_scaled_info = prefix_id(RCBF_SCALED_START, RCBF_SCALED_END)

**Answer 3**: By manually examining the data, we realized that certain portions of the dataset contain specific data that can be
examined mutually exclusively from the other portions of the dataset. We made note of the column indices that allow the data to be split into these mutually exclusive portions and went ahead and split the data set according to those indices. This also allows for easier correlation experiments.

- The columns in *patient_info* represent information about the patient like gender, age, location etc
- The columns in *disorder_info* represent a series of boolean values that indicate whether the patient ails
  from a particular disorder
- The columns in *survey_info* represent the patient's responses to the BSC, GSC and LDS survey questions(assuming they are surveys)
- The columns in *RCBF_raw* represent the raw cerebral blood flow values
- The columns in *RCBF_scaled* represent the scaled cerebral blood flow values

## Exploratory Answers
<hr/>

### Exploratory Question - 1
What is the mean, maximum and minimum for the various numerical features? Also how do these features scale?

In [172]:
min_max_mean_extract(cleansed_data_patient_info)
min_max_mean_extract(cleansed_data_disorder_info)
min_max_mean_extract(cleansed_data_survey_info)
min_max_mean_extract(cleansed_data_RCBF_raw_info)
min_max_mean_extract(cleansed_data_RCBF_scaled_info)

[1] "Minimum Values"
                  STUDY_NAME                     GROUP_ID 
                   "BigLove"                          "2" 
                  group_name                          Age 
            "Adults        "                          "9" 
                   Age_Group                   Gendername 
                 "Adult    "                    "Female " 
                   Gender_id                      race_id 
                         "1"                         "-1" 
                    RaceName                          doa 
"African American          "                "13141612800" 
                locationname                  location_id 
             "Atlanta      "                         "-1" 
[1] "Maximum Values"
                  STUDY_NAME                     GROUP_ID 
                   "BigLove"                          "3" 
                  group_name                          Age 
            "Healthy Brains"                         "88" 
              

In mean.default(X[[i]], ...): argument is not numeric or logical: returning NA

  STUDY_NAME     GROUP_ID   group_name          Age    Age_Group   Gendername 
          NA 2.000358e+00           NA 4.032368e+01           NA           NA 
   Gender_id      race_id     RaceName          doa locationname  location_id 
1.395565e+00 1.938126e+00           NA 1.348470e+10           NA 3.363376e+00 
[1] "Minimum Values"
                          Adjustment_Disorder 
                                          "0" 
                              AnxietyDisorder 
                                          "0" 
                                     dementia 
                                          "0" 
                           Amnestic_disorders 
                                          "0" 
                     Dementia_Alzheimers_Type 
                                          "0" 
                              Dementia_due_to 
                                          "0" 
                    Other_cognitive_disorders 
                                          "0" 
     

In mean.default(X[[i]], ...): argument is not numeric or logical: returning NA

                          Adjustment_Disorder 
                                 0.0171673820 
                              AnxietyDisorder 
                                 0.6266094421 
                                     dementia 
                                 0.0554363376 
                           Amnestic_disorders 
                                 0.0000000000 
                     Dementia_Alzheimers_Type 
                                 0.0007153076 
                              Dementia_due_to 
                                 0.0082260372 
                    Other_cognitive_disorders 
                                 0.0464949928 
                            Vascular_dementia 
                                 0.0010729614 
                            ChildHoodDisorder 
                                 0.5797567954 
        Attention_Deficit_Disruptive_Behavior 
                                 0.5683118741 
                       Communication_Disorder 
             

In mean.default(X[[i]], ...): argument is not numeric or logical: returning NA

BSC_Respondent GSC_Respondent LDS_Respondent          BSC_1          BSC_2 
            NA             NA             NA     1.89234621     2.51180258 
         BSC_3          BSC_4          BSC_5          BSC_6          BSC_7 
    2.09406295     2.37804006     2.38018598     2.26466381     2.07331903 
         BSC_8          BSC_9         BSC_10         BSC_11         BSC_12 
    2.54899857     2.46065808     2.05078684     2.12732475     2.00679542 
        BSC_13         BSC_14         BSC_15         BSC_16         BSC_17 
    1.34835479     1.61123033     2.15844063     2.44849785     2.51430615 
        BSC_18         BSC_19         BSC_20         BSC_21         BSC_22 
    2.06974249     1.91738197     1.29291845     0.38483548     0.43776824 
        BSC_23         BSC_24         BSC_25         BSC_26         BSC_27 
    1.26895565     1.18276109     1.34763948     1.91523605     1.42489270 
        BSC_28         BSC_29         BSC_30         BSC_31         BSC_32 
    1.714234

**Answer 1**: Though max and min values don't really make sense for data like boolean disorder flags we run them anyway because we might've missed something.

### Exploratory Question - 2
Is there a metric that can be used to separate healthy people from unhealthy ones?

In [173]:
unhealthy_vector = apply(cleansed_data_disorder_info[-1], 1, function(row) any(row[] == 1 ))
cat("There are ", sum(unhealthy_vector == FALSE), "healthy people")

There are  50 healthy people

**Answer 2**: There are 50 healthy people

### Exploratory Question - 3
What is the range of levels nominal values can take?

In [174]:
discern_levels(cleansed_data_patient_info)
discern_levels(cleansed_data_disorder_info)
discern_levels(cleansed_data_survey_info)
discern_levels(cleansed_data_RCBF_raw_info)
discern_levels(cleansed_data_RCBF_scaled_info)

ADHD_Type
Mostly Inattentive
Asymptomatic
Inattentive
Combined Type
Mostly Impulsive
Hyperactive


BSC_Respondent,GSC_Respondent,LDS_Respondent
Self,Self,Self
Other,Other,Other
Spouse,Spouse,Spouse
