### University of Virginia 
### DS 5110: Big Data Systems
### Assignment: Tools for Supervised Learning
### Last Updated: March 16, 2022
---  

**Instructions**  
In this assignment, you will code functions to support supervised learning tasks.  The outline is provided below.  The value *None* is used as a placeholder. For random sampling use seed=314 throughout.

TOTAL POINTS: 10

### MODULES

In [1]:
import os

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("data preprocessing") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext

### PARAMETERS

In [3]:
# class = 2 for benign (negative class), 4 for malignant (positive class)
target = 'class'
positive_label = 4
negative_label = 2

SEED = 314

In [4]:
brca = spark.read.csv('breast_cancer_wisconsin.csv', header=True, inferSchema=True)

In [5]:
# print the schema
brca.printSchema()

root
 |-- id: integer (nullable = true)
 |-- clump_thickness: integer (nullable = true)
 |-- uniformity_cell_size: integer (nullable = true)
 |-- uniformity_cell_shape: integer (nullable = true)
 |-- marginal_adhesion: integer (nullable = true)
 |-- single_epithelial_cell_size: integer (nullable = true)
 |-- bare_nuclei: string (nullable = true)
 |-- bland_chromatin: integer (nullable = true)
 |-- normal_nucleoli: integer (nullable = true)
 |-- mitoses: integer (nullable = true)
 |-- class: integer (nullable = true)



In [6]:
brca.count()

699

In [7]:
# compute distribution of target variable
brca.groupBy(target).count().show()


+-----+-----+
|class|count|
+-----+-----+
|    4|  241|
|    2|  458|
+-----+-----+



### Task 1:  Balancing a DataFrame with Downsampling  
i) (**4 PTS**) Write a function to implement downsampling.  Enter code into the cell containing the `downsample` function.  

INPUTS  
* df               - Spark dataframe  
* target           - string, target variable  
* positive_label   - integer, value of positive label  
* negative_label   - integer, value of negative label  

OUTPUT  
balanced spark dataframe  

Downsampling = sample from larger class to match smaller class size.  

**Example:**  

INITIAL STATE  
Smaller class has 100 records  
Larger class size has 400 records

ACTION  
Sample 100 records from larger class, with replacement  
Retain all records from smaller class

END STATE    
This produces a balanced dataset containing 100 records from each class

In [1]:
def downsample(df, target, positive_label, negative_label):
    """
    df              spark dataframe
    target          str, target variable
    positive_label  int, value of positive label
    negative_label  int, value of negative label
    
    """

    ### ENTER CODE HERE
    
    return df_b

ii) **(1 PT)** Print the target distribution from this balanced dataset, to show the label counts nearly match.

#### IMPORTANT NOTE:
Sampling won't produce the exact fraction you request. In order to sample efficiently, Spark uses Bernouilli Sampling. 
Each row is assigned a probability of being included. If you request a 10% sample, each row individually has a 10% chance of being included but this does not guarantee an exact 10% sample   
(it should be close, however).

In [10]:
# Call your downsample function here, and show the count by label


### Task 2:  Univariate AUC Measurement  

In this exercise, you will measure (in a particular sense) the individual predictive power of the following variables:  
* clump_thickness
* uniformity_cell_size
* uniformity_cell_shape
* marginal_adhesion
* single_epithelial_cell_size

**(5 PTS)** Define a function called `compute_univariate_aucs`  
The function needs to do the following:

* Split the dataset into training and testing sets (*60% / 40%*, respectively)  
* For each variable v:  
    * train a logistic regression classifier with intercept, including variable *v* as predictor
    * classify each record in the test set  
    * measure the area under the ROC curve  
* Return a dataframe containing each variable, its model weight (coefficient), and its Univariate AUC, sorted by Univariate AUC in descending order.

INPUTS  
* df, Spark dataframe 
* target variable as string
* training_fraction  
* max_iterations=10  
* seed=314  

OUTPUTS  
dataframe containing three columns: variable name, weight, AUROC  

#### IMPORTANT NOTES:   
1) If you use the RDD API, LabeledPoint requires that positive label = 1, negative label = 0  
2) Do NOT use the downsampling function

Write the function definition in the cell below

Call the `compute_univariate_aucs` and print the results from the dataframe. Remember not to downsample.