Make sure you are running this in a virtual environment! Otherwise you may cause problems to your main device. 

Look up how to create a virtual environment if you don't know how. Otherwise you can just run this in google colab. 

Then in your virtual environment run the following lines:

***

pip install jupyter

pip install ipykernel

python -m ipykernel install --user --name=myenv --display-name "Python (myenv)"

pip install tensorflow

pip install shap

***

The lines will install different packages and libraries that you need for this notebook. I may have forgotten some in which case you can likely just pip install "name of package" in the same format as above.

Also, make sure to download all the files called here and every file in the 'data_files_for_data_construction' folder. Ideally you just download the whole github, but some of the CSV files are rather large.

***

Before we begin we have some standard python libraries to import that we will use throughout this notebook.

In [1]:
import pandas as pd
import time

There was no existing dataset that contained the data needed for this project. Thus first we must generate a synthetic dataset. The dataset will be generated based on a variety of real data, mappings between datasets, and artificially generated lists. 

First we import the Data class which contains all the data needed to generate the synthetic dataset.

Next we import the DataGenerator class for the CPU. Note that a version does exist that runs on the GPU.

In [2]:
from datafiles_for_data_construction.data import Data
from data_generation.data_generation_CPU import DataGenerator

Next, we instantiate the Data and DataGenerator classes. The Data class allows us to access all the data needed to generate the synthetic dataset and the DataGenerator class allows us to use the functions needed to generate the synthetic dataset.

In [3]:
data = Data()
data_generator = DataGenerator(data)

What does the data look like? Some of the data is a list of values. Some lists were generated synthetically, others were pulled from various sources. More information can be found in the README file. Here is a list of learning styles:

In [4]:
data.learning_style()["learning_style_list"]

['Visual', 'Auditory', 'Read/Write', 'Kinesthetic']

Some of the data is a dictionary. Some dictionaries map different lists together while others map lists to demographic statistics on how common each item is. This dictionary maps the learning styles to the percentage of people that have said style.

In [5]:
data.learning_style()["learning_style"]

{'Visual': 27.27, 'Auditory': 23.56, 'Read/Write': 21.16, 'Kinesthetic': 28.01}

Now we use the generate_synthetic_dataset function to create a dataset from all the data. This function has two inputs:
- number of samples (an integer) which tell the function how many 'students' we want in our dataset
- batch size (an integer) which tells the function how to split up the work to prevent overloading the computer.
You can change the values if you want to generate more or less data. Be careful as higher values for number of samples will lead to a longer runtime.

In [6]:
num_samples = 100 # You can change these values if you want
batch_size = 10 # Batch size should be about 1/10 of the number of samples

Now we call the function. Use the time library to see how long the generator takes.

In [7]:
start_time = time.time()
synthetic_data = data_generator.generate_synthetic_dataset(num_samples, batch_size)
end_time = time.time()
runtime = end_time - start_time
print(runtime)

12.857694864273071


'generate_synthetic_dataset' outputs a pandas dataframe. Lets look at the top 5 elements of the dataframe. You can look back at the README file to get a better sense of what each column contains and how it was generated.

In [8]:
synthetic_data.head(n=5) # Change n to larger numbers to see more rows of the dataframe

Unnamed: 0,first name,last name,ethnoracial group,gender,international status,socioeconomic status,learning style,gpa,student semester,major,previous courses,course types,course subjects,subjects of interest,extracurricular activities,career aspirations,future topics
0,Nayleen,Liscombe,European American or white,Nonbinary,Domestic,Lower-middle income,[Visual],2.61,9,[Psychology],"[Data Science Discovery, College Physics: Mech...","[[Lecture, Discussion/Recitation, Laboratory],...","[ARCH, CS, CPSC, PHYS, ACCY, CHEM, HIST, RHET,...","[Psychology, Medicine, Education, Computer Sci...","[Pre-Pharmacy Club, Nursing Students Associati...","[Registered Nurse, Physician Assistant, Elemen...","[Psychology, Sociology, Cognitive Science, Neu..."
1,Kindyl,Neale,European American or white,Female,Domestic,In poverty,[Kinesthetic],2.61,11,[Electrical Engineering],"[Introduction to Computer Science II, Engineer...","[[Laboratory, Laboratory-Discussion, Lecture],...","[TE, ASTR, CLE, CS, FR, PHYS, ATMS, EPOL, SPED...","[Engineering, Nuclear Engineering, Computer Sc...","[Computer Science Club, Robotics Club, Electri...","[Software Developer, Application and System So...","[Mechanical Engineering, Computer Engineering,..."
2,Nala,Burgwin,African American or Black,Female,Domestic,Near poverty,"[Read/Write, Visual]",3.26,2,"[Engineering And Industrial Management, Nutrit...","[Introduction to Game Studies and Design, Data...","[[Online], [Lecture, Laboratory-Discussion], [...","[BCS, ATMS, DANC, HIST, GSD, SPAN, EPSY, CS]","[Nutrition, Food Science and Human Nutrition, ...","[Engineering Society, Culinary Arts Club, Indu...","[Sale Representative, Wholesale and Manufactur...","[Nutrition, Public Health, Biochemistry, Kines..."
3,Sahra,Pahr,European American or white,Male,Domestic,Middle income,[Auditory],2.84,5,[Composition And Rhetoric],"[Principles of Research, Molec and Cellular Ba...","[[Online], [Discussion/Recitation, Lecture-Dis...","[TAM, EDUC, KIN, CHEM, SE, GSD, RHET, CMN, CS,...","[Communications, Rhetoric and Composition, Lit...","[Fraternity Council, Men's Club, Broadcasting ...","[Secretary and Administrative Assistant, Broad...","[English, Journalism, Media and Cinema Studies..."
4,Gautham,Boitnott,African American or Black,Female,Domestic,Middle income,[Read/Write],2.64,5,[Genetics],"[Intro to British Literature, MEP Mentoring, I...","[[Online], [Discussion/Recitation, Laboratory,...","[ENG, PHYS, CHEM, GSD, SPAN, ENGL, MATH, CS, L...","[Literature, Comparative World Literature, Eng...","[Journalism Club, Biology Club, Creative Writi...","[Writer and Author, Life, Physical, and Social...","[World Literatures, Molecular and Cellular Bio..."


Notice that we have columns that are lists and columns that are strings. Machine learning models need the input data to be numerical. Thus some data preprocessing is required.

We import the Preprocessing class to do the preprocessing work.

In [9]:
from data_preprocessing.preprocessing import PreProcessing

Inside the Preprocessing class there are two functions that do the main preprocessing work:
- 'stringlist_to_binarylist': converts lists of strings into a binary list
- 'string_list_to_numberedlist': converts lits of strings into a numbered list.

Imagine the full options available are ['alice', 'bob', 'charlie']
Thus for the entry ['alice', 'charlie'] we get:
[1,0,1] for 'stringlist_to_binarylist'
[0,2] for 'string_list_to_numberedlist'

When we instantiate the class and call the 'preprocess_dataset' function both of the above functions will be called on certain columns. 'stringlist_to_binarylist' is called on 'learning styles' and 'string_list_to_numberedlist' is called on all the other lists.

In [11]:
preprocessor = PreProcessing(data)
start_time = time.time()
preprocessed_data = preprocessor.preprocess_dataset(synthetic_data)
end_time = time.time()
runtime = end_time - start_time
print(runtime)

TypeError: eval() arg 1 must be a string, bytes or code object

'preprocess_dataset' outputs a pandas dataframe. Lets look at the top 5 elements of the dataframe.

In [None]:
preprocessed_data.head(n=5) # Change n to larger numbers to see more rows of the dataframe

Now that the data has been preprocessed we must privatize the data to keep it safe.

We import the Privatizer class to do this.

In [10]:
from data_privatization.privatization import Privatizer

There are a variety of privatization methods you can try:
- Basic Differential Privacy (laplace noise addition)
- Uniform Noise Differential Privacy (uniform noise addition)
- Shuffling
Both Differential Privacy types can be done with or without list lengthening. This means the list columns like 'previous courses' could be lengthened according to the noise addition function. More details can be found in the README file. Let's try basic differential privacy with list lengthening.

In [None]:
privatization_type = 'basic differential privacy'
# Other 'privatization_type' options: 'uniform', 'shuffle', 'full shuffle' (full shuffle shuffles all of the rows)
privatizer = Privatizer(data, style=privatization_type, list_length=True)
# Can set 'list_length' to false if you don't want to allow the list sizes to change

Now we call 'privatize_dataset'. Use the time library to see how long the privatizer takes.

In [None]:
start_time = time.time()
privatized_data = privatizer.privatize_dataset(preprocessed_data)
end_time = time.time()
runtime = end_time - start_time
print(runtime)

'preprocess_dataset' outputs a pandas dataframe. Lets look at the top 5 elements of the dataframe.

In [None]:
privatized_data.head(n=5) # Change n to larger numbers to see more rows of the dataframe

We still have the problem of long lists. The 'previous courses list' can be over 30 elements long! Thus we call a new function from the Preprocessor class, 'create_RNN_models'. Three different recurrent neural network models are used to reduce the dimension of each list to 1 element. The three networks are: Simple, GRU (Gated Recurrent Units), and LSTM (Long Term Short Memory).

Since 'create_RNN_models' takes in a dataframe, there is no need to create a new instance of the Preprocessor class. Thus we should call:
- 'privatized_data': reduce dimensionality
- 'preprocessed_data': give a null for comparison at the end
- 'preprocessed_data' with 'utility=True': reduce dimensionality of the utility columns

Let's also calculate and compare the runtimes.

In [None]:
start_time = time.time()
privatized_data_reduced = preprocessor.create_RNN_models(privatized_data)
end_time = time.time()
runtime = end_time - start_time
print(f'Privatized data runtime: {runtime}')

start_time = time.time()
nonprivatized_data_reduced = preprocessor.create_RNN_models(preprocessed_data)
end_time = time.time()
runtime = end_time - start_time
print(f'Nonprivatized data runtime: {runtime}')

start_time = time.time()
utility_cols_reduced = preprocessor.create_RNN_models(preprocessed_data, utility=True)
end_time = time.time()
runtime = end_time - start_time
print(f'Utility columns runtime: {runtime}')

'create_RNN_models' outputs a pandas dataframe. Lets look at the top 5 elements for each of the dataframes

In [None]:
print(privatized_data_reduced.head(n=5))
print(nonprivatized_data_reduced.head(n=5))
print(utility_cols_reduced.head(n=5))
# Change n to larger numbers to see more rows of the dataframe

The reason for balancing the data privatization is to maximize the utility of the dataset while minimizing the privacy loss of the dataset. Perfectly private data would have no utility and vice versa.

Since our private columns are distinct classes, the privacy loss will be measured with accuracy where we want a low accuracy to keep the data safe. Meanwhile, after the RNNs our utility columns are essentially continuous. Thus utility gain will be measured with error where we want a low error to keep the data useful.

Our first test for this is the classifier decision tree. Since it is a classifier we will be using it to test privacy loss for the private columns. Import the DTClassifier class from 'decision_tree_classifier'.

In [12]:
from calculating_tradeoffs.decision_tree_classifier.decision_tree_classifier import DTClassifier

ModuleNotFoundError: No module named 'shap'

First we instantiate 2 classes. One for the privatized data and one for the non privatized data. We will only run them for the 'Simple1' RNN model to keep things simple. Ba Dum Tss! Similarly we will only look at the 'ethnoracial group' target.

In [None]:
RNN_model = 'Simple1' # You can change it to 'GRU1' or 'LSTM1' if you like
target = 'ethnoracial group' # You can change it to 'gender', 'international status', or 'socioeconomic status' if you like

private_classifier = DTClassifier(privatization_type, RNN_model, target)
private_classifier.read_data(100)
private_classifier.split_data()

Now that we have instantiated the class we need to find the best model. We will use a metric called cost complexity pruning to decide how to prune our tree to prevent overfitting.

In [None]:
best_model, ccp_alpha = private_classifier.get_best_model(make_graphs=False) # Leave this input to prevent the graphs from being produced
print(best_model)

What does this best model look like? Let's print out the decision tree to see for ourselves.

In [None]:
private_classifier.plotter(model=best_model, show_fig=True)

Now that we have the best ccp alpha let us run that model and get the classification report.

In [None]:
private_classifier.run_model(ccp_alpha=ccp_alpha, print_report=True, save_files=False, plot_files=False, get_shap=False) # Leave as is to prevent the graphs from being produced