Make sure you are running this in a virtual environment! Otherwise you may cause problems to your main device. 

Look up how to create a virtual environment if you don't know how. Otherwise you can just run this in google colab. 

Then in your virtual environment run the following lines:

***

pip install jupyter

pip install ipykernel

python -m ipykernel install --user --name=myenv --display-name "Python (myenv)"

pip install tensorflow

pip install shap

***

The lines will install different packages and libraries that you need for this notebook. I may have forgotten some in which case you can likely just pip install "name of package" in the same format as above.

Also, make sure to download all the files called here and every file in the 'data_files_for_data_construction' folder. Ideally you just download the whole github, but some of the CSV files are rather large.

***

Before we begin we have some standard python libraries to import that we will use throughout this notebook.

In [1]:
import pandas as pd
import time

There was no existing dataset that contained the data needed for this project. Thus first we must generate a synthetic dataset. The dataset will be generated based on a variety of real data, mappings between datasets, and artificially generated lists. 

First we import the Data class which contains all the data needed to generate the synthetic dataset.

Next we import the DataGenerator class for the CPU. Note that a version does exist that runs on the GPU.

In [2]:
from datafiles_for_data_construction.data import Data
from data_generation.data_generation_CPU import DataGenerator

Next, we instantiate the Data and DataGenerator classes. The Data class allows us to access all the data needed to generate the synthetic dataset and the DataGenerator class allows us to use the functions needed to generate the synthetic dataset.

In [3]:
data = Data()
data_generator = DataGenerator(data)

What does the data look like? Some of the data is a list of values. Some lists were generated synthetically, others were pulled from various sources. More information can be found in the README file. Here is a list of learning styles:

In [4]:
data.learning_style()["learning_style_list"]

['Visual', 'Auditory', 'Read/Write', 'Kinesthetic']

Some of the data is a dictionary. Some dictionaries map different lists together while others map lists to demographic statistics on how common each item is. This dictionary maps the learning styles to the percentage of people that have said style.

In [5]:
data.learning_style()["learning_style"]

{'Visual': 27.27, 'Auditory': 23.56, 'Read/Write': 21.16, 'Kinesthetic': 28.01}

Now we use the generate_synthetic_dataset function to create a dataset from all the data. This function has two inputs:
- number of samples (an integer) which tell the function how many 'students' we want in our dataset
- batch size (an integer) which tells the function how to split up the work to prevent overloading the computer.
You can change the values if you want to generate more or less data. Be careful as higher values for number of samples will lead to a longer runtime.

In [6]:
num_samples = 100 # You can change these values if you want
batch_size = 10 # Batch size should be about 1/10 of the number of samples

Now we call the function. Use the time library to see how long the generator takes.

In [7]:
start_time = time.time()
synthetic_data = data_generator.generate_synthetic_dataset(num_samples, batch_size)
end_time = time.time()
runtime = end_time - start_time
print(runtime)

16.887696743011475


'generate_synthetic_dataset' outputs a pandas dataframe. Lets look at the top 5 elements of the dataframe. You can look back at the README file to get a better sense of what each column contains and how it was generated.

In [8]:
synthetic_data.head(n=5) # Change n to larger numbers to see more rows of the dataframe

Unnamed: 0,first name,last name,ethnoracial group,gender,international status,socioeconomic status,learning style,gpa,student semester,major,previous courses,course types,course subjects,subjects of interest,extracurricular activities,career aspirations,future topics
0,Melton,Onley,European American or white,Female,Domestic,Lower-middle income,[Auditory],3.84,4,[],"[CHBE Profession, Macroeconomic Principles, Sa...","[[Lecture-Discussion], [Discussion/Recitation,...","[SOCW, JS, STAT, EDUC, MILS, MUSC, CHBE, FSHN,...","[Engineering, Nuclear Engineering, Social Work...","[Women's Student Association, Electrical Engin...","[Engineer, Electrical and Electronic Engineer,...","[Environmental Engineering, Electrical Enginee..."
1,Tuf,Hamet,Latino/a/x American,Female,Domestic,Near poverty,[Auditory],3.26,5,[Biological Engineering],"[Mythology of Greece and Rome, Undergraduate O...","[[Lecture-Discussion], [Lecture, Laboratory], ...","[AE, RST, AGCM, ECE, MCB, PS, MATH, ARTD, CLCV...","[Environmental Science, Education, Chemistry, ...","[Relay for Life, Biology Club, Engineering Soc...","[Industrial Engineer, including Health and Saf...","[Material Science, Biochemistry, Chemistry, Ch..."
2,Long,Bartelson,European American or white,Female,Domestic,Near poverty,[Read/Write],3.61,8,[General Social Sciences],"[Undergraduate Open Seminar, Leadership Labora...","[[Laboratory, Lecture-Discussion], [Lecture-Di...","[LING, RST, PHYS, ARTS, EDPR, STAT, ENVS, ARTF...","[Food Science and Human Nutrition, Art, Histor...","[European Student Association, Multicultural S...","[Musician, Singer, and Related Worker, Dietici...","[Anthropology, Cultural Studies, Political Sci..."
3,Mattalynn,Hochstetter,African American or Black,Male,Domestic,Middle income,[Kinesthetic],2.28,12,[Food Science],"[College Physics: Mech and Heat, Intro Asian A...","[[Lecture-Discussion], [Independent Study, Lec...","[JS, MCB, PHYS, HORT, CPSC, GEOL, MUS, CLE, MU...","[Environmental Science, Geology, Environmental...","[Environmental Science Club, Food Science Club...","[Dietician and Nutritionist, Agricultural and ...","[Food Science and Human Nutrition, Medicine, B..."
4,Wale,Sherf,Multiracial,Female,Domestic,Higher income,[Visual],2.93,9,"[International Relations, Miscellaneous Social...","[Intro to Academic Writing II, Beginning Germa...","[[Laboratory, Lecture, Online Lab, Online Lect...","[RST, AE, PHYS, HORT, GER, CPSC, JOUR, MUS, ST...","[History, Languages, International Relations, ...","[Sorority Council, Women's Student Association...","[Secretary and Administrative Assistant, Other...","[Educational Psychology, Education, Sociology,..."


Notice that we have columns that are lists and columns that are strings. Machine learning models need the input data to be numerical. Thus some data preprocessing is required.

We import the Preprocessing class to do the preprocessing work.

In [9]:
from data_preprocessing.preprocessing import PreProcessing

Inside the Preprocessing class there are two functions that do the main preprocessing work:
- 'stringlist_to_binarylist': converts lists of strings into a binary list
- 'string_list_to_numberedlist': converts lits of strings into a numbered list.

Imagine the full options available are ['alice', 'bob', 'charlie']
Thus for the entry ['alice', 'charlie'] we get:
[1,0,1] for 'stringlist_to_binarylist'
[0,2] for 'string_list_to_numberedlist'

When we instantiate the class and call the 'preprocess_dataset' function both of the above functions will be called on certain columns. 'stringlist_to_binarylist' is called on 'learning styles' and 'string_list_to_numberedlist' is called on all the other lists.

In [10]:
preprocessor = PreProcessing(data)
start_time = time.time()
preprocessed_data = preprocessor.preprocess_dataset(synthetic_data)
end_time = time.time()
runtime = end_time - start_time
print(runtime)

0.15332794189453125


'preprocess_dataset' outputs a pandas dataframe. Lets look at the top 5 elements of the dataframe.

In [11]:
preprocessed_data.head(n=5) # Change n to larger numbers to see more rows of the dataframe

Unnamed: 0,learning style,gpa,student semester,major,previous courses,course types,course subjects,subjects of interest,extracurricular activities,career aspirations,future topics
0,"[0, 1, 0, 0]",3.84,4,[],"[482, 2122, 2764, 298, 714, 1832, 3461, 3191, ...","[0, 5, 10, 1, 2, 3, 6, 20]","[156, 113, 160, 67, 134, 138, 45, 86, 61, 206,...","[18, 93, 117, 67, 108, 14, 8]","[11, 28, 62, 48, 19]","[51, 49, 105, 87, 146, 27, 70, 85]","[55, 51, 145, 123, 50]"
1,"[0, 1, 0, 0]",3.26,5,[17],"[2280, 3338, 1313, 1312, 3450, 1312, 1692, 126...","[0, 4, 5, 10, 1, 2, 3, 17]","[6, 154, 10, 64, 129, 228, 127, 20, 52, 118, 7...","[15, 22, 3, 53]","[149, 82, 62, 48, 23, 212, 5]","[70, 27, 23, 85, 87, 82, 52, 51, 105, 49, 24]","[108, 21, 30, 29, 128]"
2,"[0, 0, 1, 0]",3.61,8,[74],"[3338, 2041, 1911, 556, 816, 1346, 3461, 2220,...","[9, 0, 4, 5, 10, 1, 2, 3, 11, 6, 20, 17]","[124, 154, 147, 25, 66, 160, 74, 22, 138, 30, ...","[67, 7, 4, 72, 9]","[22, 34, 107, 71, 90]","[94, 45, 130, 50, 112, 138, 114, 62, 101, 26]","[9, 37, 132, 191, 84]"
3,"[0, 0, 0, 1]",2.28,12,[66],"[607, 1732, 606, 3338, 1077, 2443, 606, 482, 2...","[9, 0, 4, 13, 5, 10, 1, 2, 3, 7, 6, 20, 14]","[113, 129, 147, 102, 55, 88, 137, 53, 138, 216...","[15, 16, 63, 96, 67, 0, 53, 128, 18]","[79, 97, 81, 55]","[45, 4, 35, 79, 65, 108, 121, 12, 53, 54, 100]","[69, 114, 21, 134, 124]"
4,"[1, 0, 0, 0]",2.93,9,"[94, 125]","[1754, 395, 1310, 1346, 816, 1675, 1528, 270, ...","[0, 4, 5, 10, 1, 2, 3, 11, 6, 20, 19]","[154, 6, 147, 102, 89, 55, 112, 137, 160, 138,...","[4, 24, 129, 12, 72]","[9, 11, 107, 110, 60, 71, 27, 98, 169, 84, 43]","[131, 101, 26, 61, 75, 24, 40, 20, 107, 73, 14...","[49, 48, 148, 191, 65]"


Now that the data has been preprocessed we must privatize the data to keep it safe.

We import the Privatizer class to do this.

In [12]:
from data_privatization.privatization import Privatizer

There are a variety of privatization methods you can try:
- Basic Differential Privacy (laplace noise addition)
- Uniform Noise Differential Privacy (uniform noise addition)
- Shuffling
Both Differential Privacy types can be done with or without list lengthening. This means the list columns like 'previous courses' could be lengthened according to the noise addition function. More details can be found in the README file. Let's try basic differential privacy with list lengthening.

In [13]:
privatization_type = 'basic differential privacy'
# Other 'privatization_type' options: 'uniform', 'shuffle', 'full shuffle' (full shuffle shuffles all of the rows)
privatizer = Privatizer(data, style=privatization_type, list_length=True)
# Can set 'list_length' to false if you don't want to allow the list sizes to change

Now we call 'privatize_dataset'. Use the time library to see how long the privatizer takes.

In [14]:
start_time = time.time()
privatized_data = privatizer.privatize_dataset(preprocessed_data)
end_time = time.time()
runtime = end_time - start_time
print(runtime)

0.020509958267211914


'preprocess_dataset' outputs a pandas dataframe. Lets look at the top 5 elements of the dataframe.

In [15]:
privatized_data.head(n=5) # Change n to larger numbers to see more rows of the dataframe

Unnamed: 0,learning style,gpa,student semester,major,previous courses,course types,course subjects,subjects of interest,extracurricular activities
0,"[0, 0, 1, 0]",3.05,12,[],"[425, 1708, 1356, 469, 832, 1144, 3381, 2390, ...","[10, 3, 8, 7, 9, 9, 13, 13, 17, 15, 5, 2, 14, ...","[194, 26, 158, 47, 121, 164, 65]","[5, 123, 63, 109, 115, 102, 108, 101, 94]","[262, 31, 25]"
1,"[1, 1, 1, 0]",3.28,5,[],"[2392, 1787, 86, 3386, 2891, 1637, 2540, 564, ...","[13, 17, 4, 7, 2, 4, 13, 9, 13, 17, 0, 3, 6, 1...","[137, 91, 141, 117, 146, 158, 172, 7, 90, 3, 1...","[24, 129, 101, 117, 117, 34, 50, 120, 84, 73, ...","[138, 230, 218, 265, 277, 246, 88, 73, 151, 18..."
2,"[0, 0, 0, 1]",3.47,0,"[121, 77]","[321, 667, 402, 2561, 620, 1930, 2771, 2990, 5...","[1, 8, 6, 21, 18, 1, 21, 10, 7, 3, 19, 4, 18, ...","[30, 212, 156, 17, 67, 54, 0, 196, 122, 94, 52...","[29, 42, 9, 73, 128, 72, 128, 65, 25, 28, 63, ...","[295, 254, 289, 22, 104, 97, 151, 231, 284, 21..."
3,"[0, 1, 1, 1]",3.73,11,[],"[1391, 2944, 1, 2850, 1837, 947, 2758, 2570, 2...","[12, 19, 8]","[155, 2, 151, 22, 181, 234, 11, 114, 190, 94]","[99, 92, 11, 11, 136, 85, 35, 83, 51, 90, 117,...","[251, 303, 185, 258, 195, 29, 179, 85, 5, 255,..."
4,"[1, 0, 1, 1]",3.73,7,[],"[2251, 241, 425, 1697, 482, 2450, 2029, 1599, ...","[20, 15, 19, 2, 9, 13, 20, 10, 9, 13, 15, 13, ...","[32, 109, 55, 123, 176, 7, 5, 9, 4, 231, 45, 1...","[40, 101, 8, 132, 102, 35, 5]","[122, 239, 80, 37, 28, 239, 86, 147, 161, 198,..."


We still have the problem of long lists. The 'previous courses list' can be over 30 elements long! Thus we call a new function from the Preprocessor class, 'create_RNN_models'. Three different recurrent neural network models are used to reduce the dimension of each list to 1 element. The three networks are: Simple, GRU (Gated Recurrent Units), and LSTM (Long Term Short Memory).

Since 'create_RNN_models' takes in a dataframe, there is no need to create a new instance of the Preprocessor class. Thus we should call:
- 'privatized_data': reduce dimensionality
- 'preprocessed_data': give a null for comparison at the end
- 'preprocessed_data' with 'utility=True': reduce dimensionality of the utility columns

Let's also calculate and compare the runtimes.

Let us just use the simple RNN dimensionality reduction. Though this can be switch by changing 'simple' to False and a different method to True.

In [16]:
# Only let one of simple, LSTM, and GRU be equal to true.

start_time = time.time()
privatized_data_reduced = preprocessor.create_RNN_models(privatized_data, simple=True, LSTM=False, GRU=False)
end_time = time.time()
runtime_pd = end_time - start_time

start_time = time.time()
nonprivatized_data_reduced = preprocessor.create_RNN_models(preprocessed_data, simple=True, LSTM=False, GRU=False)
end_time = time.time()
runtime_npd = end_time - start_time

start_time = time.time()
utility_cols_reduced = preprocessor.create_RNN_models(preprocessed_data, utility=True, simple=True, LSTM=False, GRU=False)
end_time = time.time()
runtime_uc = end_time - start_time

print(f'Privatized data runtime: {runtime_pd}')
print(f'Nonprivatized data runtime: {runtime_npd}')
print(f'Utility columns runtime: {runtime_uc}')

2024-07-13 15:45:35.840766: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
2024-07-13 15:45:35.840798: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-07-13 15:45:35.840805: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-07-13 15:45:35.841177: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-13 15:45:35.841194: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Epoch 1/10


2024-07-13 15:45:36.634402: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 76ms/step - loss: 0.3076
Epoch 2/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - loss: 0.1170
Epoch 3/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - loss: 0.1015
Epoch 4/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - loss: 0.0834
Epoch 5/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - loss: 0.0954
Epoch 6/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - loss: 0.0808
Epoch 7/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - loss: 0.0724
Epoch 8/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - loss: 0.0764
Epoch 9/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - loss: 0.0832
Epoch 10/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - loss: 0.0772
[1m4/4[0m [32m━━━━━━━━━━━━━



[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 137ms/step



[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 162ms/step
Epoch 1/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 290ms/step - loss: 0.3533
Epoch 2/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 263ms/step - loss: 0.1042
Epoch 3/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 262ms/step - loss: 0.0774
Epoch 4/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 264ms/step - loss: 0.0715
Epoch 5/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 262ms/step - loss: 0.0604
Epoch 6/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 267ms/step - loss: 0.0485
Epoch 7/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 309ms/step - loss: 0.0470
Epoch 8/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 266ms/step - loss: 0.0411
Epoch 9/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 261ms/step - loss: 0.0374
Epoch 10/10
[1m4/4[0m [32m━━━━━━━

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 2s/step - loss: 0.3457
Epoch 2/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 2s/step - loss: 0.0764
Epoch 3/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 1s/step - loss: 0.0913
Epoch 4/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 1s/step - loss: 0.0513
Epoch 5/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1s/step - loss: 0.0541
Epoch 6/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1s/step - loss: 0.0389
Epoch 7/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1s/step - loss: 0.0384
Epoch 8/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1s/step - loss: 0.0355
Epoch 9/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1s/step - loss: 0.0241
Epoch 10/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1s/step - loss: 0.0232
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

Since this code can be run to produce many dataframes the output is a list. We only made one dataframe so let's just take the first element from the list.

In [17]:
privatized_data_reduced = privatized_data_reduced[0]

nonprivatized_data_reduced = nonprivatized_data_reduced[0]

utility_cols_reduced = utility_cols_reduced[0]

'create_RNN_models' outputs a pandas dataframe. Lets look at the top 5 elements for each of the dataframes

In [18]:
print(privatized_data_reduced.head(n=5))
print(nonprivatized_data_reduced.head(n=5))
print(utility_cols_reduced.head(n=5))
# Change n to larger numbers to see more rows of the dataframe

  learning style   gpa  student semester         major previous courses  \
0   [0.51466924]  3.05                12  [0.57751447]      [0.5620219]   
1    [0.4353882]  3.28                 5  [0.57751447]      [0.7609531]   
2    [0.5776786]  3.47                 0  [0.53115463]    [0.017651517]   
3   [0.53910905]  3.73                11  [0.57751447]     [0.28403044]   
4    [0.5401469]  3.73                 7  [0.57751447]      [0.7765268]   

   course types course subjects subjects of interest  \
0   [0.1339442]     [0.5637837]         [0.51197636]   
1  [0.56017077]     [0.4102107]         [0.37886214]   
2    [0.631463]     [0.5462269]         [0.51299524]   
3    [0.589439]      [0.642634]          [0.8003199]   
4   [0.6601572]    [0.27176663]         [0.50578284]   

  extracurricular activities  
0               [0.58821774]  
1                [0.4986311]  
2               [0.17438826]  
3                [0.6991976]  
4                [0.5451047]  
  learning style   gpa  st

Now we have to preprocess the private columns so that machine learning models can train on them. To do this we import the PrivateColumns from the 'preprocessing_private_columns' file.

In [19]:
from data_preprocessing.processing_private_columns import PrivateColumns

Now we can instantiate the class.

In [20]:
private_columns_data = PrivateColumns(data)

Get the privacy columns from the nonprivatized dataset as we only need them once.

In [21]:
privacy_columns = private_columns_data.get_private_cols(synthetic_data)

Now we use pandas to combine back together the feature columns, the private columns, and the utility columns. We do this for both the privatized and nonprivatized dataset

In [22]:
privatized_combined = pd.concat([privatized_data_reduced, utility_cols_reduced, privacy_columns])
nonprivatized_combined = pd.concat([nonprivatized_data_reduced, utility_cols_reduced, privacy_columns])

Let's look at the top five elements of both the combined dataframes.

In [23]:
print(privatized_combined.head(n=5))
print(nonprivatized_combined.head(n=5))
# Change n to larger numbers to see more rows of the dataframe

  learning style   gpa  student semester         major previous courses  \
0   [0.51466924]  3.05              12.0  [0.57751447]      [0.5620219]   
1    [0.4353882]  3.28               5.0  [0.57751447]      [0.7609531]   
2    [0.5776786]  3.47               0.0  [0.53115463]    [0.017651517]   
3   [0.53910905]  3.73              11.0  [0.57751447]     [0.28403044]   
4    [0.5401469]  3.73               7.0  [0.57751447]      [0.7765268]   

   course types course subjects subjects of interest  \
0   [0.1339442]     [0.5637837]         [0.51197636]   
1  [0.56017077]     [0.4102107]         [0.37886214]   
2    [0.631463]     [0.5462269]         [0.51299524]   
3    [0.589439]      [0.642634]          [0.8003199]   
4   [0.6601572]    [0.27176663]         [0.50578284]   

  extracurricular activities career aspirations future topics  \
0               [0.58821774]                NaN           NaN   
1                [0.4986311]                NaN           NaN   
2               [

Now that we have our combined datasets we are ready to begin testing!

***

The reason for balancing the data privatization is to maximize the utility of the dataset while minimizing the privacy loss of the dataset. Perfectly private data would have no utility and vice versa.

Since our private columns are distinct classes, the privacy loss will be measured with accuracy where we want a low accuracy to keep the data safe. Meanwhile, after the RNNs our utility columns are essentially continuous. Thus utility gain will be measured with error where we want a low error to keep the data useful.

***

Our first test for this is the classifier decision tree. Since it is a classifier we will be using it to test privacy loss for the private columns. Import the DTClassifier class from 'decision_tree_classifier'.

In [24]:
from calculating_tradeoffs.decision_tree_classifier.decision_tree_classifier import DTClassifier

First we instantiate 2 classes. One for the privatized data and one for the non privatized data. We will only run them for the 'Simple1' RNN model to keep things simple. Ba Dum Tss! Similarly we will only look at the 'ethnoracial group' target.

In [25]:
RNN_model = 'Simple1' # You can change it to 'GRU1' or 'LSTM1' if you like
target = 'ethnoracial group' # You can change it to 'gender', 'international status', or 'socioeconomic status' if you like
privatization_type = 'Basic_DP' # You can change it to 'NoPrivatization', 'Basic_DP_LLC', 'Uniform', 'Uniform_LLC', 'Shuffling', or 'Complete_Shuffling' if you like

private_classifier = DTClassifier(privatization_type, RNN_model, target, privatized_combined)

Now that we have instantiated the class we need to find the best model. We will use a metric called cost complexity pruning to decide how to prune our tree to prevent overfitting.

In [26]:
best_model, ccp_alpha = private_classifier.get_best_model(make_graphs=False) # Leave this input to prevent the graphs from being produced
print(best_model)

TypeError: 'float' object is not subscriptable

What does this best model look like? Let's print out the decision tree to see for ourselves.

In [None]:
private_classifier.plotter(model=best_model, show_fig=True)

Now that we have the best ccp alpha let us run that model and get the classification report.

In [None]:
private_classifier.run_model(ccp_alpha=ccp_alpha, print_report=True, save_files=False, plot_files=False, get_shap=False) # Leave as is to prevent the graphs from being produced