In [117]:
import pandas as pd
import numpy as np

# a) Generate a Dataset

*Randomly generate a dataset (dataframe) with eight columns and 50,000 rows. Each column should be a categorical variable (of arbitrary name) with three levels (of arbitrary names) in approximately equal proportions.*

## i. Random Generation

In [118]:
examples=50000
features=8
num_categories=3

#Encode each of the three categories as integers
#Generate random floats from 0-1. Multiply by num_categories to get random numbers over the right range, then round down to get integers.
rand_arr = np.floor(np.random.rand(examples,features)*num_categories).astype(int)
rand_arr

array([[0, 0, 2, ..., 0, 2, 0],
       [0, 0, 2, ..., 2, 2, 2],
       [1, 2, 1, ..., 0, 1, 2],
       ...,
       [0, 0, 2, ..., 0, 2, 2],
       [1, 2, 0, ..., 2, 2, 2],
       [1, 1, 0, ..., 1, 0, 1]])

## ii. Convert to a Pandas Dataframe

In [119]:
df = pd.DataFrame(rand_arr)

#Rename the columns for clarity
col_numbers = range(0,8)
col_names = [f"Column {col}" for col in col_numbers]
#generate a dictionary of column numbers and names to pass to the dataframe rename method
columns_rename = {number:name for (number,name) in zip(col_numbers, col_names)}
df.rename(columns=columns_rename, inplace=True)

df


Unnamed: 0,Column 0,Column 1,Column 2,Column 3,Column 4,Column 5,Column 6,Column 7
0,0,0,2,0,2,0,2,0
1,0,0,2,2,0,2,2,2
2,1,2,1,1,2,0,1,2
3,0,0,0,0,0,2,1,1
4,2,2,2,0,1,2,1,1
...,...,...,...,...,...,...,...,...
49995,1,0,2,0,1,2,1,2
49996,1,0,1,0,1,1,1,2
49997,0,0,2,0,0,0,2,2
49998,1,2,0,1,0,2,2,2


# b) Verify Distribution

*Verify that the proportions of each value are similar for each of the eight columns.*

In [120]:
df_counts = pd.DataFrame()

#for each of the integer encoded labels, create a series with the number of instances of that label for each column and append that to a new dataframe
for val in range(0,num_categories):
    df_counts = df_counts.append(df[df == val].count(), ignore_index=True)

df_counts.index.set_names('Category Label', inplace=True)
print("Count of Category Labels by Column:")
df_counts

Count of Category Labels by Column:


Unnamed: 0_level_0,Column 0,Column 1,Column 2,Column 3,Column 4,Column 5,Column 6,Column 7
Category Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,16804.0,16631.0,16637.0,16756.0,16552.0,16734.0,16603.0,16618.0
1,16592.0,16693.0,16676.0,16576.0,16675.0,16753.0,16693.0,16752.0
2,16604.0,16676.0,16687.0,16668.0,16773.0,16513.0,16704.0,16630.0


The proportions of each label are are similar for each column: within expectations of what we would see from random generation.

# C) Unique Combinations

*How many unique rows (i.e., permutations of category levels) are possible?*

In [121]:
print("There are {} unique permutations.".format(num_categories**features))

There are 6561 unique permutations.


# d) Permutation Frequency

*Produce a table and appropriate graph which show the frequencies (numbers of groups) by permutation group sizes up to group size of 12. That is, how many groups are unique combinations (group size = 1), how many groups are made up of a pair of matching combinations (group size = 2), how many groups are made up three the same, etc?*


## i) Create Table

Count the number of instances of each permutation:

In [122]:
s_perm_frequency = df.pivot_table(index=col_names, aggfunc='size').rename("Count")
s_perm_frequency


Column 0  Column 1  Column 2  Column 3  Column 4  Column 5  Column 6  Column 7
0         0         0         0         0         0         0         0            9
                                                                      1           12
                                                                      2            8
                                                            1         0           10
                                                                      1           10
                                                                                  ..
2         2         2         2         2         2         1         1            8
                                                                      2            6
                                                            2         0            9
                                                                      1            9
                                                                      2

Group by the permutation count and count the size of the groups:

In [125]:
#count
s_permutation_count = s_perm_frequency.value_counts().sort_index()
#convert to dataframe
df_permutation_count = s_permutation_count.to_frame()
#name the index
df_permutation_count.index.set_names("Group Size", inplace=True)
# Cut off after group size of 12
max_group_size = 12
df_permutation_count = df_permutation_count.loc[df_permutation_count.index <= max_group_size]
df_permutation_count

Unnamed: 0_level_0,Count
Group Size,Unnamed: 1_level_1
1,16
2,98
3,244
4,472
5,654
6,913
7,945
8,889
9,743
10,612


## ii) Plot

In [124]:
import plotly.express as px

px.line(df_permutation_count.reset_index(), x = "Group Size", y = "Count", title="Unique Permuations")

# e) Distribution

*Comment Upon the Distribution of Group Sizes in d)*

The group sizes appear to be normally distributed, which is what would be expected from randomly generated data.

# f) Privacy Implications

*If your random variables were, in fact, meaningful information on individuals, which group sizes are of most concern from a privacy perspective?*

Assuming that the individual category labels are not personally identifying, privacy would become a concern when combinations of these anonymous pieces of information become identifying. Therefore it is the small group sizes in the above dataset that would be of concern.

For example, if this was an equality monitoring survey (which should be completely anonymous) conducted in a work place: if we had some prior infomration on the employees, we may be able to identify the survey reponses of a target individual and have access to what should be anonymous information about them. This is clearly undesireable from a privacy perspective. The smaller the group size, the fewer pieces of prior information we need to identify individuals. 

# g) Mising Data

*Consider the effect of missing data in the dataset you created in Part a).  How might this complicate the production of a frequency table of group sizes in Part d)?*

# h) Deployment

*Imagine the code that you wrote for Part d) was to be deployed in an automated system that Mirador’s customers could use independently, on potentially large volumes of their own data. Describe how you might deploy the code, and what additional considerations you might have or any changes to the code you might make.*

- Code is already dynamic: will account for any number of columns, examples, or categories. Will probably work with different numbers of categories per column but would need tested to be sure.
- Would need to one hot encode categorical variables.
- Group size cutoff: currently manually specified, could maybe be calculated from the distribution to account for different datasets (e.g. mean +2 standard deviations)

