# Assignment 2 specification

The purpose of this assignment is to use clustering and classification to predict various aspects of internet users based on data collected from a survey.

The survey has a large number of columns (features) so you will need to choose a suitable subset both for clustering and for classification.

The rest of this notebook provides basic help with preparing the data for analysis.

## Background

In [23]:
import pandas as pd
import numpy as np

See https://www.openml.org/d/372 for description and https://www.openml.org/data/download/52407/internet_usage.arff for the data file itself. The data comprises mostly binary and some categorical (multi-valued) columns, with just 2 numeric columns, all relating to internet users circa 1997.

The first thing to do is to download the data and to load it into memory (the following code assumes the file is in the same directory as this notebook).

In [31]:
## Added this so the file is downloaded off the internet

import requests

data_URL = "https://www.openml.org/data/download/52407/internet_usage.arff"
r = requests.get(data_URL, allow_redirects=True)

open('internet_usage.arff', 'wb').write(r.content)

2816656

In [32]:
from scipy.io import arff
filePath = 'internet_usage.arff'
data, meta = arff.loadarff(filePath)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Actual_Time,Age,Community_Building,Community_Membership_Family,Community_Membership_Hobbies,Community_Membership_None,Community_Membership_Other,Community_Membership_Political,Community_Membership_Professional,Community_Membership_Religious,...,Web_Page_Creation,Who_Pays_for_Access_Dont_Know,Who_Pays_for_Access_Other,Who_Pays_for_Access_Parents,Who_Pays_for_Access_School,Who_Pays_for_Access_Self,Who_Pays_for_Access_Work,Willingness_to_Pay_Fees,Years_on_Internet,who
0,b'Consultant',b'41',b'Equally',b'0',b'0',b'1',b'0',b'0',b'0',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'0',b'Other_sources',b'1-3_yr',b'93819'
1,b'College_Student',b'28',b'Equally',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,b'No',b'0',b'0',b'0',b'0',b'1',b'0',b'Already_paying',b'Under_6_mo',b'95708'
2,b'Other',b'25',b'More',b'1',b'1',b'0',b'0',b'0',b'1',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'1',b'Other_sources',b'1-3_yr',b'97218'
3,b'Salesperson',b'28',b'More',b'0',b'0',b'0',b'1',b'0',b'0',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'0',b'Already_paying',b'1-3_yr',b'91627'
4,b'K-12_Student',b'17',b'More',b'0',b'0',b'0',b'0',b'1',b'1',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'0',b'Already_paying',b'1-3_yr',b'49906'


As can be seen, the data is loaded into a dataframe but in a binary representation. We choose to convert the binary text into strings, as they are much easier to handle.

In [33]:
for col in df.columns:
  df[col] = df[col].apply(lambda x: x.decode("utf-8"))
df.head()

Unnamed: 0,Actual_Time,Age,Community_Building,Community_Membership_Family,Community_Membership_Hobbies,Community_Membership_None,Community_Membership_Other,Community_Membership_Political,Community_Membership_Professional,Community_Membership_Religious,...,Web_Page_Creation,Who_Pays_for_Access_Dont_Know,Who_Pays_for_Access_Other,Who_Pays_for_Access_Parents,Who_Pays_for_Access_School,Who_Pays_for_Access_Self,Who_Pays_for_Access_Work,Willingness_to_Pay_Fees,Years_on_Internet,who
0,Consultant,41,Equally,0,0,1,0,0,0,0,...,Yes,0,0,0,0,1,0,Other_sources,1-3_yr,93819
1,College_Student,28,Equally,0,0,0,0,0,0,0,...,No,0,0,0,0,1,0,Already_paying,Under_6_mo,95708
2,Other,25,More,1,1,0,0,0,1,0,...,Yes,0,0,0,0,1,1,Other_sources,1-3_yr,97218
3,Salesperson,28,More,0,0,0,1,0,0,0,...,Yes,0,0,0,0,1,0,Already_paying,1-3_yr,91627
4,K-12_Student,17,More,0,0,0,0,1,1,0,...,Yes,0,0,0,0,1,0,Already_paying,1-3_yr,49906


The dataframe looks more standard now, but we notice that there is an anonymised user code 'who' which is a candidate for the dataframe's index. We check that each row has a unique 'who' value:

In [34]:
numRows = df.shape[0]
numUniq = len(df['who'].unique().tolist())
print(numRows-numUniq)

0


It does, so we set 'who' as the index and it no longer appears in the list of columns, which we can check below.

In [35]:
if 'who' in df.columns:
  df.set_index('who', inplace=True)

print(df.columns)

Index(['Actual_Time', 'Age', 'Community_Building',
       'Community_Membership_Family', 'Community_Membership_Hobbies',
       'Community_Membership_None', 'Community_Membership_Other',
       'Community_Membership_Political', 'Community_Membership_Professional',
       'Community_Membership_Religious', 'Community_Membership_Support',
       'Country', 'Disability_Cognitive', 'Disability_Hearing',
       'Disability_Motor', 'Disability_Not_Impaired', 'Disability_Not_Say',
       'Disability_Vision', 'Education_Attainment',
       'Falsification_of_Information', 'Gender', 'Household_Income',
       'How_You_Heard_About_Survey_Banner',
       'How_You_Heard_About_Survey_Friend',
       'How_You_Heard_About_Survey_Mailing_List',
       'How_You_Heard_About_Survey_Others',
       'How_You_Heard_About_Survey_Printed_Media',
       'How_You_Heard_About_Survey_Remebered',
       'How_You_Heard_About_Survey_Search_Engine',
       'How_You_Heard_About_Survey_Usenet_News',
       'How_You_Heard

As can be seen, we have ensured that the 'who' column is no longer available as a feature. Also note that all columns are treated as 'object', effectively as strings. For your convenience, I have classified the columns for you, see below. I have also changed the types of numeric and 'boolean' (0,1)-valued columns. The latter are then binarised and ready for analysis.

In [36]:
numericCols = ['Age', 'Opinions_on_Censorship']
boolCols = ['Community_Membership_Family', 'Community_Membership_Hobbies',
       'Community_Membership_None', 'Community_Membership_Other',
       'Community_Membership_Political', 'Community_Membership_Professional',
       'Community_Membership_Religious', 'Community_Membership_Support',
       'Disability_Cognitive', 'Disability_Hearing',
       'Disability_Motor', 'Disability_Not_Impaired', 'Disability_Not_Say',
       'Disability_Vision', 'How_You_Heard_About_Survey_Banner',
       'How_You_Heard_About_Survey_Friend',
       'How_You_Heard_About_Survey_Mailing_List',
       'How_You_Heard_About_Survey_Others',
       'How_You_Heard_About_Survey_Printed_Media',
       'How_You_Heard_About_Survey_Remebered',
       'How_You_Heard_About_Survey_Search_Engine',
       'How_You_Heard_About_Survey_Usenet_News',
       'How_You_Heard_About_Survey_WWW_Page', 'Not_Purchasing_Bad_experience',
       'Not_Purchasing_Bad_press', 'Not_Purchasing_Cant_find',
       'Not_Purchasing_Company_policy', 'Not_Purchasing_Easier_locally',
       'Not_Purchasing_Enough_info', 'Not_Purchasing_Judge_quality',
       'Not_Purchasing_Never_tried', 'Not_Purchasing_No_credit',
       'Not_Purchasing_Not_applicable', 'Not_Purchasing_Not_option',
       'Not_Purchasing_Other', 'Not_Purchasing_Prefer_people',
       'Not_Purchasing_Privacy', 'Not_Purchasing_Receipt',
       'Not_Purchasing_Security', 'Not_Purchasing_Too_complicated',
       'Not_Purchasing_Uncomfortable', 'Not_Purchasing_Unfamiliar_vendor',
           'Who_Pays_for_Access_Dont_Know',
       'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
       'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self',
       'Who_Pays_for_Access_Work']
strCols = ['Actual_Time', 'Community_Building', 'Country', 'Education_Attainment', 'Falsification_of_Information',
           'Gender', 'Household_Income', 'Major_Geographical_Location', 'Major_Occupation', 'Marital_Status',
           'Most_Import_Issue_Facing_the_Internet', 'Primary_Computing_Platform', 'Primary_Language',
           'Primary_Place_of_WWW_Access', 'Race', 'Registered_to_Vote', 'Sexual_Preference', 'Web_Ordering',
           'Web_Page_Creation', 'Willingness_to_Pay_Fees', 'Years_on_Internet' ]


for col in numericCols:
  df[col] = pd.to_numeric(df[col], errors='coerce')

for col in boolCols:
  df[col] = df[col].map({'0': 0, '1': 1})


According to the data description, the original internet_usage data had 2699 missing values in the 'Primary_Computing_Platform' column. In this version of the dataset, the missing values have already been replaced with '?', see below, so no further action is needed.

In [37]:
col = 'Primary_Computing_Platform'
df[col].value_counts()

Win95        4359
?            2699
Macintosh    1466
Windows       581
NT            450
Unix          212
Dont_Know      87
OS2            84
PC_Unix        76
DOS            54
Other          33
VT100           7
Name: Primary_Computing_Platform, dtype: int64

Most scikit-learn clustering and classification algorithms require _all_ features to be numeric. The `boolCols` features are already encoded as numeric 0,1 but the `strCols` need further processing. The code below shows how to _One Hot Encode_ a selection of features, and how to combine them in a dataframe. Note that we create a Python `dict` of `OneHotEncoder`s rather than reusing a single instance, so that we have the option of transforming back to the original string labels later.

In [38]:
from sklearn.preprocessing import OneHotEncoder
import random

ohe = dict()

# Choose a seed so that this code is repeatable, and select some features for the model 
random.seed(42)
originalStrCols = random.sample(strCols,k=4)
print(originalStrCols)
sampledBoolCols = random.sample(boolCols,k=5)
print(sampledBoolCols)

# Create an empty dataframe
dfSub = pd.DataFrame()

for col in originalStrCols:
  ohe[col] = OneHotEncoder(sparse=False)
  X = ohe[col].fit_transform(df[col].values.reshape(-1,1))
  # See https://stackoverflow.com/a/4843172
  dfOneHot = pd.DataFrame(X, columns = [col+'-'+str(int(i)) for i in range(X.shape[1])])
  dfSub = pd.concat([dfSub, dfOneHot], axis=1)

# Assign the index so that it matches that of the original df
dfSub.set_axis(df.index, axis='index', inplace=True)

# Add in the sampledBoolcols
dfSub = pd.concat([dfSub, df[sampledBoolCols]], axis=1)

# The following is the matrix of samples x features
dfSub.head(10)

['Years_on_Internet', 'Education_Attainment', 'Actual_Time', 'Major_Occupation']
['How_You_Heard_About_Survey_Friend', 'How_You_Heard_About_Survey_Banner', 'Disability_Cognitive', 'Who_Pays_for_Access_Work', 'Community_Membership_Religious']


Unnamed: 0_level_0,Years_on_Internet-0,Years_on_Internet-1,Years_on_Internet-2,Years_on_Internet-3,Years_on_Internet-4,Education_Attainment-0,Education_Attainment-1,Education_Attainment-2,Education_Attainment-3,Education_Attainment-4,...,Major_Occupation-0,Major_Occupation-1,Major_Occupation-2,Major_Occupation-3,Major_Occupation-4,How_You_Heard_About_Survey_Friend,How_You_Heard_About_Survey_Banner,Disability_Cognitive,Who_Pays_for_Access_Work,Community_Membership_Religious
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
93819,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0,0,0
95708,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0,1,0,0,0
97218,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0,0,0,1,0
91627,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0,0,0
49906,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0,0,0,0,0
89941,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0,1,1
96052,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0,0,0,0,0
90393,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0,0,0
90848,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0,0,0,0,0
91074,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0,1,0,0,0


In [39]:
# Add in the predicted column.

# Depending on the algorithm used, this might also need to be binarized.
# By convention, we use the LabelBinarizer for this.
binarizeLabel = True
y = df['Major_Occupation']
if binarizeLabel:
  from sklearn import preprocessing
  lbY = preprocessing.LabelBinarizer(sparse_output=False)

  # For the (string) label to predict, we use the 'Major_Occupation' column
  yBin = lbY.fit_transform(y)
  dfY = pd.DataFrame(data=yBin, index=df.index, columns=lbY.classes_)
  dfSub = pd.concat([dfSub, dfY], axis=1)
else:
  dfSub = pd.concat([dfSub, y], axis=1)

# The following matrix contains features and y
dfSub.head(10)

Unnamed: 0_level_0,Years_on_Internet-0,Years_on_Internet-1,Years_on_Internet-2,Years_on_Internet-3,Years_on_Internet-4,Education_Attainment-0,Education_Attainment-1,Education_Attainment-2,Education_Attainment-3,Education_Attainment-4,...,How_You_Heard_About_Survey_Friend,How_You_Heard_About_Survey_Banner,Disability_Cognitive,Who_Pays_for_Access_Work,Community_Membership_Religious,Computer,Education,Management,Other,Professional
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
93819,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,1
95708,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,0,1,0,0,0
97218,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,1,0,0,0,0
91627,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
49906,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,1,0,0,0
89941,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,1,0,0,0,0,1
96052,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,1,0
90393,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,1
90848,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,1,0
91074,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,1,0,0,0,0,0,0,1,0


Lastly, if you wish to transform the one-hot-encoded features back to their original form, you can use something like:

In [40]:
dx = dict()
for col in originalStrCols:
  derivedCol = [s for s in dfSub.columns if col+'-' in s]
  dx[col] = ohe[col].inverse_transform(dfSub[derivedCol])
print(dx)

{'Years_on_Internet': array([['1-3_yr'],
       ['Under_6_mo'],
       ['1-3_yr'],
       ...,
       ['1-3_yr'],
       ['4-6_yr'],
       ['Over_7_yr']], dtype=object), 'Education_Attainment': array([['Masters'],
       ['Some_College'],
       ['College'],
       ...,
       ['Special'],
       ['College'],
       ['Masters']], dtype=object), 'Actual_Time': array([['Consultant'],
       ['College_Student'],
       ['Other'],
       ...,
       ['Self_Employed'],
       ['Programmer'],
       ['Civil_Servant']], dtype=object), 'Major_Occupation': array([['Professional'],
       ['Education'],
       ['Computer'],
       ...,
       ['Management'],
       ['Computer'],
       ['Other']], dtype=object)}


## Your Tasks

### Clustering

1. Reviewing the data and its data dictionary, choose candidate feature subsets (with a maximum of 10 features per subset) that might be used to cluster the internet usage data described above. Justify your choice of feature subsets.
2. For each candidate feature subset, use hierarchical clustering, k-means, GMM and DBSCAN algorithms on it to identify possible groups of internet users.
3. Hence, discuss the advantages and disadvantages of the different clustering techniques for this task.
4. For bonus marks: Which combinations of features appear to perform best and why?

###  Classification

1. For prediction, you should use 'Major_Occupation' and 'Education_Attainment', the same candidate feature subsets, and at least 3 of the classification techniques that were taught in class:

   a. For each technique, find suitable configuration parameters (specific to that classification technique) by splitting the data into training and test sets and finding the settings that maximise the prediction accuracy and related scores.
   
   b. Given the best settings for each technique, use 10-fold cross-validation on each technique to rank the algorithms in terms of their performance on a given combination of features and predicted variable.
   
   c. Comment on the performance differences when predicting 'Major_Occupation' vs 'Education_Attainment'
3. Repeat this exercise but using the reduced dimension data from PCA. Comment on any differences you find. 

__Important__

Your attempt should be submitted as a Jupyter notebook. Use plenty of comments to explain your reasoning and to discuss what you find. Output should be presented inline using formatted print statements and/or plots.

# Clustering: a possible approach

1. Choose a set of features (columns), probably more than 10 (to have rich enough data) and less than 20 (to make it easier to work with)
2. The features should not "overlap" (if two or more have the same data, just choose one of them)
3. The features should capture "interesting" aspects of internet users and usage - that is where your insight/creativity is needed!
4. 10 or more features might be difficult to visualise/interpret, so it might be worth using a dimensionality reduction technique like PCA (see the Practical where it was applied to the MNIST digits data to reduce the dimensions from 64 to 41).
5. A selection of One-Hot-Encoded features should reduce well, say from 15 to 5 transformed dimensions (these numbers are just used as a guide!)
6. You can apply hierarchical clustering to either the original features or the transformed features.
7. You are looking for nicely distributed clusters in the dendrogram. You can play around with linkage and other choices but if you are still unsatisfied, it might be time to try a different set of features. You can change the level of dimensionality reduction (if used), or even add/remove candidate features, and try again.
8. There is no such thing as perfect data for clustering, so after a few tries, pick the best combination of features and suggest some partitional clustering algorithms and their settings.
9. If you can try several clustering schemes, so much the better - you can then see whether particular users cluster together consistently, regardless of the algorithm used to search for such clusters.
10. Using the cluster memberships, try to interpret what the clusters might be, e.g., one cluster might be "experienced internet users, using the internet mostly for work, based in the US".
11. If you transformed the features, you will definitely find it easier to interpret the clusters based on the original features, so you would just need to add the cluster membership labels (derived from the transformed features) row-by-row to the dataframe with the original feature subset.
12. Be sure to discuss and justify your findings.

# Classification: a possible approach

1. Again, it is advisable to select a subset of attributes, this time with a view to predicting `Major_Occupaion` and `Education_Attainment`, so they are non-redundant and "interesting".
2. You will need to choose 3 classification techniques and apply them to the original and transformed dimensions and the two labeled "response variables".
3. Since you are interested in comparing the classifiers and not just maximising the prediction accuracy for a specific classifier I recommend you use cross-validation to estimate the prediction accuracy _and_ its uncertainty.
4. You then need to discuss the performance of your classifiers, both in terms of sensitivitiy to settings of their own parameters, and how they compare to their peers.

# More comments (and tips!)

1. This notebook suggests one way to investigate the data (a recipe, of sorts...) but it is just one possible way to approach the assignment. Also, finding a "good" set of features does have an element of luck - you could find them on the first go or it could take several attempts. Even if your favoured feature set is not as good as you wish, the marks are assigned based on the process/journey, not on the "destination". In other words, show your working, describe your reasoning (why something worked well or why it did not), justify your assumptions, etc.
2. For clusters, there are techniques for representing higher dimensional (> 3) data, e.g., you might find parallel coordinate plots helpful. However, they do require some interpretation.  Dendrograms are distance-based so they are independent of dimension. There are many other techniques but are beyond the scope of this module. Perhaps if you reduce dimensions to 3 after getting the clusters, then plotted them, you would get a sense of what the clusters look like.  Students are not required to provide visualisations of clusters with dimensions > 3!
3. Interpreting clusters is more difficult if you are unable to visualise/interpret the clusters in a 2-D scatterplot. Remember, you will get marks for deriving the clusters, explaining and justifying your decisions. There are additional marks to be awarded for interpreting the clusters, in terms of the original features. One idea is to determine whether certain feature setting combinations appear particularly common. You could derive the frequency with which a particular setting arises for cluster 1 say. For example, you might find that > 90% of observations in cluster 1 have a particular income level, or that < 10% of observations in cluster 1 live in the USA. If a particular feature has little predictive power, we might expect it to be evenly distributed ~ (50% for each of 2 levels for a binary-valued feature, say). A combination of these "rules of thumb" might provide a meaningful description of the type of internet users in a particular cluster. Indeed, marketing analysts use this type of reasoning when segmenting their customers, often from survey data like that used in this assignment. This is just an example of what you could do, so if you use another technique and can justify it, that is good for marks too!

In [41]:
## Save it locally so the Appendix B and C can get to it
df.to_csv(r'internet_usage.csv')