# Assignment 2 specification

The purpose of this assignment is to use clustering and classification to predict various aspects of internet users based on data collected from a survey.

The survey has a large number of columns (features) so you will need to choose a suitable subset both for clustering and for classification.

The rest of this notebook provides basic help with preparing the data for analysis.

## Background

In [406]:
import pandas as pd

See https://www.openml.org/d/372 for description and https://www.openml.org/data/download/52407/internet_usage.arff for the data file itself. The data comprises mostly binary and some categorical (multi-valued) columns, with just 2 numeric columns, all relating to internet users circa 1997.

The first thing to do is to load the data.

In [407]:
from scipy.io import arff
filePath = 'internet_usage.arff'
data, meta = arff.loadarff(filePath)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Actual_Time,Age,Community_Building,Community_Membership_Family,Community_Membership_Hobbies,Community_Membership_None,Community_Membership_Other,Community_Membership_Political,Community_Membership_Professional,Community_Membership_Religious,...,Web_Page_Creation,Who_Pays_for_Access_Dont_Know,Who_Pays_for_Access_Other,Who_Pays_for_Access_Parents,Who_Pays_for_Access_School,Who_Pays_for_Access_Self,Who_Pays_for_Access_Work,Willingness_to_Pay_Fees,Years_on_Internet,who
0,b'Consultant',b'41',b'Equally',b'0',b'0',b'1',b'0',b'0',b'0',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'0',b'Other_sources',b'1-3_yr',b'93819'
1,b'College_Student',b'28',b'Equally',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,b'No',b'0',b'0',b'0',b'0',b'1',b'0',b'Already_paying',b'Under_6_mo',b'95708'
2,b'Other',b'25',b'More',b'1',b'1',b'0',b'0',b'0',b'1',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'1',b'Other_sources',b'1-3_yr',b'97218'
3,b'Salesperson',b'28',b'More',b'0',b'0',b'0',b'1',b'0',b'0',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'0',b'Already_paying',b'1-3_yr',b'91627'
4,b'K-12_Student',b'17',b'More',b'0',b'0',b'0',b'0',b'1',b'1',b'0',...,b'Yes',b'0',b'0',b'0',b'0',b'1',b'0',b'Already_paying',b'1-3_yr',b'49906'


As can be seen, the data is loaded into a dataframe but in a binary representation. We choose to convert the binary text into strings, as they are much easier to handle.

In [408]:
for col in df.columns:
  df[col] = df[col].apply(lambda x: x.decode("utf-8"))
df.head()

Unnamed: 0,Actual_Time,Age,Community_Building,Community_Membership_Family,Community_Membership_Hobbies,Community_Membership_None,Community_Membership_Other,Community_Membership_Political,Community_Membership_Professional,Community_Membership_Religious,...,Web_Page_Creation,Who_Pays_for_Access_Dont_Know,Who_Pays_for_Access_Other,Who_Pays_for_Access_Parents,Who_Pays_for_Access_School,Who_Pays_for_Access_Self,Who_Pays_for_Access_Work,Willingness_to_Pay_Fees,Years_on_Internet,who
0,Consultant,41,Equally,0,0,1,0,0,0,0,...,Yes,0,0,0,0,1,0,Other_sources,1-3_yr,93819
1,College_Student,28,Equally,0,0,0,0,0,0,0,...,No,0,0,0,0,1,0,Already_paying,Under_6_mo,95708
2,Other,25,More,1,1,0,0,0,1,0,...,Yes,0,0,0,0,1,1,Other_sources,1-3_yr,97218
3,Salesperson,28,More,0,0,0,1,0,0,0,...,Yes,0,0,0,0,1,0,Already_paying,1-3_yr,91627
4,K-12_Student,17,More,0,0,0,0,1,1,0,...,Yes,0,0,0,0,1,0,Already_paying,1-3_yr,49906


The dataframe looks more standard now, but we notice that there is abn anonymised user code 'who' which is a candidate for the dataframe's index. We check that each row has a unique 'who' value:

In [409]:
numRows = df.shape[0]
numUniq = len(df['who'].unique().tolist())
print(numRows-numUniq)

0


It does, so we set 'who' as the index and it no longer appears in the list of columns, which we can check below.

In [410]:
if 'who' in df.columns:
  df.set_index('who', inplace=True)

print(df.columns)

Index(['Actual_Time', 'Age', 'Community_Building',
       'Community_Membership_Family', 'Community_Membership_Hobbies',
       'Community_Membership_None', 'Community_Membership_Other',
       'Community_Membership_Political', 'Community_Membership_Professional',
       'Community_Membership_Religious', 'Community_Membership_Support',
       'Country', 'Disability_Cognitive', 'Disability_Hearing',
       'Disability_Motor', 'Disability_Not_Impaired', 'Disability_Not_Say',
       'Disability_Vision', 'Education_Attainment',
       'Falsification_of_Information', 'Gender', 'Household_Income',
       'How_You_Heard_About_Survey_Banner',
       'How_You_Heard_About_Survey_Friend',
       'How_You_Heard_About_Survey_Mailing_List',
       'How_You_Heard_About_Survey_Others',
       'How_You_Heard_About_Survey_Printed_Media',
       'How_You_Heard_About_Survey_Remebered',
       'How_You_Heard_About_Survey_Search_Engine',
       'How_You_Heard_About_Survey_Usenet_News',
       'How_You_Heard

As can be seen, we have ensured that the 'who' column is no longer available as a feature. Also note that all columns are treated as 'object', effectively as strings. For your convenience, I have classified the columns for you, see below. I have also changed the types of numeric and 'boolean' (0,1)-valued columns. The latter are then binarised and ready for analysis.

According to the data description, the original internet_usage data had 2699 missing values in the 'Primary_Computing_Platform' column. In this version of the dataset, the missing values have already been replaced with '?', see below, so no further action is needed.

In [411]:
col = 'Primary_Computing_Platform'
df[col].value_counts()

Win95        4359
?            2699
Macintosh    1466
Windows       581
NT            450
Unix          212
Dont_Know      87
OS2            84
PC_Unix        76
DOS            54
Other          33
VT100           7
Name: Primary_Computing_Platform, dtype: int64

## Clustering Task 1

1. Reviewing the data and its data dictionary, choose candidate feature subsets (with a maximum of 10 features per subset) that might be used to cluster the internet usage data described above. Justify your choice of feature subsets.


### Not Purchasing Feature Subset
Not Purchasing Features were selected as an interesting feature subset in order to explore clustering trends in reasons why online users do not make a purchase online. 
This has been grouped in dict notPurchasingFeatures below:

In [412]:
notPurchasingFeatures = ['Not_Purchasing_Judge_quality', 'Not_Purchasing_Bad_experience', 'Not_Purchasing_Too_complicated', 'Not_Purchasing_Unfamiliar_vendor', 'Not_Purchasing_Cant_find', 'Not_Purchasing_Prefer_people', 'Not_Purchasing_Receipt', 'Not_Purchasing_Privacy', 'Not_Purchasing_Security', 'Not_Purchasing_Easier_locally']

### Disability Feature Subset
Disability Related Features were selected as an interesting feature subset in order to explore clustering trends in disabled user internet usage trends. 
This has been grouped in dict disabilityFeatures below:

In [413]:
disabilityFeatures = ['Disability_Cognitive', 'Disability_Hearing', 'Disability_Motor', 'Disability_Not_Impaired', 'Disability_Not_Say', 'Disability_Vision']

### Pay for Access Feature Subset
Who pays for access features were selected as an interesting feature subset in order to explore clustering from the perspective of who is paying for the internet usage. 
This has been grouped in dict whoPaysFeatures below:

In [414]:
whoPaysFeatures = ['Who_Pays_for_Access_Dont_Know','Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self','Who_Pays_for_Access_Work', 'Willingness_to_Pay_Fees']

### Age Related Feature Subset
The age related features were selected as an interesting feature subset in order to explore clustering trends in age based / influenced categories associated with internet usage. 
This has been grouped in dict ageFeatures below:

In [415]:
ageFeatures = ['Actual_Time', 'Age', 'Registered_to_Vote', 'Years_on_Internet', 'Web_Ordering', 'Major_Occupation', 'Education_Attainment']

### Societal Feature Subset
The societal features were selected as an interesting feature subset in order to explore clustering trends in societal categorizations including race, gender, location and relationship status associated with internet usage. 
This has been grouped in dict societyFeatures below:

In [416]:
societyFeatures = ['Gender', 'Sexual_Preference', 'Race', 'Country', 'Marital_Status']

### Age / Experience Related Purchasing Feature Subset
The age related purchasing habits features were selected as an interesting feature subset in order to explore clustering trends in age based / influenced online purchasing versus non purchasing habits associated with internet usage. This has been grouped in dict ageFeatures below:

In [417]:
agePurchasingFeatures = ['Age', 'Years_on_Internet', 'Web_Ordering', 'Registered_to_Vote', 'Country', 'Not_Purchasing_Too_complicated', 'Not_Purchasing_Prefer_people', 'Not_Purchasing_Privacy', 'Not_Purchasing_Security', 'Not_Purchasing_Easier_locally']

## Clustering Task 2
For each candidate feature subset, use hierarchical clustering, k-means, GMM and DBSCAN algorithms on it to identify possible groups of internet users.

### Data Encoding
This section will prepare data by use of one hot encoding to convert all column values to numeric for use with clustering algorithms.

In [418]:
from sklearn.preprocessing import OneHotEncoder
import random

featureStringCols = ['Years_on_Internet', 'Web_Ordering', 'Country']
featureBoolCols = ['Not_Purchasing_Too_complicated', 'Not_Purchasing_Prefer_people',
                                'Not_Purchasing_Privacy', 'Not_Purchasing_Security', 'Not_Purchasing_Easier_locally']
featureIntCols = ['Age']

for col in featureIntCols:
  df[col] = pd.to_numeric(df[col], errors='coerce')

for col in featureBoolCols:
  df[col] = df[col].map({'0': 0, '1': 1})

ohe = dict()

# Choose a seed so that this code is repeatable, and select some features for the model 
random.seed(42)
originalStrCols = random.sample(featureStringCols,k=3)
print(originalStrCols)
sampledBoolCols = random.sample(featureBoolCols,k=5)
print(sampledBoolCols)
sampledIntCols = random.sample(featureIntCols,k=1)
print(sampledIntCols)

# Create an empty dataframe
featureSub = pd.DataFrame()

for col in originalStrCols:
  ohe[col] = OneHotEncoder(sparse=False)
  X = ohe[col].fit_transform(df[col].values.reshape(-1,1))
  # See https://stackoverflow.com/a/4843172
  dfOneHot = pd.DataFrame(X, columns = [col+'-'+str(int(i)) for i in range(X.shape[1])])
  featureSub = pd.concat([featureSub, dfOneHot], axis=1)

# Assign the index so that it matches that of the original df
featureSub.set_axis(df.index, axis='index', inplace=True)

# Add in the sampledBoolcols
featureSub = pd.concat([featureSub, df[sampledBoolCols]], axis=1)

# Add in the sampledIntcols
featureSub = pd.concat([featureSub, df[sampledIntCols]], axis=1)

# The following is the matrix of samples x features
featureSub.head(10)

['Country', 'Years_on_Internet', 'Web_Ordering']
['Not_Purchasing_Privacy', 'Not_Purchasing_Prefer_people', 'Not_Purchasing_Too_complicated', 'Not_Purchasing_Easier_locally', 'Not_Purchasing_Security']
['Age']


Unnamed: 0_level_0,Country-0,Country-1,Country-2,Country-3,Country-4,Country-5,Country-6,Country-7,Country-8,Country-9,...,Years_on_Internet-4,Web_Ordering-0,Web_Ordering-1,Web_Ordering-2,Not_Purchasing_Privacy,Not_Purchasing_Prefer_people,Not_Purchasing_Too_complicated,Not_Purchasing_Easier_locally,Not_Purchasing_Security,Age
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
93819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0,0,0,0,0,41.0
95708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0,0,0,0,1,28.0
97218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1,0,0,1,0,25.0
91627,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0,0,0,0,0,28.0
49906,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0,0,0,0,0,17.0
89941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0,1,0,1,0,55.0
96052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0,0,0,0,0,53.0
90393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1,0,0,0,1,25.0
90848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1,1,0,1,1,32.0
91074,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0,0,0,0,1,65.0


### k-means

In [419]:
import numpy as np
from sklearn.cluster import KMeans

# fill NaN values in dataset
featureSub.fillna(featureSub.mean(), inplace=True)

X = np.array(featureSub.drop(['Age'], 1).astype(float))
y = np.array(featureSub['Age'])

kmeans = KMeans(n_clusters=30) 
kmeans.fit(X)

# Check percentage clustered correctly
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))


0.009893153937475268
