## Module 3 Worksheet - Chapters 5 & 6

The three checkpoints included in this worksheet need to be completed and marked during your lab session.

### Checkpoint 1 - California Housing Price Regression

Here we use the California Housing dataset available within the Sklearn.

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

Here we first load the dataset, and print the number of instances and features.

```python
from sklearn import datasets
california = datasets.fetch_california_housing()
print('california housing data shape: '+ str(california.data.shape) + \
      '\nfeature names: ' + str(california.feature_names) + \
      '\ntarget name: ' + str(california.target_names))
```

From this, we can see that the dataset contains 20640 instances (housing blocks) and 8 features.
We can also call <code>california.DESCR</code> to get more details about the dataset and the definitions for its features.

Train a KNN regressor to predict the median house value for a block based on the other available features. Compare what effect changing the following hyperparameters has:
- Changing the number of neighbours considered (k=3, k=30, k=300)
- Using a distance weighted KNN with the following distance metrics ('euclidean', 'cosine', 'manhattan') 

**Tips:**
- All of the values and labels stored in this dataset are numerical, so there is no need to encode any of the features.
- The dataset needs to be split into training and test sets.
- Stratifying the train/test split data only works for categorical labels, and so does not need to be performed for regression tasks.
- The feature values need to be normalised/standardised.
- Think about what measure you will be using to evaluate the effectiveness of your trained models (hint, coefficient of determination)

In [1]:
# Enter your code for Checkpoint 1 here
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


# Check datasets
california = datasets.fetch_california_housing()
print('california housing data shape: '+ str(california.data.shape) + \
      '\nfeature names: ' + str(california.feature_names) + \
      '\ntarget name: ' + str(california.target_names))


california housing data shape: (20640, 8)
feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
target name: ['MedHouseVal']


In [2]:
# Set variable to train, test dataset
X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, random_state= 42, test_size= 0.2)

# Normalisation Z-Score
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train

# lr = LinearRegression()
# lr.fit(X_train_scaled, y_train)
# print(lr.score(X_test_scaled, y_test))


array([[   3.2596    ,   33.        ,    5.0176565 , ...,    3.6918138 ,
          32.71      , -117.03      ],
       [   3.8125    ,   49.        ,    4.47354497, ...,    1.73809524,
          33.77      , -118.16      ],
       [   4.1563    ,    4.        ,    5.64583333, ...,    2.72321429,
          34.66      , -120.48      ],
       ...,
       [   2.9344    ,   36.        ,    3.98671727, ...,    3.33206831,
          34.03      , -118.38      ],
       [   5.7192    ,   15.        ,    6.39534884, ...,    3.17889088,
          37.58      , -121.96      ],
       [   2.5755    ,   52.        ,    3.40257649, ...,    2.10869565,
          37.77      , -122.42      ]])

In [3]:
from sklearn.neighbors import KNeighborsRegressor



# Case 1 : k=3, euclidean, distance
knn = KNeighborsRegressor(n_neighbors=3, weights='distance', metric='euclidean')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6443459296530696

In [4]:
# Case 1 : k=3, cosine, distance
knn = KNeighborsRegressor(n_neighbors=3, weights='distance', metric='cosine')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6209197121675905

In [5]:
# Case 1 : k=3, manhattan, distance
knn = KNeighborsRegressor(n_neighbors=3, weights='distance', metric='manhattan')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6833643014653173

In [6]:
# Case 2: k=30, euclidean, distance
knn = KNeighborsRegressor(n_neighbors=30, weights='distance', metric='euclidean')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6770326078148411

In [7]:
# Case 2 : k=30, cosine, distance
knn = KNeighborsRegressor(n_neighbors=30, weights='distance', metric='cosine')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6703612076109324

In [8]:
# Case 2 : k=30, manhattan, distance
knn = KNeighborsRegressor(n_neighbors=30, weights='distance', metric='manhattan')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.7126910553331182

In [9]:
# Case 3: k=300, euclidean, distance
knn = KNeighborsRegressor(n_neighbors=300, weights='distance', metric='euclidean')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.614592954929857

In [10]:
# Case 3 : k=300, cosine, distance
knn = KNeighborsRegressor(n_neighbors=300, weights='distance', metric='cosine')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6196913356146718

In [11]:
# Case 3 : k=300, manhattan, distance
knn = KNeighborsRegressor(n_neighbors=300, weights='distance', metric='manhattan')
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

0.6402348832641214

### Checkpoint 2 - Student Performance Prediction

Here we use a dataset of student performance in secondary education (high school).
More information about this dataset can be found at (https://archive.ics.uci.edu/dataset/320/student+performance).

Download the "student-mat.csv" file provided on FLO and import the information contained within this csv into a pandas dataframe object (refer to your code from last week's worksheet if you are unsure how to do this).
The dataset should have 395 rows and 33 columns.

Split the dataset into our labels ("G3" column) and input features (everything else).

Implement train-test splitting with a test size of 0.2 (random_state=100), to produce the following four variables (X_train, X_test, y_train, y_test). Printing the shapes of these four variables should give the following output:

```
X_train_shape: (316, 32)
X_test_shape: (79, 32)
y_train_shape: (316,)
y_test_shape: (79,)
```

---

As our dataset features contain categorical feature values, here is the part of our program where we would need to encode our features. Last week we applied a default ordinal encoder to our features, although in hindsight there were two issues with this:
1) The exact ordering of our features was not specified (the default is to order the categories alphabetically).
2) The features present within our dataset were nominal (not ordinal) so we should have used one-hot-encoding.

Looking at the features in our student performance dataset more closely, we see that many of the categorical values are binary but some also contain more than two possible values (e.g., Mjob and Fjob). For these features there is no obvious ordering/ranking of the potential values, so the advised procedure is to apply one-hot-encoding.

The simplest way to produce a one-hot-encoded version of a dataset is using the get_dummies function from the pandas library:

```python
import pandas as pd
X_train_encoded = pd.get_dummies(X_train)
X_test_encoded = pd.get_dummies(X_test)
print(X_train_encoded)
print(X_test_encoded)
```

Note, using the get_dummies function assumes that the training and testing datasets always contain the same number of category values for each feature. This requirement may not be satisfied if you have a categorical feature with rare values or if new category values are added. If this happens your encoded training and testing dataset will have a different number of dimentions (columns) from each other. When this occurs you will need to add/remove any additional columns which are not present in both the training and testing datasets. However, for this dataset example the encoded training and testing datasets should have the same number of columns.

---

Finally, standardise your encoded data, train a Linear Regression model, and evaluate its effectiveness.

Compare the performance of the default Linear Regression model against versions involving L1 and L2 regularisation:
- Lasso (L1 regularisation) - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
- Ridge (L2 regularisation) - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
- ElasticNet (L1 + L2 regularisation) - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

In [12]:
# Enter your code for Checkpoint 2 here
from sklearn import datasets
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import  StandardScaler
from sklearn.neighbors import KNeighborsClassifier as KNN
import numpy as np
import pandas as pd


df = pd.read_csv('student-mat.csv')

print(df.shape) 
print(df.head())


(395, 33)
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   
4     GP   F   16       U     GT3       T     3     3    other     other  ...   

  famrel freetime  goout  Dalc  Walc health absences  G1  G2  G3  
0      4        3      4     1     1      3        6   5   6   6  
1      5        3      3     1     1      3        4   5   5   6  
2      4        3      2     2     3      3       10   7   8  10  
3      3        2      2     1     1      5        2  15  14  15  
4      4        3      2     1     2      5        4   6  10  10  

[5 rows x 33 columns]


In [13]:
#  Split datasets 
X = df.drop(columns=["G3"])
y = df["G3"]

In [14]:
# Split to train and test
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=100, test_size= 0.2)

print("X_train shape:", X_train.shape)  
print("X_test shape:", X_test.shape)    
print("y_train shape:", y_train.shape)  
print("y_test shape:", y_test.shape)    

X_train shape: (316, 32)
X_test shape: (79, 32)
y_train shape: (316,)
y_test shape: (79,)


In [15]:
# One-Hot to expand string column to more feature columns
X_train_encoded = pd.get_dummies(X_train)
X_test_encoded = pd.get_dummies(X_test)
print(list(X_train_encoded.columns))
print(X_train_encoded)
print(X_test_encoded)

['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'school_GP', 'school_MS', 'sex_F', 'sex_M', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'schoolsup_no', 'schoolsup_yes', 'famsup_no', 'famsup_yes', 'paid_no', 'paid_yes', 'activities_no', 'activities_yes', 'nursery_no', 'nursery_yes', 'higher_no', 'higher_yes', 'internet_no', 'internet_yes', 'romantic_no', 'romantic_yes']
     age  Medu  Fedu  traveltime  studytime  failures  famrel  freetime  \
372   17     2     2           1          3         0       3         4   
136   17     3     4           3          2         0       5  

In [16]:

# Normalisation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

In [17]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr.score(X_test_scaled, y_test)

0.7332295233396583

In [18]:
# Lasso
lasso = Lasso(alpha=0.1)  # Alpha at zero
lasso.fit(X_train_scaled, y_train)
lasso.score(X_test_scaled, y_test)

0.7567667021007066

In [19]:
# Ridge
ridge = Ridge(alpha=1)  # Alpha at small
ridge.fit(X_train_scaled, y_train)
ridge.score(X_test_scaled, y_test)

0.7336358409711199

In [20]:
# ElasticNet L1 + L2
elasticnet = ElasticNet(alpha=1)
elasticnet.fit(X_train_scaled, y_train)
elasticnet.score(X_test_scaled, y_test)


0.6977540094624292

### Checkpoint 3 - Genome Prediction

1) Use appropriate reader function from pandas to read the "GenomicData.csv" file provided on FLO. The dataset should have 172 rows and 21053 columns.
2) Drop the first two columns as they have redundant information.
3) Call the function <code>np.random.seed(42)</code>. This has a type of global effect on any function that uses NumPy and makes your results reproducible across multiple runs.
4) Drop any rows that contain a missing (NaN) value. How many instances does this reduce our dataset down to?
5) Use the ClassType column as the label (y), and the rest as features (X).
6) Split the data into test and train sets, with a 0.25 test size.
7) Standardize the features.
8) Train two logistic regression models with l1 penalty (use ‘liblinear’ solver) and C = 0:05 and C = 10. What are their accuracies on test data?
9) How many features are “selected” (i.e., have non-zero coefficients) in logistic regression with l1 for C = 0:05 and for C = 10? Hint, you can get the coefficient values for a trained LogisticRefression model by calling the <code>.coef_</code> attribute.

---

Think about whether it makes more sense to drop instances or features with missing values for this dataset.
Try both approaches out and see which produces the better result.

In [21]:
# Enter your code for Checkpoint 3 here

from sklearn import datasets
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import  StandardScaler
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd


df = pd.read_csv('GenomicData.csv')

print(df.shape) 
print(df.head())

(172, 21053)
   Unnamed: 0.1  Unnamed: 0       DDR1      RFC2     HSPA6      PAX8  \
0             0  GSM1019138  11.696575  9.498098  6.926517  8.119192   
1             1  GSM1019139  10.324599  8.463703  7.473955  7.874865   
2             2  GSM1019140  11.416296  7.519654  6.387836  7.983135   
3             3  GSM1019141  10.621342  8.555675  8.714281  7.824748   
4             4  GSM1019142  10.227293  8.057257  8.236758  8.188167   

     GUCA1A      UBA7      THRA    PTPN21  ...  LOC100505794 /// LOC100509111  \
0  3.412547  7.605915  5.287406  5.144362  ...                       4.265834   
1  3.328430  8.129946  5.352852  5.608019  ...                       4.213007   
2  3.646711  8.583098  5.611202  5.319620  ...                       3.806343   
3  4.590441  7.437547  4.943707  5.927861  ...                       4.538299   
4  4.062502  8.115143  5.536319  5.310099  ...                       4.568268   

   LOC100505562  LOC388210     GALR3    NUS1P3     ITIH4  C1orf175 

In [22]:
#  Drop 2 columns 
df = df.drop(df.columns[:2], axis=1)
print(df.shape) 
print(df.head())

(172, 21051)
        DDR1      RFC2     HSPA6      PAX8    GUCA1A      UBA7      THRA  \
0  11.696575  9.498098  6.926517  8.119192  3.412547  7.605915  5.287406   
1  10.324599  8.463703  7.473955  7.874865  3.328430  8.129946  5.352852   
2  11.416296  7.519654  6.387836  7.983135  3.646711  8.583098  5.611202   
3  10.621342  8.555675  8.714281  7.824748  4.590441  7.437547  4.943707   
4  10.227293  8.057257  8.236758  8.188167  4.062502  8.115143  5.536319   

     PTPN21       CCL5    CYP2E1  ...  LOC100505794 /// LOC100509111  \
0  5.144362   8.563300  4.454118  ...                       4.265834   
1  5.608019   8.014143  4.797404  ...                       4.213007   
2  5.319620   8.858539  4.705669  ...                       3.806343   
3  5.927861   9.663748  4.409423  ...                       4.538299   
4  5.310099  10.588755  4.421362  ...                       4.568268   

   LOC100505562  LOC388210     GALR3    NUS1P3     ITIH4  C1orf175 /// TTC4  \
0      4.959345   

In [23]:
# Setting a default to get the same output everytime, it is similar to random_state = 42
np.random.seed(42)

In [24]:
# Drop NaN
df = df.dropna(axis=1)
print(df)
print(df.shape)


          DDR1       RFC2     HSPA6      PAX8    GUCA1A      UBA7      THRA  \
0    11.696575   9.498098  6.926517  8.119192  3.412547  7.605915  5.287406   
1    10.324599   8.463703  7.473955  7.874865  3.328430  8.129946  5.352852   
2    11.416296   7.519654  6.387836  7.983135  3.646711  8.583098  5.611202   
3    10.621342   8.555675  8.714281  7.824748  4.590441  7.437547  4.943707   
4    10.227293   8.057257  8.236758  8.188167  4.062502  8.115143  5.536319   
..         ...        ...       ...       ...       ...       ...       ...   
167  10.840880   9.234886  7.520512  8.311986  4.281912  7.634406  6.194104   
168  11.275180   8.549333  7.547901  7.876449  3.608414  8.082584  5.974399   
169  10.531975   8.044530  6.712938  8.032822  3.934922  7.494316  5.985384   
170  11.271373  10.572225  8.287879  8.162460  3.548527  7.578741  5.501066   
171  10.702283   9.237732  7.694889  8.628865  3.804980  8.148709  5.936384   

       PTPN21       CCL5    CYP2E1  ...  LOC1005057

In [25]:
#  Set up label 'Class Type'
y = df["ClassType"]
X = df.drop(columns=["ClassType"])


In [26]:
# Split train and test
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=100, test_size= 0.25)

# Normalisation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [27]:
# Train two logistic regression models C=0.05

results = {}
C = 0.05
logr = LogisticRegression(penalty = 'l1', C = C, solver= 'liblinear') # liblinear suites for small datasets.
logr.fit(X_train_scaled, y_train)
print(logr.score(X_test_scaled, y_test))
acc = logr.score(X_test_scaled, y_test)

n_selected = np.sum(logr.coef_ != 0)
results[C] = {'acc': acc, 'n_selected': n_selected}

print(n_selected)
print(f'C={C}: Test accuracy={acc:.4f}, Non-zero features={n_selected}')

0.9534883720930233
9
C=0.05: Test accuracy=0.9535, Non-zero features=9


In [28]:
# C=10
results = {}
C = 10
logr = LogisticRegression(penalty = 'l1', C = C, solver= 'liblinear')
logr.fit(X_train_scaled, y_train)
print(logr.score(X_test_scaled, y_test))
acc = logr.score(X_test_scaled, y_test)

n_selected = np.sum(logr.coef_ != 0)
results[C] = {'acc': acc, 'n_selected': n_selected}

print(n_selected)
print(f'C={C}: Test accuracy={acc:.4f}, Non-zero features={n_selected}')

0.9767441860465116
208
C=10: Test accuracy=0.9767, Non-zero features=208
