# Homework 5 Supplemental Notebook

## DSC 40A, Fall 2021

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

## Problem 1 – Transformation Tuesday

Here are the same helper functions we gave you in Homework 4.

In [None]:
def solve_normal_equations(X, y):
    '''Returns the optimal parameter vector, w*, given a design matrix X and observation vector y.'''
    return np.linalg.solve(X.T @ X, X.T @ y)

def create_design_matrix(df, columns, intercept=True):
    '''Creates a design matrix by taking the specified columns from the DataFrame df.
       Adds a column of all 1s as the first column if intercept is True, which is the default.
       The argument columns should be a list.
    '''
    df = df.copy()
    df['1'] = 1
    if intercept:
        return df[['1'] + columns].values
    else:
        return df[columns].values
    
def mean_squared_error(X, y, w):
    '''Returns the mean squared error of the predictions Xw and observations y.'''
    return np.mean((y - X @ w)**2)

Here are two new functions – the logistic function, `sigma`, and its inverse, `sigma_inv`. Note that both functions are **vectorized**, meaning that if you give them an array as input, you will get an array back as output.

In [None]:
def sigma(x):
    return 1 / (1 + np.e**(-x))

def sigma_inv(x):
    return np.log((x / (1 - x)))

Now, let's load in our data.

In [None]:
logistic_data = pd.read_csv('data/logistic_data.csv')
logistic_data

Here's a scatter plot of it:

In [None]:
px.scatter(logistic_data, x='x', y='y')

### Part C

<p style="color:red"><b>Your Job</b></p>

Assign `X` and `z` below to the design matrix and observation vector such that `solve_normal_equations(X, z)` will give us our optimal parameter vector. In your PDF writeup, provide a screenshot of the code you wrote as well as of the resulting visualization.

_Hint: Remember that `sigma` and `sigma_inv` are vectorized, as noted above, and that we've defined `create_design_matrix` for you. Each definition should only take one line. **Don't use a `for`-loop!**_

In [None]:
X = ... # TODO
X

In [None]:
z = ... # TODO
z

Once you answer the above two parts, you should be able to run the following two cells to see the value of $\vec{w}^*$ and a fit version of our prediction rule.

In [None]:
w_logistic = solve_normal_equations(X, z)
w_logistic

In [None]:
x_range = np.linspace(-11, 5)

fig = go.Figure()
fig.add_trace(go.Scatter(x = logistic_data['x'], y = logistic_data['y'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = sigma(w_logistic[0] + w_logistic[1] * x_range), 
                         name = 'prediction rule', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title = 'Total Bill', yaxis_title = 'Tip')

Nice!

## Problem 2 – What do you k-mean?

### Part B

**Note that you don't need to write any code yourself in this question.**

Run the cell below to load in our dataset, which contains many (209) attributes for each country, taken from [here](https://databank.worldbank.org/source/world-development-indicators#).

In [None]:
world_bank = pd.read_csv('data/world_bank_data.csv') \
                .set_index('country') \
                .fillna(0) # There are null values in the dataset; .fillna(0) replaces them with 0s
world_bank

Let's try and cluster this dataset. Note that there's no easy way to visualize the dataset before clustering since it is 209-dimensional. We will use the KMeans method built into `scikit-learn`, a popular Python library for machine learning.

In [None]:
from sklearn.cluster import KMeans

Run the cell below to perform the clustering and create a DataFrame with the results. (`random_state=42` ensures that we get the same results every time we run this cell.)

In [None]:
world_bank_clustering = KMeans(n_clusters=4, random_state=42).fit(world_bank)
countries_and_labels = pd.DataFrame(index=world_bank.index) \
                         .assign(cluster=world_bank_clustering.labels_.astype(str))

In [None]:
countries_and_labels

Now we have a cluster for each country, but it's kind of hard to tell which country is in which cluster. Run the cell below to see a map.

In [None]:
px.choropleth(countries_and_labels.reset_index(),
              locations = 'country',
              locationmode = 'country names',
              color = 'cluster',
              hover_name = 'country',
              title = 'Country Clusters',
              category_orders = {'cluster': ['0', '1', '2', '3']}
)

If you aren't the greatest at geography, don't worry – you can hover over a country to see its name.

If you'd prefer, you can look at the cluster assignments in text form below.

In [None]:
# Run this cell.
def show_all_clusters(cluster_df):
    for v in sorted(np.unique(cluster_df.get('cluster'))):
        print(f'Cluster {v}:', list(cluster_df[cluster_df.get('cluster') == v].get('cluster').index), '\n')
        
show_all_clusters(countries_and_labels)

You may notice an issue – the vast majority of countries are in the same cluster! That's not all that useful.

It turns out that if we **standardize** our data, as we do when computing standardized regression coefficients, the results turn out different. Run the cells below to standardize our dataset and re-run k-means clustering.

In [None]:
world_bank_standardized = (world_bank - np.mean(world_bank, axis=0)) / np.std(world_bank, axis=0)
world_bank_standardized

In [None]:
world_bank_standardized_clustering = KMeans(n_clusters=4, random_state=42).fit(world_bank_standardized)
countries_and_labels_standardized = pd.DataFrame(index=world_bank_standardized.index) \
                                      .assign(cluster=world_bank_standardized_clustering.labels_.astype(str))

We can draw the same map as before:

In [None]:
px.choropleth(countries_and_labels_standardized.reset_index(),
              locations = 'country',
              locationmode = 'country names',
              color = 'cluster',
              hover_name = 'country',
              title = 'Country Clusters After Standardization',
              category_orders = {'cluster': ['0', '1', '2', '3']}
)

Note that in this new map, the clusters are a bit more "even" in terms of the number of countries in them. (The cluster numbers and colors are arbitrary, meaning that purple in the previous map is not related to purple in this map.)

In [None]:
show_all_clusters(countries_and_labels_standardized)

<p style="color:red"><b>Your Job</b></p>

In your PDF writeup, write the answers to the following two questions:

(i) Come up with a simple description for each of the 4 clusters created after standardizing our features.

(ii) Why do you think standardizing our features before clustering had such an impact? (_Hint: Scroll up and compare the values in `world_bank` to those in `world_bank_standardized`.)