# Quiz 4

In [3]:
import numpy as np
import pandas as pd
import scipy
from vega_datasets import data
from IPython.display import display, SVG
from lets_plot import *
from sklearn.linear_model import LinearRegression
from lets_plot.mapping import as_discrete
import seaborn as sns
LetsPlot.setup_html()

# Part 1: Fixing models

The following question uses this dataset of measurements of penguin bills. One goal of this dataset is to determine the species of a penguin from its bill. In the visualization below, we can see that the 3 species are relatively well separated by these measurements. In practice though, we may not know which penguin belongs to which species. In this case we may want to try to find species by *clustering* our measurements. 

*This data was collected by Dr. Kristen Gorman at the Palmer Station in Antarctica.*

In [2]:
penguins = sns.load_dataset("penguins")
penguins = penguins[(penguins['sex'] == 'Male')][['body_mass_g', 'bill_length_mm', 'species']].dropna()

ggplot(penguins, aes(x='body_mass_g', y='bill_length_mm', shape=as_discrete('species', order=1), color=as_discrete('species', order=1))) + \
    geom_point() +\
    scale_color_brewer(palette="Set2")

## Q1: Issues with K-means

The code below runs k-means on the penguins dataset, this time using the variables `bill_length_mm` and `body_weight_g`. Notice that despite the 3 species seeming to be well-separated in the visualization above, K-means does a poor job at recovering the 3 clusters.  

In [3]:
from sklearn.metrics import pairwise_distances
measurements_only = penguins[['bill_length_mm', 'body_mass_g']]

num_clusters = 3
np.random.seed(55)
measurements_with_clusters = measurements_only.copy()
measurements_with_clusters['cluster'] = np.random.randint(0, num_clusters, (measurements_only.shape[0],))

for iteration in range(10):
    centroids = measurements_with_clusters.groupby('cluster').mean()
    distances = pairwise_distances(measurements_only, centroids)
    measurements_with_clusters['cluster'] = distances.argmin(axis=1)


ggplot(measurements_with_clusters, aes(x='body_mass_g', y='bill_length_mm', shape=as_discrete('cluster', order=1), color=as_discrete('cluster', order=1))) + \
    geom_point() +\
    scale_color_brewer(palette="Set2")

Identify the problem, and choose the **best** approach to resolve this issue. (*Feel free to test your approach!*)

- **Standardize** the data.
- Use the **k-medians** algorithm instead of k-means.
- Increase the **number of clusters**.
- Run k-means **multiple times**.
- Take a **subset** of the data.

# Part 2: Preprocessing

For the next few questions, we'll return to the Gapminder dataset that we looked at in previous assignments. As a reminder, this dataset tracks various economic, health and human developtment metrics for coutries around the world. In this case, we'll look at how location and economic indicators affect life expectency.

## Q2: Factor data

We'll start with a very rough approach. Each country is given a label of one of four regions: `Africa`, `Americas`, `Asia` and `Europe`. Let's look how this region and average income predicts life expectency for the year 2019.

Using the data in `lex_data` below, fit a regression model that predicts `life exp` **from** `region` and `income`. Under this model, what is the *change in life expectency* when changing a country's region from **Africa** to **Europe** (holding income constant)?

What would the difference be if our model did **not** account for `income`?

In [113]:
countries = pd.read_csv('data/countries.csv')[['name', 'four_regions']].set_index('name').rename(columns={'four_regions': 'region'})
countries['region'] = countries.region.astype('category')
life_exp = pd.read_csv('data/lex.csv')[['country', '2019']].set_index('country').rename(columns={'2019': 'life exp'})
income = pd.read_csv('data/income.csv')[['country', '2019']].set_index('country').rename(columns={'2019': 'income'})
lex_data = countries.join(life_exp).join(income).dropna()

ggplot(lex_data.reset_index(), aes(x='income', y='life exp', color='region')) + geom_point(tooltips=layer_tooltips(['name']))

In [None]:
y = lex_data['life exp']

# Create a dataframe with only the appropriate, pre-processed inputs
inputs = 
x = inputs.values.astype(float)

model = LinearRegression().fit(x, y)
print(list(zip(inputs.columns, model.coef_)))

# Determine the answer to the question

### Answer: 

Using the code below, we see that our model predicts that someone in Europe will live **7.29** years longer on average than someone in Africa. Without accounting for income, we would expect that someone in Europe would live **13.24** years longer.

In [99]:
y = lex_data['life exp']
inputs = pd.get_dummies(lex_data[['income', 'region']])
x = inputs.values.astype(float)

model = LinearRegression().fit(x, y)
print(list(zip(inputs.columns, model.coef_)))

[('income', np.float64(0.1698422298358171)), ('region_africa', np.float64(-4.759524551712193)), ('region_americas', np.float64(2.4407878796344997)), ('region_asia', np.float64(-0.21874771041487953)), ('region_europe', np.float64(2.537484382492568))]


In [112]:
print('Difference incl. income:', model.coef_[-1] - model.coef_[1])

europe_mean = lex_data.groupby('region').mean().loc['europe']['life exp']
africa_mean = lex_data.groupby('region').mean().loc['africa']['life exp']
print('Difference without income:', europe_mean - africa_mean)

Difference incl. income: 7.2970089342047615
Difference without income: 13.242725366876329


  europe_mean = lex_data.groupby('region').mean().loc['europe']['life exp']
  africa_mean = lex_data.groupby('region').mean().loc['africa']['life exp']


## Q3 Preprocessing variables

Look back to the `income` vs. `life exp` visualization from `q2`. 

Based on this plot what would you expect to be the most useful transformation of the `income` variable for our linear regression? (*Feel free to test the transforms in code!*)

- Standardization
- Imputation
- Reciprocal transform
- Log transform

**Discussion:** Do you think that applying this transformation in the visualization above would aid or harm the interpret-ability of the plot?

## Q4: Missing data

We'll now try a more sophisticated model with more data, focusing on economic indicators. In this case we'll model `life exp` as a function of average `income`, human development `hdi`, and income inequality (measured by the `gini` index). We'll take 30 years of data from the from 1990 until 2019. The code below loads a dataframe with this data. Unfortunately we don't actually have measurements of these 3 variables for every country for every year! 

In [134]:
life_exp = pd.read_csv('data/lex.csv').melt(id_vars='country').rename(columns={'variable': 'year', 'value': 'life exp'}).set_index(['country', 'year'])
income = pd.read_csv('data/income.csv').melt(id_vars='country').rename(columns={'variable': 'year', 'value': 'income'})
gini = pd.read_csv('data/gini.csv').melt(id_vars='country').rename(columns={'variable': 'year', 'value': 'gini'})
hdi = pd.read_csv('data/hdi_human_development_index.csv').melt(id_vars='country').rename(columns={'variable': 'year', 'value': 'hdi'})

data = income.merge(life_exp, on=['country', 'year']).merge(gini, on=['country', 'year']).merge(hdi, on=['country', 'year']).set_index(['country', 'year'])


Which variable of the 3 inputs in `data` has missing values?

Based on the data, is it reasonable to assume that values are *missing completely at random*?

- Yes
- No

*Remove observations* with missing values from the dataset and fit a linear regression model. What is `hdi` coefficient of this model?

Fill in missing values with *mean imputation* and fit a linear regression model. What is `hdi` coefficient of this model?



In [None]:
fixed_data = 

x = fixed_data[['hdi', 'gini', 'income']].values.astype(float)
y = fixed_data['life exp'].values.astype(float)
fixed_data

# Part 2: Tree-based models

**Outline** Fill in the entropy formula for a recursive implementation of descision tree fitting. 

What is the depth of a tree that perfectly separates the data?

What percentage of held-out data is correctly classified for the *tree* and for *logistic regression*?

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
  
# metadata 
print(heart_disease.metadata) 
  
# variable information 
print(heart_disease.variables) 

{'uci_id': 45, 'name': 'Heart Disease', 'repository_url': 'https://archive.ics.uci.edu/dataset/45/heart+disease', 'data_url': 'https://archive.ics.uci.edu/static/public/45/data.csv', 'abstract': '4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 303, 'num_features': 13, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['num'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1989, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C52P4X', 'creators': ['Andras Janosi', 'William Steinbrunn', 'Matthias Pfisterer', 'Robert Detrano'], 'intro_paper': {'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.', 'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sa

In [2]:
data = heart_disease.data.features.join(heart_disease.data.targets)[['age', 'trestbps', 'chol', 'thalach', 'num']]
data['num'] = (features['num'] > 0).astype(int)


thalach = data['thalach'].values.astype(float)
thalach[np.random.rand(*thalach.shape) < 0.2] = np.nan
data['thalach'] = thalach



NameError: name 'features' is not defined

In [263]:
data

Unnamed: 0,age,trestbps,chol,thalach,num
0,63,145,233,150.0,0
1,67,160,286,108.0,1
2,67,120,229,129.0,1
3,37,130,250,,0
4,41,130,204,172.0,0
...,...,...,...,...,...
298,45,110,264,132.0,1
299,68,144,193,141.0,1
300,57,130,131,115.0,1
301,57,130,236,174.0,1


In [258]:
from sklearn.linear_model import LogisticRegression

fixed_data = data.dropna().fillna(features.median())

x = fixed_data[['age', 'trestbps', 'chol', 'thalach']].values.astype(float)
y = fixed_data['num'].values

In [259]:
model = LogisticRegression().fit(x, y)
model.score(x, y)

0.7090163934426229

In [260]:
features.dropna().values

array([[ 63., 145., 233., 150.,   0.],
       [ 67., 160., 286., 108.,   1.],
       [ 37., 130., 250., 187.,   0.],
       ...,
       [ 68., 144., 193., 141.,   1.],
       [ 57., 130., 131., 115.,   1.],
       [ 57., 130., 236., 174.,   1.]])