---
title: "Topics in Econometrics and Data Science: Tutorial 10"

---

#### General Note

You will very likely find the solution to these exercises online. We, however, strongly encourage you to work on these exercises without doing so. Understanding someone else’s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

## Exercise 1: Wine classification

Now load the wine dataset. The dataset contains information on the chemical composition of wines. You can load the data via

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
dataset = load_wine()

1. Make yourself familiar with the data. How many different wine types are contained in the sample? How many different features and observations are included?

In [2]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['Type'] = dataset.target
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,Type
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [3]:
print(np.shape(df))
print(dataset.target)
#set(pd.DataFrame(df, columns=['Type']).values.flatten())
print(df['Type'].unique())

(178, 14)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[0 1 2]


2. Use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function (`test_size = 0.2`, `random_state=0`) to split your your data into a training and a testing sample.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=0)

3. Try to classify your data with the $k$-nearest neighbor classification with the [`neighbors.KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) function. Use different weights and number of neighbors to minimize your empirical error rate.

In [5]:
from sklearn import neighbors

# Weights: uniform
neighb = neighbors.KNeighborsClassifier(n_neighbors = 10, weights = 'uniform')
model = neighb.fit(X_train, y_train)
print('Error rate:', np.mean(np.not_equal(model.predict(X_test),y_test)))

Error rate: 0.2777777777777778


In [6]:
# Weights: distance
neighb = neighbors.KNeighborsClassifier(n_neighbors = 10, weights = 'distance')
model = neighb.fit(X_train, y_train)
print('Error rate:', np.mean(np.not_equal(model.predict(X_test),y_test)))

Error rate: 0.25


4. Try to improve on your result by using random forests and the function [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

randforest = RandomForestClassifier(max_depth=5, random_state=0)
model = randforest.fit(X_train, y_train)
print('Error rate:', np.mean(np.not_equal(model.predict(X_test),y_test)))

Error rate: 0.027777777777777776


5. Can you try to determine two of the most important features (determined by the attribute [`.feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_) ) that can be used to separate the results?

In [8]:
importance=pd.DataFrame([model.feature_importances_],index=["importance"], columns=dataset.feature_names)
importance

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
importance,0.103923,0.031157,0.012805,0.020604,0.019559,0.059636,0.156013,0.019817,0.036109,0.174127,0.066472,0.14339,0.156389


## Exercise 2: Digits classification

The (famous) MNIST dataset contains 70.000 observations of an image of a handwritten digit. Each observation consists $784$ features (grey level) which correspond to a $28\times28$ image. The MNIST data set is very popular to train and test algorithms in machine learning (see [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) )


![Example, MNIST.](MNIST_1.png)

Due to time constraints we process the just a subset with a reduced number of features. At first load the digits datastet.

In [9]:
import pandas as pd
from sklearn.datasets import load_digits
dataset = load_digits()

df = pd.DataFrame(dataset.data)
df['Digit'] = dataset.target
df[0:3]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,Digit
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2


1. Again, make yourself familiar with the data. How many different features and observations are included?

In [10]:
import numpy as np

np.shape(df)

(1797, 65)

2. Use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function (`test_size = 0.2`, `random_state=0`) to split your your data into a training and a testing sample.


In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset.data,dataset.target, test_size=0.2, random_state=0)

3. Try to classify your data with the support vector machines and the function [`svm.SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). Use different kernels and other tuning parameters minimize your empirical error rate.

In [12]:
from sklearn import svm

# Use linear kernel
supvecma = svm.SVC(kernel = 'linear', C = 2)
model = supvecma.fit(X_train, y_train)
print('Error rate linear kernel:', np.mean(np.not_equal(model.predict(X_test),y_test)))

Error rate linear kernel: 0.022222222222222223


In [13]:
# Use polynomial kernel 
supvecma = svm.SVC(kernel = 'poly',degree =2, C = 2, gamma = 'auto')
model = supvecma.fit(X_train, y_train)
print('Error rate polynomial kernel:', np.mean(np.not_equal(model.predict(X_test),y_test)))

Error rate polynomial kernel: 0.011111111111111112
