# A Basic Introduction to ML

Last updated 2 June 2020 by Vanessa Meschke (vmeschke@mymail.mines.edu)

This notebook is a demo of the machine learning algorithms available through Scikit Learn.

In [7]:
# imports
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## Reading in Data

We'll be using the same dataset from TEDesign Lab that we used for the basic plotting demo before.

In [6]:
# Read in the data!
data_path = 'tedesignlab.csv' # This is where your data is located. 
all_data = pd.read_csv(data_path) # Turn CSV into pandas DataFrame
all_data.head() # See first 5 row of data set

Unnamed: 0,compound,Space Group,Band Gap (eV),Lattice Thermal Conductivity (W/m K),Hole mobility (cm^2 / V s),Electron mobility (cm^2 / V s),beta (p-type),beta (n-type),Number of Atoms per Unit Cell,density (g/cm^3),volume of unit cell (A^3),Volume per Atom (A^3),Bulk Modulus (GPa),Average Coordination Number
0,La1O4V1,14,3.5,9.82,1.07,0.5,0.87,0.96,24,4.82,349.6,14.566667,89.8,1.92
1,La1O4V1,141,3.15,12.41,2.69,7.35,2.25,0.67,12,4.42,190.9,15.908333,97.3,3.92
2,In1O3Y1,185,2.02,6.27,3.45,411.06,2.85,3.85,30,5.67,442.3,14.743333,137.2,4.53
3,Ba2In2O5,46,0.93,8.07,8.5,678.87,0.99,5.69,18,6.1,318.3,17.683333,89.4,3.56
4,K6Mg1O4,186,1.71,2.2,0.48,35.84,0.47,2.63,22,2.53,424.3,19.286364,31.7,5.59


## Supervised and Unsupervised Learning Methods


### Supervised Learning



### Unsupervised Learning

## Classification and Regression Tasks

### Classification


### Regression

In [17]:
# Regression - predict lattice thermal conductivity using bulk modulus, average coordination number, 
# and hole/electron mobility as features
# Read in data
kappa = all_data['Lattice Thermal Conductivity (W/m K)'].values # Get thermal cond & convert to NumPy array 
bm = all_data['Bulk Modulus (GPa)'].values
bm = bm.reshape((len(bm), 1))
mu_h = all_data['Hole mobility (cm^2 / V s)'].values
mu_h = mu_h.reshape((len(mu_h), 1))
mu_e = all_data['Electron mobility (cm^2 / V s)'].values
mu_e = mu_e.reshape((len(mu_e), 1))
vol_per_atom = all_data['Volume per Atom (A^3)'].values
vol_per_atom = vol_per_atom.reshape((len(vol_per_atom), 1))
density = all_data['density (g/cm^3)'].values
density = density.reshape((len(density), 1))

X = []
for i in range(len(bm)):
    row = [bm[i], mu_e[i], mu_h[i], vol_per_atom[i], density[i]]
    X.append(row)

#from sklearn.datasets import load_diabetes
#X, y = load_diabetes(return_X_y=True)
X = np.asarray(X).reshape((len(bm), 5))

# Set inputs to model - we'll use random forest here
rf = RandomForestRegressor()
rf.fit(X, kappa)

# BM, mu_e, mu_h
pbte = np.asarray([38.51, 184.53, 267.5, 35.23, 7.89]).reshape(1, -1)

pbte_predicted = rf.predict(pbte)
print(pbte_predicted)

[3.562]




## Assessing Model Accuracy

### Classification Accuracy


### Regression Accuracy
There are several metrics that can gauge the performance of a regression model. The ones we'll cover from Scikit-Learn are the:
- Mean Absolute Error: $\frac{1}{n}\sum_{i = 1}^{n}(predicted_{i} - actual_{i})$ (No Error has MAE = 0)
- Mean Squared Error: $\frac{1}{n}\sum_{i = 1}^{n}(predicted_{i} - actual_{i})^{2}$ (No Error has MSE = 0)
- R$^{2}$ Score = $1 - \frac{\sum_{i = 1}^{n}(actual_{i} - predicted_{i})^{2}}{\sum_{i = 1}^{n}(actual_{i} - avg. actual)^{2}}$ (No Error has R$^{2}$ = 1)

However, I find it useful to also use the root mean squared error normalized by the standard deviation to also be useful in assessing model accuracy.
- Normalized Root Mean Squared Error: $\frac{\sqrt{\frac{1}{n}\sum_{i = 1}^{n}(predicted_{i} - actual_{i})^{2}}}{StDev(actual)}$ (No Error has Normalized RMSE = 0)

In [24]:
# Actual value of kappa for PbTe
pbte_actual = 6.09

# MAE
pbte_mae = mean_absolute_error([pbte_actual], [pbte_predicted])

# MSE
pbte_mse = mean_squared_error([pbte_actual], [pbte_predicted])

# R^2 (Note: Need at least two samples for R^2 in Scikit-Learn)
pbte_r2 = r2_score([pbte_actual], [pbte_predicted])

# Normalized RMSE
pbte_nrmse = math.sqrt(pbte_mse)/np.std(kappa)

# Print
print("Mean Absolute Error: " + str(pbte_mae))
print("Mean Squared Error: " + str(pbte_mse))
print("R^2 Score: " + str(pbte_r2))
print("Normalized Root Mean Squared Error: " + str(pbte_nrmse))

28.63329468991136
Mean Absolute Error: 2.5279999999999996
Mean Squared Error: 6.390783999999998
R^2 Score: nan
Normalized Root Mean Squared Error: 0.08828882695398353




## More Accurate Representations of Model Accuracy Using Cross Validation

### Leave One Out

### k-Fold
In k-fold cross validation, we break the data into similarly-sized subsets to use for training & testing. 

In [None]:
from sklearn.model_selection import KFold