# <div style="text-align: center">Machine Learning</div>

<div style="text-align: center"> <sub>ENCN404 - Modern Modelling Practices in Civil Engineering</sub></div>

$\,$

<div style="text-align: center"> University of Canterbury </div>

$\,$

<img src="img/ml.png" alt="Drawing" style="width: 600px;"/>

### Notebook instructions

Run cells containing code by clicking on them and hitting **Ctrl+Enter** or by Cell>Run Cells in the drop-down menu.

For queries, the course instructor or notebook author (David Dempsey)

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

## 1. Data exploration with Pandas

Work through the examples below during the lecture

In [None]:
# The most important object is the DataFrame. Think of this like a table in a spreadsheet.
data={'time': [30, 60, 90, 120, 150], 'rainfall': [4, 11, 32, 8, 0], 'runoff': [0, 0, 1.7, 8.6, 3.1]}

# create the dataframe from a dictionary of data
df=pd.DataFrame(data)

# look at the dataframe
df.head()

In [None]:
# display the column names
df.columns

In [None]:
# display the row and column counts
df.shape

In [None]:
# Dataframes have indices. These are like the indices of an array or list, e.g., 0, 1, 2, … -1. 
# The indices populate by default in the Python convention. They can be accessed from the 'index' attribute.
df.index

In [None]:
# Indices don't have to be integers. We can change them to something else. 
# A popular choice is some kind of measure of time, in which case we are working with time series data.
df.set_index('time', inplace=True)

In [None]:
# We can use indices to get access to parts of the dataframe.
print(df.loc[30])
print(df.loc[90,'rainfall'])
print(df.loc[120:,'runoff'])

In [None]:
# extract a series (one column) from the larger dataframe
rain=df['rainfall']

# summarize aspects of the series
print(rain.max())             # or min, mean, std, sum
print(rain.describe())
print(rain.unique())          # sort_values, value_counts

In [None]:
# With matplotlib, we can also generate plots
rain.plot(kind='hist')   # or line, box, pie…
plt.show()

In [None]:
# We'll use the dataframe as a variable on which to do calculations (like a spreadsheet). 
# For example, calculate new columns
df['rnf_rnd']=df['rainfall'].round()
df.head()

In [None]:
# or calculate a summary row
df.max()

In [None]:
# We can write dataframes out to files, and read them back in again. We'll generally use CSV files.
df.to_csv('rainfall.csv')
df2=pd.read_csv('rainfall.csv')
df2.head()

In [None]:
# rolling window calculations are a useful series operation
df['avg_rain']=df['rainfall'].rolling(3).mean()
df.head()

In [None]:
# rolling() can be chained with apply() to any function you can think of, e.g., sum of squares
def sum_of_squares(x):
    return np.sum(x**2)
df['ss_runoff']=df['runoff'].rolling(3).apply(sum_of_squares)
df.head()

In [None]:
# let's look now at some categorical data in buildings.csv
df=pd.read_csv('buildings.csv')
df

In [None]:
# we can sort on a particular column
df.sort_values('Cost', ascending=False)

# (Note: sort_values() OUTPUTS a new sorted dataframe. It does not sort the original
#  dataframe unless you set inplace=True)

In [None]:
# we can group according to a category and calculate summaries of those groups
df.groupby('Type').median()

In [None]:
# we can use pandas to find and replace outliers, for example in the data below
df=pd.DataFrame({'disp': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.2, 10, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8], 'load': [10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38]}) 
plt.plot(df['load'], df['disp'], 'kx'); plt.show()

In [None]:
# calculate a new columnt that is the z-score (deviation from the mean)
df['zsc']=(df['disp']-df['disp'].mean())/df['disp'].std()
df

In [None]:
# find outliers based on large absolute zscores
no_outliers=df['disp'].where(df['zsc'].abs()<3)

# replace the outliers with linear interpolation
df['disp']=no_outliers.interpolate(method='linear') 
plt.plot(df['load'], df['disp'], 'kx'); plt.show()

## 2. Feature engineering

## 3. Unsupervised learning and clustering

## 4. Hypothesis testing

Work through the examples below during the lecture

In [41]:
from scipy import stats
# Data: compressive strength measurements of two concrete types
concrete_type_A = [30, 32, 31, 33, 29, 28, 30, 31, 32, 30]
concrete_type_B = [35, 34, 33, 36, 32, 31, 33, 34, 35, 33]

# Perform an independent two-sample t-test
t_statistic, p_value = stats.ttest_ind(concrete_type_A, concrete_type_B)

# Set significance level (alpha)
alpha = 0.05
if p_value < alpha: 
    print('different means')

different means


In [44]:
# Data: stage and flow measurements on a river
stage = [1.2, 1.5, 1.8, 2.0, 2.3, 2.6, 2.9, 3.2, 3.5, 3.8]
flow = [11.0, 11.3, 12.0, 12.3, 18.0, 18.5, 19.8, 25.3, 28.3, 28.2]

# Calculate Kendall's tau and p-value
tau, p_value = stats.kendalltau(stage, flow)

# - A positive tau indicates a positive correlation (as stage height increases, river flow tends to increase).
# - The p-value tells us if the correlation is statistically significant.
tau, p_value = stats.kendalltau(stage[::3], flow[::3])
print(tau, p_value)

1.0 0.08333333333333333


In [None]:
# Perform the Mann Whitney U test
statistic, p_value = stats.mannwhitneyu(concrete_type_A, concrete_type_B, alternative='two-sided')

# Set significance level (alpha)
alpha = 0.05
if p_value < alpha: 
    print('different medians')

## 5. Supervised learning

## 6. Performance metrics

## 7. Cross validation