# Regression - Additional Notes.
We created a notebook called Regression, regression, regression  comparing different regression techniques, that notebook got a bit long and a bit off topic, this notebook contains some of the additional material taken out of that document to keep it focussed.

In section 2 we discovered a relationship between child_mortality and fertility, we go on to  plot this here to see that it is infact correct.

## Plotting a linear regression for child mortality and fertility

Having learnt from our Lasso regression that child mortality has a stronger relationship to fertility that life expectancy, it is worth plotting the linear regression for child-martality and fertility to see how close this relationship is.

In [None]:
## Fit & predict - Linear Regression ##

# Import LinearRegression
from sklearn.linear_model import LinearRegression

def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat=np.corrcoef(x,y)
    # Return entry [0,1]
    return corr_mat[0,1]

# Create arrays for features and target variable
y =lf['fertility'].values
X_mortality=lf['child_mortality'].values

print("Pearson's Coefficient : " + (str(pearson_r(y,X_mortality))))

# Reshape X and y
y = y.reshape(-1,1)
X_mortality = X_mortality.reshape(-1,1)

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_mortality), max(X_mortality)).reshape(-1,1)

# Fit the model to the data
reg.fit(X_mortality, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
print("Score :" + (str(reg.score(X_mortality,y))))

# Plot data points
plt.scatter(X_mortality, y,marker='.',color='green')

# Plot regression line
plt.plot(prediction_space, y_pred, color='purple', linewidth=2)

# Label Graph
_ = plt.title('Supervised Learning - Child mortality and Fertility')
_ = plt.xlabel('Child Mortality')
_ = plt.ylabel('Fertility')

#show plot
plt.show()

The Pearson's coefficient for Life Expectancy and fertility was 0.78 but child mortality and fertility is higher at 0.91 which suggest that there is stronger relationship between child mortality and fertility than there is between life expectancy and fertility.

So while there was little difference between the Linear regression plots generated by Least squares and supervised learning for the life expectancy and fertility relationhip, the supervised learning lasso regression produced a far more insightful result.

# Dimension Reduction : Unsupervised Learning

While Linear Regression relates to supervised rather than unsepvised learning, you can use unsupervised learnng to identify principal components and to reduce dimensions.

## Dimension Reduction

Dimension reduction seeks to simplify data by looking for patterns in the data, and then using these pattens to re-express the data in a compressed form. This makes subsequent expression of the data much more efficient, which can be very useful for larger datasets/big data.

Dimension reduction also seeks remove less-informative "noise" features, which cause problems for prediction tasks, e.g. classification, regressions.

## Principal Component Analysis (PCA)

The most fundamental dimension reduction technique is Principal Component Analysis.

Principal Component Analysis consists of 2 steps, "decorrelation" which doesn't change the dimentions of the data at all the second step is diminsion reduction.

In [None]:
## Principal Component Analysis (PCA) ##

# import libraries
import pandas as pd
from scipy.stats import pearsonr
import matplotlib.pyplot as plt

#import data
# import the data
lf=pd.read_csv('Data/gm_2008_region.csv',encoding = "ISO-8859-1", sep=',',index_col='Unnamed: 0')

# drop column
samples= lf.drop('Region', axis=1)
samples= samples.drop(['population',"GDP"], axis=1)

# create labels
from sklearn.cluster import KMeans
model = KMeans(n_clusters=4)
model.fit(samples)
labels = model.predict(samples)

# plot original dataset
plt.figure(1)
plt.scatter((samples['fertility']),(samples['life']),c=labels)

# Set the x-axis range
plt.ylim(45,85)

plt.show

#import library
from sklearn.decomposition import PCA

# instanciate PCA model and .fit() data to it
model = PCA()
model.fit(samples)

# .transform() the data
transformed = model.transform(samples)

# plot transformed dataset
transformeddf=pd.DataFrame(data=transformed[0:,0:])
plt.figure(2)
plt.scatter(transformeddf[0],transformeddf[1],c=labels)
plt.axhline(0, color='black')
plt.axvline(0, color='black')
plt.show

# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr((samples['fertility']),(samples['life']))

# Display the correlation
print("Pearson's correlation : " + str(correlation))

# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(transformeddf[0],transformeddf[1])

# Display the correlation
print("Pearson's correlation for Principal Components : " + str(correlation))

# Plotting the variances of PCA features¶

In [None]:
#computer the intrincic dimensions
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler,pca)
# Fit the pipeline to 'samples'
pipeline.fit(samples)
# Plot the explained variances
plt.figure(1)
features = range(pca.n_components_)
plt.bar(features,pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()

 ## The boston house price linear regression example with keras

In [None]:
## multi feature Linear regression with Tensorflow and Keras ##
## Sample from BOSTON house price Sample ##

import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# load dataset
dataframe = pandas.read_csv("Data/boston.csv", sep=',',index_col='Unnamed: 0')
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

# define base model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)

kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(estimator, X, Y, cv=kfold)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))

<a href="https://studiolab.sagemaker.aws/import/github/davidcoxon/Python/blob/master/Reference/Regressions%2CRegressions%2CRegressions/Regression_%20additional%20notes.ipynb">
  <img src="https://studiolab.sagemaker.aws/studiolab.svg" alt="Open In SageMaker Studio Lab"/></a>