# Imports and Globals

In [None]:
!pip install palmerpenguins

In [None]:
#imports
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn import metrics,datasets,neighbors
from palmerpenguins import load_penguins

### Load sample data and poke around to understand it

For this tutorial, we'll use a classic sample dataset called the Palmer Penguins. There will be more details below.

In [None]:
penguinsdf = load_penguins()

In [None]:
penguinsdf.head()
penguinsdf.dropna(inplace = True)

In [None]:
penguinsdf.head()

Another dataset that we are gonna work on is the iris dataset again. But this time we will be loading it directly from sklearn.datasets. And they way its stored is slightly different

In [None]:
iris = datasets.load_iris()
type(iris)

What's an sklearn bunch?? Looking at the documentation for the sklearn iris dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html), we see that a bunch is just a python dictionary. Let's try interacting with it as we would a dictionary and seeing the what the keys are.

In [None]:
iris.keys()

Great! We see that there are 6 keys. Let's take a look at the description.

In [None]:
#index the iris dictionary to see the description
print(iris['DESCR'])

This gives a nice description and background of the classic iris dataset. Now, let's poke around and see what the other things in the iris dictionary are.

In [None]:
#print each of the remaining values in the iris dictionary by indexing the dictionary with the keys
#dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
print(iris['filename'])

Okay, it's clear these values of the iris dataset appear as we expected from the documentation. Let's store the data in a pandas dataframe to more easily manipulate it.

In [None]:
#create a pandas dataframe to store the data and a pandas series to store the targets or y labels
irisdf = pd.DataFrame(data=iris['data'], index=range(1,151,1), columns=iris['feature_names'])
ylabels = pd.Series(data=iris['target'], index=range(1,151,1), name="iris type")
irisdf['label'] = ylabels

In [None]:
#let's check that our dataframe looks as we expect
irisdf

In [None]:
#now let's check the series
ylabels

## Linear Regression

Let's look at some visualizations!

In [None]:
pairplot_figure = sns.pairplot(irisdf, hue='label',palette='colorblind')
pairplot_figure.fig.set_size_inches(9, 6.5)

Which correlation looks most linear? Let's plot the petal length against petal width. 

In [None]:
data_columns = ["petal length (cm)"]
target_column = "petal width (cm)"
_ = sns.scatterplot(data=irisdf, x=data_columns[0], y=target_column)

Now lets try building a linear regression model that use petal length to predict petal width. First we need to split the data in to train/test sets using a 50/50 split

In [None]:
def splitdata(data, testratio):
    #set seed so train and test will always split the same
    np.random.seed(42)
    shuffindices = np.random.permutation(len(data))
    testsize = int(len(data) * testratio)
    testindices = shuffindices[:testsize]
    trainindices = shuffindices[testsize:]
    return data.iloc[trainindices], data.iloc[testindices]

In [None]:
#split the test/train set
iristrain, iristest = splitdata(irisdf, 0.5)
y_train = iristrain['petal width (cm)']
y_test = iristest['petal width (cm)']

In [None]:
#see dimensions fit each other
y_train

In [None]:
#fit the linear model
reg = LinearRegression().fit(iristrain.loc[:,["petal length (cm)"]], y_train)

R^2 is one of the most commonly used metrics to evaluate a linear regression fit: https://en.wikipedia.org/wiki/Coefficient_of_determination

In [None]:
#how good is the fit based on the R^2 coefficient of determination
reg.score(iristrain.loc[:,["petal length (cm)"]], y_train)

Predict the held out test data using the linear regression model trained on the train set of the data

In [None]:
predictions = reg.predict(iristest.loc[:,["petal length (cm)"]])

Now lets visualize how well the linear regression did in predicting the held out test set

In [None]:
ax =sns.scatterplot(x = predictions, y = y_test)
ax.set(xlabel='prediction', ylabel='target')
r2 = reg.score(iristest.loc[:,["petal length (cm)"]], y_test)
ax.text(.05, .8, 'r={:.2f}'.format(r2),
            transform=ax.transAxes)
x0, x1 = ax.get_xlim()
y0, y1 = ax.get_ylim()
lims = [max(x0, y0), min(x1, y1)]
ax.plot(lims, lims, '-r')
plt.show()

### Activity

Now try the same process we just done on the palmer penguin dataset. Can you identify features that are linearly correlated? Can one feature help predicting the target feature?

In [None]:
#plot visulization of the palmer penguins dataset, identify linearly related features


In [None]:
#Scatter plot of the features you identified.


In [None]:
#split the test/train set, you can directly use the defined splitdata function


In [None]:
#fit the linear model


In [None]:
#how good is the fit based on the R^2 coefficient of determination


In [None]:
#predict the held out test set


In [None]:
#compute r2 and plot prediction against ground truth of test set


# PCA Analysis

PCA, or principla component analysis is a commonly used dimension reduction algorithm : https://en.wikipedia.org/wiki/Principal_component_analysis
Principal components can be considered as linear combinations of features. 
It projects each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

In [None]:
#take numerical columns out of df
numerics = ['float16', 'float32', 'float64']
numdf = irisdf.select_dtypes(include=numerics)
numdf

In [None]:
#or, for the iris dataset thats stored as sklearn bunch
iris.data

In [None]:
#fit PCA model
pca =PCA(n_components=4)
pca.fit(numdf)
X = pca.transform(numdf)

In [None]:
#Construct PCA output into a dataframe
pca_df = pd.DataFrame(data=X,columns = ['PC1','PC2','PC3','PC4'])
pca_df['label'] = iris.target
pca_df.head()

In [None]:
#visualize PCA with label information
sns.scatterplot(x ="PC1",y="PC2",data=pca_df,hue='label',palette='colorblind')
plt.show()

### Activity

Try to replicate the PCA analysis on the palmer penguin datasets

In [None]:
#take numerical columns out of df


In [None]:
#your code here


In [None]:
#Construct PCA output into a dataframe


Also Try plotting with different dimensions, see how it influences the way data is presented