[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/asakura000/csc_448_Sakurai/blob/main/Atsuko_Assignment_3.ipynb)

# Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline

## Part 1. Loading the dataset

In [2]:
csv_url = 'https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv'

In [3]:
# Using pandas load the dataset (load remotely, not locally)
df = pd.read_csv(csv_url)

In [4]:
# Output the first 15 rows of the data
df.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [5]:
# Display a summary of the table information (number of datapoints, etc.)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


In [6]:
df.shape

(1000, 3)

In [7]:
df.describe()

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


In [8]:
# see how things are correlated:
# sns.pairplot(data=df)

## Part 2. Splitting the dataset

In [9]:
df.columns

Index(['Temperature °C', 'Mols KCL', 'Size nm^3'], dtype='object')

In [10]:
# Take the pandas dataset and split it into our features (X) and label (y)
my_features = ['Temperature °C', 'Mols KCL']
target = 'Size nm^3'
X = df[my_features].values
y = df[target].values

In [11]:
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
print('Lenght of our Training data:', X_train.shape, '\nLength of our Testing data:', y_test.shape)

Lenght of our Training data: (900, 2) 
Length of our Testing data: (100,)


## Part 3. Perform a Linear Regression

In [12]:
# Use sklearn to train a model on the training set

# create an empty Linear Regression model 
model = LinearRegression()

# train the model 
model.fit(X_train, y_train)

LinearRegression()

In [13]:
# Create a sample datapoint and predict the output of that sample with the trained model

# Sample data
test = [[898, 75]]
my_prediction = model.predict(test)
print('Model predicted: ', my_prediction)

Model predicted:  [445860.12961101]


In [14]:
model.score(X_test, y_test)

0.8552472077276096

In [16]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means

### Interpretation of the score of 85.6%:
- The score function returns the coefficient of determination (R squared).
- The R-squared value tells us the degree of correlation between the X and Y variables.
- In this case, we can say that approximately 86% of the change in the size of the slime can be explained by the changes in temperature and moles. 

In [17]:
# y_pred = model.predict(X_test)
# df_preds = pd.DataFrame(y_pred, columns=['predictions'])
# df_preds['actual'] = y_test
# df_preds['abs_error'] = abs(df_preds['predictions'] - df_preds['actual'])
# df_preds

### Extract the Coefficients

In [18]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
coefficient_values = model.coef_
coefficient_values

array([ 866.14641337, 1032.69506649])

In [19]:
intercept_val = model.intercept_
intercept_val

-409391.47958340833

In [21]:
# cols = df[my_features]
# df_coeff = pd.DataFrame(model.coef_, cols.columns,
#              columns=['coefficients'])  
# df_coeff

### Equation for h(x):
###  $$ h(x) = -409391.5 + 866.11x{_1} + 1032.7x{_2} $$
### where x1 = temperature(celsius) and x2 = Mols of potassium chloride

# Part 4. Cross Validation

In [22]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
# Report on their finding and their significance

In [23]:
from sklearn.model_selection import cross_val_score

In [24]:
scores = cross_val_score(model, X, y, cv=5)
scores

array([0.83918826, 0.87051239, 0.85871066, 0.87202623, 0.84364641])

In [25]:
# this is the average (and SD) of the scores array above
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.86 accuracy with a standard deviation of 0.01


In [26]:
from sklearn.model_selection import ShuffleSplit

In [27]:
X.shape[0]

1000

In [30]:
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.1, random_state=0)
cross_val_score(model, X, y, cv=cv)

array([0.87616468, 0.86951566, 0.83708494, 0.86963943, 0.84945355])

In [31]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.86 accuracy with a standard deviation of 0.01


### Interpret results from CV scores

1. For the first implementation: cross_val_score(model, X, y, cv=5)
    - uses a K fold cross validator (by default). This means the data is split into k consecutive folds
    - there is no shuffling
    - cv=5 means that we split and fit the model 5 different times.
    - Therefore the output is an array of 5 scores, each corresponding to one split/fit iteration
    - In this case, we got a score of 83.9, 87.1, 85.6, 87.2, 84.4 respectively.
2. For the second implementation where I added this line: cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    - Before we find the cross validation scores, we do an extra step where we want to shuffle the data for each iteration
    - Only fter the shuffling, the train/test split happens. 
    - When we shuffle the data, we get the same mean and standard deviation for the 5 iterations of the split.
     

# Part 5. Using Polynomial Regression

In [132]:
from sklearn.preprocessing import PolynomialFeatures

In [133]:
# augment current dataset with polynomial features
poly_features = PolynomialFeatures(degree=2)

In [139]:
# turn it into a new dataframe
df_poly = pd.DataFrame(
    data = poly_features.fit_transform(X),
    columns = poly.get_feature_names(df.columns)
)

In [143]:
#df_poly.head()

In [141]:
# append the original column for the thing we are trying to predict (size of the slime)
df_poly['Size nm^3'] = df['Size nm^3']

In [142]:
# take a look at the new dataframe with polynomial features added:
df_poly.head()

Unnamed: 0,1,Temperature °C,Mols KCL,Temperature °C^2,Temperature °C Mols KCL,Mols KCL^2,Size nm^3
0,1.0,469.0,647.0,219961.0,303443.0,418609.0,6.244743e+05
1,1.0,403.0,694.0,162409.0,279682.0,481636.0,5.779610e+05
2,1.0,302.0,975.0,91204.0,294450.0,950625.0,6.196847e+05
3,1.0,779.0,916.0,606841.0,713564.0,839056.0,1.460449e+06
4,1.0,901.0,18.0,811801.0,16218.0,324.0,4.325726e+04
...,...,...,...,...,...,...,...
995,1.0,894.0,847.0,799236.0,757218.0,717409.0,1.545661e+06
996,1.0,327.0,982.0,106929.0,321114.0,964324.0,6.737041e+05
997,1.0,791.0,213.0,625681.0,168483.0,45369.0,3.477543e+05
998,1.0,769.0,553.0,591361.0,425257.0,305809.0,8.684794e+05


In [144]:
df_poly.columns

Index(['1', 'Temperature °C', 'Mols KCL', 'Temperature °C^2',
       'Temperature °C Mols KCL', 'Mols KCL^2', 'Size nm^3'],
      dtype='object')

In [154]:
my_poly_feat = ['1', 'Temperature °C', 'Mols KCL', 'Temperature °C^2',
       'Temperature °C Mols KCL', 'Mols KCL^2']
target_poly = 'Size nm^3'

In [155]:
# re-define my X and y
X = df_poly[my_poly_feat].values
y = df_poly[target_poly].values

In [156]:
# train, test, and split again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
print('Lenght of our Training data:', X_train.shape, '\nLength of our Testing data:', y_test.shape)

Lenght of our Training data: (900, 6) 
Length of our Testing data: (100,)


In [157]:
# create an empty Linear Regression model 
model_p = LinearRegression()

# train the model 
model_p.fit(X_train, y_train)

LinearRegression()

In [158]:
model_p.score(X_test, y_test)

1.0

In [159]:
model_p.coef_

array([ 0.00000000e+00,  1.20000000e+01, -1.27196461e-07,  1.26477356e-11,
        2.00000000e+00,  2.85714287e-02])

In [160]:
model_p.intercept_

2.0477862562984228e-05

### What polynomial features do:
The original features $$ [x{_1}, x{_2}]$$ have been transformed to: $$ [1, x{_1}, x{_2}, x{_1}^2, x{_1}x{_2}, x{_2}^2] $$ 

### Polynomial Equation for h(x):
###  $$ h{_p}(x) = 2.05(10)^{-5} + 1.2(10)^{1}x_{1} -1.27(10)^{-7}x_{2} -1.26(10)^{-11}x{_1}^2 + 2.0x{_1}x{_2} + 2.86(10)^{-2}x{_2}^2$$
### - where x1 = temperature(celsius) and x2 = Mols of potassium chloride

### - the new R squared of 1.0 means that 100% of the changes in slime sizes can be explained by the change in temperature and mols of KCl