<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
# Imports section
import pandas as pd # needed for dataframe
from sklearn.model_selection import train_test_split #used to split dataset
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures

## Part 1. Loading the dataset

In [2]:
# Using pandas load the dataset (load remotely, not locally)
# Output the first 15 rows of the data
# Display a summary of the table information (number of datapoints, etc.)

# load the date into dataframe
dataframe = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

# this will give us the first 15 rows
dataframe.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [4]:
# Get the summary of the table information
dataframe.info()
dataframe.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


## Part 2. Splitting the dataset

In [6]:
# Take the pandas dataset and split it into our features (X) and label (y)
# get the first two feature columns and store them in X, get the size of smile and store them in y
X = dataframe[["Temperature °C", "Mols KCL"]]
y = dataframe["Size nm^3"]

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

## Part 3. Perform a Linear Regression

In [18]:
# Use sklearn to train a model on the training set
linear_model = LinearRegression().fit(X_train, y_train)

# Create a sample datapoint and predict the output of that sample with the trained model
x_sample_point = np.array([[200,420]])
y_sample_pred = linear_model.predict([[200, 420]])[0]
print(f"The predicted result of the sample datapoint is: {y_sample_pred:.5f}")

# Report on the score for that model, in your own words (markdown, not code) explain what the score means
train_score = linear_model.score(X_train, y_train)
test_score = linear_model.score(X_test, y_test)
print(f"The (train, test) score for our linear model is ({train_score:.5f}, {test_score:.5f}).")

The predicted result of the sample datapoint is: 197569.73102
The (train, test) score for our linear model is (0.86101, 0.85525).




- The train score indicated the relation between our predicted result and the actual result in our train_y set, and we have a good relation which 0.86101
- The test score is can represent our model's performance in real world data. And we have around 0.85525, which is very close to our train socre, so it is also a good result. 

In [20]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX

# get the weight and bias
w = linear_model.coef_
b = linear_model.intercept_
w[0], w[1] = round(w[0], 5), round(w[1], 5)
b = round(b, 5)
print(f"The coefficents are {w}, the intercpet is {b}")

The coefficents are [ 866.14641 1032.69507], the intercpet is -409391.47958


According to our coefficents and intercept, we can write our smile equation as follow: 
$y = 866.14641x_1 + 1032.69507x_2 - 409391.47958$
<br>
And $x_1$ is our first feature temperature, $x_2$ is our second feature Mols KCL

## Part 4. Use Cross Validation

In [24]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
# Report on their finding and their significance

k_folds_5 = cross_val_score(linear_model, X_train, y_train, cv = 5)
avg = k_folds_5.mean()
print(k_folds_5)
print(f"The average score is: {avg:.5f}")

[0.86226163 0.81982226 0.88938198 0.86663176 0.85729958]
The average score is: 0.85908


- we have chosen 5 to be the number of folds, we found that for each of the 5 differet experiments, they generated different results, and the second experiment has the worst performance, maybe it is too overfitting the data.
- Cross_validation is important because it can test a model's performance on new dataset. It randomly takes out part of the data and use it as a validation set. We don't know how real world data is, so this way can treat the dataset as a testing set to see the accuracy of the model.

## Part 5. Using Polynomial Regression

In [25]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly_f = PolynomialFeatures(degree=2)
X_poly_train = poly_f.fit_transform(X_train)
X_poly_test = poly_f.fit_transform(X_test)
poly_model = LinearRegression().fit(X_poly_train, y_train)

# Report on the metrics and output the resultant equation as you did in Part 3.
train_score = poly_model.score(X_poly_train, y_train)
test_score = poly_model.score(X_poly_test, y_test)
print(f"The (train, test) score for our linear model is ({train_score:.5f}, {test_score:.5f}).")

The (train, test) score for our linear model is (1.00000, 1.00000).


- The score for both traning set and test set is 1, which means our model is perfect

In [27]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX

# get the weight and bias
w = np.round_(poly_model.coef_, decimals=5)
b = round(poly_model.intercept_, 5)
print(f"The coefficents are {w}, the intercpet is {b}")

2e-05
The coefficents are [ 0.      12.      -0.       0.       2.       0.02857], the intercpet is 2e-05


We can write our equation as following according to the degree 2 polynomial function:
$ y = 12x_1 + 2x_1x_2 + 0.02857x_2^2 $
<br>