# **Assignment 1. Concrete Strength Regression**

This assignment require to perform a multiple variable fitting on a civil engineering dataset. In doing this assignment, you will learn to:

* Load data from a `csv` file using the `pandas` package
* Fit a multiple variable model using the `sklearn` package
* Evaluate the fit.

### **Step 1: load the packages you will need.**

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
import pandas as pd

### **Step 2: Download Data**

Concrete is one of the most basic construction materials.  In this exercise, you will download a simple dataset for predicting the strength of concrete from the attributes of concrete.  The data set comes from this very nice
[kaggle competition](https://www.kaggle.com/maajdl/yeh-concret-data).  Kaggle has many excellent dataset for your project.  
You can download the data with the following command.  After running this command, you should have the file `data.csv` in your local folder.

In [None]:
from six.moves import urllib
import os

fn_src = 'https://raw.githubusercontent.com/sdrangan/introml/master/unit03_mult_lin_reg/Concrete_Data_Yeh.csv'
fn_dst = 'data.csv'

if os.path.isfile(fn_dst):
    print('File %s is already downloaded' % fn_dst)
else:
    urllib.request.urlretrieve(fn_src, fn_dst)
    print('File %s downloaded' % fn_dst)


The `pandas` package has excellent methods for loading `csv` files.  The following command loads the `csv` file into a dataframe `df`.

In [None]:
df = pd.read_csv('data.csv')

Ues the `df.head()` to print the first few rows of the dataframe.

In [None]:
print(len(df))
df.head(20)

### **Step 3: Exploring the data.**

In [None]:
df.describe()

**Step3: Create the list of attribute names**

In this exercise, the target variable will be the concrete strength in Megapascals, `csMPa`.  We will use the other 8 attributes as predictors to predict the strength.  

Create a list called `xnames` of the 8 names of the predictors.  You can do this as follows:
* Get the list of names of the columns from `df.columns.tolist()`.  
* Remove the last items from the list using indexing.

Print the `xnames`.

In [None]:
names = df.columns.tolist()
print(names)
xnames = names[0: 8]
print(xnames)

**Step 4. Get the data matrix `X` and target vector `y` from the dataframe `df`.**  

Recall that to get the items from a dataframe, you can use syntax such as

    X = df.iloc[:,:-1]  
        
which gets the data of last column `csMPa` and puts it into an array `y`.  You can also do that with syntax like

    y = df.iloc[:,-1]  


In [None]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

#print(X.shape)
#print(y.shape)

**Step 5.Split the Data into Training and Test**

Split the data into training and test.  Use 30% for test and 70% for training.
You can do that by using numpy array like the demo. 
You also can do the splitting manually or use the `sklearn` package `train_test_split`.   Store the training data in `Xtr,ytr` and test data in `Xts,yts`.


In [None]:
from sklearn.model_selection import train_test_split

# TODO
# Xtr, Xts, ytr, yts = train_test_split(...)

Xtr, Xts, ytr, yts = train_test_split(X, y, train_size=0.7, random_state=None, stratify=None)
print(Xtr.shape, ytr.shape, sep=" ")

**Step 6. Fit a Linear Model**

Create a linear regression model object `reg` and fit the model on the training data.


In [None]:
# TODO
# reg = ...
# reg.fit(...)

reg = linear_model.LinearRegression()
reg.fit(Xtr, ytr)

**Step 7. Compute the predicted values `yhat_tr` on the training data and print the `R^2` value on the training data.**

In [None]:
# TODO
# yhat_tr = ...
# rsq_tr = ...

yhat_tr = reg.predict(Xtr)

RSS_tr = np.mean((yhat_tr-ytr)**2)/(np.std(ytr)**2)
Rsq_tr = 1- RSS_tr

print("R^2 = {0:f}".format(Rsq_tr))

**Step 8. Compute the predicted values `yhat_val` on the validation data and print the `R^2` value on the validation data.**

In [None]:
# TODO
# yhat_val = ...
# rsq_val = ...

yhat_val = reg.predict(Xts)

RSS_val = np.mean((yhat_val-yts)**2)/(np.std(yts)**2)
Rsq_val = 1 - RSS_val

print("R^2 = {0:f}".format(Rsq_val))

**Step 9. Create a scatter plot of the actual vs. predicted values of `y` on the validation data.**

In [None]:
# TODO

plt.scatter(yts, yhat_val)
plt.plot([0, 100], [0, 100], color='red')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.grid()