# Module 7 - Exercise - Linear Regression with `Python`

This exercise is going to be a little different in the sense that we won't be guiding in a question-by-question format. Instead, we are going to let you construct a linear model in your choice of either `R` or `Python`, whichever you prefer.

The prediction problem is to predict `height` from the `'/dsa/data/all_datasets/stature-hand-foot/stature-hand-foot.csv'` dataset. You can use any variable (or combination of variables) in order to predict `height`.

You are not going to be graded upon the performance of the model itself, but please approach this as an actual prediction problem. That being said, you should split the data into training and testing sets, in which your model is trained on your training set while the performance is assessed on the testing set. Be sure to predict some output with your testing inputs.

The purpose of this assignment is to demonstrate your ability to use regression to develop a machine learning model. Feel free to include anything that demonstrates your understanding of model development and model refinement including data exploration as well as a written description of your reasoning. 

Like always, feel free to ask questions along the way if you get stuck at any point. We are more than happy to help!

To add execution cells, click in this cell.
Then, in the notebook menu: `Insert > Insert Cell Below`

In [98]:
#Importing libraries
import pandas as pd
import numpy as np
from sklearn import linear_model
from scipy.stats import pearsonr,spearmanr

In [87]:
#Data aquisition
with open('/dsa/data/all_datasets/stature-hand-foot/stature-hand-foot.csv') as file:
    df = pd.read_csv(file)
print(df.head())

   gender  height  hand length   foot length
0       1  1760.2        208.6         269.6
1       1  1730.1        207.6         251.3
2       1  1659.6        173.2         193.6
3       1  1751.3        258.0         223.8
4       1  1780.6        212.3         282.1


In [96]:
#Data preprocessing
gender = pd.get_dummies(df['gender']) #Converting gender into dummy  input
new_df = pd.concat([gender,df],axis=1)
train = new_df.sample(frac=0.6,random_state=1) #60:40 training : testing split ration
test = new_df.drop(train.index)
train_X = np.asarray(train.iloc[:,[0,1,4,5]]) #Used combination of all inputs to predict height
train_y = np.asarray(train[['height']])
test_X = np.asarray(test.iloc[:,[0,1,4,5]])
test_y = np.asarray(test[['height']])

In [103]:
#Model fitting and prediction
regr = linear_model.LinearRegression()
regr.fit(train_X,train_y)
print('R-Squared: {}'.format(regr.score(train_X, train_y)))
print('R-Squared: {}'.format(regr.score(test_X, test_y)))
train_pred = regr.predict(train_X)
test_pred = regr.predict(test_X)
print(pearsonr(train_y,train_pred))
print(pearsonr(test_y,test_pred))
#Pearson correlation and R squared is high in testing dataset than training dataset. 

R-Squared: 0.8650774035876797
R-Squared: 0.8865903282766516
(array([0.93009537]), array([2.34869598e-41]))
(array([0.94860038]), array([1.13155723e-31]))


# Save your notebook, then `File > Close and Halt`