# Exercise questions on Linear Regression

# Q1

Imagine you're trying to model the relationship between consumer spending, income and price of goods. However, you suspect that the relationship may not be linear, and as an economist you have a clue that spending is directly proportional to log of income, and inversely proportional to price. 

Fit a model and find the relevant coefficients for the given data in 'consumers.csv' file.

Import required libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [3]:
def metrics(y_test,predictions):
    residuals=y_test-predictions

    print(f'\nThe Mean Absolute Error on the test data is: {mean_absolute_error(y_test,predictions):.3f}\n')
    print(f'\nThe RMSE on the test data is: {mean_squared_error(y_test,predictions)**0.5:.3f}\n')
    print(f'\nThe R2 value is: {r2_score(y_test,predictions):.3f}\n')

Load the data:

In [12]:
df=pd.read_csv('consumers.csv')
df.head()

Unnamed: 0,Income,Price,Spending
0,4370.86107,4.111489,273.088771
1,9556.428758,64.004631,280.607753
2,7587.945476,32.121242,259.492918
3,6387.926358,51.348498,210.312997
4,2404.167764,90.849081,164.197755


Are there any missing values?

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Income    100 non-null    float64
 1   Price     100 non-null    float64
 2   Spending  100 non-null    float64
dtypes: float64(3)
memory usage: 2.5 KB


Fit a suitable model, and check its performance:

In [15]:
x1=np.log(df['Income'].values).reshape(-1,1)
x2=(1/df['Price']).values.reshape(-1,1)
X=np.hstack((x1,x2))
y=df['Spending'].values.reshape(-1,1)
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.3, random_state=42)
model=LinearRegression()
model.fit(X_train,y_train)
predictions=model.predict(X_test)
metrics(y_test,predictions)


The Mean Absolute Error on the test data is: 23.252


The RMSE on the test data is: 27.406


The R2 value is: 0.855



## Q2 - Multivariate Linear Regression

'wine_data' consists of various attributes of red wine such as acidic, sugar,chloride and sulfur content and density. We are interested in being able to model alcohol content and quality given the attribute data. 
Build such a model and comment on the fit of your model. Assume linear relationships between dependent and independent sets of variables.

Load the data

In [16]:
data=pd.read_csv('wine_data.csv')

In [17]:
data.head()

Unnamed: 0,fixed acidity,residual sugar,chlorides,Sulphates,density,alcohol,quality
0,5.488135,13.556331,0.000312,0.906555,13.556331,47.266304,-43.775853
1,7.151894,5.400159,0.000696,0.774047,5.400159,26.183714,-4.191843
2,6.027634,14.70388,0.000378,0.333145,14.70388,56.56418,-50.673865
3,5.448832,19.243771,0.00018,0.081101,19.243771,68.564888,-58.720789
4,4.236548,4.975063,2.5e-05,0.407241,4.975063,22.498105,-8.607858


Build and test the model:

In [18]:
X,y=data.drop(['alcohol','quality'],axis=1).values,data[['alcohol','quality']].values
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.3, random_state=42)
model=LinearRegression()
model.fit(X_train,y_train)
predictions=model.predict(X_test)


How is it performing?

In [19]:
print('The metrics for alcohol is as follows: \n')
metrics(y_test[:,0],predictions[:,0])

The metrics for alcohol is as follows: 


The Mean Absolute Error on the test data is: 0.798


The RMSE on the test data is: 0.992


The R2 value is: 0.997



In [20]:

print('The metrics for quality is as follows: \n')
metrics(y_test[:,1],predictions[:,1])

The metrics for quality is as follows: 


The Mean Absolute Error on the test data is: 3.447


The RMSE on the test data is: 4.054


The R2 value is: 0.972

