# Regression

## Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
pd.options.display.max_columns = None # to remove the limit on the number of df columns shown in output
pd.options.display.precision = 3 # to have 3 decimal points as a global option in dataframes
pd.options.display.float_format = '{:,.3f}'.format # 3 decimals and ',' for larger floats in pd dataframes
np.set_printoptions(precision = 3) # to have 3 decimal points as a global option in simple print outputs
np.set_printoptions(suppress = True) # to avoid scientific notation in the outputs
np.set_printoptions(formatter={'float': lambda x: "{:,.3f}".format(x)}) # 3 decimals and ',' for larger floats in np arrays

In [None]:
Housing = pd.read_csv("https://raw.githubusercontent.com/monahatami1/monogram1/master/USA_Housing.csv")

In [None]:
Housing.head()

In [None]:
Housing.isnull().sum()

I don't like column heads that are so long! So I am going to change those to 1-2 words maximum length. It will easier to handle if shorter.

In [None]:
ColNames = "Income Age Rooms BedroomArea Population Price Address".split()

In [None]:
Housing = pd.read_csv("https://raw.githubusercontent.com/monahatami1/monogram1/master/USA_Housing.csv", 
                      skiprows = 1,
                     names = ColNames)

In [None]:
pd.options.display.float_format = '{:,.2f}'.format # 1 decimal points and ',' for larger floats in Pandas dataframes
Housing.head()

In [None]:
Housing.info()

In [None]:
Housing.describe()

## EDA

In [None]:
sns.pairplot(Housing)

In [None]:
sns.histplot(data = Housing['Price'], 
             bins = 20,
             kde = True,
             color = "red",
             fill = "blue",
             line_kws = {'color': 'red', 
                         'linewidth': 2,
                        'linestyle': "-."})
plt.show()

## Training
We need predictors and a target. In this case, the target is the house price and the other variables are being used as predictors. Address is also not used assuming it does not provide any additional informaiton on the Price. It may do, if we could extarct zip codes or street names but for now, this column is out!
1. Divide dataset into train and test sets using **`train_test_split`** function from `scikit-learn` library's `model_selection` module
2. Create a regression model using **`LinearRegression`** function from `scikit-learn` library's `linear_model` module

In [None]:
X = Housing.iloc[:, 0:5] # Defining the predictor variables
y = Housing['Price'] # Defining the target variable

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
TrainX, TestX, Trainy, Testy = train_test_split(X, y, test_size = 0.4, random_state = 101)

In [None]:
lm = LinearRegression() # to make the function simepler to write
lm.fit(TrainX, Trainy)  # to fit the regression model to the training dataset
print(lm.intercept_)
pd.DataFrame(data = list([lm.intercept_]) + list(lm.coef_), 
             index = ['Intercept'] + list(X.columns), 
             columns = ['Coefficient'])

## Testing and Evaluation
For testing the model, we will use the fitted model to predict target values on the test set and then compare predictions with observed values to find error rates. There are different metrics or loss functions that can be used for evaluating the fit, as we try to minimize them with better models that predict better.
- **Mean Absolute Error**, (the average error / the easiest to understand)
- **Mean Squared Error**, (more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in real world)
- **Root Mean Squared Error** (even more popular than MSE, because RMSE is interpretable in the "y" units)

In [None]:
Predictions = lm.predict(TestX)

In ML, the method `.reshape(1, -1)` is used to transform a 1-dimensional numpy array (`np.array()`) into a 2-dimensional array with 1 row and a number of columns equal to the length of the original array. This is often used to prepare data for model training or prediction, as many ML algorithms expect 2-dimensional arrays as input. **The -1 argument in the reshape method is used to automatically calculate the number of columns based on the length of the original array.**

In [None]:
# Let's predict the price of a specific house with certain values for its predictors
SampleHouse = np.array([68500, 5.9, 6.9, 3.4, 3600]).reshape(1,-1)
SampleHouseprice = lm.predict(SampleHouse)
print('Predicted price for the sample house is ${0}'.format(round(SampleHouseprice[0], 0)))

In [None]:
SampleHouse = np.array([68500, 5.9, 6.9, 3.4, 3600]).reshape(1,-1)
SampleHouseprice = lm.predict(SampleHouse)
print('Predicted price for the sample house is ${0}'.format(round(SampleHouseprice[0], 0)))

In [None]:
plt.scatter(Testy, Predictions, 
            marker = '.', 
            c = "red", 
            s = 1)
plt.title('Predicted vs. Observed House Prices') ; plt.xlabel('Observed') ; plt.ylabel('Predicted'); plt.show()

In [None]:
sns.histplot((Testy - Predictions), bins = 50);

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', round(metrics.mean_absolute_error(Testy, Predictions), 2))
print('MSE:', round(metrics.mean_squared_error(Testy, Predictions), 2))
print('RMSE:', round(np.sqrt(metrics.mean_squared_error(Testy, Predictions)), 2))