# Regression

## Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**`%matplotlib notebook`**

Unlike `%matplotlib inline`, which displays static images of plots, `%matplotlib notebook` provides an interactive interface with the plots, allowing for zooming, panning, and other interactive features. It can be useful for exploring and analyzing data in a more interactive manner. However, it may be slower and more resource-intensive than %matplotlib inline.

In [None]:
pd.options.display.max_columns = None # to remove the limit on the number of df columns shown in output
pd.options.display.precision = 3 # to have 3 decimal points as a global option in dataframes
pd.options.display.float_format = '{:,.3f}'.format # 3 decimals and ',' for larger floats in pd dataframes
np.set_printoptions(precision = 3) # to have 3 decimal points as a global option in simple print outputs
np.set_printoptions(suppress = True) # to avoid scientific notation in the outputs
np.set_printoptions(formatter={'float': lambda x: "{:,.3f}".format(x)}) # 3 decimals and ',' for larger floats in np arrays

In [None]:
Housing = pd.read_csv("https://raw.githubusercontent.com/monahatami1/monogram1/master/USA_Housing.csv")

In [None]:
Housing.head()

In [None]:
Housing.isnull().sum() # checking if there is any null values in the dataset

I don't like column heads that are so long! So I am going to change those to 1-2 words maximum length. It will easier to handle if shorter.

In [None]:
ColNames = "Income Age Rooms BedroomArea Population Price Address".split()

In [None]:
Housing = pd.read_csv("https://raw.githubusercontent.com/monahatami1/monogram1/master/USA_Housing.csv", 
                      skiprows = 1,
                     names = ColNames)

In [None]:
pd.options.display.float_format = '{:,.2f}'.format # 1 decimal points and ',' for larger floats in Pandas dataframes
Housing.head()

In [None]:
Housing.info()

In [None]:
Housing.describe()

Just to have some fun with categorical variables as well, I am going to create some categorical variable out of the existing variables. Let's make a binary and a multi-class variable. I will just use the 40,000 as the population threshold to call it an MPO and this shall become an indicator factor variable with Yes and No as the values. 
Similarly, for Rooms, let's call them Low (rooms less than 4), Medium (for 4-7 rooms) and High (for more than 7 rooms)

In [None]:
Housing['MPO'] = Housing['Population'].apply(lambda x: "Yes" if x >= 40000 else "No")

The .apply() function is a method in pandas that applies a function along an axis of the DataFrame. I used it above to apply a lambda function to each value of the Population column. The lambda function takes a single argument x, which represents each population value and if it is greater than or equal to 40000, the function returns "Yes", otherwise it returns "No" in the new column defined as MPO.

I could simply use 1 and 0 instead of Yes and No, but I did it so that below I can practice the `.cat.codes` attribute below.
This attribute in pandas can be used to get the integer codes of the categories in a categorical column. **This attribute is available only for pandas categorical columns, which are created using the `pd.Categorical()` function or by calling the `.astype('category')` method on an existing column.** 

For practice purposes I will create a new column with the codes this time. Note, the new MPO column is not defined as categorical so it has to be identified as categorical first before coding it.

In [None]:
Housing['MPO01'] = Housing['MPO'].astype('category').cat.codes

In [None]:
RoomGroups = 'Low Medium High'.split() # define groups to put numbers in
Room_bins = np.linspace(min(Housing["Rooms"]), max(Housing["Rooms"]), 4) # define bins betwen min and max to use to split
Housing['RoomGroup'] = pd.cut(Housing['Rooms'], 
                                bins = Room_bins, 
                                right = False, # whether the interval should be closed on the right
                                include_lowest = True, # whether the lowest edge of the leftmost bin should be included.
                                labels = RoomGroups)
# Housing[['Rooms', 'RoomGroup']].head(20)

In [None]:
Housing.head()

## EDA

In [None]:
sns.histplot(data = Housing['Price'], 
             bins = 20,
             kde = True,
             color = "red",
             fill = "blue",
             line_kws = {'color': 'red', 
                         'linewidth': 2,
                        'linestyle': "-."})
plt.show()

In [None]:
# simple scatter plot of price vs one predictor variable
Housing.plot(kind = 'scatter', 
             x = 'Income', 
             y = 'Price', 
             s = 1, 
             c = "orange")
plt.show()

In [None]:
# a side-by-side boxplot for the Price grouped by the levels of the created binary MPO variable 
sns.boxplot(x = 'MPO01', y = 'Price', data = Housing)

In [None]:
# Scatter plots of all pairs of variables
sns.pairplot(Housing)

In [None]:
# let's take a look at the correlation between variables
corrmatrix = Housing.iloc[:,0:5].corr()
corrmatrix

Let's look at the correlation matrix in a heat map. Some of the options for the color map are viridis, coolwarm, magma, inferno, plasma, Greys, Blues, Greens, Reds, Oranges.

In [None]:
sns.heatmap(corrmatrix, 
            cmap = 'Reds', # color map
            annot = True, # add numeric values
            linewidth = 0.1,
            linecolor = "black"
           ) 
plt.show()

## Training
We need predictors and a target. In this case, the target is the house price and the other variables are being used as predictors. Address is also not used assuming it does not provide any additional informaiton on the Price. It may do, if we could extarct zip codes or street names but for now, this column is out!

**1.** Divide dataset into train and test sets using **`train_test_split`** function from `scikit-learn` library's `model_selection` module

The syntax is `train_test_split(X, y, test_size = None, random_state = None, shuffle = True, stratify = None)`. 
- `test_size` with a default of 0.25 represents the proportion of the dataset to include in the test split.
- `random_state` is like a seed thing
- `shuffle` is to whether or not sshuffle the data before splitting
- `stratify` ensures that each class is represented proportionally in the train and test datasets.  This is useful when dealing with imbalanced datasets or when the target variable is categorical. here is an example: 

`X_train, X_test, y_train, y_test = train_test_split(data.drop('class', axis=1), data['class'], test_size=0.3, random_state=42, stratify=data['class'])`


**2.** Create a regression model using **`LinearRegression`** function from `scikit-learn` library's `linear_model` module

In [None]:
Housing.head(2)

In [None]:
ColIDs = list(range(0,5)) + [8] # including the MPO Bindary variable with population is redundant (did it for coding purposes)
X = Housing.iloc[:, ColIDs] # Defining the predictor variables
y = Housing['Price'] # Defining the target variable

In [None]:
# pd.get_dummies(Housing['RoomGroup']) # If I wanted to use the RoomGroup I made in my analysis

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression # to make the function simepler to write

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 101)
TestData = pd.concat([X_test, y_test], axis = 1)
TrainData = pd.concat([X_train, y_train], axis = 1)

In [None]:
TrainData.head()

In [None]:
# Looking at the Price vs. Income for train and test data visually to see how it looks!
plt.scatter(X_train['Income'], y_train, 
            label = "Train Data",
            color = "r", 
            marker = '.',
            s = 2)

plt.scatter(X_test['Income'], y_test, 
            label = "Test Data",
            color = "b", 
            marker = '.',
            s = 2,
            alpha = 0.6)
plt.legend()
plt.title("Price vs. Income")
plt.show()

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)  # to fit the regression model to the training dataset
print(lm.intercept_)
pd.DataFrame(data = list([lm.intercept_]) + list(lm.coef_), 
             index = ['Intercept'] + list(X.columns), 
             columns = ['Coefficient'])

## Testing and Evaluation
For testing the model, we will use the fitted model to predict target values on the test set and then compare predictions with observed values to find error rates. There are different metrics or loss functions that can be used for evaluating the fit, as we try to minimize them with better models that predict better.
- **Mean Absolute Error**, (the average error / the easiest to understand)
- **Mean Squared Error**, (more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in real world)
- **Root Mean Squared Error** (even more popular than MSE, because RMSE is interpretable in the "y" units)

In [None]:
Predictions = lm.predict(X_test)

In ML, the method `.reshape(1, -1)` is used to transform a 1-dimensional numpy array (`np.array()`) into a 2-dimensional array with 1 row and a number of columns equal to the length of the original array. This is often used to prepare data for model training or prediction, as many ML algorithms expect 2-dimensional arrays as input. **The -1 argument in the reshape method is used to automatically calculate the number of columns based on the length of the original array.**

In [None]:
# Let's predict the price of a specific house with certain values for its predictors
SampleHouse = np.array([68500, 5.9, 6.9, 3.4, 3600, 1]).reshape(1,-1)
SampleHouseprice = lm.predict(SampleHouse)
print('Predicted price for the sample house is ${0}'.format(round(SampleHouseprice[0], 0)))

In [None]:
plt.scatter(y_test, Predictions, 
            marker = '.', 
            c = "red", 
            s = 1)
plt.title('Predicted vs. Observed House Prices') ; plt.xlabel('Observed') ; plt.ylabel('Predicted'); plt.show()

In [None]:
sns.histplot((y_test - Predictions), bins = 50);

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', round(metrics.mean_absolute_error(y_test, Predictions), 2))
print('MSE:', round(metrics.mean_squared_error(y_test, Predictions), 2))
print('RMSE:', round(np.sqrt(metrics.mean_squared_error(y_test, Predictions)), 2))

In additon, we can use the **R<sup>2</sup>** value as a metric for goodness of fit. the `score()` function from scikit-learn library would do the job.

In [None]:
lm.score(X_test, y_test)