# Tutorial 5 - Cameron O'Dell

### Activity-I

**In the Tutorial Completion Document, describe your findings.**

Fit/train the model with `sepal_width` and `sepal_length` and then predict the `petal_length`. Plot the actual and predicted values using Plotly.

In [56]:
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression

# Read and create dataframe from CSV
fpath = "C:/Users/14175/Desktop/CS490/Tutorial 5/data/input/"
df = pd.read_csv(fpath + 'iris.csv')

# Distinguishing target variable, generating regression, making prediction
X = df[['sepal_width', 'sepal_length']]
y = df['petal_length']

reg = LinearRegression().fit(X, y)

y_pred = reg.predict(X)

# Create new dataframe
results_df = pd.DataFrame({'Actual': y, 'Predicted': y_pred})

# Plotting
fig = px.scatter(results_df, x='Actual', y='Predicted', title='Actual vs Predicted Petal Length',
                 trendline='ols',  # Ordinary Least Square (OLS) with plotly.express
                 trendline_color_override='red').update_traces(marker=dict(color='blue'))

fig.update_layout(title='<b>Iris Dataset</b>',
                  autosize=False,
                  width=700,
                  height=500,
                  margin=dict(
                      l=0,
                      r=0,
                      b=0,
                  ), )

fig.show()

**Activity I findings.**

Using the 'sepal_width' and 'sepal_length' columns as the inputs ('X') and the 'petal_length' column as the target output ('y') I generated a linear regression model from the scikit-learn library to analyze the data and make predictions based on the inputs. The resulting predictions and the actual target values are then put into a new Pandas dataframe called 'results_df'. 

The graph, using plotly express, shows the relationship between the predicted and actual values. The graph also includes a line that shows how closely the predicted values match the actual values. The predicted values tend to increase as the actual values increase. This suggests that the linear regression model is relatively effective for predicting the target output based on the inputs. The scatter plot points do not deviate substantially from the trace line.

## TASK I

**In the Tutorial Completion Document, describe your findings.**

Fit/train the model with the number of `bedrooms` and `bathrooms`. And then, predict the **house price** if the number of **bedrooms is three** and the number of **bathrooms is two**. Plot the actual and predicted values using Plotly.

<font color='red'>Note: You need to print the house price (single value) when bedrooms=3 and bathrooms=2</font>

In [57]:
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression

# Reading the CSV file
fpath = "C:/Users/14175/Desktop/CS490/Tutorial 5/data/input/"
df = pd.read_csv(fpath + 'HousePrices.csv')

#  Distinguishing target variable, generating regression, making prediction
X = df[['bedrooms', 'bathrooms']]
y = df['price']

reg = LinearRegression().fit(X, y)

house = [[3, 2]]
price_pred = reg.predict(house)

# Creating a new dataframe with actual and predicted values
results_df = pd.DataFrame({'Actual': y, 'Predicted': reg.predict(X)})

# Plotting actual and predicted values using Plotly
fig = px.scatter(results_df, x='Actual', y='Predicted',
                 trendline='ols',  # Ordinary Least Square (OLS) with plotly.express
                 trendline_color_override='red').update_traces(marker=dict(color='blue'))

fig.update_layout(title='<b>House Price Dataset</b>',
                  autosize=False,
                  width=700,
                  height=500,
                  margin=dict(
                      l=0,
                      r=0,
                      b=0,
                  ), )

fig.show()


X does not have valid feature names, but LinearRegression was fitted with feature names



**Task I findings.**

I used the 'bedrooms' and 'bathrooms' columns as input features (X) and the 'price' column as the target variable (y) to fit a linear regression model using the scikit-learn library. This model was then used to predict the price of a hypothetical 3-bedroom, 2-bathroom house, which was saved as 'price_pred'.

To evaluate the effectiveness of the model, I created a new Pandas dataframe called 'results_df' that contains both the predicted and actual house prices. Using plotly.express, I plotted a scatter plot that shows the relationship between the predicted and actual values. The plot also includes a trace line that indicates the level of agreement between the predicted and actual prices. The predicted prices increase as the actual prices increase, however there is a high level of deviation from the trendline indicating that the data that the linear regression model generated from bedroom and bathroom data is not the best tool to predict the home price.

### Activity-II

**In the Tutorial Completion Document, describe your findings.**

Report and discuss the three regression evaluation metrics (MAE, MSE, and RMSE) while predicting the `petal_length` using `sepal_width` and `sepal_length` with 75% train and 25% test data. Plot the actual and predicted values.

In [58]:
import pandas as pd
import plotly.express as px
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Reading the CSV file
df = pd.read_csv(fpath + 'iris.csv')

# Splitting the data into training and testing datasets
X = df[['sepal_length', 'sepal_width']]
y = df['petal_length']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Training the Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predicting the petal_length for the test data
y_pred = lr.predict(X_test)

# Regression evaluation metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = metrics.mean_squared_error(y_test, y_pred, squared=False)

print('Mean Absolute Error (MAE):', mae)
print('Mean Squared Error (MSE):', mse)
print('Root Mean Squared Error (RMSE):', rmse)

# Plotting the actual and predicted values
fig = px.scatter(x=y_test, y=y_pred,
                 trendline='ols',  # Ordinary Least Square (OLS) with plotly.express
                 trendline_color_override='red').update_traces(marker=dict(color='blue'))

fig.update_layout(title='<b>House Price Dataset</b>',
                  xaxis_title='Actual Petal Length',
                  yaxis_title='Predicted Petal Length',
                  autosize=False,
                  width=700,
                  height=500,
                  margin=dict(
                      l=0,
                      r=0,
                      b=0,
                  ), )

fig.show()

Mean Absolute Error (MAE): 0.45447426433682603
Mean Squared Error (MSE): 0.31544568447952925
Root Mean Squared Error (RMSE): 0.5616455149643138


**Activity II findings.**

I generated a linear regression using the iris dataset, which includes data on the sepal and petal lengths and widths of iris flowers. The dataset is split into training and testing datasets using a 75/25 split. The linear regression model is trained on the training data, and the petal length for the testing data is predicted using the trained model. From there the regression evaluation metrics are computed, including mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) between the actual and predicted petal lengths. The scatterplot shows the actual versus predicted petal lengths, with a trendline indicating the linear regression fit. The positive correlation between actual and predicted data points indicates that the model is accurately predicting the petal length of the iris flowers based on their sepal length and width

## TASK II

**In the Tutorial Completion Document, describe your findings.**

Report and discuss the three regression evaluation metrics (MAE, MSE, and RMSE) while predicting the **house price** using the number of **bedrooms** and the number of **bathrooms** with 80% train and 20% test data. Plot the actual and predicted values.

In [55]:
import pandas as pd
import plotly.express as px
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Reading the CSV file
fpath = "C:/Users/14175/Desktop/CS490/Tutorial 5/data/input/"
df = pd.read_csv(fpath + 'HousePrices.csv')

# Extracting the features and target variable
X = df[['bedrooms', 'bathrooms']]
y = df['price']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predicting the target variable using the test data
y_pred = lr.predict(X_test)

# Regression Evaluation Metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = metrics.mean_squared_error(y_test, y_pred, squared=False)

print(f'Mean Absolute Error: {mae:.2f}')
print(f'Mean Squared Error: {mse:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')

# Scatter plot
fig = px.scatter(x=y_test, y=y_pred, labels={'x': 'Actual Price for "House Price"', 'y': 'Predicted Price for "House Price"'},
                 trendline='ols',  # Ordinary Least Square (OLS) with plotly.express
                 trendline_color_override='red').update_traces(marker=dict(color='blue'))

fig.update_traces(marker=dict(size=8, opacity=0.6))

fig.update_layout(title='<b>House Price Dataset</b>',
                  autosize=False,
                  width=700,
                  height=500,
                  margin=dict(
                      l=0,
                      r=0,
                      b=0,
                  ), )


fig.show()

Mean Absolute Error: 16965.54
Mean Squared Error: 517047432.50
Root Mean Squared Error: 22738.68


**Task II findings.**


I generated a linear regression using the House Prices dataset, from the number of bedrooms and bathrooms in a house and its price. The dataset is read from a CSV file, and the features and target variable are extracted. The data is then split into training and testing datasets using an 80/20 split. The target variable is then predicted using the test data, and the regression evaluation metrics, including mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE), are computed to evaluate the performance of the model. 

The scatter plot visualizes the actual versus predicted house prices, with a trendline indicating the least squares fit. The scatter plot indicates a positive correlation between actual and predicted house prices, however there is a substantial amount of deviation of points from the trendline indicating that the  linear regression model generated from bedroom and bathroom data is not the best tool to predict the home price.