# DTSC670: Foundations of Machine Learning Models

## Assignment 1: Johnny Likes Pie

#### Name: Arsenika Merenkov

### CodeGrade
Note that this assignment will be automatically graded through CodeGrade and you will have unlimited submission attempts.  When submitting to CodeGrade, your notebook should be named `assignment1.ipynb` and there should be no errors in the file or CodeGrade will not be able to grade it.  Before submitting, I suggest that you restart your kernel and attempt to run all cells again to ensure that there will be no errors when CodeGrade runs your script.

### Details

First, make sure that you watch the video titled "Should You Play Golf Today" in the "Preparation for Assignment 1" section of Brightspace.  This assignment is meant to purely allow you to perform some basic steps with Scikit-Learn to get you used to working with it.

The following data describes features of different types of pie, along with a positive or negative classification of the pie based whether or not Johnny likes it.  A positive classification means Johnny likes that pie; a negative classification means Johnny does not like that pie.

<img src="JohnnyPies.png " width ="600" />

### Import Data

Let's start out by importing some standard imports.

In [None]:
# Importing numpy and pandas library
import numpy as np
import pandas as pd

# Set the maximum number of columns to None to display all columns in the DataFrame
# Do not change these options; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore') 

Next you should place the data file called `JohnnyPiesData.csv` and this Jupyter notebook in the same directory.  Use the [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to read in the data from the comma-separated values (csv) file to a Pandas DataFrame called `pie_df` and output the data to take a look.

In [None]:
# Store the file name of the data in a variable
fileName = "JohnnyPiesData.csv"

# Read the csv file and store it in a dataframe
pie_df = pd.read_csv(fileName)
pie_df

## Prepare Data for Linear Regression

- Drop the `Example` column from the `pie_df` DataFrame, because it offers no information.

- Encode all categorical data into numeric data via the "One Hot Encoding" technique provided by the Pandas [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function.  

  - Since we are performing ordinary least squares linear regression, we will want to drop one of the newly created Boolean-valued features (output from the `get_dummies()` function) to prevent introducing unwanted correlation in the data.  Include `drop_first = True` as an argument to the `get_dummies()` function.

- Store the final features in a DataFrame called `features`.  The one-hot-encoded columns must go in the same order as the original data so that the linear regression coefficients match what CodeGrade is expecting.

- Store the positive class labels in a DataFrame called `response`.  The `response` data must be a DataFrame and not a Series or some of the code towards the end of this notebook may not function correctly and your output might be slightly different than what CodeGrade is expecting.

**Note:** Since we are not concerned with generalization error in this assignment, we will not split our data into training and test sets. In 'real-world' projects, you would want to split your data to see how your model performs with data that it has never seen before.

In [None]:
# Remove the Example column from the dataframe
pie_df = pie_df.drop('Example', axis = 1)
pie_df

In [None]:
# Using one-hot encoding to convert categorical variables into numerical ones. The drop_first=True argument specifies that the first binary column should be dropped, resulting in one less column than the number of unique values in the original column
features = pd.get_dummies(pie_df, columns = ['Crust Shape', 'Crust Size', 'Crust Shade', 'Filling Size', 'Filling Shade', 'Class'], drop_first=True)
features

In [None]:
# Extracting the column 'Class_pos' from the dataframe from features and assigning to a new dataframe called response.
# Class_pos is a depenendent variable (or response variable) which will be used to predict using the Linear regression model.
response = features[['Class_pos']]
response

## Perfrom Linear Regression Model Fitting

1. Import the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class from the `sklearn.linear_model` library. 

2. Instantiate an object of the `LinearRegression` class called `reg_model`.

3. Train the model by invoking the `fit()` method of the `reg_model` object and passing it `features` and `response`.

In [1]:
# Create an instance of LinearRegression and store it in a variable
from sklearn.linear_model import LinearRegression
reg_model = LinearRegression()
# Fit the model with the features and response
reg_model.fit(features, response)
reg_model.fit

ModuleNotFoundError: No module named 'sklearn'

## Examine Linear Regression Model Parameters

View the trained model parameters by using the `coef_` and `intercept_` attributes of the trained model.

In [None]:
# Print the coefficients of the Linear Regression model
reg_model.coef_

In [2]:
# Print the intercept of the Linear Regression model
reg_model.intercept_

NameError: name 'reg_model' is not defined

## Making Predictions Using the Linear Regression Model

Evaluate the model's performance on the training data set by invoking the `predict()` method and passing `features` to it.  Save this output as `preds`. 


In [None]:
# Use the Linear Regression model to predict the response for the given features
preds = reg_model.predict(features)

Below are the results from the linear regression model:

The column "Class_pos" regards the "positive" or negative classification of the pie.  The column "Regression_Predictions" regards the predictions made by the linear regression model directly.  The column "Predicted_Responses" are the adjusted prdeictions made by the model after employing the cut-off values of 0 being 0 <= x <= 0.5 and 1 being 0.5 < x <= 1.0.

Note:  Make sure that your `response` is a DataFrame and not a Series or some of the code below may not function correctly.

In [None]:
# resp_comp = Response Comparison
resp_comp = response.copy() 
reg_outputs = [float(reg_model.predict(np.reshape(row, (1, -1)))) for row in features.itertuples(index=False)]
predicted_resp = np.array([1 if reg_output > 0.5 else 0 for reg_output in reg_outputs])
resp_comp = resp_comp.assign(Regression_Predictions = reg_outputs)
resp_comp = resp_comp.assign(Predicted_Responses = predicted_resp)
resp_comp

## Calculate Model Accuracy

Use the [accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function to calculate the accuracy score of the model.  Save the accuracy score as `acc_score`.

In [None]:
# Import the accuracy_score function from sklearn library
from sklearn.metrics import accuracy_score
acc_score = accuracy_score(resp_comp['Class_pos'], resp_comp['Predicted_Responses'])
print("accuracy score: ", acc_score)