[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/<your-org>/<your-repo-name>/blob/<your-branch-name>/<path-to-your-notebook>/<your-notebook.ipynb>)

# Your First Machine Model: Predicting Eaten By Bears!

Important!  Before running this notebook, open a terminal window and run the following commands:

`pip install names && python ./bearStatsGenerator.py --rows 32000 --outfile ./EatenByBearData.csv`

The above command will generate some synthetic data for us to use in our model training.

If you are running this notebook, you must be worried about BEAR ATTACKS!  In this notebook, we will use the power of machine learning to build a model to detect if we are about to be eaten by bears!

First, we will add a Python package to allow us to visialize our data before training our model:

In [None]:
%pip install seaborn

Next, we import python packages into our runtime environment.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sns

Next, we import our data file into a Pandas Data Frame.  This allows us to work with the data in memory.  We then display some data about our data to see what we are working with.

In [None]:
df = pd.read_csv('EatenByBearData.csv')
df.info()
df.head()

We also check for any duplicate rows and if they are present, drop them from our data frame.

In [3]:
sum(df.duplicated())
df.drop_duplicates(inplace=True)

Since some of our data is categorical, which is to say a column (or feature) that includes a discrete list of values, we break those individual categories into their own categories.

In [4]:
df[['colorOne', 'colorTwo', 'colorThree']] = df['colorsWorn'].str.split(',', expand=True)

Now that we have our colors worn into their own categories, we can encode those string values into numeric values.  Our chosen machine learning model, Logistic Regression, operates best with numeric values.

In [5]:
df[['colorOne', 'colorTwo', 'colorThree']] = df[['colorOne', 'colorTwo', 'colorThree']].apply(lambda x: x.astype('category'))

Next, we do a similar transofrmation on our boolean values, changing to True/False to more computational-friendly 1/0 values

In [6]:
df.combatTraining = df.combatTraining.replace({True: 1, False: 0})
df.hasBearSpray = df.hasBearSpray.replace({True: 1, False: 0})
df.eatenByBear = df.eatenByBear.replace({True: 1, False: 0})

Checking through our data, there are several values that don't seem to add a lot of value to our model, so before training, we drop them.

In [None]:
df.drop(['name', 'birthDate', 'colorsWorn', ], axis=1, inplace=True, errors='ignore')

For our colors, we can use pandas data frames object to encode those categories into numerical values.

In [8]:
df["colorOne"] = df["colorOne"].cat.codes
df["colorTwo"] = df["colorTwo"].cat.codes
df["colorThree"] = df["colorThree"].cat.codes

Lets check out our data again now that we have done our feature engineering to see what we are working with.

In [None]:
df.head()

Next we identify which values in our tabular data set are the X values or the features we will use for training and which is our Y value, or the one we are trying to predict.

In [10]:
X = df.iloc[:,df.columns != 'eatenByBear']
y = df.eatenByBear

Next we split our data into data sets for training and testing along our X and Y axis.  Training data is used by pour model to learn what we want to predict.  Test data is never seen in the training process, but represented a simulation what live prediction requests would look like.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=5, stratify=y)

In [12]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

Here we plot our dataset out a few different ways to see what our distribution is and how the different columns/features relate to each other

In [None]:
import matplotlib.colors as mcolors
colors = list(mcolors.CSS4_COLORS.keys())[10:]
def draw_histograms(dataframe, features, rows, cols):
    fig=plt.figure(figsize=(20,20))
    for i, feature in enumerate(features):
        ax=fig.add_subplot(rows,cols,i+1)
        dataframe[feature].hist(bins=20,ax=ax,facecolor=colors[i])
        ax.set_title(feature+" Histogram",color=colors[35])
        ax.set_yscale('log')
    fig.tight_layout() 
    plt.savefig('Histograms.png')
    plt.show()
draw_histograms(df,df.columns,8,4)

In [None]:
plt.figure(figsize = (38,16))
sns.heatmap(df.corr(), annot = True)
plt.savefig('heatmap.png')
plt.show()

In [15]:
model = LogisticRegression()

Above we identify the type of model we want to use to solve our problem, logistic regression (sometimes called Binary Classification) and below, we call the .fit function to start training our model.

In [None]:
model.fit(X_train_scaled, y_train)

Now with an (albeit quickly) trained model, we check to see, using our test data, how accurate our model is.  Since this is synthetic random data, we expect that our accuracy never exceeds ~50%.

In [None]:
train_acc = model.score(X_train_scaled, y_train)
print("The Accuracy for Training Set is {}".format(train_acc*100))

In [None]:
y_pred = model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print("The Accuracy for Test Set is {}".format(test_acc*100))

Lastly we check our classification report to see the overall evaluations made my our test data.  How many did we get right and how many did we get wrong.

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
cm=confusion_matrix(y_test,y_pred)
plt.figure(figsize=(12,6))
plt.title("Confusion Matrix")
sns.heatmap(cm, annot=True,fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.savefig('confusion_matrix.png')