<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/LogisticRegression_NB2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Predict if a glass type is household or window based on the amount of aluminum the glass contains.

In [None]:
# glass identification dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass.sort_values('al', inplace=True)
glass.head()

In [None]:
sns.set(style="white")

sns.relplot(x="glass_type", y="al", 
            sizes=(40, 400), alpha=.5, palette="bright",
            height=6, data=glass)

There are 4 glass types :<br>
>types 1, 2, 3 are window glass<br>
types 5, 6, 7 are household glass<br>

In [None]:
# examine glass_type
glass.glass_type.value_counts().sort_index()

Create two categories of glass:<br>
>household = 0, window glass, categories 1,2,3<br>
household = 1, household glass, categories 5,6,7<br>

In [None]:
# types 1, 2, 3 are window glass
# types 5, 6, 7 are household glass
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.head()

In [None]:
sns.set(style="white")

sns.relplot(x="household", y="al", 
            sizes=(40, 400), alpha=.5, palette="bright",
            height=6, data=glass)
plt.show()

**Create the Logistic Regression model**

In [None]:
# fit a logistic regression model and store the class predictions
logreg = LogisticRegression()

The only feature we are using is aluminum

In [None]:
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household

In [None]:
sns.set(style="white")

sns.relplot(x="al", y="household", 
            sizes=(40, 400), alpha=.5, palette="bright",
            height=6, data=glass)
plt.show()

Print the aluminum column

In [None]:
X.head()

Print the classifications

In [None]:
print(y)

Use the al column and the hoousehold column to train the model. <br>
Store the predictions in the glass[predictions column]

In [None]:
logreg.fit(X, y)
glass['household_pred_class'] = logreg.predict(X)

In [None]:
glass.head()

In [None]:
cnf_matrix = metrics.confusion_matrix(glass['household'],glass['household_pred_class'])
cnf_matrix

In [None]:
sns.set(style="white")

sns.relplot(x="glass_type", y="al", hue="household_pred_class", 
            sizes=(40, 400), alpha=.5, palette="bright",
            height=6, data=glass)
plt.show()

# **Predicting Probabilities**

What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?

In [None]:
# store the predicted probabilites of class 1
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]
glass.shape

In [None]:
glass.head()

In [None]:
# plot the predicted probabilities
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')

The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1

In [None]:
#glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]
#X feature column is al
print(logreg.predict_proba(X)[1, :])
print(logreg.predict_proba(X)[2, :])
print(logreg.predict_proba(X)[213, :])
print(logreg.predict_proba(X)[200, :])

In [None]:
feature_cols, logreg.coef_[0]

In [None]:
logreg.predict_proba(X)[:, 1]

In [None]:
plt.plot(X,logreg.predict_proba(X)[:, 1])
plt.show()

Interpretation: A 1 unit increase in 'al' is associated with a 4.18 unit increase in the log-odds of 'household'.

In [None]:
# increasing al by 1 (so that al=3) increases the log-odds by 4.18
logodds = 0.64689603 + 4.1804038614510901
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob

**Logistic Regression and Categorical Features**

Use the column high BA 

In [None]:
glass['ba']

In [None]:
glass['high_ba'] = np.where(glass.ba > 0.5, 1, 0)
glass['high_ba']

In [None]:
# original (continuous) feature
fig = sns.lmplot(x='ba', y='household', data=glass, ci=None, logistic=True)
fig.set(xlim=(-0.5,3))
plt.show()

In [None]:
# categorical feature
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True)

In [None]:
# categorical feature, with jitter added
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True, x_jitter=0.05, y_jitter=0.05)

In [None]:
# fit a logistic regression model
feature_cols = ['high_ba']
X = glass[feature_cols]
y = glass.household
logreg.fit(X, y)


In [None]:
# examine the coefficient for high_ba
feature_cols, logreg.coef_[0]