# Exploring the Mechanics of Machine Learning

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

## Part 1: Import Training Dataset

A dummy dataset is created below as a learning tool. Training dataset will only have a "messy" outlet name and a "clean" outlet name. The only "features" we can use from this dataset are the strings split up by a space in the "messy" outlet name

Importing the array into a Pandas DataFrame helps with visualizing the dataset.

In [18]:
train_outlet = [["Bet Buys", "Best Buy"], ["Best Buys", "Best Buy"], ["BetsBuys","Best Buy"], 
                ["Walmert", "Wal-Mart"], ["Walmart", "Wal-Mart"], ["Wal mrt", "Wal-Mart"], ["Wall mrt", "Wal-Mart"],
                ["Tar gat", "Target"], ["Targ get", "Target"]]

df_train_outlet = pd.DataFrame(data=train_outlet, columns = ['messy', 'clean'])
df_train_outlet

Unnamed: 0,messy,clean
0,Bet Buys,Best Buy
1,Best Buys,Best Buy
2,BetsBuys,Best Buy
3,Walmert,Wal-Mart
4,Walmart,Wal-Mart
5,Wal mrt,Wal-Mart
6,Wall mrt,Wal-Mart
7,Tar gat,Target
8,Targ get,Target


## Part 2: Building a Model from Training Set

### Building the Features

I used <b>CountVectorizer()</b>. It will split up the "messy" outlet name by spaces and name each of the resulting substrings a "feature". This appears to be only <b>one</b> way to extract "features" from text data. Another seems to be <b>TfidfTransformer</b> and they are not mutually exclusive.

It looks like the <b>fit()</b> function does the actual feature extraction and the <b>transform()</b> function turns it into a spase matrix. 

In [19]:
outlet_vector = CountVectorizer()
outlet_train_fit = outlet_vector.fit(df_train_outlet['messy'])
outlet_train_tf = outlet_train_fit.transform(df_train_outlet['messy'])

To build intuition of what <b>CountVectorizer()</b> is doing and the results of <b>transform()</b>: 
1. Print the feature names
2. Print the shape of the transform matrix
3. Print the transform matrix elements

The feature names are the variables. The variable is 1 if the "messy" outlet name includes that feature and 0 otherwise. <i>I am not sure why/how outlet_vector has feature names as well and outlet_train_tf has no features names.</i>

In [20]:
print(outlet_vector.get_feature_names())
print(outlet_train_fit.get_feature_names())
print(outlet_train_tf.shape)
print(outlet_train_tf.toarray())

['best', 'bet', 'betsbuys', 'buys', 'gat', 'get', 'mrt', 'tar', 'targ', 'wal', 'wall', 'walmart', 'walmert']
['best', 'bet', 'betsbuys', 'buys', 'gat', 'get', 'mrt', 'tar', 'targ', 'wal', 'wall', 'walmart', 'walmert']
(9, 13)
[[0 1 0 1 0 0 0 0 0 0 0 0 0]
 [1 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 1 0 0 1 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 1 0 0 0 0]]


### Building the Model

Three logistics regressions are stored in <b>model</b> (there are other models available, such as MultinomialNB - naive Bayes). The theree models calculate the probability that the input will be equal to a category, in this case, "Best Buy", "Wal-Mart", or "Target".

The values for our regressors are stored in <b>outlet_train_tf</b> and we are trying to fit them to <b>df_train_outlet['clean']</b>, which is the "clean" outlet name from our training dataset. The results will be coefficients for each of our feature variables.

$$\operatorname{Pr}(\text{Outlet Name}) = \frac{exp(\alpha + \beta_{1}\mathrm{best}_{i}+ \beta_{2}\mathrm{bet}_{i}+ \beta_{3}\mathrm{betsbuys}_{i} + \beta_{4}\mathrm{buys}_{i} + ... + \beta_{9}\mathrm{walmert}_{i})}{1+exp(\alpha + \beta_{1}\mathrm{best}_{i}+ \beta_{2}\mathrm{bet}_{i}+ \beta_{3}\mathrm{betsbuys}_{i} + \beta_{4}\mathrm{buys}_{i} + \beta_{5}\mathrm{mrt}_{i} + ... + \beta_{9}\mathrm{walmert}_{i})}$$

In [21]:
model = LogisticRegression()
model.fit(outlet_train_tf, df_train_outlet['clean'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

For intuition, print the coefficients and the intercepts.

In [22]:
print(model.coef_, model.intercept_)

[[ 0.35293808  0.35293808  0.49053479  0.70587616 -0.27027493 -0.27027493
  -0.47529205 -0.27027493 -0.27027493 -0.23764602 -0.23764602 -0.31662451
  -0.31662451]
 [-0.20653576 -0.20653576 -0.26971548 -0.41307152  0.45447395  0.45447395
  -0.41307152  0.45447395  0.45447395 -0.20653576 -0.20653576 -0.26971548
  -0.26971548]
 [-0.27878693 -0.27878693 -0.37912528 -0.55757385 -0.31994963 -0.31994963
   0.61588609 -0.31994963 -0.31994963  0.30794304  0.30794304  0.42330551
   0.42330551]] [-0.45267998 -0.72634159 -0.11410127]


## Part 3: Predicting New Outlet Names

Create an array which has all the outlet names we want to fix. Obviously, we are only doing this for learning purposes, so I am only using things close to "Best Buy" and "Wal-Mart".

In [23]:
new_outlet = ["Better Buys", "Wall Cart", "Tar rat"]
outlet_new = outlet_vector.transform(new_outlet)
outlet_new.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

Running the model to predict each new outlet will return the suggested outlet name with highest probability. Also, the actual probability for each outlet choice can be printed as well. 

In [24]:
predicted = model.predict(outlet_new)
prob = model.predict_proba(outlet_new)

In [25]:
for p, i in zip(predicted, prob):
    print(p,i)

Best Buy [ 0.49231012  0.21200295  0.29568693]
Wal-Mart [ 0.28675713  0.24243399  0.47080888]
Target [ 0.28354443  0.37527521  0.34118036]
