# Tutorial 3: Classification



In the previous two tutorials we explored how to manipulate csv datasets, first as a <i>dictionary</i> then as a <i>list</i>. In this tutorial, we will step it up a notch by briefly exploring a third data structure called a <i>data frame</i> by leveraging the <i>pandas</i> Python library. We will also explore how to use the <i>scikit-learn</i> library to do basic classification. Our task is to build a model that can predict the country where a purchase is being made from. In reality, this is a pretty low value task for this dataset, but we will use it to demonstrate how classification works. Classification is one of the essential elements of machine learning, and is the backbone of many artificial intelligence technologies.

We will start by importing two new libraries: pandas and numpy. As before, we want to start by loading the csv data into our Python environment. We will do this similarly this time, but using pandas. You can learn more about pandas here: https://pandas.pydata.org/

In [1]:
import pandas as pd #import pandas as an object pd

ec = pd.read_csv('data.csv', encoding="ISO-8859-1", dtype={'CustomerID': str,'InvoiceNo': str}) #we will name our dataframe ec
ec.head() #see the first five entries

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom


Data frames look nice in Jupyter! The pandas dataframe is a two-dimensional data structure which makes it much easier to process and manage data. Not only does it look nice, it also comes with a number of built in features designed to make our life easy. For instance, we can easily observe the shape of our data.

In [2]:
ec.shape

(541909, 8)

This tells us that there are 541909 rows and 8 columns. We are not just limited to descriptions however. If we want to quickly process our data to drop null values, we can do that as well.

In [3]:
ec.dropna(inplace = True) #drop the null values
ec.shape

(406829, 8)

It is often good to eliminate or represent null values in your dataset. In this case, we opted to drop any values that had a null in any of the columns. We can also perform advanced functions like counting values in the dataset. For instance, we can observe the breakdown of the countries represented in the data.

In [4]:
print(ec['Country'].value_counts())

United Kingdom          361878
Germany                   9495
France                    8491
EIRE                      7485
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               1877
Portugal                  1480
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
USA                        291
Israel                     250
Unspecified                244
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58
Lebanon                     45
Lithuani

Our data is rather imbalanced, with the majority of the orders from the United Kingdom. This may create problems for us later, if we are going to build a classification algorithm. Finally, we can also observe our data types. 

In [5]:
ec.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID      object
Country         object
dtype: object

It would seem that only two of our values are numerical: Quantity and Unit Price. Classification algorithms (as with all machine learning) can only understand values that are represented numerically. Though there may be good information contained in the descriptions, we would have to add additional analysis to process the text data. For the purposes of this tutorial, we will only focus on classification -- the proper processing of textual data is a live research question in natural language processing!

We can cut our dataframe down easily by taking a subset. Let's take a subset with just the numerical values and the country values, and see if we can build a predictive model on those.

In [6]:
new_ec = ec[['Quantity', 'UnitPrice', 'Country']]

In [7]:
new_ec.head()

Unnamed: 0,Quantity,UnitPrice,Country
0,6,2.55,United Kingdom
1,6,3.39,United Kingdom
2,8,2.75,United Kingdom
3,6,3.39,United Kingdom
4,6,3.39,United Kingdom


## Logistic Regression: Round 1
Let's start by exploring a basic predictive algorithm. Logistic regression is one of many regression models designed to best fit the data using a predefined method. If you are familar with statistics or economics, you probably already understand how it works--we are just using a logistic regression library built in Scikit. We will use this to start our classification analysis. Let's begin by importing the model from scikit. We will save the model in the variable clf, as per scikit-learn's conventions.

In [8]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial') #the classification model

The classification models we will use here belong in the category of <i>supervised learning</i>. Supervised learning algoirthms need data and labels to learn from them. In our case, we will need to take part of our data as a <i>training set</i>. Though there are many ways that you can do this, one simple method is to take the majority of the dataset for training, and save a minority for testing. The author of this tutorial is lazy, so he will just take the first 300 000, which is roughly 3/4 of this dataset. We will use the last 1/4 for testing.

In [9]:
train = new_ec[:300000] #take the first 300 000 and save it in train
test = new_ec[300000:] #take what remains and save it in test

print("Train: " + str(len(train)) + " Test: " + str(len(test)))

Train: 300000 Test: 106829


The next step is to <i>fit</i> the model. We saved the model as the variable clf, so it's just a matter of fitting our training data to the model. Typically, you fit the model by specifying first the data that the model is assessing, followed by the labels. We will do this by telling it to observe the columns that are not 'Country' for the model, while using the 'Country' values as labels. <u>Note: this may take a minute or two on some computers.</u>

In [10]:
clf.fit(train.loc[:, train.columns != 'Country'], train['Country']) #the 'non-country' columns are inputs while 'country' is the label



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

The model is trained! We can now tell the model to predict values based on inputs. Let's save the predictions as the preds variable. We will print some of the output to make sure that it is working.

In [11]:
preds = clf.predict(test.loc[:, test.columns != 'Country'])
print(preds[1])

United Kingdom


In machine learning research, there are multiple measures that you can use to determine whether an algorithm is good. One of the most common measures is the algorithm's <i>accuracy</i>, which can be defined as the ratio of true values to the data overall. We can import an accuracy_score function from scikit learn to make this easy. 

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(preds, test['Country']) #measure accuracy of preds versus the test values

0.9079463441574853

<b>90 percent accuracy! These are amazing results! <i>Or are they?</i></b>

One problem with accuracy measures is that there could be an underdetermining factor that drives high accuracy results. Earlier we noticed that the data was weighted heavily toward one country. Let's look closer at our predictions.

In [13]:
import collections
collections.Counter(preds)

Counter({'United Kingdom': 106774,
         'Germany': 7,
         'Netherlands': 13,
         'Singapore': 17,
         'Saudi Arabia': 10,
         'EIRE': 8})

<b>Our model seems to have simply classified most of the data as "United Kingdom"</b>. Let's see how close that was to reality.

In [14]:
collections.Counter(test['Country'])

Counter({'United Kingdom': 97036,
         'Germany': 1932,
         'EIRE': 1734,
         'Spain': 653,
         'Italy': 349,
         'France': 2104,
         'Netherlands': 552,
         'Austria': 131,
         'Singapore': 4,
         'Portugal': 249,
         'Cyprus': 132,
         'Belgium': 469,
         'Denmark': 100,
         'USA': 67,
         'Switzerland': 351,
         'Japan': 56,
         'Finland': 135,
         'Norway': 357,
         'Sweden': 98,
         'Iceland': 58,
         'Poland': 74,
         'Australia': 45,
         'Channel Islands': 81,
         'Malta': 23,
         'Czech Republic': 3,
         'Greece': 36})

Clearly there was more variance that the algorithm discovered... and a six year old could have come up with this solution. Given that there are 106 829 values in the test dataset, our 90% accuracy is an illusion--the results are no better than random chance. This is a very common issue with imbalanced datasets, as the algorithms used might detect a simple solution: select the majority. If we are going to have meaningful results, we should consider digging deeper and rebalancing the data.

## Logistic Regression: Round 2
If we want to eventually develop some sort of predictive algorithm, we should consider balancing the dataset. One simple way for us to do that is to cut down the number of UK values of the other countries. We can observe the number of instances that were not 'United Kingdom' by using the value_counts function below.

In [15]:
(new_ec['Country'] != 'United Kingdom').value_counts() # how many times values other than 'United Kingdom' appear

False    361878
True      44951
Name: Country, dtype: int64

With this we can further process our data by dividing it between the "uk" subset and the "not uk" subset. We would want to do this because we want to cut down on the data, but only that data which has the label of "United Kingdom". With pandas this is really easy; we just specify the subset conditions.

In [16]:
ec_uk = new_ec[new_ec['Country'] == 'United Kingdom']
ec_others = new_ec[new_ec['Country'] != 'United Kingdom']

With a separate data frame, we can use the sample function that is contained in the pandas dataframe class. We can thus take a random sample. 

In [17]:
ec_uk_under = ec_uk.sample(44951) # the number of values for not 'United Kingdom'
new_ec = pd.concat([ec_uk_under, ec_others]) #bring the disparate data together
collections.Counter(new_ec['Country']) #show the countries

Counter({'United Kingdom': 44951,
         'France': 8491,
         'Australia': 1259,
         'Netherlands': 2371,
         'Germany': 9495,
         'Norway': 1086,
         'EIRE': 7485,
         'Switzerland': 1877,
         'Spain': 2533,
         'Poland': 341,
         'Portugal': 1480,
         'Italy': 803,
         'Belgium': 2069,
         'Lithuania': 35,
         'Japan': 358,
         'Iceland': 182,
         'Channel Islands': 758,
         'Denmark': 389,
         'Cyprus': 622,
         'Sweden': 462,
         'Austria': 401,
         'Israel': 250,
         'Finland': 695,
         'Greece': 146,
         'Singapore': 229,
         'Lebanon': 45,
         'United Arab Emirates': 68,
         'Saudi Arabia': 10,
         'Czech Republic': 30,
         'Canada': 151,
         'Unspecified': 244,
         'Brazil': 32,
         'USA': 291,
         'European Community': 61,
         'Bahrain': 17,
         'Malta': 127,
         'RSA': 58})

We can further simplify our task by reducing the number of classes to two. Many classification algorithms (such as support vector machines) are designed to be binary classifiers, so are optimized for exactly two classes. One way we can do this is to distinguish domestic orders from foreign orders. We can do this be changing all foreign orders to 'Other Country'. The way to do this in Pandas is to use the .loc feature.

In [18]:
new_ec.loc[new_ec['Country'] != 'United Kingdom', 'Country'] = 'Other Country' #select the values other than United Kingdom and make them one value 

In [19]:
collections.Counter(new_ec['Country']) #list all of the country data

Counter({'United Kingdom': 44951, 'Other Country': 44951})

In [20]:
new_ec.shape #shape of the new data frame

(89902, 3)

Finally, when we appended the two halves of our dataset, we essentially added our reduced 'United Kingdom' set to the end of our 'Other Country' set. We should shuffle them before beginning classification, or else our results will be biased by our distribution.   

In [21]:
rand_ec = new_ec.sample(frac=1) #take a random fraction of 100% of the data frame. We could use frac = 0.1 to take a random 10%

<b>We're now ready to try Logistic Regression again!</b> As before, we will divide them into a train and test before fitting the algorithm. Let's try this again and see how we fare. 

In [22]:
train = rand_ec[:60000] #60000, approximately 2/3 of the data
test = rand_ec[60000:]

print("Train: " + str(len(train)) + " Test: " + str(len(test)))

Train: 60000 Test: 29902


In [23]:
clf.fit(train.loc[:, train.columns != 'Country'], train['Country'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

We are now ready to make predictions, as before. Let's test it on the test dataset and record the accuracy.

In [24]:
preds = clf.predict(test.loc[:, test.columns != 'Country'])

In [25]:
accuracy_score(preds, test['Country'])

0.5627382783760284

57 percent accuracy-- terrible, and certainly not better than random chance. Let's break this down a bit more using the confusion matrix. The confusion matrix will show the number of items classified as 'Other Country' on the left, followed by those which were classified as 'United Kingdom' on the right. The items that were actually 'Other Country' are on the top, while those which were actually 'United Kingdom' are on the bottom. This is very useful to seeing the breakdown of our classifier and what went wrong. In our case, it is clear that our classifier identified far too many as 'United Kingdom'.

In [26]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test['Country'], preds) #shows the confusion matrix

array([[ 4956,  9994],
       [ 3081, 11871]], dtype=int64)

In [27]:
collections.Counter(preds) #show the sollectoin of predicted countries

Counter({'Other Country': 8037, 'United Kingdom': 21865})

Let's try some other techniques before calling it a day.

### Naive Bayes
A second technique we can try is called Naive Bayes. This is a probabilistic classifier based on Bayes Theorem. It is primarily used in text analysis, but can be used in our context as well. Let's train this classifier using the same code as before. We end up with an even worse result using Naive Bayes.

In [28]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(train.loc[:, train.columns != 'Country'], train['Country'])

GaussianNB(priors=None, var_smoothing=1e-09)

In [29]:
preds = clf.predict(test.loc[:, test.columns != 'Country'])
accuracy_score(preds, test['Country'])

0.5179586649722426

In [30]:
confusion_matrix(test['Country'], preds)

array([[  713, 14237],
       [  177, 14775]], dtype=int64)

In [31]:
collections.Counter(preds)

Counter({'United Kingdom': 29012, 'Other Country': 890})

### Random Forest
A third technique we can try is called random forest. This classifier belongs to the category called decision trees, which create an algorithm based on the information gained. They are called 'forests' because they are actually the average of many decision trees. Using random forest, we get 65% accuracy, which is a significant improvement. Though nothing to write home about, this is on the path to usefulness.

In [32]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(train.loc[:, train.columns != 'Country'], train['Country'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [33]:
preds = clf.predict(test.loc[:, test.columns != 'Country'])
accuracy_score(preds, test['Country'])

0.6513611129690322

In [34]:
confusion_matrix(test['Country'], preds)

array([[11389,  3561],
       [ 6864,  8088]], dtype=int64)

In [35]:
collections.Counter(preds)

Counter({'Other Country': 18253, 'United Kingdom': 11649})

### Support Vector Machines
The fourth technique that we will try is called support vector machines. Similarly to regression, this classifier envisions the data as points in space on a plane and tries to fit the data as best possible. Unlike regression, it plots on a hyperplane, using a kernel function that is specified by the user. We will use the default radial basis function kernel for classification. Using RBF, we attain a classification accuracy of 67% which is much closer to useful. 

With this in hand, we have a working (if not terribly good) predictive algorithm that can determine with 67% accuracy whether an order is from the United Kingdom or a foreign country, and brings us to the end of the classification tutorial.

In [36]:
from sklearn import svm
clf = svm.SVC(gamma='scale')
clf.fit(train.loc[:, train.columns != 'Country'], train['Country'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [37]:
preds = clf.predict(test.loc[:, test.columns != 'Country'])
accuracy_score(preds, test['Country'])

0.6706240385258511

In [38]:
collections.Counter(preds)

Counter({'Other Country': 19679, 'United Kingdom': 10223})

## Challenge Question
In this tutorial, we explored machine learning techniques that are often called "shallow learning". With the hype around deep learning, <i>neural networks</i> appear to be all the rage. How would we implement a neural network using scikit learn? 