<a href="https://colab.research.google.com/github/dnzengou/pantapa/blob/master/pantapa_VisualizingData_v1.ipynb" 
target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualizing Data for Classification

In the previous lab, you explored the automotive price dataset to understand the relationships for a regression problem. In this lab you will explore the German bank credit dataset to understand the relationships for a **classification** problem. The difference being, that in classification problems the label is a categorical variable. 

In other labs you will use what you learn through visualization to create a solution that predicts the customers with bad credit. For now, the focus of this lab is on visually exploring the data to determine which features may be useful in predicting customer's bad credit.

Visualization for classification problems shares much in common with visualization for regression problems. Colinear features should be identified so they can be eliminated or otherwise dealt with. However, for classification problems you are looking for features that help **separate the label categories**. Separation is achieved when there are distinctive feature values for each label category. Good separation results in low classification error rate.

## Load and prepare the data set 

### Prepare data to a manageable format

***Processing bson files***

source:
- [Kaggle](https://www.kaggle.com/inversion/processing-bson-files?select=category_names.csv)
- [Access and process nested objects, arrays or JSON](https://hackersandslackers.com/extract-data-from-complex-json-python/)

In [3]:
## Unziping (file on linux)
#unzip pantapa_api_development.zip

In [4]:
##  Convert bson files with, optionally, the outputs documents in a pretty-printed format JSON
#bsondump --pretty --outFile collection.json collection.bson
## OR via https://json-bson-converter.appspot.com/

In [99]:
## List all the mongodb data .bson files in the dedicated folder

import os

json_arr = os.listdir('data/data-pantapa_bson2json')
print(json_arr)

['vouchertypes.json', 'scans.json', 'brands.json', 'appinfos.json', 'voucherurls.json', 'vouchers.json', 'vouchertypeurls.json', 'organizations.json', 'materialtypes.json', 'companies.json', 'stations.json', 'prescans.json']


In [78]:
## Alternatively, proceed as below
## Eg. list all bson files Input data files contained in pantapa_api_development directory

from subprocess import check_output
print(check_output(['ls', 'data/data-pantapa_bson']).decode('utf8'))

# Any results writen to the current directory are saved as output.

appinfos.bson
appinfos.metadata.json
brands.bson
brands.metadata.json
codenotfounds.metadata.json
companies.bson
companies.metadata.json
materialtypes.bson
materialtypes.metadata.json
modulehashes.metadata.json
organizations.bson
organizations.metadata.json
packages.metadata.json
prescans.bson
prescans.metadata.json
scans.bson
scans.metadata.json
sessiontokens.bson
stations.bson
stations.metadata.json
tokens.bson
tokens.metadata.json
userinformations.metadata.json
vouchers.bson
vouchertypes.bson
vouchertypes.metadata.json
vouchertypeurls.bson
voucherurls.bson
voucherurls.metadata.json



In [126]:
## Convert files from json to csv, for ease of processing and visualization
import pandas as pd
import json

In [115]:
## Read and print JSON files into the directory in JSON format
## Let's start with companies

# Open the existing JSON file for loading into a variable
with open('data/data-pantapa_bson2json/companies.json') as json_file:
  companies = json.load(json_file) #This does the same as above, reading the json file and storing it into a variable (dict)

print(companies)

{'_id': {'machine': -1768797184, 'inc': 299119782, 'time': 1576742726}, 'data': {'name': 'Test', 'active': True, 'alreadyConnected': True, 'show_popup_notification': False}, 'meta': {'timestamp': {'createdAt': 1576742726344, 'updatedAt': 1576742726344}}, 'local': {'sv': {'name': 'Test'}}, '__v': 0}


In [116]:
## Or in pretty json
print(json.dumps(companies, indent=4, sort_keys=True))

{
    "__v": 0,
    "_id": {
        "inc": 299119782,
        "machine": -1768797184,
        "time": 1576742726
    },
    "data": {
        "active": true,
        "alreadyConnected": true,
        "name": "Test",
        "show_popup_notification": false
    },
    "local": {
        "sv": {
            "name": "Test"
        }
    },
    "meta": {
        "timestamp": {
            "createdAt": 1576742726344,
            "updatedAt": 1576742726344
        }
    }
}


In [123]:
## Note. We obtain below the same result as when proceeding as above
companies = pd.read_json('data/data-pantapa_bson2json/companies.json', lines=True)

In [124]:
## Let's convert companies into csv format. We will do the same for the other json files
companies.to_csv (r'data/data-pantapa_json2csv/companies.csv', index = None)

In [125]:
## Let's check the structure of this newly converted csv file
companies_csv = pd.read_csv('data/data-pantapa_json2csv/companies.csv')

companies.head()

Unnamed: 0,_id,data,meta,local,__v
0,"{'machine': -1768797184, 'inc': 299119782, 'ti...","{'name': 'Test', 'active': True, 'alreadyConne...","{'timestamp': {'createdAt': 1576742726344, 'up...",{'sv': {'name': 'Test'}},0


Inspecting the data structure for a few of these objects and dictionaries (dict) shows that the csv files do not look like something we want to use for visualization (nested data)... We willl work on json format instead. An easier way could have been to load the bson files on MongoDB, then selecting data subsets of interest for further analysis; we will go straight to that step with the queries down below (in the processing section).

In [8]:
## Let's proceed with brands file: read json (already done above) and convert to csv
#brands = pd.read_json('data/data-pantapa_bson2json/brands.json', lines=True)
brands.to_csv (r'data/data-pantapa_json2csv/brands.csv', index = None)

In [122]:
print(brands)
#print(json.dumps(brands, indent=4, sort_keys=True))

{'_id': {'machine': 1762676284, 'inc': 1074367085, 'time': 1560947259}, 'data': {'name': 'Apoteket AB', 'image': {'key': 'development/brands-image/5d0a2a3b69104e3c40098a6d-apoteket_logo_png', 'source': 'https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d-apoteket_logo_png'}, 'active': True, 'company_id': None, 'country_code': ['SE'], 'deep_link': 'https://app.pantapa.com/N3AKBENTKSpAgZeq9'}, 'meta': {'timestamp': {'createdAt': 1566215629675, 'updatedAt': 1591180045366}}, 'local': {'sv': {'order': 10, 'company_name': None, 'company_address': None, 'post_address': None, 'vat_nr': None, 'contact_person': {}, 'how_it_works': {'package_name': 'Apoteket ABs plastpåsar', 'image_link': {'key': 'development/brands-image/5d0a2a3b69104e3c40098a6d_package-Apoteket_Bags_Green_png', 'source': 'https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d_package-Apoteket_Bags_Green_png', '__typename': 'AwsImage'}, 'text_line1': 'See availab

In [9]:
materialtypes = pd.read_json('data/data-pantapa_bson2json/materialtypes.json', lines=True)
materialtypes.to_csv (r'data/data-pantapa_json2csv/materialtypes.csv', index = None)

In [13]:
organizations = pd.read_json('data/data-pantapa_bson2json/organizations.json', lines=True)
organizations.to_csv (r'data/data-pantapa_json2csv/organizations.csv', index = None)

In [15]:
prescans = pd.read_json('data/data-pantapa_bson2json/prescans.json', lines=True)
prescans.to_csv (r'data/data-pantapa_json2csv/prescans.csv', index = None)

In [16]:
scans = pd.read_json('data/data-pantapa_bson2json/scans.json', lines=True)
scans.to_csv (r'data/data-pantapa_json2csv/scans.csv', index = None)

In [17]:
stations = pd.read_json('data/data-pantapa_bson2json/stations.json', lines=True)
stations.to_csv (r'data/data-pantapa_json2csv/stations.csv', index = None)

In [19]:
vouchertypes = pd.read_json('data/data-pantapa_bson2json/vouchertypes.json', lines=True)
vouchertypes.to_csv (r'data/data-pantapa_json2csv/vouchertypes.csv', index = None)

In [20]:
vouchertypeurls = pd.read_json('data/data-pantapa_bson2json/vouchertypeurls.json', lines=True)
vouchertypeurls.to_csv (r'data/data-pantapa_json2csv/vouchertypeurls.csv', index = None)

In [21]:
voucherurls = pd.read_json('data/data-pantapa_bson2json/voucherurls.json', lines=True)
voucherurls.to_csv (r'data/data-pantapa_json2csv/voucherurls.csv', index = None)

Let's check the list of converted csv files:

In [100]:
from subprocess import check_output
print(check_output(['ls', 'data/data-pantapa_json2csv']).decode('utf8'))

brands.csv
companies.csv
materialtypes.csv
organizations.csv
prescans.csv
scans.csv
stations.csv
vouchers.csv
vouchertypes.csv
vouchertypeurls.csv
voucherurls.csv



### Extract objects from nested JSON
#### and explore datasets


In [128]:
## Inspect content of the scans dictionary
print(scans['data']['enums']['location']['coordinates'])

[18.0373788, 59.3313148]


Lat = 59.3313148, Long = 18.0373788
![station0](img/Latitude-Longitude_Point0.png)
[source](https://getlatlong.net/)

In [43]:
print(scans['data']['enums']['name'])

Apoteket stor grå påse


Let's proceed the same way with the other dictionaries obtained above from reading json files.

In [44]:
print(prescans['data']['enums']['location']['coordinates'])

[27.5285912, 53.9204432]


In [45]:
print(prescans['data']['enums']['status'])

PENDING


In [46]:
print(prescans['data']['enums']['name'])

Brunchägg L 12-p inbur HP


In [59]:
print(vouchers['data']['redeem_date'])

1556777805897


In [60]:
print(vouchers['data']['coupon']['validTo'])

2019-11-14T00:00:00


In [55]:
print(vouchers['data']['coupon']['name'])

Panta Påsen Test


In [57]:
print(vouchers['data']['coupon']['htmlLink'])

http://p.kupong.se/LY2Ujv64yF


In [58]:
print(vouchers['data']['coupon']['couponCode'])

LY2Ujv64yF


We observe that there is only one brand in this file. Not enough to draw any pattern or trend, yet interersting to explore in depth some variables of interest for information purpose. To get to know the data better.

In [63]:
print(brands['data']['name'])

Apoteket AB


In [65]:
print(brands['data']['image']['source'])

https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d-apoteket_logo_png


![source](https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d-apoteket_logo_png)
[apoteket_logo](https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d-apoteket_logo_png)

In [67]:
print(brands['data']['country_code'])

['SE']


In [73]:
print(brands['local']['sv']['how_it_works']['image_link']['source'])

https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d_package-Apoteket_Bags_Green_png


![bags_green](https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d_package-Apoteket_Bags_Green_png)
[Bags_Green](https://panta-pasen.s3.amazonaws.com/development/brands-image/5d0a2a3b69104e3c40098a6d_package-Apoteket_Bags_Green_png)

In [74]:
print(brands['local']['sv']['how_it_works']['text_line1_url'])

https://www.apoteket.se/globalassets/om-apoteket/hallbar-utveckling/apotek-med-pantbara-pasar-2019_apoteket_se.pdf


In [75]:
print(brands['local']['sv']['how_it_works']['package_name'])

Apoteket ABs plastpåsar


In [82]:
print(brands['local']['en']['how_it_works']['description'])

['Plastic bags sold in selected stores in Sweden.', 'Up to SEK 2 per scanned bag']


In [129]:
## From the Read and print JSON file in JSON format previous steps,

# Also equivalent to what obtained by the queries below (opening the existing JSON file for loading into a variable)
#with open('data/data-pantapa_bson2json/stations.json') as json_file:
#  stations = json.load(json_file)

## Let's print pretty JSON data
print(json.dumps(stations, indent=4, sort_keys=True))

{
    "__v": 0,
    "_id": {
        "inc": -2072099102,
        "machine": 1962819352,
        "time": 1545242828
    },
    "data": {
        "address": "Borgarfjordsgatan 8, 164 40 Kista, Sweden",
        "country_code": "SE",
        "description": "Laudantium et dignissimos voluptate eos. Dolorum quo voluptas corporis id aliquid magni voluptas. Soluta ducimus voluptas vel aut nihil ullam. Debitis consequatur vitae. Culpa voluptates tempora aut. Voluptatem occaecati voluptatem.",
        "disabled": false,
        "location": {
            "coordinates": [
                17.9472797,
                59.4067509
            ],
            "type": "Point"
        },
        "name": "Test 1",
        "point_type": "STATION",
        "public": true,
        "scan_distance": 100,
        "store": "Eriksson - Nilsson",
        "type_id": []
    },
    "meta": {
        "created": 1545242828280,
        "owner": "PP",
        "timestamp": {
            "updatedAt": 1597910340625
        },

As usual, let's look closer at the data

In [87]:
print(stations['data']['address'])

Borgarfjordsgatan 8, 164 40 Kista, Sweden


In [89]:
print(stations['data']['location']['point_type'])

[17.9472797, 59.4067509]


Lat = 59.4067509, Long = 17.9472797
![station1](img/Latitude-Longitude_Point1.png)

In [91]:
print(stations['data']['point_type'])

STATION


In [92]:
print(stations['data']['scan_distance'])

100


In [93]:
print(stations['data']['store'])

Eriksson - Nilsson


In [98]:
print(organizations['_id']['machine'])

1443409920


In [130]:
print(organizations['data']['name'])

Ika


***
<code>As datasets do not have significant numbers of products to visualize, extract patterns and/or predict future behaviours (such as articles often bought, i.e scanned, together), we will do the predictive analytics work on a dummy dataset of choice. For this purpose, we will put ourselves in the situation where ALL products are scannable. The method of choice we will implement is call <b>Market Basket Analysis<b></code>
<br>

## Market Basket Analysis (MBA)

In a first part, we will briefly explain the MBA basics and illustrate it with a case study of items scanned in a supermarket. In the second part we will implement this technique in python language programming using public [dataset](https://raw.githubusercontent.com/limchiahooi/market-basket-analysis/master/BreadBasket_DMS.csv) from model some source coded on github.<br>
References at the end of this project.

***
# <a name="understanding-mba">Understanding MBA</a> 
 In this hypothetical case study, we are going to use the **Apriori algorithm** for frequent pattern mining to perform a Market Basket Analysis. Following sources ([Xavier Vivancos García](https://www.kaggle.com/xvivancos/market-basket-analysis)), "MBA is a technique used by large retailers to *uncover associations between items*. It works by looking for combinations of items that occur together frequently in transactions, providing information to understand the purchase behavior. The outcome of this type of technique is, in simple terms, a set of rules that can be understood as “if this, then that”." 

 Additional sources ([limchiahooi](https://github.com/limchiahooi/market-basket-analysis)), define "Market basket analysis (MBA), also known as **association-rule mining**, as a method of discovering *customer purchasing patterns* by extracting *associations or co-occurrences* from stores' transactional databases.  It is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in a supermarket and you buy a loaf of Bread, you are more likely to buy a packet of Butter at the same time than somebody who didn't buy the Bread. (...)" <br>
 Same principle can in theory be applied to *scanned items* -- as the scanning process is an integrated part of the purchasing process.

***
### Applications ###
There are many real-life applications of MBA:
- **Recommendation engine** – showing related products as "Customers Who Bought This Item Also Bought" or “Frequently bought together” (as shown in the Amazon example above). It can also be applied to recommend videos and news article by analyzing the videos or news articles that are often watched or read together in a user session.
<br>
<br>
- **Cross-sell / bundle products** – selling associated products as a "bundle" instead of individual items. For example, transaction data may show that customers often buy a new phone with screen protector together. Phone retailers can then package new phone with high-margin screen protector together and sell them as a bundle, thereby increasing their sales.
<br>
<br>
- **Arrangement of items in retail stores** – associated items can be placed closer to each other, thereby invoking "impulse buying". For example it may be uncovered that customers who buy Barbie dolls also buy candy at the same time. Thus retailers can place high-margin candy near Barbie doll display, thereby tempting customers to buy them together.
<br>
<br>
Etc.


***
### Case Study ###
We are analyzing the hypothetic scanning case of two items – Bread and Butter. We want to know if there is any evidence that suggests that scanning Bread leads to scanning Butter. Note. We will often replace scanning by transaction, interchangeably.

**Problem Statment:** Is the pscanning of Bread leads to the scanning of Butter?<br><br>
**Hypothesis:** There is significant evidence to show that scanning Bread leads to scanning Butter. (As much as buying Bread leads to buying Butter)


Bread => Butter

Antecedent => Consequent

Let's consider a supermarket which generates **1,000 transactions monthly**, of which **Bread was purchased in 150 transactions, Butter in 130 transactions, and both together in 50 transactions**.



### Analysis and Findings ###
We can use MBA to extract the association rule between Bread and Butter. There are *three metrics* or criteria to evaluate the strength or quality of an association rule, which are **support**, **confidence** and **lift**. (*Convictions* is an additional metric used in some cases)<br>
More about this [here](https://medium.com/datadriveninvestor/product-recommendation-using-association-rule-mining-for-a-grocery-store-7e7feb6cd0f9)

In short,
- Support measures the percentage of transactions containing a particular combination of items relative to the total number of transactions. <br>In our example: *Support (antecedent (Bread) and consequent (Butter)) = Number of transactions having both items / Total transactions*. <br>Result: **The support value of 5% means 5% of all transactions have this combination of Bread and Butter scanned together**. Since the value is above the threshold of 1%, it shows there is indeed **support** for this association and thus *satisfy the first criteria*.
![alt text](img/support.jpg "Support")

- Confidence measures the probability of finding a particular combination of items whenever antecedent is bought. <br> *Confidence (antecedent i.e. Bread and consequent i.e. Butter) = P (Consequent (Butter) is bought GIVEN antecedent (Bread) is bought)*. <br> Result: **The confidence value of 33.3% is above the threshold of 25%**, indicating we can be **confident** that Butter will be scanned whenever Bread is scanned, and thus *satisfy the second criteria*.
![alt text](img/confidence.jpg "Confidence")

- Lift is a metric to determine how much the transaction between antecedent and consequent influence each other. <br>We want to know which is higher, P(Butter) or P(Butter / Bread)? (Conditional probabilities) If the scanning of Butter is influenced by the one of Bread, then the *ratio of P(Butter / Bread) over P(Butter) > 1*.<br> Result: **The lift value of 2.56 is greater than 1**, thus that the transaction for Butter is indeed **influenced** by the one for Bread which *satisfy the third criteria*. This also means that Bread's transaction lifts the Butter's purchase by 2.56 times.


***
### Takeaways ###
Based on the findings above, we

    a) Have the support of 5% transactions for Bread and Butter in the same basket
    b) Have 33.3% confidence that Butter scan happen whenever Bread is scanned.
    c) Know the lift in Butter's transaction is 2.56 times more whenever Bread is involved than when Butter is alone.

Therefore, we can justify our initial hypothesis by concluding that there is indeed evidence to suggest that the *transaction for Bread leads to the one for Butter*. This is a valuable insight to guide decision-making.
<br>Actions forward could be, among other things, for retail stores to start placing bread and butter close to each other, knowing that customers are highly likely to "impulsively" scanned (and ultimately purchase) them together.

***
## <a name="implementation-in-python">Implementation in Python</a> ##
On a large dataset, leveraging on Python libraries for a ready-made algorithm is more efficient than the use of traditional Ms Excel to calculate support, confidence and lifts. Furthermore, as the popular scikit-learn library does not allow us to apply *Apriori algorithm* for extracting frequent item sets for further analysis, because not supported this algorithm, we use another library instead: [MLxtend (machine learning extensions)](http://rasbt.github.io/mlxtend/) by Sebastian Raschka. [Chris Moffitt](http://pbpython.com/market-basket-analysis.html) also provides a tutorial on using MLxtend.


Note. If you are using Jupyter Notebook, the MLxtend library does not come pre-installed with Anaconda (which I am using right now). You can easily install this package with conda by running one of the following in your Anaconda Prompt:<br><br>
`conda install -c conda-forge mlxtend`<br>
`conda install -c conda-forge/label/gcc7 mlxtend`<br><br>
Or with pip:<br><br>
`!pip install mlxtend`<br>("!" if cell ran from the notebook)

### Dataset
The [dataset](https://github.com/dnzengou/pantapa/data/MBA/MBA.csv) we are using in the case study in this is inspired from a publicly available one initially from Kaggle, now hosted on [github](https://github.com/limchiahooi/market-basket-analysis/blob/master/BreadBasket_DMS.csv) which contains the Transactions data from a bakery from 30/10/2016 to 09/04/2017. The original data belongs to a real bakery called "The Bread Basket" that serves coffee, bread, muffin, cookies etc. located in the historic center of Edinburgh.
<br>
<br>

### Import libraries

In [133]:
#!pip install mlxtend

In [134]:
# import the libraries required
%matplotlib inline
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

### Load data

In [None]:
# load the data into a pandas dataframe and take a look at the first 10 rows
bread = pd.read_csv("https://raw.githubusercontent.com/limchiahooi/market-basket-analysis/master/BreadBasket_DMS.csv")
bread.head(10)

### Apriori algorithm
#### for frequent pattern mining

There is a lot of information in these plots. The key to interpreting these plots is comparing the proportion of the categories for each of the label values. If these proportions are distinctly different for each label category, the feature is likely to be useful in separating the label.  

There are several cases evident in these plots:
1. Some features such as checking_account_status and credit_history have significantly different distribution of categories between the label categories. 
2. Others features such as gender_status and telephone show small differences, but these differences are unlikely to be significant. 
3. Other features like other_signators, foreign_worker, home_ownership, and job_category have a dominant category with very few cases of other categories. These features will likely have very little power to separate the cases.  

Notice that only a few of these categorical features will be useful in separating the cases. 

## Summary

In this lab you have performed exploration and visualization to understand the relationships in a classification dataset. Specifically:
1. Examine the imbalance in the label cases using a frequency table. 
2. Find numeric or categorical features that separate the cases using visualization.

***
### References ###
- Amir, A. (2019, February 3). Association Rule(Apriori and Eclat Algorithms) with Practical Implementation. *Medium*. Retrieved from https://medium.com/machine-learning-researcher/association-rule-apriori-and-eclat-algorithm-4e963fa972a4
- Kaushik, D. (2019, January 15). Product Recommendation Case Study Using Apriori Algorithm for a Grocery Store. *Medium*. Retrieved from https://medium.com/datadriveninvestor/product-recommendation-using-association-rule-mining-for-a-grocery-store-7e7feb6cd0f9
- Madalina, C. (2019, Juin 8). An introduction to frequent pattern mining research. Summary of Apriori, Eclat and FP tree algorithms. *IMedium*. Retrieved from https://medium.com/@ciortanmadalina/an-introduction-to-frequent-pattern-mining-research-564f239548e
- Andrewngai (2020, March 17). Understand and Build FP-Growth Algorithm in Python. Frequency Pattern Mining using FP-tree and conditional FP-tree in Python. *Towards Data Science*. Retrieved from https://towardsdatascience.com/understand-and-build-fp-growth-algorithm-in-python-d8b989bab342
- Xavier Vivancos, G. (2020, May). Market Basket Analysis. *Kaggle*. Retrieved from https://www.kaggle.com/xvivancos/market-basket-analysis


