# Shopify Analysis Hand-In Notebook

This notebook contains sample code and code and text fields relating to the practical project of Principles of Machine Learning. 

The targeted dataset consists of a scraped dump of shopify store descriptions and associated meta-data (https://www.kaggle.com/datasets/shopgram/shopify-stores-by-shopgramio).



# Data Exploration

The following cells contain code to connect a Google Drive folder and to explore the dataset. 

## Data Loading

The first step consist of loading the dataset we downloaded from Kaggle. 
It is available on the teams channel for the practical work. 

The lines below import the pandas library with the alias `pd`, and loads the `CSV` file into a `dataframe`.

In [1]:
# Only necessary if we load from Google Drive
from google.colab import drive
drive.mount('/content/drive')

MessageError: ignored

In [2]:
!wget -O stores_data_UTF8.csv  https://drive.switch.ch/index.php/s/mvOKN47CFmlSi2S/download

--2022-12-20 14:26:14--  https://drive.switch.ch/index.php/s/mvOKN47CFmlSi2S/download
Resolving drive.switch.ch (drive.switch.ch)... 86.119.34.137, 86.119.34.138, 2001:620:5ca1:1ee::11, ...
Connecting to drive.switch.ch (drive.switch.ch)|86.119.34.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 260322878 (248M) [text/csv]
Saving to: ‘stores_data_UTF8.csv’


2022-12-20 14:26:35 (15.6 MB/s) - ‘stores_data_UTF8.csv’ saved [260322878/260322878]



In [13]:
import pandas as pd
df = pd.read_csv('./stores_data_UTF8.csv', lineterminator="\n")

## Dataframe Exploration

It is usually a good idea to explore the `dataframe` after loading in order to better understand the data, and to make sure the loading worked as expected.

In [4]:
# This shows us the column labels in the dataframe.
df.columns

Index(['Unnamed: 0', 'store_title', 'store_description', 'store_collections',
       'store_labels'],
      dtype='object')

In [5]:
# The value counts function is a good way to get an overview of the values contained in a column by frequency (the most common and least common values are shown below).
df['store_title'].value_counts()

Home                                                                                    816
Create an Ecommerce Website and Sell Online! Ecommerce Software by Shopify              173
Home Page                                                                               117
Welcome                                                                                 106
403 Forbidden                                                                           105
                                                                                       ... 
       Original Ink Cartridges at Low Prices with FREE Delivery!  — Cartridge King        1
Holeshot Moto                                                                             1
Up to 70% off | Huge Discounts | Everything Kitchen                                       1
#FYMP! For Your Mind's Peace !                                                            1
Authenticity50 Honest Made in USA Sheets, Pillows, Comforters & Towels          

In [6]:
# The describe function prints some basic descriptive stats on the columns in the dataframe (e.g. number of values, number of unique values. )
df.describe()

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels
count,620863.0,618575,593365,539469,617969
unique,619971.0,603975,580417,506295,170216
top,,Home,[],frontpage,[]
freq,361.0,816,251,10530,195740


In [7]:
# The tail() function allows us to look at the last entries in the dataframe. The head() function shows us the top entries in the dataframe.
df.head()

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels
0,0,Easestudio,,"festive 18, reverie festive 2019, spring summe...",[]
1,1,Mason - Super Thin iPhone Cases,The original super thin iPhone cases that perf...,"iphone 11, iphone xs max, iphone x, iphone xr,...","['case', 'iphone', 'leather']"
2,2,Chictypeaccessoires | accessoires mode homme b...,La boutique Chic Type est votre boutique d'acc...,"noeud papillon, ceinture, les accessoires util...","['bracelet', 'boutique']"
3,3,Le Corps Fitness,Le Corps Fitness,"trainers, short sets, yoga sets, reine, one pi...",['fitness']
4,4,Womens fashions,Charming Lilly,"charming glambam, charming accessories, swim c...",[]


# Training Dataset Creation

## Data Pre-Processing 

It is often necessary to do some form of pre-procesing of the data.

This can become necesary in order to handle `null` values, deal with wrong data types, or make sure that data is encoded consistently in the same format. 

In [8]:
print(f"There are {df['store_title'].isnull().sum()} null values in the 'store_title' column and {df['store_description'].isnull().sum()} null values in the 'store_description' column")

There are 2288 null values in the 'store_title' column and 27498 null values in the 'store_description' column


In [9]:
# This replaces the null values with an empty string.
df['store_labels'].fillna("", inplace=True)
df['store_description'].fillna("", inplace=True)

## Semi-Supervised Label Creation

The first step to build a classifier for our targeted categories consists of creating labels. 

In order to train our machine learning system that is expected to be able to classify shops based on their `store_description` or the content of the `store_collections` field, we need to label the rows in the dataframe.

The cell below shows a simple approach to do this based on using the existing `store_labels`.
* We define our set of `ml categories` and assign an integer to each of these categories
* We collect keywords that we believe to match each of our target categories
* If the `store_labels` of a shop contain one of these keywords we will assign the respective integer label of the category.
* Finally, we assign a label representing all unmatched rows.

In [14]:
import pandas as pd
df = pd.read_csv('./stores_data_UTF8.csv', lineterminator="\n")
# In this example I assign the following ints to each category
# jewelry = 1
# phone_accessories = 2

# I use the following (very limited) selection of keywords for the two categories
labels_class1 = [r"\bad&d", r"\bmonopoly"]
labels_class2 = ['case', 'iphone', 'accessories', 'samsung', 'android', 'sleeve']


# In order to match these categories I use the following code. If any of the 
# listed keywords is contained in the store_labels field it will be labeled with 
# the category in the new column ml_labels
df.loc[df.store_collections.str.contains('|'.join(labels_class1), case = False, na = False), 'ml_labels'] = 1
df.loc[df.store_labels.str.contains('|'.join(labels_class2), case = False, na = False), 'ml_labels'] = 2


# After having labeled all our matching rows we label all remaining rows with an
# int representing all the other categories.
df['ml_labels'] = df['ml_labels'].where((df['ml_labels'].isin([1,2])), other=100)

# To have a quick check we can use the head() or tail() command and see if we
# have matches.
df.head(n=20)

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels,ml_labels
0,0,Easestudio,,"festive 18, reverie festive 2019, spring summe...",[],100.0
1,1,Mason - Super Thin iPhone Cases,The original super thin iPhone cases that perf...,"iphone 11, iphone xs max, iphone x, iphone xr,...","['case', 'iphone', 'leather']",2.0
2,2,Chictypeaccessoires | accessoires mode homme b...,La boutique Chic Type est votre boutique d'acc...,"noeud papillon, ceinture, les accessoires util...","['bracelet', 'boutique']",100.0
3,3,Le Corps Fitness,Le Corps Fitness,"trainers, short sets, yoga sets, reine, one pi...",['fitness'],100.0
4,4,Womens fashions,Charming Lilly,"charming glambam, charming accessories, swim c...",[],100.0
5,5,American Jewel,"American Jewel Accessories, Bags & Beauty for ...","sale, yummy gummy purses, party bags, jewels, ...","['beauty', 'bag']",100.0
6,6,"Eurofit Autocentres Tyres, Brakes, MOT, Servic...",Eurofit Autocentres are a vehicle repair group...,,[],100.0
7,7,Anita B Spa Store,,"serum, toners, treatments, eyes, correctives, ...",[],100.0
8,8,Ana illueca. Ceramics with Valencia character....,"Ceramics with Valencia character. Plates, vase...",,['ceramic'],100.0
9,9,Speezys,Speezys Stylish Kaftan Wear. Dutch brand Speez...,speezys kaftan onesize s m l,[],100.0


In [15]:
# To see overall results of your labeling efforts, you can use the value_counts() function again.
df['ml_labels'].value_counts()

100.0    609514
2.0       11286
1.0          63
Name: ml_labels, dtype: int64

Semi-supervised approaches to label machine learning training data are a rapidly developing fields and are big business. 

One of the fastest growing companies specialised on this is https://snorkel.ai/ ; a spin-off from Stanford University. 


### Analysing your Automatically Labeled Data

In [16]:
# If you want to look at the individual values for a label you can use the command below.
df.loc[df['ml_labels'] == 1.0]

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels,ml_labels
2281,2262,Soft Toys | Pocket Money Toys | Board Games | PDK,"Buy fundraising supplies, board games, soft to...","childrens jigsaw puzzles, dominoes, childrens ...","['puzzle', 'supply', 'game', 'charity', 'garde...",1.0
24945,24838,Archie's Toys,We specalize in all licensed toys ranging from...,"bodysocks, marvel, wwe, board games, call of d...","['game', 'toy', 'doll', 'comic', 'wheel']",1.0
26011,25902,K-Swiss México,"Tenis, gorras y mochilas","botas caballero, tenis caballero 1, lo mas ven...",[],1.0
43088,42927,Winning Moves Shop - The Worlds Coolest Board ...,Welcome to the home of the worlds coolest game...,"monopoly, skyrim, top trumps specials, monopol...","['card', 'game', 'junior', 'board']",1.0
48394,48217,Kidmoro - Shop Your Child's Favourite Toys,Online Toy Shopping at Kidmoro Singapore ♥ ✪ E...,"monopoly, kitchen cooking and cut foods game c...","['sport', 'puzzle', 'play', 'game', 'toy', 'pa...",1.0
...,...,...,...,...,...,...
595276,593260,"Canvas Art: Motivational, Inspirational & Mode...","Ikonick modern art canvases give homes, gyms, ...","eric the hip hop preacher, travel, culture, tr...","['modern', 'bundle', 'art', 'canvas']",1.0
595429,593413,Játéksziget - Játékbolt és Játék Webáruház,"A Játéksziget.hu egy online játékbolt, ahol a ...","bebijatek, brainbox es gemker tarsasok, kerti ...","['sport', 'puzzle', 'wheel', 'toy', 'party', '...",1.0
595860,593843,Platinum Prints,Platinum Prints LLC,"monopoly collection canvas prints, definition ...","['print', 'canvas']",1.0
609025,606968,"Midas Touch Toys, Games And Collectables","Midas Touch Toys, Games And Collectables","spongebob squarepants, monster high, models of...","['car', 'game', 'toy', 'power', 'comic']",1.0


In [17]:
# If you want to see the full output you can write it out to a file.
# This will write to the temporary file space of a Google collab notebook if you run the notebook there.
df.loc[df['ml_labels'] == 1.0].to_csv(r'matched_label_1.txt', header=None, index=None, sep=',')

## Creating the Training Data Structures

Now that we have labelled data, it is time to create the training set. 
In order to train our classifier we need the set of:
* Samples `X` and
* the set of labels for these samples `y`

Both `X` and `y` have to be encoded in numerical form. 
In order to transform the text of the descriptions we make use of the `CountVectorizer`.

For the vector holding the labels we just have to ensure it is an `int`.


In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X = count_vect.fit_transform(df['store_description'].apply(lambda x: np.str_(x)))
y = df['ml_labels'].astype(int)
X.shape
y.shape


(620863,)

In [None]:
# This shows you how the samples have been transformed from text into a numerical 
# represenation. If you want to see the full output you have to write it to a file.
X[1:10].todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

### Exercise: Train and Test Sets
Take the input X and y and create a train and test version that we can use to create the first classifiers. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=54)

# Training and Scoring the Classifier Model

After the all the hard work we have done above, the actual training and scoring of the classifier is very simple. 

As shown below, all we have to do is to call the `fit()` method with the input of the samples `X` and their labels `y`. 

Afterwards we measure by calling the `score()` method. 


In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)
clf.score(X_test, y_test)

0.9485475908611373

# Practical Part Deliverables

As part of the grading for Principles of  Machine Learning each group has to complete the following steps:

### 1. Create a labelled dataset

Following the schema shown above, map your chosen categories to keywords and create labels in an ml_labels column.

After you have finished this process export your resulting dataframe as a `csv` file (example code for this is contained in this notebook). 

### 2. Train a classifier

If you want you can use the same sample code contained in this notebook to train your first classifier based on your labelling. You can also look at the sci-kit learn documentation and try your hand at other classifiers (this is not required from a grading perspective).

### 3. Measure correctly

In the above example code we train and measure on the test set. This is already a good approach, and depending on the size of your test data can give meaningful results.
Explore another way to measure by applying cross-validation when measuring. 

### 4. Interprete the process and your results

Provide a short (not more than 10 sentences) written interpretation of the observed result and the process that we have used to create our classifier. 

Do you see any potential problems in the semi_supervised keyword labeling shortcut we have used?

How would you interpret the observed results? Could the classifier be used for the discussed purpose of doing some preliminary competitive analysis (e.g. what is my competition in this area, how many shops exist that serve the products I target)


## Interpretation of the process and our results

Write your input below.

## 3. Measuring Correctly

Provide your baseline measurements:
- Train/test split based measurements:
    - Split Ratio:
    - Accuracy:
- Cross-Validation measurements:
    - Number of Folds:
    - Accuracy:


## 4. Interprete the process and your results

- What is your general interpretation of the results? Are they good, bad, mediocre? Describe briefly.

...

- Do you see any potential problems in the semi_supervised keyword labeling shortcut that was used?


...

- How would you interpret the observed results? Could the classifier be used for your intended purpose? 

...


- What could be done to improve the performance of your classifier?

...

## Hand-in Procedure

The hand-in for each group consists of the CSV file containing your labelled version of the dataframe, and your version of this notebook.

Place the notebook file and the CSV file in a Zip file and upload it to the teams channel for Part II of the practical project. 

The name of the zip file should contain the last names of the group members. 

