# Data Loading

The first step consist of loading the dataset we downloaded from Kaggle. 
It is available on the teams channel for the practical work project and via the link in the next cell. 

The lines below import the pandas library with the alias `pd`, and loads the `CSV` file into a `dataframe`.

In [None]:
!wget -O stores_data_UTF8.zip  https://drive.switch.ch/index.php/s/sfrwmwUmFAAfeCt/download

In [2]:
import pandas as pd
df = pd.read_csv('drive/MyDrive/stores_data_UTF8.csv', lineterminator="\n")

In [1]:
# Data Exploration

It is usually a good idea to explore the `dataframe` after loading in order to better understand the data, and to make sure the loading worked as expected.

SyntaxError: invalid syntax (<ipython-input-1-93590b0a615c>, line 3)

In [3]:
# This shows us the column labels in the dataframe.
df.columns

Index(['Unnamed: 0', 'store_title', 'store_description', 'store_collections',
       'store_labels'],
      dtype='object')

In [6]:
# The value counts function is a good way to get an overview of the values contained in a column by frequency (the most common and least common values are shown below).
df['store_title'].value_counts()

Home                                                                          816
Create an Ecommerce Website and Sell Online! Ecommerce Software by Shopify    173
Home Page                                                                     117
Welcome                                                                       106
403 Forbidden                                                                 105
                                                                             ... 
Huellas Personalizadas para Mascotas - Tienda en Línea                          1
Funny Maternity T Shirts Pregnancy Announcement Gifts Baby Bump Tees            1
Jo Gordon | Modern Luxury Knitwear | Handcrafted in Scotland                    1
Designer Soft Leather Gold Plated Dog Collars With Genuine Gemstones            1
Size 9 Boutique                                                                 1
Name: store_title, Length: 603975, dtype: int64

In [7]:
# The describe function prints tsome basic descriptive stats on the columns in the dataframe (e.g. number of values, number of unique values. )
df.describe()

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels
count,620863.0,618575,593365,539469,617969
unique,619971.0,603975,580417,506295,170216
top,,Home,[],frontpage,[]
freq,361.0,816,251,10530,195740


In [None]:
# The tail() function allows us to look at the last entries in the dataframe. The head() function shows us the top entries in the dataframe.
df.tail()

# Data Pre-Processing 

It is often necessary to do some form of pre-procesing of the data.

This can become necesary in order to handle `null` values, deal with wrong data types, or make sure that data is encoded consistently in the same format. 

In [9]:
print(f"There are {df['store_title'].isnull().sum()} null values in the 'store_title' column and {df['store_description'].isnull().sum()} null values in the 'store_description' column")

There are 2288 null values in the 'store_title' column and 27498 null values in the 'store_description' column


In [10]:
# This replaces the null values with an empty string.
df['store_labels'].fillna("", inplace=True)
df['store_description'].fillna("", inplace=True)

# Label Creation

The first step to build a classifier for our targeted categories consists of creating labels. 

In order to train our machine learning system that is expected to be able to classify shops based on their `store_description` or the content of the `store_collections` field, we need to label the rows in the dataframe.

The cell below shows a simple approach to do this based on using the existing `store_labels`.
* We define our set of `ml categories` and assign an integer to each of these categories
* We collect keywords that we believe to match each of our target categories
* If the `store_labels` of a shop contain one of these keywords we will assign the respective integer label of the category.
* Finally, we assign a label representing all unmatched rows.

In [28]:

# In this example I assign the following ints to each category
# jewelry = 1
# phone_accessories = 2

# I use the following (very limited) selection of keywords for the two categories
phone_acc_labels = ['case', 'iphone', 'accessories', 'samsung', 'android']
jewelry_labels = ['bracelet', 'necklace']

# In order to match these categories I use the following code. If any of the 
# listed keywords is contained in the store_labels field it will be labeled with 
# the category in the new column ml_labels
df.loc[df.store_labels.str.contains('|'.join(jewelry_labels), case = False, na = False), 'ml_labels'] = 1
df.loc[df.store_labels.str.contains('|'.join(phone_acc_labels), case = False, na = False), 'ml_labels'] = 2


# After having labeled all our matching rows we label all remaining rows with an
# int representing all the other categories.
df['ml_labels'] = df['ml_labels'].where((df['ml_labels'].isin([1,2])), other=100)

# To have a quick check we can use the head() or tail() command and see if we
# have matches.
df.head()

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels,ml_labels
0,0,Easestudio,,"festive 18, reverie festive 2019, spring summe...",[],100.0
1,1,Mason - Super Thin iPhone Cases,The original super thin iPhone cases that perf...,"iphone 11, iphone xs max, iphone x, iphone xr,...","['case', 'iphone', 'leather']",2.0
2,2,Chictypeaccessoires | accessoires mode homme b...,La boutique Chic Type est votre boutique d'acc...,"noeud papillon, ceinture, les accessoires util...","['bracelet', 'boutique']",1.0
3,3,Le Corps Fitness,Le Corps Fitness,"trainers, short sets, yoga sets, reine, one pi...",['fitness'],100.0
4,4,Womens fashions,Charming Lilly,"charming glambam, charming accessories, swim c...",[],100.0


In [27]:
# To see overall results of your labeling efforts, you can use the value_counts() function again.
df['ml_labels'].value_counts()

10.0    604662
1.0       9040
2.0       7161
Name: ml_labels, dtype: int64

In [14]:
# If you want to look at the individual values for a label you can use the command below.
df.loc[df['ml_labels'] == 2.0]

Unnamed: 0.1,Unnamed: 0,store_title,store_description,store_collections,store_labels,ml_labels
1,1,Mason - Super Thin iPhone Cases,The original super thin iPhone cases that perf...,"iphone 11, iphone xs max, iphone x, iphone xr,...","['case', 'iphone', 'leather']",2.0
236,236,100% Organic Cotton Luxury Bedding and Bed Lin...,Strawberry & Cream is a new luxury bedding bra...,"super king pillow cases, super king fitted she...","['duvet', 'sheet', 'bed', 'pillow', 'organic',...",2.0
445,442,ICE Computer Cafe & Repair Center | Computer ...,Computer Repair | iPhone Repair | iPad Repair ...,"apple watch repair, samsung galaxy s8 repair, ...","['samsung', 'iphone']",2.0
487,484,ALASKA TECHNOLOGY FOR ALL,"MOBILE PHONES, MOBILE PARTS, CELLPHONES, IPHON...","home health, xiaomi, smartwatches, iphones, ip...",['iphone'],2.0
507,504,Handmade Designer Lighter Sleeves & Lighter Cases,Custom lighter sleeves & lighter cases made fr...,"hfl lighter sleeves, louis vuitton gucci goyar...","['case', 'sleeve']",2.0
...,...,...,...,...,...,...
620208,618113,Smartphone Mounting For An Active Lifestyle™,Quad Lock® is an Australian business producing...,"iphone 11 pro, iphone 11, shop samsung, iphone...",['iphone'],2.0
620257,618162,Olde Time Mac,Canada’s best source for Certified Pre-Owned A...,"ipad pro 12 9, macbook pro 13 retina, iphone x...","['iphone', 'macbook']",2.0
620285,618190,Yangtze Store,"Online store for scarves, silk pillowcases, si...","long oblong silk scarves, pearl earrings, wool...","['sleep', 'silk', 'wrap', 'pillowcase', 'pearl']",2.0
620361,618266,Brakitty Premium Hardshell Womens Travel Essen...,Brakitty case organizes & protects your linger...,"brakitty classic, bra case travel accessory su...","['bra', 'travel', 'case', 'bikini']",2.0


In [25]:
# If you want to see the full output you can write it out to a file.
# This will write to the temporary file space of a Google collab notebook if you run the notebook there.
df.loc[df['ml_labels'] == 1.0].to_csv(r'matched_label_1.txt', header=None, index=None, sep=',')

# Creating the Training Data Set

Now that we have labelled data, it is time to create the training set. 
In order to train our classifier we need the set of:
* Samples `X` and
* the set of labels for these samples `y`

Both `X` and `y` have to be encoded in numerical form. 
In order to transform the text of the descriptions we make use of the `CountVectorizer`.

For the vector holding the labels we just have to ensure it is an `int`.


In [37]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X = count_vect.fit_transform(df['store_description'].apply(lambda x: np.str_(x)))
y = df['ml_labels'].astype(int)
X.shape
y.shape


(620863,)

In [33]:
# This shows you how the samples have been transformed from text into a numerical 
# represenation. If you want to see the full output you have to write it to a file.
X[1:10].todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

# Training and Scoring the Classifier Model

After the all the hard work we have done above, the actual training and scoring of the classifier is very simple. 

As shown below, all we have to do is to call the `fit()` method with the input of the samples `X` and their labels `y`. 

Afterwards we measure by calling the `score()` method. 


In [54]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(x_train, y_train)
clf.score(x_test, y_test)

0.9452495974235104

# Practical Part Deliverables

As part of the grading for Principles of  Machine Learning each group has to complete the following steps:

### 1. Create a labelled dataset

Following the schema shown above, map your chosen categories to keywords and create labels in an ml_labels column.

After you have finished this process export your resulting dataframe as a `csv` file (example code for this is contained in this notebook). 

### 2. Train a classifier

If you want you can use the same sample code contained in this notebook to train your first classifier based on your labelling. You can also look at the sci-kit learn documentation and try your hand at other classifiers (this is not required from a grading perspective).

### 3. Measure correctly

In the above example code we train and measure on the same data. 
As we have discussed in class this is not a good practice.
Add a cell that makes use of the `from sklearn.model_selection import train_test_split` method to create a X_train, y_train, X_test, y_test portion and train and measure with those.

### 4. Interprete the process and your results

Provide a short (not more than 10 sentences) written interpretation of the observed result and the process that we have used to create our classifier. 

Do you see any potential problems in the semi_supervised keyword labeling shortcut we have used?

How would you interpret the observed results? Could the classifier be used for the discussed purpose of doing some preliminary competitive analysis (e.g. what is my competition in this area, how many shops exist that serve the products I target)

Write your input into the next text cell. 

# Hand-in Procedure

The hand-in for each group consists of the CSV file containing your labelled version of the dataframe, and your version of this notebook.

Place the notebook file and the CSV file in a Zip file and upload it to the teams channel for Part II of the practical project. 

The name of the zip file should contain the last names of the group members. 



# Interpretation of the process and our results

Write your input here.