# Lazada Product Classifier

### Audhi Aprilliant

Read more about text pre-processing [HERE](https://medium.com/@ksnugroho/dasar-text-preprocessing-dengan-python-a4fa52608ffe)

## 1 Import Modules

In [108]:
import pandas as pd                   # Dataframe manipulation
import numpy as np                    # Mathematics operation
import re                             # Regular expression
import itertools
import collections                    # Collections
import string
from collections import OrderedDict
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from nltk.tokenize import word_tokenize

## 2 Load the Data

In [109]:
data_lazada = pd.read_csv('Datasets/lazada-products.csv',usecols=['title','category'])

In [110]:
print('Dimension of the data:\n{}'.format(data_lazada.shape[0]),
      'rows and {}'.format(data_lazada.shape[1]),'columns')
data_lazada.head()

Dimension of the data:
7020 rows and 2 columns


Unnamed: 0,title,category
0,[Lazada Exclusive] Infinix Smart 4 2/32GB - Du...,handphone
1,[Lazada Special Edition] Infinix Hot 8 3/32GB ...,handphone
2,"Realme C2 2/32 Hp Murah 4.000 mAh Battery, Oct...",handphone
3,Vivo Y12 hp 3GB 32GB/64GB All Screen 6.35 inch...,handphone
4,Realme 5i hp 4GB/64GB 4GB/128GB Qualcomm Snapd...,handphone


In [111]:
print(data_lazada.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7020 entries, 0 to 7019
Data columns (total 2 columns):
title       7020 non-null object
category    7020 non-null object
dtypes: object(2)
memory usage: 109.8+ KB
None


## 3 Tokenization

### Duplicated Data Removal

In [112]:
data_lazada = data_lazada.drop_duplicates(subset=['title']).reset_index(drop=True)
print('Dimension of non-duplicated data:\n{}'.format(data_lazada.shape[0]),
      'rows and {}'.format(data_lazada.shape[1]),'columns')

Dimension of non-duplicated data:
6735 rows and 2 columns


### 3.1 Case Folding

In [113]:
for i in range(data_lazada.shape[0]):
    data_lazada.loc[i,'title'] = data_lazada.loc[i,'title'].lower()

In [114]:
data_lazada.head()

Unnamed: 0,title,category
0,[lazada exclusive] infinix smart 4 2/32gb - du...,handphone
1,[lazada special edition] infinix hot 8 3/32gb ...,handphone
2,"realme c2 2/32 hp murah 4.000 mah battery, oct...",handphone
3,vivo y12 hp 3gb 32gb/64gb all screen 6.35 inch...,handphone
4,realme 5i hp 4gb/64gb 4gb/128gb qualcomm snapd...,handphone


### 3.2 Number Removal

In [115]:
def number_removal(data_text):
    return re.sub('\d+','',data_text)

In [116]:
data_lazada['title'] = data_lazada['title'].apply(remove_number)

In [117]:
data_lazada.head()

Unnamed: 0,title,category
0,[lazada exclusive] infinix smart /gb - dual c...,handphone
1,[lazada special edition] infinix hot /gb - tr...,handphone
2,"realme c / hp murah . mah battery, octa-core h...",handphone
3,vivo y hp gb gb/gb all screen . inch mp helio ...,handphone
4,realme i hp gb/gb gb/gb qualcomm snapdragon a...,handphone


### 3.3 Punctuation Removal

In [118]:
for i in range(data_lazada.shape[0]):
    data_lazada.loc[i,'title'] = data_lazada.loc[i,'title'].translate(str.maketrans(string.punctuation,
                                                                                    ' '*len(string.punctuation)))

In [119]:
data_lazada.head()

Unnamed: 0,title,category
0,lazada exclusive infinix smart gb dual c...,handphone
1,lazada special edition infinix hot gb tr...,handphone
2,realme c hp murah mah battery octa core h...,handphone
3,vivo y hp gb gb gb all screen inch mp helio ...,handphone
4,realme i hp gb gb gb gb qualcomm snapdragon a...,handphone


### 3.4 Extra Space Removal

In [120]:
for i in range(data_lazada.shape[0]):
    data_lazada.loc[i,'title'] = ' '.join(data_lazada.loc[i,'title'].split())

In [121]:
data_lazada.head()

Unnamed: 0,title,category
0,lazada exclusive infinix smart gb dual camera ...,handphone
1,lazada special edition infinix hot gb triple c...,handphone
2,realme c hp murah mah battery octa core helio ...,handphone
3,vivo y hp gb gb gb all screen inch mp helio p ...,handphone
4,realme i hp gb gb gb gb qualcomm snapdragon ai...,handphone


## 4 Normalization

### 4.1 Stemming with Nazief and Adriani Algorithm

In [122]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()
for i in range(data_lazada.shape[0]):
    data_lazada.loc[i,'title'] = stemmer.stem(data_lazada.loc[i,'title'])

In [123]:
data_lazada.head()

Unnamed: 0,title,category
0,lazada exclusive infinix smart gb dual camera ...,handphone
1,lazada special edition infinix hot gb triple c...,handphone
2,realme c hp murah mah battery octa core hio p ...,handphone
3,vivo y hp gb gb gb all screen inch mp hio p tr...,handphone
4,realme i hp gb gb gb gb qualcomm snapdragon ai...,handphone


## 5 Stopwords Removal

In [124]:
factory = StopWordRemoverFactory()
stopword = factory.create_stop_word_remover()
for i in range(data_lazada.shape[0]):
    data_lazada.loc[i,'title'] = stopword.remove(data_lazada.loc[i,'title'])

In [125]:
data_lazada.head()

Unnamed: 0,title,category
0,lazada exclusive infinix smart gb dual camera ...,handphone
1,lazada special edition infinix hot gb triple c...,handphone
2,realme c hp murah mah battery octa core hio p ...,handphone
3,vivo y hp gb gb gb all screen inch mp hio p tr...,handphone
4,realme i hp gb gb gb gb qualcomm snapdragon ai...,handphone


## 6 Term and Its Frequency

In [130]:
term_freq = pd.Series(' '.join(data_lazada['title']).split(' ')).value_counts()
print(term_freq[:10])

tunik         2973
gb            1791
wanita        1470
mascara       1429
murah          895
garansi        760
baju           756
resmi          726
atas           655
waterproof     609
dtype: int64


## Save the Data after Pre-processing

In [127]:
# The whole data
data_lazada.to_csv('Datasets/interim/1 Lazada-product after Preprocessing.csv',index=False)

In [131]:
# Term frequency data
term_freq.to_csv('Datasets/interim/0 Term Frequency.csv')

  
