In [59]:
import seaborn as sns
import csv,json,os,statistics,time
import datetime as dt
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
import praw
import time
import re
import nltk
from operator import itemgetter

# Feature Engineering

In [60]:
df=pd.read_csv('product.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            215 non-null    object
 1   money           215 non-null    object
 2   money_discount  215 non-null    object
 3   total_sell      203 non-null    object
 4   link            215 non-null    object
dtypes: object(5)
memory usage: 8.5+ KB


## Get total sales
- We will get sales of each product by removing unwanted characters and keeping only numbers. 
- **'k'** character in sale number means 1000, so 20k means 20000 products have been sold 
- Any product does not have sales record will be replaced with 0

In [61]:
def getSales(sales):
    if(isNaN(sales)):
        return 0
    else:
        numSales=sales.split(' ')[2]
        numSales=numSales.replace(',','.')
        if('k' in numSales):
            numSales=numSales.replace('k','')
            numSales=float(numSales)*1000
        else:
            numSales=float(numSales)
        return int(numSales)
    
def isNaN(num):
    return num != num
            

df['total_sell']=df['total_sell'].apply(getSales)
df['total_sell']

0      40100
1      30100
2      37100
3      21400
4      14800
       ...  
210        5
211        0
212        0
213        0
214        0
Name: total_sell, Length: 215, dtype: int64

## Get category of product
- The Shopee page which I scraped is in vietnamese language, but you don't need to worry about it. I already translated all categories for you in the below section
- If you want make sure every product's category is correct, you can get URL of it and check it yourself

In [62]:
def getCategory(name):
    name=name.lower()
    if('áo' in name):
        return 'shirt'
    elif ('quần' in name):
        return 'pant'
    elif (('mũ' in name) or ('nón' in name)):
        return 'hat'
    elif (('vớ' in name) or ('tất' in name)):
        return 'sock'
    else:
        return 'other'

In [63]:
df['category']=df['name'].apply(getCategory)
print(df['category'].value_counts().sort_values(ascending=False))
df[df['category']=='shirt']

shirt    109
pant      74
other     11
sock      11
hat       10
Name: category, dtype: int64


Unnamed: 0,name,money,money_discount,total_sell,link,category
0,Áo thun nam Cotton Compact phiên bản Premium c...,259000,149000,40100,https://shopee.vn//Áo-thun-nam-Cotton-Compact-...,shirt
5,Áo thun nam 100% Cotton USA Coolmate Basics th...,119000,115000,32200,https://shopee.vn//Áo-thun-nam-100-Cotton-USA-...,shirt
7,Áo thun nam DÀI TAY Cotton Compact Premium chố...,269000,129000,14600,https://shopee.vn//Áo-thun-nam-DÀI-TAY-Cotton-...,shirt
9,Áo sát nách thể thao nam Dri-Breathe thoáng má...,189000,159000,5000,https://shopee.vn//Áo-sát-nách-thể-thao-nam-Dr...,shirt
10,Áo Polo thể thao nam ProMax-S1 Logo thương hiệ...,159000-199000,159000-199000,14600,https://shopee.vn//Áo-Polo-thể-thao-nam-ProMax...,shirt
...,...,...,...,...,...,...
207,Áo thun nam Cotton Compact phiên bản Premium c...,259000,259000,812,https://shopee.vn//Áo-thun-nam-Cotton-Compact-...,shirt
211,Áo thun thể thao nam Recycle Active V1 thoáng ...,169000,169000,0,https://shopee.vn//Áo-thun-thể-thao-nam-Recycl...,shirt
212,Áo Polo thể thao nam Recycle Active V1 thoáng ...,249000,249000,0,https://shopee.vn//Áo-Polo-thể-thao-nam-Recycl...,shirt
213,Áo Polo thể thao nam Recycle Active V2 thoáng ...,249000,249000,0,https://shopee.vn//Áo-Polo-thể-thao-nam-Recycl...,shirt


**Data Insight**: As you can see, most of Coolmate's products are shirts or pants. They specifically target young men in vietnam, who value minimalistic or comfort clothes over fashionable one

## Get subcategory of product
In **'pant' and 'shirt'** categories, we can continue divide product into subcategory for more detail

In [64]:
def getSubcategory(name):
    name=name.lower()
    if('áo' in name):
        if('áo thun' in name):
            return 'tshirt'
        elif('sơ mi' in name):
            return 'shirt'
        elif('polo' in name):
            return 'polo'
        else:
            return 'other'
    elif('quần' in name):
        if('short' in name or 'shorts' in name):
            return 'short'
        elif('jogger' in name):
            return 'jogger'
        elif('quần lót' in name):
            return 'underwear'
        elif('jean' in name or 'jeans' in name):
            return 'jean'
        else:
            return 'other'
    else:
        return 'none'

df['subcategory']=df['name'].apply(getSubcategory)

In [65]:
df['subcategory'].value_counts().sort_values(ascending=False)

tshirt       84
underwear    38
none         32
short        26
other        16
polo         12
jean          3
shirt         3
jogger        1
Name: subcategory, dtype: int64

## Get sale type of product
Some products are sold as a bundel (a bundle often includes multiple items with same or different products mixed together and offer a lower price)

In [66]:
def getSaleType(name):
    name=name.lower()
    if('combo' in name or 'set' in name):
        return 'bundle'
    else:
        return 'single'
df['sale_type']=df['name'].apply(getSaleType)

In [67]:
df['sale_type'].value_counts().sort_values(ascending=False)

single    168
bundle     47
Name: sale_type, dtype: int64

**Data Insight**: You must be wondering why over a fifth of Coolmate's products are sold as a bunble instead of single item ? Why would they want to sell item as a package but not single for profit ? <br>
**Here is my 2 cents**: Trendiness isn't usually something men want when shopping for clothes, so they end up buying only essential items such as shirts,pants,socks,.. with different colors. Bundle sales strategy would be great for Coolmate to offer their products with lower price and increase sales. So it is totally predictable that Coolmate's products are often sold as a bundle

## Keep only single price from product

There are few products that have multiple price tag (seperate by hyphen character) in our data, we only need to keep the lowest price

In [68]:
df['money'][:10]

0          259000
1          279000
2          299000
3    98000-129000
4          229000
5          119000
6          189000
7          269000
8          199000
9          189000
Name: money, dtype: object

In [69]:
def removePrice(price):
    price=price.split('-')
    return price[0]

df['money']=df['money'].apply(removePrice)
df['money_discount']=df['money_discount'].apply(removePrice)
df['money'][:10]

0    259000
1    279000
2    299000
3     98000
4    229000
5    119000
6    189000
7    269000
8    199000
9    189000
Name: money, dtype: object

In [72]:
df.to_csv('cleaned_product.csv',encoding='utf-8-sig')

# Conclusion

You can see that with only name of product, we can create more useful features for our analysis and reporting process. This is the end of part 1 and I hope you are enjoying it so far, thanks for reading.