# Mercari Price Data Cleaning

![](https://cdn-images-1.medium.com/max/1600/1*jX6Gwn1rt4da7e-yUj84IQ.png)

-  Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to __offer pricing suggestions to sellers__, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.

In [1]:
import numpy as np
import pandas as pd

import string
import re

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight') 

import gc
gc.enable()

import warnings
warnings.filterwarnings('ignore')

## 讀取檔案

In [2]:
PATH = 'data/'
test = pd.read_table(PATH + 'test.tsv',  engine='c')
train = pd.read_table(PATH + 'train.tsv',  engine='c')

## 文字統計

In [3]:
from collections import Counter
import time

In [4]:

start_time = time.time()
total_counts = Counter()
merge = pd.concat([train, test])
for i in range(test.shape[0]):
    for word in test['item_description'][i].lower().split(' '):
        total_counts[word] += 1
print('[{}] Sec Finished!'.format(round(time.time() - start_time,3)))

[18.744] Sec Finished!


## item_description 到底有什麼東西
`most_common` : 多到少排序

In [5]:
total_counts.most_common()

[('and', 392764),
 ('the', 302872),
 ('a', 263210),
 ('for', 252534),
 ('in', 230538),
 ('to', 221503),
 ('with', 215514),
 ('is', 201711),
 ('size', 185275),
 ('new', 180537),
 ('of', 161513),
 ('no', 141492),
 ('i', 132754),
 ('on', 124583),
 ('brand', 116848),
 ('free', 111454),
 ('you', 107807),
 ('or', 96961),
 ('are', 94798),
 ('it', 92581),
 ('this', 86853),
 ('[rm]', 79646),
 ('my', 76569),
 ('will', 74101),
 ('-', 73060),
 ('never', 72228),
 ('but', 71237),
 ('all', 69196),
 ('shipping', 68169),
 ('great', 67796),
 ('worn', 66158),
 ('have', 65478),
 ('not', 65165),
 ('used', 64296),
 ('from', 63872),
 ('black', 59503),
 ('your', 59246),
 ('&', 57079),
 ('price', 55202),
 ('condition', 54178),
 ('be', 53794),
 ('as', 51394),
 ('2', 51276),
 ('has', 51139),
 ('one', 50848),
 ('like', 49800),
 ('please', 48248),
 ('only', 47662),
 ('good', 46954),
 ('bundle', 46859),
 ('can', 45995),
 ('1', 45904),
 ('pink', 45629),
 ('if', 43049),
 ('very', 41932),
 ('description', 41166),
 ('s

### __Important Note__:

### 前幾多的文字都算是 **stop words**。

## 統計:
1. 總共字數
2. 幾個單詞
3. 幾個只出現一次的詞

要loop Dictionary 可用 `items`

In [6]:
text = []
total = 0
appear_once = 0
for k, v in total_counts.items(): # k for keys, v for values
    text.append(k)
    total += v
    if v == 1:
        appear_once += 1
print("Total words: {}".format(total))
print("Unique words: {}".format(len(set(text))))
print("Numbers of appear once words: {}".format(appear_once))

Total words: 17793403
Unique words: 367621
Numbers of appear once words: 214911


In [7]:
text[:100] # list 為了省內存，列出前一百個

['size',
 '7',
 '25',
 'pcs',
 'new',
 '7.5"x12"',
 'kraft',
 'bubble',
 'mailers',
 'lined',
 'with',
 'wrap',
 'for',
 'protection',
 'self',
 'sealing',
 '(peel-and-seal),',
 'adhesive',
 'keeps',
 'contents',
 'secure',
 'and',
 'tamper',
 'proof',
 'durable',
 'lightweight',
 'material',
 'helps',
 'save',
 'on',
 'postage',
 'approved',
 'by',
 'ups,',
 'fedex,',
 'usps.',
 'brand',
 'coach',
 'bag.',
 'bought',
 '[rm]',
 'at',
 'a',
 'outlet.',
 '-floral',
 'kimono',
 '-never',
 'worn',
 '-lightweight',
 'perfect',
 'hot',
 'weather',
 'rediscovering',
 'life',
 'after',
 'the',
 'loss',
 'of',
 'loved',
 'one',
 'tony',
 'cooke.',
 'paperback',
 'in',
 'good',
 'condition',
 '2003.',
 '❤',
 'bundle',
 'save!',
 'book,',
 'death,',
 'grief,',
 'bereavement',
 'shlf.sw.5.15',
 'absolut',
 'vodka',
 'pink',
 'iphone',
 '6',
 'plus',
 'also',
 'fits',
 '6s',
 'iphone.',
 'these',
 'are',
 'made',
 'flexible',
 'rubber',
 'material.',
 'case',
 'have',
 'size:',
 'free',
 'shipping'

## `join` 的功能是？
- PS : scroll down next cell 你會看到很多神奇符號，這就是raw data 真實樣貌

In [10]:
' '.join(text)[:10000] # 為了省內存，列出前一萬個

'size 7 25 pcs new 7.5"x12" kraft bubble mailers lined with wrap for protection self sealing (peel-and-seal), adhesive keeps contents secure and tamper proof durable lightweight material helps save on postage approved by ups, fedex, usps. brand coach bag. bought [rm] at a outlet. -floral kimono -never worn -lightweight perfect hot weather rediscovering life after the loss of loved one tony cooke. paperback in good condition 2003. ❤ bundle save! book, death, grief, bereavement shlf.sw.5.15 absolut vodka pink iphone 6 plus also fits 6s iphone. these are made flexible rubber material. case have size: free shipping two vintage cameo pieces. 1. silver metal locket pendant filigree, green background ivory cameo. it is 2". 2. goldtone metal, mirrored gray pinback loop chain. shipping. price firm no trades box included 1yr warranty card dust cloth. metal: stainless steel finish: polished width: 12mm length: 8" can be adjusted smaller will not rust, tarnish, change colors or turn skin green. wa

### Important Note:

⚡❌⭕️◼️⚫❤✅✳️❇️⭐✔✔️☑️✨❗️❣️♥️❤️✴️⏺️⬅️⏬➡️♦️⛔♠️♣️⚠️◇♧♤▶ ......**more**
- 超級多神奇符號emoji，以及莫名其妙的文字，夠骯髒的

- 文字emoji 黏一起會被視為一個單詞
   - Example:  ❤️iphone，會被視作一個詞彙，而不是 ❤️，iphone



# 練習時間：
## 所以各位學員的工作就是當一個稱職得清潔工 
## 請寫一個Function，去手動把神奇符號給 取代 or 去除 
### 此題沒有標準答案！

- 您可以使用regex, or simply replace them.
- Examle:

```import re
text = re.sub('A','a', text)```
- function name = `preprocessing`


In [None]:
# 練習題：

def preprocessing(text):
    text = str(text).lower()

    # Your code goes here
    # IP/OP both Pandas Series
    
    
    return " ".join(text.split())

### preprocess 寫完後請執行這個Cell 



In [None]:
train['item_description'] = train['item_description'].apply(lambda x: preprocessing(x))
test['item_description'] = test['item_description'].apply(lambda x: preprocessing(x))

## 清除完後，請看
1. Total words
2. Unique words
3. Numbers of appear once words



In [None]:
start_time = time.time()
total_counts = Counter()
merge = pd.concat([train, test])
for i in range(test.shape[0]):
    for word in test['item_description'][i].lower().split(' '):
        total_counts[word] += 1
print('[{}] Sec Finished!'.format(round(time.time() - start_time,3)))

text = []
total = 0
appear_once = 0
for k, v in total_counts.items():
    text.append(k)
    total += v
    if v == 1:
        appear_once += 1
print("Total words: {}".format(total))
print("Unique words: {}".format(len(set(text))))
print("Numbers of appear once words: {}".format(appear_once))

```
print("Total words: {}".format(total))
print("Unique words: {}".format(len(set(text))))
print("Numbers of appear once words: {}".format(appear_once))
```
### 以上三項都要減少，本題沒有絕對答案，所以不提供答案。