# Prepping Scraped Data for NLP Cleaning

I first import my packages

In [1]:
import pandas as pd
import regex as re
import numpy as np
import string
import pickle

Next, I bring in the CSV file of my scraped data.  I will pickle it, so it can also be accessed later. 

In [3]:
raw_scraped_df = pd.read_csv("/Users/caitlinsanderson/Documents/ironhack_course_work/median-market-price/scraping/sold1000.csv")

In [4]:
with open("scraped_pickle","wb") as f:
    pickle.dump(raw_scraped_df, f)

In [5]:
with open("scraped_pickle","rb") as f:
    raw_scraped_df = pickle.load(f)

## Exploring my data

In [6]:
raw_scraped_df.head()

Unnamed: 0,brand,title,sold_price,when_sold,item_links
0,Harrington × Polo Ralph Lauren × Vintage,Polo Golf 90s 00s Corduroy Velvet Check Jacket...,$91,Sold 5 minutes ago,/listings/16896292-harrington-x-polo-ralph-lau...
1,Ysl Pour Homme × Yves Saint Laurent,YSL Yves Saint Laurent Short Sleeve T-Shirt Si...,$60,Sold 6 minutes ago,/listings/18764205-ysl-pour-homme-x-yves-saint...
2,Walter Van Beirendonck,Walter van Beirendonck A/W 2011 Hand on Heart ...,$165,Sold 8 minutes ago,/listings/17713042-walter-van-beirendonck-walt...
3,Massimo Osti,Archive 90ies Tracksuit Jacket,$150,Sold 10 minutes ago,/listings/17739704-massimo-osti-archive-90ies-...
4,Nike × Vintage,Nike Vintage Puffer Down Jacket Fleece,$115,Sold 11 minutes ago,/listings/18182447-nike-x-vintage-nike-vintage...


In [7]:
raw_scraped_df.shape

(8900, 5)

In [8]:
raw_scraped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8900 entries, 0 to 8899
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   brand       8900 non-null   object
 1   title       8900 non-null   object
 2   sold_price  8900 non-null   object
 3   when_sold   8900 non-null   object
 4   item_links  8900 non-null   object
dtypes: object(5)
memory usage: 347.8+ KB


In [9]:
raw_scraped_df.dtypes

brand         object
title         object
sold_price    object
when_sold     object
item_links    object
dtype: object

In [10]:
raw_scraped_df['brand'].isnull().sum()

0

In [11]:
raw_scraped_df['title'].isnull().sum()

0

I see there are several things I need to do right away to my dataframe: 
<ol>
    <li>clean up the text to make it easier to work with</li>
    <li>convert sold_price to integer and convert to euros</li>
    <li>convert when_sold to DATETIME format, based on time the CSV was created</li>
</ol>
I will leave the links alone for now.  I scraped them to have, in case we want to use them in the future. 

In [12]:
#make a function to clean the columns of the df
#many thanks to Alice Zhao (adashofdata) for (most) of this function.  She also introduced me to pickle :)  
    #https://github.com/adashofdata/nlp-in-python-tutorial
def clean_text(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text) #take out anything in brackets
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #remove punctuation
    text = re.sub('^\$', '', text) #get rid of the $ before the prices
    text = re.sub('[‘’“”…\']', '', text) #get rid of more gobbledygook
    text = text.strip()
    text = " ".join(text.split())
    return text

cleaner = lambda x: clean_text(x)

In [13]:
raw_scraped_df['brand'] = raw_scraped_df['brand'].apply(cleaner)
raw_scraped_df['title'] = raw_scraped_df['title'].apply(cleaner)
raw_scraped_df['sold_price'] = raw_scraped_df['sold_price'].apply(cleaner)
raw_scraped_df['when_sold'] = raw_scraped_df['when_sold'].apply(cleaner)

In [14]:
cleaner_df = raw_scraped_df

cleaner_df.head()

Unnamed: 0,brand,title,sold_price,when_sold,item_links
0,harrington × polo ralph lauren × vintage,polo golf 90s 00s corduroy velvet check jacket...,91,sold 5 minutes ago,/listings/16896292-harrington-x-polo-ralph-lau...
1,ysl pour homme × yves saint laurent,ysl yves saint laurent short sleeve tshirt siz...,60,sold 6 minutes ago,/listings/18764205-ysl-pour-homme-x-yves-saint...
2,walter van beirendonck,walter van beirendonck aw 2011 hand on heart s...,165,sold 8 minutes ago,/listings/17713042-walter-van-beirendonck-walt...
3,massimo osti,archive 90ies tracksuit jacket,150,sold 10 minutes ago,/listings/17739704-massimo-osti-archive-90ies-...
4,nike × vintage,nike vintage puffer down jacket fleece,115,sold 11 minutes ago,/listings/18182447-nike-x-vintage-nike-vintage...


## turning the prices to integers and converting to euros

In [15]:
cleaner_df['sold_price'].dtype

dtype('O')

In [16]:
cleaner_df['sold_price'] = cleaner_df['sold_price'].astype(str).astype(int)

In [17]:
cleaner_df['sold_price'].dtype

dtype('int64')

In [18]:
cleaner_df['sold_price'] = round(cleaner_df['sold_price'].apply(lambda x: x * 0.8),2)
cleaner_df['sold_price']

0        72.8
1        48.0
2       132.0
3       120.0
4        92.0
        ...  
8895    168.0
8896    138.4
8897     28.0
8898     68.8
8899     36.0
Name: sold_price, Length: 8900, dtype: float64

## Next, I will convert my when_sold, to the approximate datetime it was sold

In [19]:
import dateparser

In [20]:
cleaner_df['when_sold'] = cleaner_df['when_sold'].apply(lambda x: re.sub('^sold\s', '', x))

In [21]:
cleaner_df['when_sold'] = cleaner_df['when_sold'].apply(lambda x: dateparser.parse(x))
cleaner_df['when_sold']

0      2020-12-17 11:46:39.800985
1      2020-12-17 11:45:39.811749
2      2020-12-17 11:43:39.816702
3      2020-12-17 11:41:39.822449
4      2020-12-17 11:40:39.825097
                  ...            
8895   2020-12-06 11:51:56.595229
8896   2020-12-06 11:51:56.596596
8897   2020-12-06 11:51:56.597962
8898   2020-12-06 11:51:56.599334
8899   2020-12-06 11:51:56.600700
Name: when_sold, Length: 8900, dtype: datetime64[ns]

Let's take a look at our cleaner dataframe

In [22]:
cleaner_df.head()

Unnamed: 0,brand,title,sold_price,when_sold,item_links
0,harrington × polo ralph lauren × vintage,polo golf 90s 00s corduroy velvet check jacket...,72.8,2020-12-17 11:46:39.800985,/listings/16896292-harrington-x-polo-ralph-lau...
1,ysl pour homme × yves saint laurent,ysl yves saint laurent short sleeve tshirt siz...,48.0,2020-12-17 11:45:39.811749,/listings/18764205-ysl-pour-homme-x-yves-saint...
2,walter van beirendonck,walter van beirendonck aw 2011 hand on heart s...,132.0,2020-12-17 11:43:39.816702,/listings/17713042-walter-van-beirendonck-walt...
3,massimo osti,archive 90ies tracksuit jacket,120.0,2020-12-17 11:41:39.822449,/listings/17739704-massimo-osti-archive-90ies-...
4,nike × vintage,nike vintage puffer down jacket fleece,92.0,2020-12-17 11:40:39.825097,/listings/18182447-nike-x-vintage-nike-vintage...


There are now 2 more things I want to do:
<ol>
    <li>break out the brands into separate columns when there is more than one brand associated to a product</li>
        <li>take out any numbers from the title column</li></ol><br>
I will then deal more extensively with the title text in order to pull out the words that will tell me what type of item was sold and - ultimately - match that to our own type categories. <br>
                
My ultimate goal is to create a dataframe that will contain 3 columns: brand, type-category, and sold_price. 

## I will start with the brand column

In [23]:
brand_counts_df = cleaner_df['brand'].value_counts().to_frame()
brand_counts_df.head(25)

Unnamed: 0,brand
nike,200
nike × vintage,189
gucci,174
prada,169
maison margiela,160
supreme,150
acne studios,137
stone island,135
vintage,108
rick owens,104


Grailed allows up to 3 brands to be associated with an item and each time, it is separated by an "x".  I will use this to divide up my brands into 3 separate columns, and thereby get a better sense of how often certain brands are sold. 

In [24]:
#Many thanks to Jonny Fox and his Medium article breaking down RegEx syntax. 
    #https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
first_brand = []
re_pattern1 = r"^.*(?=\s×\s)"

for brand in cleaner_df['brand'].tolist():
    if re.match(re_pattern1, brand):
        first_brand.append(brand.split(" × ")[0])
    else:
        first_brand.append(brand)
        
first_brand

['harrington',
 'ysl pour homme',
 'walter van beirendonck',
 'massimo osti',
 'nike',
 'streetwear',
 'streetwear',
 'nudie jeans',
 'the north face',
 'custom',
 'burberry',
 'japanese brand',
 'arcteryx',
 'evisu',
 'carhartt wip',
 'soccer jersey',
 'malcolm mclaren',
 'carhartt',
 'yeezy season',
 'fjallraven',
 'marcelo burlon',
 'nike',
 'stone island',
 'bape',
 'brooklyn we go hard',
 'asap rocky',
 'prada',
 'yves saint laurent',
 '11 by boris bidjan saberi',
 'our legacy',
 'kangol',
 'jh design',
 'nike',
 'bottega veneta',
 'galliano',
 'raf simons',
 'amiri',
 'suitsupply',
 'kapital',
 'nike',
 'rick owens',
 'carhartt wip',
 'evisu',
 'vintage',
 'offwhite',
 'marlboro classics',
 'prada',
 'visvim',
 'jordan brand',
 'nike',
 'goyard',
 'stussy',
 'hype',
 'gmbh',
 '999 club',
 'nike',
 'common projects',
 'bape',
 'arcteryx',
 'true religion',
 'moncler',
 'tom ford',
 'prada',
 'carhartt',
 'maison margiela',
 'carhartt',
 'nike',
 'gruniforma',
 'nike acg',
 'vintag

In [25]:
second_brand = []
re_pattern2 = r"^.*(?=\s×\s).*"

for brand in cleaner_df['brand'].tolist():
    if re.match(re_pattern2, brand):
        second_brand.append(brand.split(" × ")[1])
    else:
        second_brand.append(np.NAN)
        
second_brand

['polo ralph lauren',
 'yves saint laurent',
 nan,
 nan,
 'vintage',
 'supreme',
 'young thug',
 nan,
 nan,
 'streetwear',
 nan,
 'vintage',
 'arcteryx veilance',
 'japanese brand',
 nan,
 'umbro',
 'seditionaries',
 'vintage',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'guess',
 nan,
 nan,
 nan,
 nan,
 'vintage',
 'racing',
 'vintage',
 nan,
 'john galliano',
 nan,
 nan,
 nan,
 'kapital kountry',
 nan,
 nan,
 'custom',
 'japanese brand',
 nan,
 nan,
 nan,
 nan,
 nan,
 'nike',
 'streetwear',
 'hypebeast',
 nan,
 'streetwear',
 nan,
 'streetwear',
 nan,
 nan,
 'timberland',
 nan,
 nan,
 'vintage',
 nan,
 nan,
 'carhartt wip',
 nan,
 'carhartt wip',
 'vintage',
 'gosha rubchinskiy',
 nan,
 'wu tang clan',
 'carhartt wip',
 'wu tang clan',
 nan,
 nan,
 nan,
 'iron maiden',
 nan,
 'uniqlo',
 nan,
 nan,
 'pro player',
 'custom',
 'carhartt wip',
 nan,
 'leather jacket',
 'uniqlo',
 nan,
 'vintage',
 nan,
 'streetwear',
 nan,
 nan,
 nan,
 nan,
 'vintage',
 'carhartt wip',
 'ronnie fieg',
 na

In [26]:
third_brand=[]
re_pattern3 = r"^.*(?<=\s×\s.*\s×\s).*$"

for brand in cleaner_df['brand'].tolist():
    if re.match(re_pattern3, brand):
        third_brand.append(brand.split(" × ")[2])
    else:
        third_brand.append(np.NAN)
        
third_brand

['vintage',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'vintage',
 nan,
 nan,
 'vivienne westwood',
 'workers',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'custom made',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'vlone',
 nan,
 'vlone',
 nan,
 nan,
 'undefeated',
 nan,
 nan,
 nan,
 nan,
 nan,
 'custom made',
 nan,
 'vintage',
 nan,
 nan,
 nan,
 nan,
 'vintage',
 'wu wear',
 nan,
 nan,
 nan,
 'vintage',
 nan,
 nan,
 nan,
 nan,
 nan,
 'stussy',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'vintage',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'vintage',
 nan,
 'kids see ghosts',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'vintage',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'vintage',
 nan,
 nan,
 nan,
 'vi

I will now turn my 3 lists into a new dataframe, that will be the start of my final dataframe I will use to reference for price recommendations.
I will pickle this to preserve it for future use. 

In [27]:
clean_df = pd.DataFrame(np.column_stack([first_brand, second_brand, third_brand]), 
                               columns=['brand1', 'brand2', 'brand3'])
clean_df

Unnamed: 0,brand1,brand2,brand3
0,harrington,polo ralph lauren,vintage
1,ysl pour homme,yves saint laurent,
2,walter van beirendonck,,
3,massimo osti,,
4,nike,vintage,
...,...,...,...
8895,moncler,,
8896,maison margiela,,
8897,burberry,vintage,
8898,prada,,


In [28]:
with open("clean_df_1","wb") as f:
    pickle.dump(clean_df, f)

I now want to create a list of the 50, top brands, which I will then use to eliminate those words from my item titles.  I know that many titles include the brands and, in my quest to match as many item titles to our type categories, I would like to eliminate as many words as possible from them. <br>
<br>
I'm also just going to use the top 50 brands from my "brand1" column with the assumption being, that this will be representative enough for this POC.  I will, of course, eventually need to build a more robust way of searching the title column for the key words that will indicate the item type and then, match it to our type category.  

In [29]:
clean_df_values = clean_df['brand1'].value_counts().to_frame()
clean_df_values

Unnamed: 0,brand1
nike,569
adidas,240
vintage,227
gucci,221
burberry,218
...,...
silent by damir doma,1
isaia,1
sagittaire a,1
barba napoli,1


In [30]:
clean_df_values.reset_index(level=0, inplace=True)
clean_df_values.head(50)

Unnamed: 0,index,brand1
0,nike,569
1,adidas,240
2,vintage,227
3,gucci,221
4,burberry,218
5,stone island,196
6,prada,190
7,band tees,189
8,supreme,175
9,maison margiela,160


In [31]:
big_brand_list = []
i = 0

for brand in clean_df_values['index']:
    big_brand_list.append(brand)
    i += 1
    if i > 50:
        break
    
big_brand_list

['nike',
 'adidas',
 'vintage',
 'gucci',
 'burberry',
 'stone island',
 'prada',
 'band tees',
 'supreme',
 'maison margiela',
 'carhartt',
 'acne studios',
 'rick owens',
 'saint laurent paris',
 'moncler',
 'jordan brand',
 'luxury',
 'levis',
 'polo ralph lauren',
 'custom',
 'dolce gabbana',
 'dior',
 'the north face',
 'american vintage',
 'dries van noten',
 'offwhite',
 'bape',
 'balenciaga',
 'flannel',
 'arcteryx',
 'vivienne westwood',
 'louis vuitton',
 'versace',
 'streetwear',
 'evisu',
 'rick owens drkshdw',
 'comme des garcons',
 'our legacy',
 'designer',
 'yves saint laurent',
 'palm angels',
 'raf simons',
 'patagonia',
 'nudie jeans',
 'ysl pour homme',
 'lacoste',
 'helmut lang',
 'barbour',
 'jewelry',
 'japanese brand',
 'kith']

I notice there are a few brands that are either generic descriptions of the item type, such as ('jewelry') or have an common type of clothing in the brand name, such as 'nudie jeans'.  Since I will be using this list to eliminate words from my titles, I will take these words out of my brand list so they will remain with the titles. I will also pickle the list, since I anticipate running my NLP prep in a separate notebook. <br><br>
This makes me curious to look into how NLP vectorization and/or clustering to could be used as a more robust way, in the future of detecting item categories.  This will be for a future iteration. 

In [32]:
big_brand_list2 = list(filter(lambda x:x!='band tees', big_brand_list))
big_brand_list3 = list(filter(lambda x:x!='nudie jeans', big_brand_list2))
strip_brand_list = list(filter(lambda x:x!='jewelry', big_brand_list3))
strip_brand_list

['nike',
 'adidas',
 'vintage',
 'gucci',
 'burberry',
 'stone island',
 'prada',
 'supreme',
 'maison margiela',
 'carhartt',
 'acne studios',
 'rick owens',
 'saint laurent paris',
 'moncler',
 'jordan brand',
 'luxury',
 'levis',
 'polo ralph lauren',
 'custom',
 'dolce gabbana',
 'dior',
 'the north face',
 'american vintage',
 'dries van noten',
 'offwhite',
 'bape',
 'balenciaga',
 'flannel',
 'arcteryx',
 'vivienne westwood',
 'louis vuitton',
 'versace',
 'streetwear',
 'evisu',
 'rick owens drkshdw',
 'comme des garcons',
 'our legacy',
 'designer',
 'yves saint laurent',
 'palm angels',
 'raf simons',
 'patagonia',
 'ysl pour homme',
 'lacoste',
 'helmut lang',
 'barbour',
 'japanese brand',
 'kith']

In [33]:
with open("strip_brand_list","wb") as f:
    pickle.dump(strip_brand_list, f)

## I now turn my attention to the title column <br>
<br>
I will first eliminate emojis I have seen in some of the titles using a formula I found in a github thread.  I will then eliminate any numbers in the column using regex.

In [34]:
  # Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
  # Ref: https://en.wikipedia.org/wiki/Unicode_block
EMOJI_PATTERN = re.compile(
    "(["
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F700-\U0001F77F"  # alchemical symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "\U000024C2-\U0001F251"
    "])"
    )

def strip_emoji(text):
    for title in cleaner_df['title']:
        return EMOJI_PATTERN.sub(r'',text)

In [35]:
cleaner_df['title'] = cleaner_df['title'].apply(strip_emoji)

In [36]:
cleaner_df['title'] = cleaner_df['title'].apply(lambda x: re.sub('\w*\d\w*', '', x))
cleaner_df.head()

Unnamed: 0,brand,title,sold_price,when_sold,item_links
0,harrington × polo ralph lauren × vintage,polo golf corduroy velvet check jacket,72.8,2020-12-17 11:46:39.800985,/listings/16896292-harrington-x-polo-ralph-lau...
1,ysl pour homme × yves saint laurent,ysl yves saint laurent short sleeve tshirt siz...,48.0,2020-12-17 11:45:39.811749,/listings/18764205-ysl-pour-homme-x-yves-saint...
2,walter van beirendonck,walter van beirendonck aw hand on heart sweat...,132.0,2020-12-17 11:43:39.816702,/listings/17713042-walter-van-beirendonck-walt...
3,massimo osti,archive tracksuit jacket,120.0,2020-12-17 11:41:39.822449,/listings/17739704-massimo-osti-archive-90ies-...
4,nike × vintage,nike vintage puffer down jacket fleece,92.0,2020-12-17 11:40:39.825097,/listings/18182447-nike-x-vintage-nike-vintage...


In [37]:
with open("cleaner_df","wb") as f:
    pickle.dump(cleaner_df, f)

I will dump all of my titles into one, big, long string so I can work with it more easily using some NLP tools.  I am ultimately looking for as many type or item words as possible so I can start building my dictionary, which is what I will use to match our type categories with the items scraped.  Later, either I will build up my dictionary over time, or I will learn of more sophisticated NLP techniques for doing this for me and I can use this dictionary to help train the model for our, specific needs.<br>
<br>
I will then pickle the dump and take it over to a second jupyter notebook to continue. 

In [38]:
title_word_dump = cleaner_df.title.tolist()
title_word_dump = " ".join(title_word_dump)
title_word_dump


'polo golf   corduroy velvet check jacket  ysl yves saint laurent short sleeve tshirt size s m walter van beirendonck aw  hand on heart sweater new archive  tracksuit jacket nike vintage puffer down jacket fleece supreme long sleeve mike kelley big logo young thug x hm thugger hoodie rare s nudie jeans thin finn back  black black men jeans  the north face face  down vest l brand new long moon woman tshirt travis scott style burberry embossed metallic bronze leather notebook white warm vintage longsleeve shirt arcteryx polartec fleece jacket m size vintage evisu custom made denim jeans  nwt carhartt wip active jacket soft teal size m derby county home   adult shirt umbro jersey top  destroy bondage muslin vintage carhartt active work jacket yeezy slide bone wmns keb trousers size  marcelo burlon wings print tshirt size m dunk high shirt jacket overshirt  a bathing ape logo hoodie till black friday crazy deals shirt denim embroidery aap rocky striped tee wool knit roll neck sweater size 

In [39]:
with open("title_pickle_str" + ".txt","wb") as f:
    pickle.dump(title_word_dump, f)

# End here.  Please continue in NLP of Product Titles Notebook