Notebook walking through updating the text file used for model training. 

In [2]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

In [21]:
# connect to drive
#from google.colab import drive
#drive.mount('/content/drive',force_remount=True)

data = pd.read_csv('taylor_stitch_info.csv')
data.head()

Unnamed: 0,Name,Description,Material
0,The Waylons in Chestnut,"Built by our buddies at VALLON, The Waylons is...","The Waylons’ frame is made with lightweight, d..."
1,The Sashiko Denim Repair Kit,"Sure, you could hand off your well-worn denim ...",The Sashiko Denim Repair Kit includes an assor...
2,The Camp Candle in Shoreline,The Camp Candle was poured by hand in small ba...,"Made with natural, renewable soy wax. Soy wax ..."
3,The Cotton Hemp Tee in Charcoal Open Road,This exclusive edition of The Cotton Hemp Tee ...,Organic cotton offers all the benefits of the ...
4,The Cotton Hemp Tee in Navy Give to Get,Regenerative agriculture is all about reciproc...,We love organic cotton because it offers all o...


In [22]:
data.shape

(1193, 3)

In [7]:
#archive_url = 'https://www.taylorstitch.com/collections/mens-archive'
other_url = 'https://www.taylorstitch.com/collections/2022-summer-sale-archive-bring-back?sorted=best-selling-sales-count'
shirts_url = 'https://www.taylorstitch.com/collections/mens-shirts'
bottoms_url = 'https://www.taylorstitch.com/collections/mens-bottoms'
knits_url = 'https://www.taylorstitch.com/collections/mens-knits'
outerwear_url = 'https://www.taylorstitch.com/collections/mens-outerwear'
last_call_url = 'https://www.taylorstitch.com/collections/mens-last-call'
base_url = 'https://www.taylorstitch.com'

url_list = [shirts_url,bottoms_url,knits_url,outerwear_url,last_call_url]
#url_list = [shirts_url]

In [8]:
product_urls = [] # list of all product pages to get later

for url in url_list:
    site = requests.get(url)

    soup = BeautifulSoup(site.text, 'html.parser')

    products = soup.find_all('ul',{'class':'product matrix'})
    products = products[0].find_all('a',href=True) # a tags hold products here

    for product in products:
        product_urls.append(base_url+product['href'])
        
# look at each product, pull relevent info.
# store all info in lists, which will be converted to pandas df later
product_title = []
product_description = []
product_material = []


for product in product_urls:
    product_page = requests.get(product)
    product_soup = BeautifulSoup(product_page.text, 'html.parser')
    
    title_info = product_soup.find('h1')['data-title']
    if data['Name'].str.contains(title_info).any():
      #print(title_info,'already exists!')
      continue
    else: print('adding',title_info)

    description_info = product_soup.find_all('div',
                                             {'id':'collapsible-description'})
    material_info = product_soup.find_all('div',
                                             {'id':'collapsible-material'})
    #if len(description_info)==0:
    #   continue
    try: 
        description = description_info[0].find('p').text
        material = material_info[0].find('p').text
    
    except:
        print('could not add info for',title_info) 
        continue
        
    product_title.append(title_info)
    product_description.append(description)
    product_material.append(material)

    
all_info = pd.DataFrame(list(zip(product_title, 
                                 product_description,
                                 product_material)),
               columns =['Name', 'Description', 'Material'])


adding The Jack in Walnut Cord
adding The Jack in Dark Navy Cord
adding The Jack in Birch Cord
adding The Western Shirt in Wheat Selvage Denim
adding The Weekend Pant in Coal Double Knit
adding The Weekend Hoodie in Coal Double Knit
adding The Weekend Pant in Coal Double Knit
adding The Bomber Jacket in Sierra Suede


In [16]:
all_info = all_info.drop_duplicates()

In [17]:
for item in all_info['Description']:
  print(item,'\n')

Timeless tailoring, rugged build quality, and unmatched versatility have earned The Jack both legendary status, and a permanent place in our wardrobes. This time around, we’ve upped the ante in the texture department with a balanced 100% organic cotton cord that’ll stand up to whatever you throw at it, and develop amazing character as the years take their toll. From that backwoods fishing trip to a fancy dinner date, The Jack in Walnut Cord is as dependable as they come.  

Timeless tailoring, rugged build quality, and unmatched versatility have earned The Jack both legendary status, and a permanent place in our wardrobes. This time around, we’ve upped the ante in the texture department with a balanced 100% organic cotton cord that’ll stand up to whatever you throw at it and develop amazing character as the years take their toll. From that backwoods fishing trip to a fancy dinner date, The Jack in Dark Navy Cord is as dependable as they come.  

Timeless tailoring, rugged build quality

In [18]:
data = pd.concat([data,all_info])

In [19]:
data.shape

(1193, 3)

In [20]:
data.to_csv('taylor_stitch_info.csv',index=False,header=True)

In [10]:
from collections import Counter

In [11]:
all_descriptions = data['Description'].unique().tolist()

In [12]:
all_words = [description.split() for description in all_descriptions]
all_words = [word.lower() for sublist in all_words for word in sublist]

In [13]:
Counter = Counter(all_words)

In [14]:
most_common = dict(Counter.most_common(1000))
  
print(most_common)

{'the': 3336, 'a': 2061, 'and': 1754, 'to': 1633, 'of': 1316, 'for': 817, 'with': 785, 'is': 764, 'our': 723, 'in': 694, 'this': 667, 'it': 544, 'you': 477, 'as': 474, 'that': 455, 'we': 405, 'on': 391, 'your': 390, 'from': 330, 'but': 323, 'its': 294, 'an': 273, 'built': 245, 'some': 245, 'up': 235, 'all': 229, 'at': 226, 'be': 209, 'or': 204, 'shirt': 199, 'when': 199, 'so': 187, 'classic': 175, 'no': 175, 'jacket': 173, 'it’s': 172, 'out': 167, 'cotton': 167, 'we’ve': 163, 'organic': 158, 'just': 158, 'has': 157, 'new': 157, 'will': 155, 'take': 150, 'short': 149, 'like': 149, 'are': 130, 'signature': 130, 'long': 129, 'keep': 124, 'can': 121, 'perfect': 121, 'rugged': 116, 'features': 115, 'jack': 113, 'iteration': 111, 'construction': 109, 'by': 108, 'tailored': 108, 'heavy': 107, 'look': 99, 'pair': 98, 'sleeve': 98, 'cut': 97, 'time': 96, 'have': 95, '100%': 95, 'that’s': 95, 'one': 95, 'tee': 93, 'looks': 92, 'back': 89, 'get': 86, 'been': 86, 'over': 85, 'cozy': 85, 'while': 8

In [15]:
word_stats = pd.DataFrame.from_dict(most_common,'index')

In [25]:
word_stats.loc['timeless']

0    75
Name: timeless, dtype: int64