# Introdução 

##  Problema 

Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários
negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um
modelo de negócio do tipo E-commerce.

A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso
o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação
baixo e escalar a medida que forem conseguindo clientes.
Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência
nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e
o material para a fabricação de cada peça.

Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes
perguntas: 

1. Qual o melhor preço de venda para as calças? 

2. Quantos tipos de calças e suas cores para o produto inicial? 

3. Quais as matérias-prima necessárias para confeccionar as calças?

As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

##  Extração

Colunas para extração da H&M:

- id 
- product_name
- product_type 
- product_color
- price 
- product_composition 

O id do produto é a concatenação de dois id:
- Style id 
- Color id

In [3]:
from bs4 import BeautifulSoup
import requests 
import pandas as pd
from datetime import datetime
import numpy as np

# Pratica I 

In [4]:
url = 'https://www2.hm.com/en_us/men/products/jeans.html'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get(url , headers = headers)

In [5]:
soup_pagination = BeautifulSoup(page.text, 'html.parser')

In [6]:
total_items = soup_pagination.find('h2', class_ = 'load-more-heading').get('data-total')

In [7]:
url_total = url + '?page-size=' + str(total_items)
page = requests.get(url_total , headers = headers)

In [8]:
soup = BeautifulSoup(page.text, 'html.parser')
product_list = soup.find('ul', class_ = 'products-listing small')

In [9]:
# getting product id and type in article HTML tag
pg_products_article = product_list.find_all('article', class_ = 'hm-product-item')
product_id = [p.get('data-articlecode') for p in pg_products_article]
product_type = [p.get('data-category') for p in pg_products_article]

In [10]:
# getting product name in link HTML tag
pg_products_link = product_list.find_all('a', class_ = 'link')
product_name = [p.get_text() for p in pg_products_link]

In [11]:
# getting product price in span HTML tag
pg_products_span = product_list.find_all('span', class_ = 'price regular')
product_price = [p.get_text() for p in pg_products_span]

In [12]:
# generating a DF based in product attributes scraped
df_products = pd.DataFrame([product_id, product_type, product_name, product_price]).T

In [13]:
df_products.columns = ['product_id', 'product_type', 'product_name', 'product_price']

In [14]:
# Add column describing the datetime 
df_products['scrapy_datetime'] = datetime.now().strftime('%y-%m-%d, %H:%M:%S')

In [61]:
df_products['style_id'] = df_products['product_id'].apply(lambda x : x[:-3])
df_products['color_id'] = df_products['product_id'].apply(lambda x : x[-3:])

# Pratica III

In [54]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

df_default = pd.DataFrame(columns= ['Art. No.', 'Composition', 'Fit', 'Product safety', 'color_id', 'style_id'])
df_details = pd.DataFrame()

In [59]:
for i in range(len(df_products)): 
    # requesting API
    url_product = 'https://www2.hm.com/en_us/productpage.' + df_products.iloc[i]['product_id'] + '.html'
    page = requests.get(url_product , headers = headers)
    soup = soup_pagination = BeautifulSoup(page.text, 'html.parser')

    # scrapping product color and id
    pg_product_list = soup.find_all('a', class_ = 'filter-option miniature')

    product_color_name = [p.get('data-color') for p in pg_product_list]
    product_id =  [p.get('data-articlecode') for p in pg_product_list]

    # creating df of products colors 
    df_color = pd.DataFrame([product_id, product_color_name]).T
    df_color.columns = ['product_id', 'product_color']

    df_color['style_id'] = df_color['product_id'].apply(lambda x : x[:-3])
    df_color['color_id'] = df_color['product_id'].apply(lambda x : x[-3:])

    #scrapping fit and composition 
    pg_description_list = soup.find_all('div', class_ = 'pdp-description-list-item')
    product_composition = [list(filter(None,p.get_text().split('\n'))) for p in pg_description_list]
    product_composition = product_composition[1:]

    # creating df of product composition 
    df_composition = pd.DataFrame(product_composition).T
    df_composition.columns = df_composition.iloc[0]
    df_composition = df_composition.iloc[1:].fillna(method = 'ffill')
    
    # creating new columns, style id and color id 
    df_composition['style_id'] = df_composition['Art. No.'].apply(lambda x : x[:-3])
    df_composition['color_id'] = df_composition['Art. No.'].apply(lambda x : x[-3:])
    
    df_composition = pd.concat([df_composition, df_default], axis= 0)
    
    # merge df color + df composition 
    df_color_composition = pd.merge(df_color, df_composition[['style_id', 'Fit', 'Product safety', 'Composition']], on = 'style_id') 
    df_details = pd.concat([df_details, df_color_composition], axis = 0 )

In [63]:
df_product_raw = pd.merge(df_products, df_details, on = 'style_id', how = 'left')

In [64]:
df_product_raw

Unnamed: 0,product_id_x,product_type,product_name,product_price,scrapy_datetime,style_id,color_id_x,product_id_y,product_color,color_id_y,Composition,Fit,Product safety
0,0985197006,men_jeans_slim,Slim Jeans,$ 19.99,"21-07-29, 16:09:22",0985197,006,0985197001,Black,001,"Pocket lining: Polyester 65%, Cotton 35%",,
1,0985197006,men_jeans_slim,Slim Jeans,$ 19.99,"21-07-29, 16:09:22",0985197,006,0985197001,Black,001,"Shell: Cotton 99%, Elastane 1%",,
2,0985197006,men_jeans_slim,Slim Jeans,$ 19.99,"21-07-29, 16:09:22",0985197,006,0985197002,Midnight blue,002,"Pocket lining: Polyester 65%, Cotton 35%",,
3,0985197006,men_jeans_slim,Slim Jeans,$ 19.99,"21-07-29, 16:09:22",0985197,006,0985197002,Midnight blue,002,"Shell: Cotton 99%, Elastane 1%",,
4,0985197006,men_jeans_slim,Slim Jeans,$ 19.99,"21-07-29, 16:09:22",0985197,006,0985197003,Denim blue,003,"Pocket lining: Polyester 65%, Cotton 35%",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7426,0664647004,men_jeans_skinny,Skinny Jeans,$ 34.99,"21-07-29, 16:09:22",0664647,004,0664647035,Dark blue/washed,035,"Cotton 59%, Polyester 39%, Elastane 2%",,\t\t\t FOR CHILD’S SAFE...
7427,0664647004,men_jeans_skinny,Skinny Jeans,$ 34.99,"21-07-29, 16:09:22",0664647,004,0664647036,Denim blue,036,"Cotton 59%, Polyester 39%, Elastane 2%",,\r
7428,0664647004,men_jeans_skinny,Skinny Jeans,$ 34.99,"21-07-29, 16:09:22",0664647,004,0664647036,Denim blue,036,"Cotton 59%, Polyester 39%, Elastane 2%",,\t\t\t FOR CHILD’S SAFE...
7429,0664647004,men_jeans_skinny,Skinny Jeans,$ 34.99,"21-07-29, 16:09:22",0664647,004,0664647037,Black/coated,037,"Cotton 59%, Polyester 39%, Elastane 2%",,\r
