# `MGMT 590: Web Data Analytics`
## **Final Group Project - Web Data 7**
<br>
<div style="text-align: justify"> Very recently Adidas Group has finally cleaned its hands of Reebok, selling the underperforming shoe brand at a steep loss. Reebok has stayed faithful to its cause with revolutionary footwear despite the several changes in ownership and underwhelming marketing efforts over the years that saw the company's market share drop. In order to get Reebok back in the game, we wish to understand it’s current positioning in the market and avenues to level up the game to bring back it lost stature by improvising on the customer satisfaction metric. </div> 

### Purpose of this Notebook:
* Collect URLs to be accessed for Adidas and Reebok from RunRepeat
* Extract high level attributes like Product Price, Product Score, Category, Product Name
* Extract User Rating after visiting individual products
* Clean the data to make it in usable format for analysis in R
<br>

### Version History:
| Author | Description | Execution Duration (mins) |
| --- | --- | --- |
| Deepa Narayanan | Web Scraping to Extract Data for Adidas | 17 |
| Deepa Narayanan | Web Scraping to Extract Data for Reebok | 12 |
| Deepa Narayanan | Data Pre-Processing | 1 |



#### Importing Libraries

In [3]:
import requests
import random
import pandas as pd
from bs4 import BeautifulSoup as bs

#### Defining Header List

In [9]:
headers_list = [

    # Chrome 92.0 Win10
    {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.google.com/",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
    },
    # Chrome 91.0 Win10
    {
    "Connection": "keep-alive",
    "DNT": "1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Dest": "document",
    "Referer": "https://www.google.com/",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
    },
    # Firefox 90.0 Win10
    {
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-User": "?1",
    "Sec-Fetch-Dest": "document",
    "Referer": "https://www.google.com/",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9"
    }
]

headers = random.choice(headers_list)
r = requests.Session()
r.headers = headers

#### Extracting URLs for Adidas

In [5]:
adidas_urls = []
adidas_urls.append("https://runrepeat.com/search?q=adidas+women&order_by=score" )
for i in range (0,24):
    adidas_urls.append("https://runrepeat.com/search?q=adidas+women&order_by=score" + "&page=" + str(i+2))
adidas_urls

['https://runrepeat.com/search?q=adidas+women&order_by=score',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=2',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=3',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=4',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=5',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=6',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=7',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=8',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=9',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=10',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=11',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=12',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=13',
 'https://runrepeat.com/search?q=adidas+women&order_by=score&page=14',
 'https://runrepeat.co

#### Extracting URLs for Reebok

In [6]:
reebok_urls = []
reebok_urls.append("https://runrepeat.com/search?q=reebok+women&order_by=score" )
for i in range (0,10):
    reebok_urls.append("https://runrepeat.com/search?q=reebok+women&order_by=score" + "&page=" + str(i+2))
reebok_urls

['https://runrepeat.com/search?q=reebok+women&order_by=score',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=2',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=3',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=4',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=5',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=6',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=7',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=8',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=9',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=10',
 'https://runrepeat.com/search?q=reebok+women&order_by=score&page=11']

#### Combining URLs for Adidas and Reebok

In [7]:
shoe_list = adidas_urls + reebok_urls

#### Web Scraping

In [10]:
# Defining List to collect data for all product
product_names = []
product_scores = []
review_counts = []
product_prices = []
brand = []
category = []
link_product = []
brand_name = "Adidas"
i = 1
for url in shoe_list:
    if(i == 26):
        brand_name = "Reebok"
        print('Changed Brand')

# Defining the categories to be iterated throughout
    categories = ['running-shoes','sneakers','basketball-shoes','hiking-boots','training-shoes','hiking-shoes','training-shoes','walking-shoes','tennis-shoes','track-and-field-shoes','soccer-cleats','football-cleats','golf-shoes','cycling-shoes','climbing-shoes','mountaineering-boots','hiking-sandals']
    for url_category in categories:
        url_appended = url+'&category='+url_category
        print(url_appended)
        html = r.get(url_appended).text
        soup = bs(html, 'html.parser')
        for html_class in soup.find_all('li',class_="row product_list"):
                product_name = html_class.find("div","product-name")       
                product_score = html_class.find("div","product-score")
                review_count = html_class.find("span","reviews-count")      
                product_names.append(product_name.text) 
                product_scores.append(product_score.text.split(' ')[0])
                review_counts.append(review_count.text)
                brand.append(brand_name)
                category.append(url_category)
                link_product.append('https://runrepeat.com' + html_class.find('a')['href'])
                try:
                    price = html_class.find("span","price")
                    product_prices.append(price.text)
                except:
                    product_prices.append('NA')   
    i = i+1

https://runrepeat.com/search?q=adidas+women&order_by=score&category=running-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=sneakers
https://runrepeat.com/search?q=adidas+women&order_by=score&category=basketball-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=hiking-boots
https://runrepeat.com/search?q=adidas+women&order_by=score&category=training-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=hiking-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=training-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=walking-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=tennis-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=track-and-field-shoes
https://runrepeat.com/search?q=adidas+women&order_by=score&category=soccer-cleats
https://runrepeat.com/search?q=adidas+women&order_by=score&category=football-cleats
https://r

In [12]:
product_customer_rating = []
for i in link_product:
    html = r.get(i).text
    soup = bs(html, 'html.parser')
    print(i)
    for html_class in soup.find_all('div',class_="corescore-big__users-vote"):
        product_customer_rating.append(html_class.text.split('/')[0])

https://runrepeat.com/adidas-ultra-4d-50
https://runrepeat.com/adidas-adizero-adios-4
https://runrepeat.com/adidas-x90004d
https://runrepeat.com/adidas-adizero-adios-5
https://runrepeat.com/adidas-zx-2k-4d
https://runrepeat.com/adidas-4dfwd-pulse
https://runrepeat.com/adidas-adizero-pro
https://runrepeat.com/adidas-ultraboost-guard
https://runrepeat.com/adidas-terrex-agravic-boa
https://runrepeat.com/adidas-energy-falcon-x
https://runrepeat.com/adidas-ultra-4d
https://runrepeat.com/adidas-ultraboost-dna-10
https://runrepeat.com/adidas-adizero-boston-9
https://runrepeat.com/adidas-ultra-boost-20
https://runrepeat.com/adidas-ultraboost-22
https://runrepeat.com/adidas-solarglide-4
https://runrepeat.com/adidas-ultraboost-dna
https://runrepeat.com/adidas-solarglide-5
https://runrepeat.com/adidas-adizero-boston-8
https://runrepeat.com/adidas-duramo-9
https://runrepeat.com/adidas-adizero-adios-pro-20
https://runrepeat.com/adidas-adizero-takumi-sen-6
https://runrepeat.com/adidas-terrex-two-flo

### Combining Lists to Write to CSV

In [13]:
combined_list = []
for i in range(0,len(brand)):
    combined_sub_list = []
    combined_sub_list.append(brand[i])
    combined_sub_list.append(product_names[i])
    combined_sub_list.append(product_scores[i])
    combined_sub_list.append(review_counts[i])
    combined_sub_list.append(product_prices[i])
    combined_sub_list.append(category[i])
    combined_sub_list.append(product_customer_rating[i])
    combined_list.append(combined_sub_list)

import csv
fields = ['Brand', 'ProductName', 'ProductScore', 'ReviewCount','ProductPrice','Category', 'UserReview'] 
with open('AdidasReebok.csv', 'w',newline='', encoding="utf-8") as f:
    write = csv.writer(f)
    write.writerow(fields)
    write.writerows(combined_list)

### Data Preprocessing

In [15]:
df_reebok = pd.read_csv('C:/AR-Fmly/DN/GRE/Purdue West Lafayette/11 Fall/01 Web Data Analytics/AdidasReebok.csv')
df_reebok['UserReview'] = df_reebok['UserReview'].astype('float')
df_reebok = df_reebok[df_reebok['ProductPrice'].notna()]
df_reebok['ProductPrice'] = df_reebok.ProductPrice.str.replace('$' , '').astype(int)
df_reebok['ReviewCount'] = df_reebok.ReviewCount.str.replace(' reviews' , '')
df_reebok['ReviewCount'] = df_reebok.ReviewCount.str.replace(',' , '').astype('int')
df_reebok.rename(columns = {'Categoy':'Category'}, inplace = True)
df_reebok.to_csv('Reebok.csv')

  df_reebok['ProductPrice'] = df_reebok.ProductPrice.str.replace('$' , '').astype(int)


In [17]:
df_reebok

Unnamed: 0,Brand,ProductName,ProductScore,ReviewCount,ProductPrice,Category,UserReview
0,Adidas,Adidas Adizero Adios 4,90,3380,85,running-shoes,4.5
1,Adidas,Adidas X90004D,87,67,100,running-shoes,4.5
2,Adidas,Adidas Adizero Adios 5,87,1312,60,running-shoes,4.3
3,Adidas,Adidas ZX 2K 4D,68,21,100,running-shoes,4.2
4,Adidas,Adidas 4DFWD Pulse New,71,68,80,running-shoes,3.9
...,...,...,...,...,...,...,...
1405,Reebok,Reebok Classic Leather MU,90,1025,84,sneakers,4.5
1406,Reebok,Reebok Instapump Fury Zone,90,9,170,sneakers,5.0
1408,Reebok,Reebok Sole Fury Floatride,69,38,61,sneakers,3.9
1410,Reebok,Reebok Classic Leather Vector,92,5639,80,sneakers,4.6
