# r/Sneakers Scraper
### Out of the 630 shoes models we scraped from StockX we have created a list of the model names. This script takes the list and grabs a random sample of a specified amount (in our case 30). From there we use PRAW to grab posts from the subreddit "r/Sneakers" using the shoe model name as a keyword. We also grab when the post was created as well as the post ID. Then we are able to use the individual post IDs to scrape the comments section. Up to 30 comments per post were gathered. In our original sample of 30 shoes, there were only 19 which had comments.



In [9]:
# install PRAW
# !pip install praw

In [234]:
import praw
import pandas as pd
import datetime as dt
import random
import numpy as np
import time
from praw.models import MoreComments
import json

In [4]:
# pass in keys 
%run ./reddit_keys.ipynb

In [12]:
# import sneakers_df.csv
path = r'/Users/gabbyvinco/Desktop/Sneaker_Info_data.csv'
sneaker_info_df = pd.read_csv(path, index_col=None, header=0)

In [14]:
sneaker_info_df.head(10)

Unnamed: 0,ID,Brand,Colorway,ReleaseDate,RetailPrice,Name,Volatility,ChangePercent,Gender
0,7154866b-6f46-4525-b7d3-f7f91ba78fab,New Balance,Munsell White/Holly Green,2021-02-27,130.0,New Balance 327,0.093883,0.28355,men
1,de62db29-a612-4824-bfa2-24a757233c17,New Balance,Yellow/White-Black,0,150.0,New Balance Vision Racer,0.151835,0.585,men
2,545efd57-816c-4cd5-8d4a-18deb4af035e,New Balance,White/Grey/Black,0,90.0,New Balance 327,0.697667,-0.044444,men
3,9dd76318-1e22-42fa-a8cf-2228a8a570c1,Crocs,Black,2019-12-10,60.0,Crocs Duet Max Clog,0.545088,-0.179856,men
4,85bf50a1-c610-4d9a-9993-d82a8435a296,Reebok,Dynamic Pink/Dynamic Pink/Clear,2021-02-05,100.0,Reebok Club C Cardi,0.224116,0.84,women
5,bf0e2d1f-cb4f-4103-baea-3a1951eeb990,BAPE,Green Camo/White,2020-01-25,457.0,A Bathing Ape Bapesta,0.117895,0.492537,men
6,a6d3c2bd-fc96-41f8-b6ac-ddba77474550,Yeezy,Earth Brown/Earth Brown/Earth Brown,2020-04-16,55.0,Yeezy Slide,0.122437,-0.16,men
7,685143e6-a965-4afe-a500-96d403167813,New Balance,White/Silver,2020-01-07,175.0,New Balance 992,0.259089,0.14876,men
8,4ce73ae5-1d15-402b-8306-e790512347f9,New Balance,White/Blue,0,110.0,New Balance 550,0.23329,-0.048172,men
9,74cee2cd-43cb-4ea6-a1bc-41ace69bdc9d,New Balance,Black/Black/Black,0,175.0,New Balance 990v5,0.127637,-0.085091,men


In [15]:
# get only unique sneaker names
unique_sneaker_models = sneaker_info_df["Name"].unique()

In [18]:
# take random selection of 30 out of 630
sneakers_to_search_30 = np.random.choice(unique_sneaker_models,30)

In [82]:
print(sneakers_to_search_30)

['Gucci Off The Grid High Top' 'OFF-WHITE Odsy 1000' 'Puma Clyde Hardwood'
 'Yeezy QNTM BSKTBL' 'Adilette 2' 'Yeezy QNTM BSKTBL'
 'Balenciaga Speed Trainer Lace Up' 'Puma RS-Dreamer' 'Jordan 8 Retro'
 'Air Max 95 SE' 'Reebok Question Mid' 'React Phantom Run Flyknit 2'
 'Cosmic Unity' 'Clarks Sandford' 'Saucony Grid 8000'
 'Puma Clyde Hardwood' 'Air Foamposite One' 'New Balance 997 OG'
 'Dunk Low Disrupt' 'Crocs Classic Clog' 'Vans Sk8-Hi Reissue CA'
 'Vans Comfycush Authentic' 'Arrow Canvas Mid Top' 'Futurecraft 4D 2021'
 'Reebok Club C 85' 'NMD CS2' 'Gucci Slide' 'Crocs Classic Clog'
 'Converse Jack Purcell Chukka Mid' 'Air Force 1 Low Pixel']


In [21]:
# specify the subreddit
sneakers = reddit.subreddit('Sneakers')

In [22]:
# create a dictionary to add the lists to
all_sneaker_posts = []
sneaker_count = 0

In [24]:
# finding posts on the subreddit by keyword in the sneakers_to_search_30 list

for model_name in sneakers_to_search_30:
    # create/clear the medium level for each new loop of the group of shoes
    info_by_shoe = []
    
    # pass the model name into the loop as a query
    for i in sneakers.search(model_name, limit=50):
        # assign post attributes to variables
        title = i.title
        id_num = i.id
        created = i.created
        comments_count = i.num_comments
        
        # place each post attributes in a list (the smallest level)
        output = [title,id_num,created,comments_count]
        # add the smallest level into the medium level
        info_by_shoe.append(output)
        
    #add one to sneaker count to keep track of progress    
    sneaker_count += 1
    # add the medium level to the large level which will contain all the posts for all the shoes in our random sample
    all_sneaker_posts.append(info_by_shoe)
    print("Posts added for shoe number {}.".format(sneaker_count))
    print("Please wait 2 minutes in between requests.")
    # set the loop to chill for 120 seconds (the limit should be 1 request/second, so this should be plenty of time)
    time.sleep(120)
    
print("-----------------------------------")
print("Random sample of 30 sneakers scrape complete.")

Posts added for shoe number 2.
Please wait 2 minutes in between requests.
Posts added for shoe number 3.
Please wait 2 minutes in between requests.
Posts added for shoe number 4.
Please wait 2 minutes in between requests.
Posts added for shoe number 5.
Please wait 2 minutes in between requests.
Posts added for shoe number 6.
Please wait 2 minutes in between requests.
Posts added for shoe number 7.
Please wait 2 minutes in between requests.
Posts added for shoe number 8.
Please wait 2 minutes in between requests.
Posts added for shoe number 9.
Please wait 2 minutes in between requests.
Posts added for shoe number 10.
Please wait 2 minutes in between requests.
Posts added for shoe number 11.
Please wait 2 minutes in between requests.
Posts added for shoe number 12.
Please wait 2 minutes in between requests.
Posts added for shoe number 13.
Please wait 2 minutes in between requests.
Posts added for shoe number 14.
Please wait 2 minutes in between requests.
Posts added for shoe number 15.
P

In [25]:
print(all_sneaker_posts)

[[], [], [['My first Off-white anything [ODSY-1000]', 'f53rvc', 1581945615.0, 15], ['Hypebeast couples', 'ehl57e', 1577735358.0, 12]], [['Puma Clyde Hardwood Natural, originally got these to ball in but these definitely looks good for casual use too.', 'ki4kfi', 1608669448.0, 0], ['The Puma Clyde Hardwood! Can’t wait to break these out on court when things ease up!', 'glmuya', 1589776644.0, 3], ['Puma Clyde Hardwoods on hardwood.', 'g5ds54', 1587498573.0, 5], ['Puma Clyde Advice', 'kavgr2', 1607688835.0, 5], ['Any love for Puma? Such an underrated shoe. No swoosh. No Hype 😂', 'i4d2nv', 1596687829.0, 28], ['Started Hooping in the Clyde’s and moved onto Uproars', 'gp8y97', 1590284048.0, 5]], [['Yeezy QNTM BSKTBL ‘OG’', 'lvheia', 1614654285.0, 7], ['Who can guess which is the Yeezy BSKTBL and which is the QNTM?', 'kje2tf', 1608838905.0, 11], ['QNTMs looking so pretty in the sunlight \U0001f972', 'lreoja', 1614207177.0, 19]], [['AJ1 Sizing Help', 'lb0j78', 1612314080.0, 13], ['Anyone ever 

In [173]:
# create a dictionary from all_sneaker_posts
reddit_sneaker_info = {}
count = 1

In [174]:
# adding lists to dictionary so refrenceable by shoe name
for shoe_name in sneakers_to_search_30:
    reddit_sneaker_info[shoe_name] = all_sneaker_posts[count]
    count += 1
    if count == 31:
        break
print('done')

done


In [175]:
# makes the nested lists into a dictionary
for shoe in reddit_sneaker_info:
    print(shoe)
    for post in range(0,len(reddit_sneaker_info[shoe])):
#         print(post)
        title = reddit_sneaker_info[shoe][post][0]
        post_id = reddit_sneaker_info[shoe][post][1]
        creation = reddit_sneaker_info[shoe][post][2]
        num_of_comments = reddit_sneaker_info[shoe][post][3]
#         print(title)
        reddit_sneaker_info[shoe][post] = {"Post_title": title,
                                                   "Post_id": post_id,
                                                   "Created_at": creation,
                                                   "Num_comments": num_of_comments}

Gucci Off The Grid High Top
OFF-WHITE Odsy 1000


KeyError: 0

In [211]:
reddit_sneaker_info['Puma Clyde Hardwood'][0]

{'Post_title': 'Puma Clyde Hardwood Natural, originally got these to ball in but these definitely looks good for casual use too.',
 'Post_id': 'ki4kfi',
 'Created_at': 1608669448.0,
 'Num_comments': 0}

In [None]:
# write reddit_sneaker_info to json
with open('Sneaker_posts.json', 'w') as json_file:
    json.dump(reddit_sneaker_info, json_file)

In [187]:
# start a new count to keep track of testing
new_count = 0

In [188]:
# testing the script to grab the comment

submission = reddit.submission(id=id_from_a_post)
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
    new_count +=1
    print(new_count)
    print(comment.body)

1
Man I personally don’t know

For slippers I wear these dollar store slippers. They’re sole has this stiff jelly material. It’s awesome
2
Cloudfoam is legit just as comfy as Boost.
3
No boost but I own a pair and they’re comfy as hell
4
I am intrigued. What kind of material? I’ve been looking for new indoor comfy slippers since the crocs I bought from Christmas is slowly decaying. Plus they stink when I wear them bare feet in a sunny day. Which is the worst part... I’ve tried washing them with not as much luck in these hot days. :(

Thanks for your response!
5
That’s good to hear. I might get them then if I can. I haven’t found them for retail yet which confuses me. Didn’t know they were so popular.

Thanks :)
6
I’ll get them if I can them. 

Thanks man man
7
Dude my doctor wears these super comfortable crocs. It’s got these nub things on the heel. He swears by them. He comes in wearing his dress shoes. Then switches to the crocs

Go look for em. But wear them with socks LOL 
8
Lmao m

In [219]:
# create counts to track progress while scraping

comment_count = 0
post_count = 0

In [220]:
# create lists for the information to get passed into
comments_list = []
post_id_list = []
shoe_name_list = []

In [221]:
# get comments by specific post 
#loop through each sneaker in the entire set
for shoe in reddit_sneaker_info:
    # for each sneaker extract the id number of the post
    for post in reddit_sneaker_info[shoe]:
        number_of_comments = post["Num_comments"]
        post_id = post["Post_id"]
        comment_count = 0
        if number_of_comments != 0:
            # get the comments by post_id
            submission = reddit.submission(id=post_id)
            submission.comments.replace_more(limit=None)
            for comment in submission.comments.list():
                comment_count += 1
                # prevent the script from grabbing more than 30 comments
                if comment_count < 30:
                    comment_by_post = comment.body
                    shoe_name_list.append(shoe)
                    post_id_list.append(post_id)
                    # append each comment text to list
                    comments_list.append(comment_by_post)
                    print('comment added')
                    
                elif comment_count == 30:
                    print('limit of 30 comments reached')
                    break
        elif number_of_comments == 0:
            pass
        post_count += 1
        print("please wait 45 seconds")
        # hopefully this is enough time in between requests
        time.sleep(45)
    print("Comments grabbed for post number {}".format(post_count))
print("scraping complete!") 


Comments grabbed for shoe number 0
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
Comments grabbed for shoe number 2
please wait 45 seconds
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment 

comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
limit of 30 comments reached
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
please wait 45 seconds
comment added
comment added
comment added
comment

Comments grabbed for shoe number 130
comment added
please wait 45 seconds
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
please wait 45 seconds
comment added
please wait 45 seconds
please wait 45 seconds
Comments grabbed for shoe number 138
Comments grabbed for shoe number 138
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
please wait 45 seconds
comment added
comment added
c

comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 4

comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
please wait 45 seconds
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added

comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added

comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
please wait 45 seconds
comment added
comment added
comment added
comment added
please wait 45 seconds
Comments grabbed for shoe number 356
please wait 45 seconds
Comments grabbed for shoe number 357
Comments grabbed for shoe number 357


In [235]:
# comments_list

In [236]:
# post_id_list

In [237]:
# shoe_name_list

In [226]:
# create df for comments
comments_df = pd.DataFrame()

In [227]:
# add the shoe names that correspond with each comment
comments_df["ShoeName"] = shoe_name_list

In [228]:
# add the post id that corresponds with each comment
comments_df["PostID"] = post_id_list

In [229]:
# add the comments
comments_df["Comments"] = comments_list

In [230]:
comments_df

Unnamed: 0,ShoeName,PostID,Comments
0,OFF-WHITE Odsy 1000,f53rvc,Love it. How heavy are these?
1,OFF-WHITE Odsy 1000,f53rvc,Post more pics of these.
2,OFF-WHITE Odsy 1000,f53rvc,Actually really like these. The black colorway...
3,OFF-WHITE Odsy 1000,f53rvc,The topographic versions of these go crazy
4,OFF-WHITE Odsy 1000,f53rvc,Nice pick up
...,...,...,...
2914,Gucci Slide,5lvxnl,I've honestly never owned Nike or adidas slide...
2915,Gucci Slide,6o0o9b,Adidas adilette Supercloud Plus Slides. $35\nC...
2916,Gucci Slide,6o0o9b,Adidas and Nike are the most common and I'm pr...
2917,Gucci Slide,6o0o9b,Under Armour surprisingly


In [232]:
# just to see how many shoes we have comments for out of the 30
comments_df["ShoeName"].value_counts()

NMD CS2                             550
Jordan 8 Retro                      449
Air Foamposite One                  441
Gucci Slide                         369
Reebok Club C 85                    281
Saucony Grid 8000                   183
Reebok Question Mid                 171
Puma RS-Dreamer                     105
Futurecraft 4D 2021                  55
Puma Clyde Hardwood                  47
New Balance 997 OG                   46
Air Max 95 SE                        46
Cosmic Unity                         39
Yeezy QNTM BSKTBL                    37
Adilette 2                           32
OFF-WHITE Odsy 1000                  27
React Phantom Run Flyknit 2          27
Dunk Low Disrupt                     10
Balenciaga Speed Trainer Lace Up      4
Name: ShoeName, dtype: int64

In [233]:
# save the dataframe to csv
comments_df.to_csv (r'/Users/gabbyvinco/Desktop/Comments_on_30sample.csv', index = False, header=True)