# Scrape Quantopian's Community Forum

Quantopian is a great place to share ideas. We wanted to learn what Quantopian users are most interested in. So we scraped the discussion forum, to get the data. Once we have the data we can analyze it further. Below is the notebook for all the steps followed in getting the data.

Import the libraries.

In [0]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
import numpy as np

In [0]:
my_url = "https://www.quantopian.com/posts?page=1"

In [0]:
req = requests.get(my_url)

In [0]:
soup = BeautifulSoup(req.text, 'lxml')

Checking just one tag. Since it worked, we will use it to findall items with that tag.

In [5]:
soup.find('a', class_="user-link")

<a class="user-link" href="/users/55a2a24cd9946b6c4a0002f3">Antony Jackson</a>

In [6]:
for i in soup.select(selector='.post-title a'):
  print(i['href'])

/posts/$10k-third-party-challenge-design-a-factor-for-a-large-us-corporate-pension
/posts/conference-opportunity-odsc-east-2020-quantopian-discount
/posts/quantopian-strategic-pivot
/posts/new-strategy-presenting-the-quality-companies-in-an-uptrend-model-1
/posts/alphalens-questions-thread
/posts/new-feature-challenge-winner-badge
/posts/quantpedia-trading-strategy-series-reversals-in-the-pead
/posts/new-video-learn-from-the-experts-ep-1-full-algorithm-creation-with-vedran-rusman
/posts/when-to-combine-factors-and-when-not-to
/posts/new-tearsheet-challenge-insider-transactions-dataset-5000-dollars-in-prizes
/posts/end-of-paper-trading
/posts/learning-sdes-in-python
/posts/big-news-for-the-community-more-opportunities-to-license-your-algorithms
/posts/the-next-quantopian-based-paper-on-uncovering-momentum
/posts/forum-update-improved-thread-load-times
/posts/upgrading-to-python-3
/posts/futures-data-now-available-in-research
/posts/general-update-on-quantopian-oss
/posts/how-to-get-an-a

Getting the total views for each post. We can use this analyze the most popular posts.

In [7]:
for i in soup.select(selector='.views span'):
  print(i.text)

9k
Views
71
Views
331
Views
19.5k
Views
13.6k
Views
196
Views
19.9k
Views
1.3k
Views
1.2k
Views
10.4k
Views
5.7k
Views
7.5k
Views
4.4k
Views
3.5k
Views
180
Views
3.4k
Views
24.6k
Views
321
Views
7.8k
Views
16.7k
Views


Getting the articles replies. This can indicate which articles produced the most discussion.

In [8]:
for i in soup.select(selector='.replies span'):
  print(i.text)

259
Replies
0
Replies
1
Reply
258
Replies
126
Replies
0
Replies
38
Replies
16
Replies
11
Replies
210
Replies
54
Replies
5
Replies
37
Replies
7
Replies
1
Reply
18
Replies
57
Replies
1
Reply
43
Replies
123
Replies


This is the main funtion that scrapes all the data. We can provide the page_num as an argument and it will get the data for that page.


In [0]:
def get_quantopian(page_num):

  my_url = "https://www.quantopian.com/posts?page="+str(page_num)

  req = requests.get(my_url)

  soup = BeautifulSoup(req.text, 'lxml')


  links = []
  title = []
  author = []
  views = []
  replies = []

  for i in soup.select(selector='.post-title a'):
    links.append(i['href'])
  
  for i in soup.select(selector='.post-title a'):
    title.append(i.text)
  for i in soup.select(selector='.views span'):
    views.append(i.text)
  for i in soup.select(selector='.replies span'):
    replies.append(i.text)

  for i in soup.select(selector='.user-link'):
    author.append(i.text)

  return title, author, views, replies, links





Testing the function on page 33.

In [0]:
t, a, v, r, l = get_quantopian(page_num=33)

The Replies and Views data comes with extra string that we dont need. So we are removing it from our data.

In [0]:
# Remove the str(Replies | Reply) from our data

f1 = np.unique(r)[-1]
f2 = np.unique(r)[-2]
r = list(filter(lambda x: x != f1, r))
r = list(filter(lambda x: x != f2, r))

# Remove word Views from v
f3 = np.unique(v)[-1]
v = list(filter(lambda x: x != f3, v))

# Storing the data in a data frame
quantopian_df = pd.DataFrame({'title':t,
              'author':a,
              'views':v,
              'replies':r,
              'links':l})

We can see the data has been saved in the below Data Frame.

In [12]:
# Hiding Author's name
# We dont want to display it on github, where this
# Notebook will be posted.

quantopian_df.drop(columns='author')

Unnamed: 0,title,views,replies,links
0,Happy 3rd Birthday to Our Community!,663,4,/posts/happy-3rd-birthday-to-our-community
1,Curriculum of study/books,14.3k,18,/posts/curriculum-of-study-slash-books
2,Warren Buffett Market Crash Predictions: The C...,1k,1,/posts/warren-buffett-market-crash-predictions...
3,"The Quantopian Lecture Series: Notebooks, Back...",3k,3,/posts/the-quantopian-lecture-series-notebooks...
4,UPDATE: Front Running S&P 500 Index Funds,2.7k,0,/posts/update-front-running-s-and-p-500-index-...
5,Brent/WTI Spread Fetcher Example,17.7k,14,/posts/brent-slash-wti-spread-fetcher-example
6,Research Cheat Sheet: easily move between the ...,4.9k,5,/posts/research-cheat-sheet-easily-move-betwee...
7,Seeking Feedback on Trading based on Sentiment...,4.1k,13,/posts/seeking-feedback-on-trading-based-on-se...
8,Mebane Faber Relative Strength Strategy with M...,41.9k,23,/posts/mebane-faber-relative-strength-strategy...
9,Automated Leverage System,1.2k,1,/posts/automated-leverage-system


Now we can do the same for all 33 pages. We will write a for loop and get all the data. We are going to put the system to sleep for random intervals (between 1 to 4 seconds) for each loop.

In [14]:
t1 = []
a1 = []
v1 = []
r1 = []
l1 = []

for i in range(1,34):

  import time

  sleep_t = np.random.randint(low=1,high=4,size=1)
  time.sleep(sleep_t)

  t, a, v, r, l = get_quantopian(i)

  t1.append(t)
  a1.append(a)
  v1.append(v)
  r1.append(r)
  l1.append(l)
  print(f"Finished getting data for page {i}")

Finished getting data for page 1
Finished getting data for page 2
Finished getting data for page 3
Finished getting data for page 4
Finished getting data for page 5
Finished getting data for page 6
Finished getting data for page 7
Finished getting data for page 8
Finished getting data for page 9
Finished getting data for page 10
Finished getting data for page 11
Finished getting data for page 12
Finished getting data for page 13
Finished getting data for page 14
Finished getting data for page 15
Finished getting data for page 16
Finished getting data for page 17
Finished getting data for page 18
Finished getting data for page 19
Finished getting data for page 20
Finished getting data for page 21
Finished getting data for page 22
Finished getting data for page 23
Finished getting data for page 24
Finished getting data for page 25
Finished getting data for page 26
Finished getting data for page 27
Finished getting data for page 28
Finished getting data for page 29
Finished getting data f

Checking the length of all the lists. These are all the same. But they are nested lists. We will flatten them using the below function.

In [15]:
len(t1), len(a1), len(l1), len(v1), len(r1)

(33, 33, 33, 33, 33)

In [0]:
def flatten(your_list):

  flat_list = []
  
  for sublist in your_list:
    for item in sublist:
      flat_list.append(item)
  return flat_list


In [0]:
flat_a1 = flatten(a1)
flat_t1 = flatten(t1)
flat_l1 = flatten(l1)
flat_v1 = flatten(v1)
flat_r1 = flatten(r1)

Now all our lists are flatten. We need to remove the extra strings from the replies and views data, just as we did above.

In [18]:
len(flat_a1), len(flat_t1), len(flat_l1), len(flat_v1), len(flat_r1)

(656, 656, 656, 1312, 1312)

As we can see the list are of unequal length. We need to remove the extra strings.

In [19]:
# Remove the str(Replies | Reply)

f1 = np.unique(flat_r1)[-1]
f2 = np.unique(flat_r1)[-2]
f1,f2

('Reply', 'Replies')

In [0]:
# Remove the replies keyword

In [0]:
flat_r1 = list(filter(lambda x: x!=f2,flat_r1))
flat_r1 = list(filter(lambda x: x!=f1,flat_r1))

In [0]:
# Remove word Views from v
f3 = np.unique(flat_v1)[-1]
flat_v1 = list(filter(lambda x: x != f3, flat_v1))

In [23]:
len(flat_a1), len(flat_t1), len(flat_l1), len(flat_v1), len(flat_r1)

(656, 656, 656, 656, 656)

Now our lists are the same length. So lets make our DataFrame.

In [0]:
quantopian_df_large = pd.DataFrame({'title':flat_t1,
              'author':flat_a1,
              'views':flat_v1,
              'replies':flat_r1,
              'links':flat_l1})

Convert views column to float.

In [0]:
quantopian_df_large['views_float'] = (quantopian_df_large['views'].replace(r'[kK]+$','',regex=True).astype(float) * \
 quantopian_df_large['views'].str.extract(r'[\d\.]+([kK]+)', expand = False).fillna(1).replace(['k'], [1000]).astype(int))

In [26]:
quantopian_df_large.drop(columns='author')

Unnamed: 0,title,views,replies,links,views_float
0,$10K Third-Party Challenge: Design a Factor fo...,9k,259,/posts/$10k-third-party-challenge-design-a-fac...,9000.0
1,Conference Opportunity: ODSC East 2020 Quantop...,71,0,/posts/conference-opportunity-odsc-east-2020-q...,71.0
2,Quantopian Strategic Pivot,331,1,/posts/quantopian-strategic-pivot,331.0
3,New Strategy - Presenting the “Quality Compani...,19.5k,258,/posts/new-strategy-presenting-the-quality-com...,19500.0
4,Alphalens Questions Thread,13.6k,126,/posts/alphalens-questions-thread,13600.0
...,...,...,...,...,...
651,Backtest a unique news and blog dataset from A...,662,0,/posts/backtest-a-unique-news-and-blog-dataset...,662.0
652,discuss the sample algorithm,68.2k,23,/posts/discuss-the-sample-algorithm-1,68200.0
653,Run Summary,4.2k,16,/posts/run-summary,4200.0
654,"Ranking and Trading on ""Days to Cover""",36.2k,5,/posts/ranking-and-trading-on-days-to-cover,36200.0
