# Top Repositories for Github Topics

Use the "Run" button to execute the code.

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

Outline:
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL.
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js/
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Use the requests library to download web pages

In [68]:
!pip install requests --upgrade --quiet

In [3]:
import requests

In [4]:
topics_url = 'https://github.com/topics'

In [5]:
response = requests.get(topics_url)

In [6]:
response.status_code # Response code from 200-299 means successful response.

200

In [7]:
len(response.text)

151608

In [8]:
page_contents = response.text

In [9]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [10]:
# Creates a local copy of the file that we have downloaded on our server
with open('webpage.html', 'w') as f:  
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [11]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▋                             | 10 kB 22.2 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 5.5 MB/s eta 0:00:01[K     |███████▊                        | 30 kB 7.7 MB/s eta 0:00:01[K     |██████████▎                     | 40 kB 4.6 MB/s eta 0:00:01[K     |████████████▉                   | 51 kB 4.6 MB/s eta 0:00:01[K     |███████████████▍                | 61 kB 5.5 MB/s eta 0:00:01[K     |██████████████████              | 71 kB 6.0 MB/s eta 0:00:01[K     |████████████████████▌           | 81 kB 5.7 MB/s eta 0:00:01[K     |███████████████████████         | 92 kB 6.3 MB/s eta 0:00:01[K     |█████████████████████████▋      | 102 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████▏   | 112 kB 5.3 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 122 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 128 kB 5.3 MB/s 
[?25h

In [12]:
from bs4 import BeautifulSoup

In [13]:
# BeautifulSoup(html_doc, 'html.parser')

In [14]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [15]:
type(doc)

bs4.BeautifulSoup

In [16]:
p_tags = doc.find_all('p')

In [17]:
len(p_tags)

67

In [18]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Babel
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Babel is a compiler for writing next generation JavaScript, today.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Scala
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Scala is an object-oriented programming language.</p>]

In [19]:
# Our objective here is to get the p_tags corresponding to the topic name.
# By using this particular class we are able to find all p_tags of the topics.
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [20]:
len(topic_title_tags) # The length of the p_tags has reduced

30

In [21]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [22]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [23]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector}) # topic description tags

In [24]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [25]:
topic_title_tag0 = topic_title_tags[0]

In [26]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [27]:
div_tag = topic_title_tag0.parent # First title topic tag

In [28]:
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [29]:
topic_link_tags = doc.find_all('a', {'class' : 'no-underline flex-grow-0'})

In [30]:
len(topic_link_tags)

30

In [31]:
topic_link_tags[0]['href']

'/topics/3d'

In [32]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [33]:
topic_title_tags[0].text

'3D'

In [34]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [35]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
print(topic_descs)    

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud

In [36]:
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [37]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [38]:
!pip install pandas --quiet

In [39]:
import pandas as pd

In [40]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [41]:
topics_df = pd.DataFrame(topics_dict) # Convert the dictionary into a dataframe. 

In [42]:
topics_df 

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [43]:
topics_df.to_csv('topics.csv', index=None)

## Getting information out of a topic page

In [44]:
import requests
from bs4 import BeautifulSoup

def get_topic_page(topic_url):
    """
    This function scrap a designed webpage at url address and return a parsed BeautifulSoup containing the website
    Lib needed:
        import requests
    :param topic_url: website address to scrap
    :return: BeautifulSoup doc
    """

    # download the page
    r = requests.get(topic_url)
    # check response
    if r.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # parse beautiful soup
    topic_doc = BeautifulSoup(r.text, 'html.parser')
    return topic_doc

In [45]:
url = 'https://github.com/topics'
doc = get_topic_page(url)
type(doc)

bs4.BeautifulSoup

In [46]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

In [47]:
def get_topic_title(doc):
    """
    This function is used to retrieve the titles of the topics.
    :param doc: Beautifulsoup Object
    :return: a list
    """

    selected_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.findAll('p', class_ = selected_class)

    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)

    return  topic_titles

In [48]:
titles = get_topic_title(doc)
len(titles), titles[:30]

(30,
 ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'])

In [49]:
def get_topic_descs(doc):
    """
    This function retrieves the description of the topics.
    :param doc: Beautifulsoup Object
    :return: a list
    """

    selected_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.findAll('p', class_ = selected_class)
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())

    return topic_descs

In [50]:
descs = get_topic_descs(doc)
descs[:30]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [51]:
def get_topic_url(doc):
    """
    This function retrieve the URL of the topics.
    :param doc: Beautifulsoup Object
    :return: a list
    """

    selected_class = 'no-underline flex-grow-0'
    topic_link_tags = doc.findAll('a', {'class' : selected_class})
    topic_urls = []
    base_url  = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls

In [52]:
topic_urls = get_topic_url(doc)
topic_urls[:30]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [53]:
import pandas as pd

def scrape_topics(url):

    r = requests.get(url)
    if r.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))

    topics_dict = {
        'title':get_topic_title(doc),
        'description': get_topic_descs(doc),
        'url':get_topic_url(doc)
    }
    return pd.DataFrame(topics_dict)

In [54]:
topics_df = scrape_topics('https://github.com/topics')
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [55]:
type(topics_df)

pandas.core.frame.DataFrame

In [56]:
def get_topic_page(topic_url):
    """
    This function scrap a designed webpage at url address and return a parsed BeautifulSoup containing the website
    Lib needed:
        import requests
    :param topic_url: website address to scrap
    :return: BeautifulSoup doc
    """

    # download the page
    r = requests.get(topic_url)
    # check response
    if r.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # parse beautiful soup
    topic_doc = BeautifulSoup(r.text, 'html.parser')
    return topic_doc

In [57]:
topic_doc = get_topic_page('https://github.com/topics')

In [58]:
topic_doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-0c343b529849.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/assets/dark_dimmed-f22da508b62a.c

In [59]:
def parse_star_count(stars_str):
    stars_tags = stars_str.strip()
    if stars_tags[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [60]:
# Retrieve 3D topic
r = requests.get('https://github.com/topics/3d')
topic_doc = BeautifulSoup(r.text, 'html.parser')

# Define a selection tag
a_selection_class = 'Counter js-social-count'
star_tags = topic_doc.find_all('span', {'class': a_selection_class})
star_tags[0].text.strip()

'86.8k'

In [61]:
parse_star_count(star_tags[0].text.strip())

86800

In [62]:
def get_repo_info(h1_tag, star_tags):
    # return all the info about the repo
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, repo_url, stars

In [63]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', class_ = h3_selection_class)
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 86800)

In [64]:
def get_topic_repos(topic_doc):

    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )

    a_selection_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

    topic_repo_dic = {'username': [],'repo_name': [],'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repo_dic['username'].append(repo_info[0])
        topic_repo_dic['repo_name'].append(repo_info[1])
        topic_repo_dic['repo_url'].append(repo_info[2])
        topic_repo_dic['stars'].append(repo_info[3])

    return pd.DataFrame(topic_repo_dic)

## Create CSV file(s) with the extracted information

In [65]:
def scrape_topic(topic_url, path):

    if os.path.exists(path):
        print('the file {} already exists. Skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [66]:
import os

def scrape_topics_repos(url):
    print('Scraping list of topics')
    topics_df = scrape_topics(url)
    folder_name = "Scraped_csv"
    try:
        os.makedirs(folder_name, exist_ok = True)
    except OSError:
        print ("Creation of the directory %s failed" % folder_name)
    else:
        print ("Successfully created the directory %s " % folder_name)

    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'], folder_name + '/{}.csv'.format(row['title']))

In [67]:
scrape_topics_repos(url)

Scraping list of topics
Successfully created the directory Scraped_csv 
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scrapin

## Document and share your work