## OUTLINE:

- We'll scrape https://github.com/topics.
- We'll inspect the page for HTML tags.
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 20 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL.
- We will make a Dataframe from these informations.


In [1]:
!pip install requests



In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Getting html of Github Topic page.

In [2]:
page_url = "https://github.com/topics"

In [3]:
page_html = requests.get(page_url).text

In [4]:
soup = BeautifulSoup(page_html, 'html.parser')

## Finding out The Topic Titles

In [5]:
topic_title_tags = soup.find_all('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary')

In [6]:
topic_title_tags[0].text

'3D'

## Finding out Topic description

In [7]:
topic_desc_tags = soup.find_all('p', class_='f5 color-fg-muted mb-0 mt-1')

## Print Topic url:

In [8]:
topic_url_tags = soup.find_all('a', class_='no-underline flex-1 d-flex flex-column')

In [9]:
base_url = 'https://github.com/topics'
print(base_url + topic_url_tags[0]['href'])

https://github.com/topics/topics/3d


## Converting Tags into desired Lists.

In [10]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)


topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_urls = []
base_url = 'https://github.com'
for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])


In [11]:
print(topic_urls)   
print(topic_titles)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [12]:
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

## Making a Dataframe for Title, Desc and Link

In [13]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [14]:
topics_df = pd.DataFrame(topics_dict)

# DataFrame with Topic Titles, Description and url.

In [15]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [None]:
#topics_df.to_csv('Topic_info.csv')

In [40]:
topics_df.title.values

['3D' 'Ajax' 'Algorithm' 'Amp' 'Android' 'Angular' 'Ansible' 'API'
 'Arduino' 'ASP.NET' 'Atom' 'Awesome Lists' 'Amazon Web Services' 'Azure'
 'Babel' 'Bash' 'Bitcoin' 'Bootstrap' 'Bot' 'C' 'Chrome'
 'Chrome extension' 'Command line interface' 'Clojure' 'Code quality'
 'Code review' 'Compiler' 'Continuous integration' 'COVID-19' 'C++']


# Finding out Details of Top Repos in Each Topics.

In [16]:
topic_page_url = topic_urls[0]
pg_url = requests.get(topic_page_url).text
soup2 = BeautifulSoup(pg_url, 'html.parser')

In [17]:
repo_tags = soup2.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')

In [18]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
   
             mrdoob
 
   
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-h

In [19]:
a_tags = repo_tags[0].find_all('a')

## url of Repositories.

In [20]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


## Star count of Repos

In [21]:
star_tags = soup2.find_all('span', class_='Counter js-social-count')

In [22]:
len(star_tags)

30

In [23]:
star_tags[0].text.strip()

'81.7k'

In [24]:
def star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [25]:
star_count(star_tags[0].text.strip())

81700

## Defining a Function to find username, repo name, star count, repo url

In [26]:
def get_repo_info(repo_tags, star_tag):
# returns all the required info about a repository
    a_tags = repo_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [27]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 81700, 'https://github.com/mrdoob/three.js')

## Putting the result into a Dictionary

In [28]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [29]:
print(topic_repos_dict['username'])

['mrdoob', 'libgdx', 'pmndrs', 'BabylonJS', 'aframevr', 'ssloy', 'lettier', 'FreeCAD', 'metafizzy', 'CesiumGS', 'timzhang642', 'a1studmuffin', 'isl-org', 'blender', 'domlysz', 'spritejs', 'openscad', 'jagenjo', 'tensorspace-team', 'YadiraF', 'AaronJackson', 'google', 'ssloy', 'mosra', 'FyroxEngine', 'gfxfundamentals', 'tengbao', 'cleardusk', 'jasonlong', 'cnr-isti-vclab']


In [30]:
print(topic_repos_dict['repo_name'])

['three.js', 'libgdx', 'react-three-fiber', 'Babylon.js', 'aframe', 'tinyrenderer', '3d-game-shaders-for-beginners', 'FreeCAD', 'zdog', 'cesium', '3D-Machine-Learning', 'SpaceshipGenerator', 'Open3D', 'blender', 'BlenderGIS', 'spritejs', 'openscad', 'webglstudio.js', 'tensorspace', 'PRNet', 'vrn', 'model-viewer', 'tinyraytracer', 'magnum', 'Fyrox', 'webgl-fundamentals', 'vanta', '3DDFA', 'isometric-contributions', 'meshlab']


In [31]:
print(topic_repos_dict['stars'])

[81700, 20000, 17800, 16700, 14100, 13600, 12800, 11300, 9100, 8600, 7900, 7100, 6700, 5500, 5200, 4900, 4800, 4600, 4600, 4600, 4400, 4300, 4200, 4000, 3600, 3600, 3500, 3200, 3200, 2900]


## Defining a function to Get a Pandas Dataframe with Repository info

In [32]:
def get_topic_page(topic_url):
    # Download the page
    pg_url = requests.get(topic_url)
    # Check successful response
    if pg_url.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    soup3 = BeautifulSoup(pg_url.text, 'html.parser')
    return soup3

def get_repo_info(repo_tags, star_tag):
# returns all the required info about a repository
    a_tags = repo_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(soup3):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = soup3.find_all('h3', class_= h3_selection_class)
    # Get star tags
    star_tags = soup3.find_all('span', class_='Counter js-social-count')
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [33]:
get_topic_page('https://github.com/topics/ansible')


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-21555afbe856.css" integrity="sha512-IVVa++hW3DBYJnNsmMMiUwt96BJ1mjUpGNDRWeui5BY1iA04E58M5NujgomnZU9R9DB+H99IlE7a+9b5XlO25g==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX1obPnf4Yp7

In [34]:
get_topic_repos(get_topic_page('https://github.com/topics/ansible'))

Unnamed: 0,username,repo_name,stars,repo_url
0,ansible,ansible,53000,https://github.com/ansible/ansible
1,trailofbits,algo,25400,https://github.com/trailofbits/algo
2,bregman-arie,devops-exercises,23600,https://github.com/bregman-arie/devops-exercises
3,StreisandEffect,streisand,22800,https://github.com/StreisandEffect/streisand
4,kubernetes-sigs,kubespray,12200,https://github.com/kubernetes-sigs/kubespray
5,MichaelCade,90DaysOfDevOps,11800,https://github.com/MichaelCade/90DaysOfDevOps
6,ansible,awx,10900,https://github.com/ansible/awx
7,easzlab,kubeasz,8000,https://github.com/easzlab/kubeasz
8,geerlingguy,ansible-for-devops,5700,https://github.com/geerlingguy/ansible-for-devops
9,khuedoan,homelab,5700,https://github.com/khuedoan/homelab


In [42]:
topics_df.title.values

array(['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible',
       'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists',
       'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin',
       'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension',
       'Command line interface', 'Clojure', 'Code quality', 'Code review',
       'Compiler', 'Continuous integration', 'COVID-19', 'C++'],
      dtype=object)

In [44]:
topic_selected = '3D'

In [52]:
topic_arr = topics_df.loc[topics_df['title'] == topic_selected].url.values

In [54]:
topic_arr[0]

'https://github.com/topics/3d'