Outline:
- We are going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll extract the topic title, topic page URL, and topic description.
- For each topic, we'll get the top 25 repositories in the topic from the topic page

## Importing the Dependencies

In [1]:
import requests

In [2]:
topics_url = 'https://github.com/topics'

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code # 200 if successful

200

In [5]:
response.text

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-b92e9647318f.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" m

In [6]:
len(response.text)

166238

In [7]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

## Use Beautiful Soup to Parse and Extract Information

In [8]:
from bs4 import BeautifulSoup

In [9]:
soup = BeautifulSoup(response.text, 'html.parser')

In [10]:
soup


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-b92e9647318f.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

In [11]:
type(soup)

bs4.BeautifulSoup

In [12]:
topic_title_tags = soup.find_all('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary')

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [15]:
topic_description_tags = soup.find_all('p', class_='f5 color-fg-muted mb-0 mt-1')

In [16]:
len(topic_description_tags)

30

In [17]:
topic_description_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (App

In [18]:
topic_title_tags[0].parent['href']

'/topics/3d'

In [19]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [20]:
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [21]:
topic_descriptions = []

for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())

In [22]:
topic_descriptions

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure 

In [23]:
topic_urls = []

for tag in topic_title_tags:
    topic_urls.append('https://www.github.com' + tag.parent['href'])

In [24]:
topic_urls

['https://www.github.com/topics/3d',
 'https://www.github.com/topics/ajax',
 'https://www.github.com/topics/algorithm',
 'https://www.github.com/topics/amphp',
 'https://www.github.com/topics/android',
 'https://www.github.com/topics/angular',
 'https://www.github.com/topics/ansible',
 'https://www.github.com/topics/api',
 'https://www.github.com/topics/arduino',
 'https://www.github.com/topics/aspnet',
 'https://www.github.com/topics/atom',
 'https://www.github.com/topics/awesome',
 'https://www.github.com/topics/aws',
 'https://www.github.com/topics/azure',
 'https://www.github.com/topics/babel',
 'https://www.github.com/topics/bash',
 'https://www.github.com/topics/bitcoin',
 'https://www.github.com/topics/bootstrap',
 'https://www.github.com/topics/bot',
 'https://www.github.com/topics/c',
 'https://www.github.com/topics/chrome',
 'https://www.github.com/topics/chrome-extension',
 'https://www.github.com/topics/cli',
 'https://www.github.com/topics/clojure',
 'https://www.github.co

## Create CSV file with the Extracted Information

In [25]:
import pandas as pd

In [26]:
df = pd.DataFrame({'Title': topic_titles, 'URL': topic_urls, 'Description': topic_descriptions})

In [27]:
df

Unnamed: 0,Title,URL,Description
0,3D,https://www.github.com/topics/3d,3D refers to the use of three-dimensional grap...
1,Ajax,https://www.github.com/topics/ajax,Ajax is a technique for creating interactive w...
2,Algorithm,https://www.github.com/topics/algorithm,Algorithms are self-contained sequences that c...
3,Amp,https://www.github.com/topics/amphp,Amp is a non-blocking concurrency library for ...
4,Android,https://www.github.com/topics/android,Android is an operating system built by Google...
5,Angular,https://www.github.com/topics/angular,Angular is an open source web application plat...
6,Ansible,https://www.github.com/topics/ansible,Ansible is a simple and powerful automation en...
7,API,https://www.github.com/topics/api,An API (Application Programming Interface) is ...
8,Arduino,https://www.github.com/topics/arduino,Arduino is an open source platform for buildin...
9,ASP.NET,https://www.github.com/topics/aspnet,ASP.NET is a web framework for building modern...


In [28]:
df.to_csv('topics.csv', index=False)

## Extracting Information from a Topic Page

In [29]:
topic_page_url = topic_urls[0]
topic_page_url

'https://www.github.com/topics/3d'

In [30]:
response = requests.get(topic_page_url)

In [32]:
response.status_code

200

In [33]:
len(response.text)

484075

In [31]:
soup = BeautifulSoup(response.text, 'html.parser')

In [43]:
h3_tags = soup.find_all('h3', class_="f3 color-fg-muted text-normal lh-condensed")

In [45]:
len(h3_tags)

20

In [49]:
h3_tags[0].find('a')['href'][1:]

'mrdoob'

In [50]:
usernames = []

for tag in h3_tags:
    usernames.append(tag.find('a')['href'][1:])

In [51]:
usernames

['mrdoob',
 'pmndrs',
 'libgdx',
 'BabylonJS',
 'ssloy',
 'lettier',
 'aframevr',
 'FreeCAD',
 'CesiumGS',
 'MonoGame',
 'metafizzy',
 'blender',
 'isl-org',
 'timzhang642',
 'a1studmuffin',
 'nerfstudio-project',
 'domlysz',
 'FyroxEngine',
 'google',
 'openscad']

In [52]:
len(usernames)

20

In [55]:
h3_tags[0].find_all('a')[1].text.strip()

'three.js'

In [56]:
repo_names = []

for tag in h3_tags:
    repo_names.append(tag.find_all('a')[1].text.strip())

In [57]:
repo_names

['three.js',
 'react-three-fiber',
 'libgdx',
 'Babylon.js',
 'tinyrenderer',
 '3d-game-shaders-for-beginners',
 'aframe',
 'FreeCAD',
 'cesium',
 'MonoGame',
 'zdog',
 'blender',
 'Open3D',
 '3D-Machine-Learning',
 'SpaceshipGenerator',
 'nerfstudio',
 'BlenderGIS',
 'Fyrox',
 'model-viewer',
 'openscad']

In [58]:
len(repo_names)

20

In [61]:
star_tags = soup.find_all('span', {'id': 'repo-stars-counter-star'})

In [62]:
len(star_tags)

20

In [67]:
float(star_tags[0].text[:-1])

95.3

In [72]:
stars_count = []

for tag in star_tags:
    stars_count.append(int(float(tag.text[:-1]) * 1000))

In [73]:
stars_count

[95300,
 24300,
 22100,
 21600,
 18200,
 16200,
 15700,
 15500,
 11200,
 10200,
 10000,
 9900,
 9600,
 9200,
 7500,
 6900,
 6800,
 6700,
 6100,
 6000]

## Creating Pandas DataFrame

In [74]:
df = pd.DataFrame({'Repo Name': repo_names, 'Username': usernames, 'Stars': stars_count})

In [75]:
df

Unnamed: 0,Repo Name,Username,Stars
0,three.js,mrdoob,95300
1,react-three-fiber,pmndrs,24300
2,libgdx,libgdx,22100
3,Babylon.js,BabylonJS,21600
4,tinyrenderer,ssloy,18200
5,3d-game-shaders-for-beginners,lettier,16200
6,aframe,aframevr,15700
7,FreeCAD,FreeCAD,15500
8,cesium,CesiumGS,11200
9,MonoGame,MonoGame,10200


In [81]:
def get_topic_repo(topic_url):
    
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    repo_info = {
        'repo_names': [],
        'usernames': [],
        'stars': []
    }
    
    username_tags = soup.find_all('h3', class_="f3 color-fg-muted text-normal lh-condensed")   
    repo_name_tags = soup.find_all('h3', class_="f3 color-fg-muted text-normal lh-condensed")   
    stars_tags = soup.find_all('span', {'id': 'repo-stars-counter-star'})
    
    for tag in username_tags:
        repo_info['usernames'].append(tag.find('a')['href'][1:])
    
    for tag in repo_name_tags:
        repo_info['repo_names'].append(tag.find_all('a')[1].text.strip())
        
    for tag in stars_tags:
        repo_info['stars'].append(int(float(tag.text[:-1]) * 1000))
        
    df = pd.DataFrame({'Repo Name': repo_info['repo_names'], 'Username': repo_info['usernames'], 'Stars': repo_info['stars']})
    
    return df

In [83]:
df = get_topic_repo(topic_urls[0])

In [84]:
df

Unnamed: 0,Repo Name,Username,Stars
0,three.js,mrdoob,95300
1,react-three-fiber,pmndrs,24300
2,libgdx,libgdx,22100
3,Babylon.js,BabylonJS,21600
4,tinyrenderer,ssloy,18200
5,3d-game-shaders-for-beginners,lettier,16200
6,aframe,aframevr,15700
7,FreeCAD,FreeCAD,15500
8,cesium,CesiumGS,11200
9,MonoGame,MonoGame,10200
