## Top Repositories on GitHub

**Problem Statement:**

GitHub is a widely-used platform for sharing and discovering open-source projects. The platform organizes repositories into various topics, making it easier to explore related projects. However, GitHub does not provide a direct, structured dataset for these topics and their associated repositories. This project aims to automate the extraction of relevant data from the GitHub Topics page
, allowing users to analyze and explore repositories based on different topics. By scraping this data, we can create CSV files that include key repository details such as name, author, stars, and URL for each topic.

**Project Outline:**
  
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic we'll get topic title, page url and description
- For each topic, we'll get the top repositories on the topic from the topic page
- For each repository, we'll grab the repo name, username,stars and repo URl
- For each topic we'll create a CSV file in the following format:
```
Repo name,Username,Stars,Repo URL
```

In [84]:
!pip install requests



In [85]:
import requests

In [86]:
topics_url = 'https://github.com/topics'

In [87]:
response = requests.get(topics_url)

In [88]:
response.status_code

200

In [89]:
len(response.text)

188297

In [90]:
page_contents = response.text

In [91]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-6215e264aa81.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light_high_contrast-0d1726fbc5ce.css" /><link crossorigin="anonym

In [92]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [93]:
!pip install beautifulsoup4



In [94]:
from bs4 import BeautifulSoup

In [95]:
doc = BeautifulSoup(page_contents,'html.parser')

In [96]:
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-6215e264aa81.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/light_high_contrast-0d1726fbc5ce.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-be3560533

In [97]:
p_tags = doc.find_all('p')

In [98]:
len(p_tags)

41

In [99]:
p_tags[:5]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         JavaScript
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">JavaScript (JS) is a lightweight interpreted programming language with first-class functions.</p>]

In [100]:
#way1
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class' : selection_class})

In [101]:
len(topic_title_tags)

16

In [102]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Chrome</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Code quality</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Compiler</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">CSS</p>]

In [103]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Chrome</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Code quality</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Compiler</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">CSS</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Database</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Front end</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">JavaScript</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Node.js</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">npm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Project management</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Python</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">React</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">React Native</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Scala</p>,
 <p c

In [104]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_selector})

In [105]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           An awesome list is a list of awesome things curated by the community.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Chrome is a web browser from the tech company Google.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Automate your code review with style, quality, security, and test‑coverage checks when you need them.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Compilers are software that translate higher-level programming languages to lower-level languages (e.g. machine code).
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Cascading Style Sheets (CSS) is a language used most often to style and improve upon the appearance of views.
         </p>]

In [106]:
topic_title_tag0 = topic_title_tags[0]

In [107]:
div_tag = topic_title_tag0.parent

In [108]:
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/awesome">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          An awesome list is a list of awesome things curated by the community.
        </p>
</a>

In [109]:
#class of a tag
topic_link_tags = doc.find_all('a',{'class' : 'no-underline flex-1 d-flex flex-column'})

In [110]:
len(topic_link_tags)

16

In [111]:
topic_link_tags[0]['href']

'/topics/awesome'

In [112]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/awesome


In [113]:
topic_title_tags[0].text

'Awesome Lists'

In [114]:
topic_titles = []

for tag  in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['Awesome Lists', 'Chrome', 'Code quality', 'Compiler', 'CSS', 'Database', 'Front end', 'JavaScript', 'Node.js', 'npm', 'Project management', 'Python', 'React', 'React Native', 'Scala', 'TypeScript']


In [166]:
topic_desc = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]

['An awesome list is a list of awesome things curated by the community.',
 'Chrome is a web browser from the tech company Google.',
 'Automate your code review with style, quality, security, and test‑coverage checks when you need them.',
 'Compilers are software that translate higher-level programming languages to lower-level languages (e.g. machine code).',
 'Cascading Style Sheets (CSS) is a language used most often to style and improve upon the appearance of views.']

In [115]:
topic_urls = []
base_url = 'https://github.com'

for tag  in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

print(topic_urls)

['https://github.com/topics/awesome', 'https://github.com/topics/chrome', 'https://github.com/topics/code-quality', 'https://github.com/topics/compiler', 'https://github.com/topics/css', 'https://github.com/topics/database', 'https://github.com/topics/frontend', 'https://github.com/topics/javascript', 'https://github.com/topics/nodejs', 'https://github.com/topics/npm', 'https://github.com/topics/project-management', 'https://github.com/topics/python', 'https://github.com/topics/react', 'https://github.com/topics/react-native', 'https://github.com/topics/scala', 'https://github.com/topics/typescript']


In [116]:
#create small csv
#how to create csv
import pandas as pd

In [117]:
#create pandas dataframe from lists
topics_dict = {
    'title' : topic_titles,
    'description' : topic_descs,
    'url' : topic_urls
}

In [118]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,Awesome Lists,An awesome list is a list of awesome things cu...,https://github.com/topics/awesome
1,Chrome,Chrome is a web browser from the tech company ...,https://github.com/topics/chrome
2,Code quality,"Automate your code review with style, quality,...",https://github.com/topics/code-quality
3,Compiler,Compilers are software that translate higher-l...,https://github.com/topics/compiler
4,CSS,Cascading Style Sheets (CSS) is a language use...,https://github.com/topics/css
5,Database,A database is a structured set of data held in...,https://github.com/topics/database
6,Front end,Front end is the programming and layout that p...,https://github.com/topics/frontend
7,JavaScript,JavaScript (JS) is a lightweight interpreted p...,https://github.com/topics/javascript
8,Node.js,Node.js is a tool for executing JavaScript in ...,https://github.com/topics/nodejs
9,npm,npm is a package manager for JavaScript includ...,https://github.com/topics/npm


## Create CSV file(s) of extracted information

In [119]:
topics_df.to_csv('topics.csv',index=None)

## Getting information out of a topic page

In [120]:
topic_page_url = topic_urls[0]

In [121]:
topic_page_url

'https://github.com/topics/awesome'

In [122]:
response = requests.get(topic_page_url)

In [123]:
response.status_code

200

In [124]:
len(response.text)

483657

In [125]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [126]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [127]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":170270,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="677df237bb167b584adb9b473d6574bd660cf38a371ef02e9a75f8087b3e4c59" data-turbo="false" data-view-component="true" href="/sindresorhus">sindresorhus</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":21737465,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="f52b7f08337a25eb8c939fb3a85f1e8932089fbcdbd9c1d7c5d410bf0a3c872e" data-turbo="false" data-view-component=

In [128]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":170270,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="677df237bb167b584adb9b473d6574bd660cf38a371ef02e9a75f8087b3e4c59" data-turbo="false" data-view-component="true" href="/sindresorhus">sindresorhus</a>          /
           <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":21737465,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="f52b7f08337a25eb8c939fb3a85f1e8932089fbcdbd9c1d7c5d410bf0a3c872e" data-turbo="false" data-view-compone

In [129]:
len(repo_tags)

20

In [130]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":170270,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="677df237bb167b584adb9b473d6574bd660cf38a371ef02e9a75f8087b3e4c59" data-turbo="false" data-view-component="true" href="/sindresorhus">sindresorhus</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":21737465,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="f52b7f08337a25eb8c939fb3a85f1e8932089fbcdbd9c1d7c5d410bf0a3c872e" data-turbo="false" data-view-component=

In [131]:
a_tags = repo_tags[0].find_all('a')

In [132]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":170270,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="677df237bb167b584adb9b473d6574bd660cf38a371ef02e9a75f8087b3e4c59" data-turbo="false" data-view-component="true" href="/sindresorhus">sindresorhus</a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":21737465,"originating_url":"https://github.com/topics/awesome","user_id":null}}' data-hydro-click-hmac="f52b7f08337a25eb8c939fb3a85f1e8932089fbcdbd9c1d7c5d410bf0a3c872e" data-turbo="false" data-view-component="true" href="/sindresorhus/awesome">awesome</a>]

In [133]:
a_tags[0].text.strip()

'sindresorhus'

In [134]:
a_tags[1].text.strip()

'awesome'

In [135]:
base_url = 'https://github.com'
a_tags[1]['href']

'/sindresorhus/awesome'

In [136]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/sindresorhus/awesome


In [137]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [138]:
len(star_tags)

20

In [139]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [140]:
parse_star_count(star_tags[0].text.strip())

404000

In [141]:
stars_str = '403.4k'

In [142]:
stars_str[-1]

'k'

In [143]:
stars_str[:-1]

'403.4'

In [144]:
int(float(stars_str[:-1]) * 1000)

403400

In [145]:
h3_selection_class

'f3 color-fg-muted text-normal lh-condensed'

In [146]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [147]:
star_tags

[<span aria-label="403888 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="403,888">404k</span>,
 <span aria-label="262220 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="262,220">262k</span>,
 <span aria-label="250080 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="250,080">250k</span>,
 <span aria-label="188189 users starred this repository" class="Counter js-social-count" da

In [148]:
get_repo_info(repo_tags[0], star_tags[0])

('sindresorhus', 'awesome', 404000, 'https://github.com/sindresorhus/awesome')

In [149]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [150]:
topic_repos_dict

{'username': ['sindresorhus',
  'vinta',
  'awesome-selfhosted',
  'trimstray',
  'avelino',
  '521xueweihan',
  'papers-we-love',
  'Hack-with-Github',
  'jaywcjlove',
  'MunGell',
  'DopplerHQ',
  'enaqx',
  'fffaraz',
  'binhnguyennus',
  'prakhar1989',
  'sindresorhus',
  'Solido',
  'wasabeef',
  'rust-unofficial',
  'tiimgreen'],
 'repo_name': ['awesome',
  'awesome-python',
  'awesome-selfhosted',
  'the-book-of-secret-knowledge',
  'awesome-go',
  'HelloGitHub',
  'papers-we-love',
  'Awesome-Hacking',
  'awesome-mac',
  'awesome-for-beginners',
  'awesome-interview-questions',
  'awesome-react',
  'awesome-cpp',
  'awesome-scalability',
  'awesome-courses',
  'awesome-nodejs',
  'awesome-flutter',
  'awesome-android-ui',
  'awesome-rust',
  'github-cheat-sheet'],
 'stars': [404000,
  262000,
  250000,
  188000,
  154000,
  130000,
  98700,
  98500,
  89000,
  78200,
  77500,
  70100,
  66700,
  65700,
  63300,
  63000,
  57400,
  54000,
  52900,
  52700],
 'repo_url': ['https:

In [151]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [152]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,sindresorhus,awesome,404000,https://github.com/sindresorhus/awesome
1,vinta,awesome-python,262000,https://github.com/vinta/awesome-python
2,awesome-selfhosted,awesome-selfhosted,250000,https://github.com/awesome-selfhosted/awesome-...
3,trimstray,the-book-of-secret-knowledge,188000,https://github.com/trimstray/the-book-of-secre...
4,avelino,awesome-go,154000,https://github.com/avelino/awesome-go
5,521xueweihan,HelloGitHub,130000,https://github.com/521xueweihan/HelloGitHub
6,papers-we-love,papers-we-love,98700,https://github.com/papers-we-love/papers-we-love
7,Hack-with-Github,Awesome-Hacking,98500,https://github.com/Hack-with-Github/Awesome-Ha...
8,jaywcjlove,awesome-mac,89000,https://github.com/jaywcjlove/awesome-mac
9,MunGell,awesome-for-beginners,78200,https://github.com/MunGell/awesome-for-beginners


## Final Code

In [199]:
import os
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using BS
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc
    
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url
    
def get_topic_repos(topic_doc):
    #get h3 tags containing repo title,repo url and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
    #get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    #get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):
    #if error occurs no need to download the existing files all over again
    #fname = topic_name + '.csv'
    if os.path.exists(path):
        print("The file {} already exists.Skipping...".format(fname))
        return
        
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path , index = None)

In [154]:
topic_urls

['https://github.com/topics/awesome',
 'https://github.com/topics/chrome',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/compiler',
 'https://github.com/topics/css',
 'https://github.com/topics/database',
 'https://github.com/topics/frontend',
 'https://github.com/topics/javascript',
 'https://github.com/topics/nodejs',
 'https://github.com/topics/npm',
 'https://github.com/topics/project-management',
 'https://github.com/topics/python',
 'https://github.com/topics/react',
 'https://github.com/topics/react-native',
 'https://github.com/topics/scala',
 'https://github.com/topics/typescript']

In [155]:
url4 = topic_urls[4]

In [156]:
url4

'https://github.com/topics/css'

In [157]:
topic4_doc = get_topic_page(url4)

In [158]:
topic4_repos = get_topic_repos(topic4_doc)

In [159]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,twbs,bootstrap,173000,https://github.com/twbs/bootstrap
1,Chalarangelo,30-seconds-of-code,125000,https://github.com/Chalarangelo/30-seconds-of-...
2,electron,electron,119000,https://github.com/electron/electron
3,microsoft,Web-Dev-For-Beginners,92100,https://github.com/microsoft/Web-Dev-For-Begin...
4,tailwindlabs,tailwindcss,90400,https://github.com/tailwindlabs/tailwindcss
5,florinpop17,app-ideas,86200,https://github.com/florinpop17/app-ideas
6,animate-css,animate.css,82300,https://github.com/animate-css/animate.css
7,FortAwesome,Font-Awesome,75700,https://github.com/FortAwesome/Font-Awesome
8,thedaviddias,Front-End-Checklist,71400,https://github.com/thedaviddias/Front-End-Chec...
9,juliangarnier,anime,64400,https://github.com/juliangarnier/anime


In [160]:
#all in one
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,twbs,bootstrap,173000,https://github.com/twbs/bootstrap
1,Chalarangelo,30-seconds-of-code,125000,https://github.com/Chalarangelo/30-seconds-of-...
2,electron,electron,119000,https://github.com/electron/electron
3,microsoft,Web-Dev-For-Beginners,92100,https://github.com/microsoft/Web-Dev-For-Begin...
4,tailwindlabs,tailwindcss,90400,https://github.com/tailwindlabs/tailwindcss
5,florinpop17,app-ideas,86200,https://github.com/florinpop17/app-ideas
6,animate-css,animate.css,82300,https://github.com/animate-css/animate.css
7,FortAwesome,Font-Awesome,75700,https://github.com/FortAwesome/Font-Awesome
8,thedaviddias,Front-End-Checklist,71400,https://github.com/thedaviddias/Front-End-Chec...
9,juliangarnier,anime,64400,https://github.com/juliangarnier/anime


In [161]:
topic_urls[6]

'https://github.com/topics/frontend'

In [162]:
get_topic_repos(get_topic_page(topic_urls[6]))

Unnamed: 0,username,repo_name,stars,repo_url
0,facebook,react,239000,https://github.com/facebook/react
1,vuejs,vue,210000,https://github.com/vuejs/vue
2,vitejs,vite,75500,https://github.com/vitejs/vite
3,thedaviddias,Front-End-Checklist,71400,https://github.com/thedaviddias/Front-End-Chec...
4,ionic-team,ionic-framework,52100,https://github.com/ionic-team/ionic-framework
5,dypsilon,frontend-dev-bookmarks,45000,https://github.com/dypsilon/frontend-dev-bookm...
6,LeCoupa,awesome-cheatsheets,44400,https://github.com/LeCoupa/awesome-cheatsheets
7,expo,expo,43700,https://github.com/expo/expo
8,NaiboWang,EasySpider,42700,https://github.com/NaiboWang/EasySpider
9,GitHubDaily,GitHubDaily,42300,https://github.com/GitHubDaily/GitHubDaily


In [163]:
get_topic_repos(get_topic_page(topic_urls[6])).to_csv('frontend.csv',index=None)

## Write a single function to:

1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos for the topic

In [175]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles


def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs


def get_topic_urls(doc):
    topic_urls = []
    base_url = 'https://github.com'
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls


def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)

    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))

    doc = BeautifulSoup(response.text, 'html.parser')

    titles = get_topic_titles(doc)
    descs = get_topic_descs(doc)
    urls = get_topic_urls(doc)

    # sanity check to avoid length mismatch error
    print(f"titles: {len(titles)}, descs: {len(descs)}, urls: {len(urls)}")

    topics_dict = {
        'title': titles,
        'description': descs,
        'url': urls
    }

    return pd.DataFrame(topics_dict)

In [178]:
scrape_topics()

titles: 16, descs: 16, urls: 16


Unnamed: 0,title,description,url
0,Awesome Lists,An awesome list is a list of awesome things cu...,https://github.com/topics/awesome
1,Chrome,Chrome is a web browser from the tech company ...,https://github.com/topics/chrome
2,Code quality,"Automate your code review with style, quality,...",https://github.com/topics/code-quality
3,Compiler,Compilers are software that translate higher-l...,https://github.com/topics/compiler
4,CSS,Cascading Style Sheets (CSS) is a language use...,https://github.com/topics/css
5,Database,A database is a structured set of data held in...,https://github.com/topics/database
6,Front end,Front end is the programming and layout that p...,https://github.com/topics/frontend
7,JavaScript,JavaScript (JS) is a lightweight interpreted p...,https://github.com/topics/javascript
8,Node.js,Node.js is a tool for executing JavaScript in ...,https://github.com/topics/nodejs
9,npm,npm is a package manager for JavaScript includ...,https://github.com/topics/npm


In [181]:
topics_df

Unnamed: 0,title,description,url
0,Awesome Lists,An awesome list is a list of awesome things cu...,https://github.com/topics/awesome
1,Chrome,Chrome is a web browser from the tech company ...,https://github.com/topics/chrome
2,Code quality,"Automate your code review with style, quality,...",https://github.com/topics/code-quality
3,Compiler,Compilers are software that translate higher-l...,https://github.com/topics/compiler
4,CSS,Cascading Style Sheets (CSS) is a language use...,https://github.com/topics/css
5,Database,A database is a structured set of data held in...,https://github.com/topics/database
6,Front end,Front end is the programming and layout that p...,https://github.com/topics/frontend
7,JavaScript,JavaScript (JS) is a lightweight interpreted p...,https://github.com/topics/javascript
8,Node.js,Node.js is a tool for executing JavaScript in ...,https://github.com/topics/nodejs
9,npm,npm is a package manager for JavaScript includ...,https://github.com/topics/npm


In [184]:
#loop over rows
for index, row in topics_df.iterrows():
    print(row['title'],row['url'])

Awesome Lists https://github.com/topics/awesome
Chrome https://github.com/topics/chrome
Code quality https://github.com/topics/code-quality
Compiler https://github.com/topics/compiler
CSS https://github.com/topics/css
Database https://github.com/topics/database
Front end https://github.com/topics/frontend
JavaScript https://github.com/topics/javascript
Node.js https://github.com/topics/nodejs
npm https://github.com/topics/npm
Project management https://github.com/topics/project-management
Python https://github.com/topics/python
React https://github.com/topics/react
React Native https://github.com/topics/react-native
Scala https://github.com/topics/scala
TypeScript https://github.com/topics/typescript


In [200]:
def scrape_topics_repos():
    print('Scarping list of topics')
    topics_df = scrape_topics()

    os.makedirs('data',exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}" '.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv' .format(row['title']))

In [201]:
scrape_topics_repos()

Scarping list of topics
titles: 16, descs: 16, urls: 16
Scraping top repositories for "Awesome Lists" 
Scraping top repositories for "Chrome" 
Scraping top repositories for "Code quality" 
Scraping top repositories for "Compiler" 
Scraping top repositories for "CSS" 
Scraping top repositories for "Database" 
Scraping top repositories for "Front end" 
Scraping top repositories for "JavaScript" 
Scraping top repositories for "Node.js" 
Scraping top repositories for "npm" 
Scraping top repositories for "Project management" 
Scraping top repositories for "Python" 
Scraping top repositories for "React" 
Scraping top repositories for "React Native" 
Scraping top repositories for "Scala" 
Scraping top repositories for "TypeScript" 
