# Top Repositories for GitHub Topics



## Pick a website and describe your objective
 - Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
 - Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
 - Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Project Outline:
- We're going to scrape  https://github.com/topics
- We'll get a list of topics. For each topic we'll get topic title,topic page URL and topic description
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format: 
 
 '''
 Repo Name,Username,Stars,Repo URL
 three.js,mrdoob,69700,https://github.com/mrdoob/three.js
 libgdx,libgdx,18300,https://github.com/libgdx/libgdx
 '''


# Use the requests library to download web pages

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
len(response.text)

188989

In [6]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" /><link crossorigin="anonymous" media="all" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4hJnnRdkaPuY1eu9bumt33FyHHFDX8hskTUNWNkIsMCz7F

In [7]:
with open('webpage.html','w') as f:
     f.write(page_contents)

# Use Beautiful Soup to parse and extract information



In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [22]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':selection_class})

In [23]:
len(topic_title_tags)

30

In [25]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [26]:
desc_selector ='f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})

In [27]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [28]:
topic_title_tag0 =topic_title_tags[0]

In [31]:
div_tag = topic_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [33]:
topic_link_tags = doc.find_all('a', {'class':'no-underline flex-grow-0'})

In [34]:
len(topic_link_tags)

30

In [38]:
topic0_url = 'https://github.com'+ topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [40]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [41]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [45]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

# Create CSV file(s) with the extracted information

In [46]:
import pandas as pd

In [47]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url':topic_urls
              }


In [48]:
topics_df = pd.DataFrame(topics_dict)

In [49]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [52]:
topics_df.to_csv('topics.csv', index = None)

### Getting information out of a topic page

In [53]:
topic_page_url = topic_urls[0]


In [54]:
topic_page_url

'https://github.com/topics/3d'

In [56]:
response = requests.get(topic_page_url)
response.status_code

200

In [57]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [58]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [61]:
len(repo_tags)

30

In [64]:
a_tags = repo_tags[0].find_all('a')

In [66]:
a_tags[0].text.strip()

'mrdoob'

In [67]:
a_tags[1].text.strip()

'three.js'

In [69]:
repo_url = base_url + a_tags[1]['href'] 

In [71]:
print(repo_url)

https://github.com/mrdoob/three.js


In [75]:
star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

In [76]:
len(star_tags)

30

In [77]:
star_tags[0].text.strip()

'79.4k'

In [84]:
def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href'] 
    stars = star_tags[0].text.strip()
    return username, repo_name,stars,repo_url

In [85]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', '79.4k', 'https://github.com/mrdoob/three.js')

In [90]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range (len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [93]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [110]:
def get_topic_page(topic_url):
         response = requests.get(topic_url)
         if response.status_code != 200:
            raise Exception('Failed to load page {}'.format(topic_url))
         
         topic_doc = BeautifulSoup(response.text, 'html.parser')
         
         return topic_doc

def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href'] 
    stars = star_tags[0].text.strip()
    return username, repo_name,stars,repo_url
    
def get_topic_repos(topic_doc):
   
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }
    
    #get repo info
    for i in range (len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

In [111]:
url4 = topic_urls[4]
topic4_doc = get_topic_page(url4)
topic4_repos = get_topic_repos(topic4_doc)

In [112]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,79.4k,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,79.4k,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,79.4k,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,79.4k,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,79.4k,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,79.4k,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,79.4k,https://github.com/square/okhttp
7,android,architecture-samples,79.4k,https://github.com/android/architecture-samples
8,square,retrofit,79.4k,https://github.com/square/retrofit
9,Solido,awesome-flutter,79.4k,https://github.com/Solido/awesome-flutter


# Document and share your work