### Top Repositories for Github Topics####



### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.

- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.

- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### Project outline
- We're going to scrape https://github.com/topics

- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description

- For each topic, we'll get the top 25 repositories in the topic from the topic page

- For each repository, we'll grab the repo name, username, stars and repo URL

- For each topic we'll create a CSV file in the following format:

```

Repo Name,Username,Stars,Repo URL

three.js,mrdoob,69700,https://github.com/mrdoob/three.js
    
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
    
```

### Use the requests library to download web pages###

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topics_url = "https://github.com/topics"

In [4]:
# download this url
response = requests.get(topics_url)

In [5]:
# the statuscode indicate whether a response was successful. search diff http status codes.
response.status_code

200

In [6]:
# response.text prints all the content of the downloaded file
len(response.text)

141057

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [9]:
# lets save it into a file
with open('webpage.html', "w", encoding="utf-8") as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information###

In [10]:
!pip install beautifulsoup4 --upgrade --quiet

In [11]:
from bs4 import BeautifulSoup

In [12]:
# we perse the page_content on html parser since wepbages r html cods, Bs cn passotherthings as well .json etc.
doc = BeautifulSoup(page_contents, 'html.parser')

In [13]:
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+sol

In [14]:
type(doc)

bs4.BeautifulSoup

In [15]:
# Lets find all the P-tags
p_tags = doc.find_all('p')

In [16]:
len(p_tags)

67

In [17]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Spring Boot
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Spring Boot is a coding and configuration model for Java applications.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         MySQL
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">MySQL is an open source relational database management system.</p>]

In [18]:
 #let's find the p-tags of the topic by inspecting on the topics i.e 3D
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
p_tags = doc.find_all('p', {'class' : selection_class})

In [19]:
len(p_tags)

30

In [20]:
p_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

That means we got the 30 p-tags of the topic titles

In [21]:
# so let's rename the p-tags to topic_title_tags
topic_title_tags = p_tags.copy()

In [22]:
len(topic_title_tags)

30

In [23]:
topic_title_tags[0]

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [24]:
topic_title_tags[0].text

'3D'

In [25]:
# let's now find the topic desc tags. we inspect on the first topic description(3D)
desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all('p', {'class' : desc_selector})

In [26]:
len(topic_desc_tags)

30

In [27]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [28]:
topic_desc_tags[0].text

'\n          3D modeling is the process of virtually developing the surface and structure of a 3D object.\n        '

In [29]:
topic_title_tags0 = topic_title_tags[0]

In [30]:
topic_title_tags0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [31]:
topic_title_tags0.parent.parent.a

<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>

In [32]:
# .parent gives us the parent of this ptag which inside it contains a couple of ptags
topic_title_tags0.parent.parent.a['href']

'/topics/3d'

  ### OR

In [33]:
# we can directly copy the class by inspection
topic_link_tags = doc.find_all('a', {'class' : "no-underline flex-grow-0"})

In [34]:
len(topic_link_tags)

30

In [35]:
topic_link_tags[:5]

[<a class="no-underline flex-grow-0" href="/topics/3d">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/ajax">
 <img alt="ajax" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/8be26d91eb231fec0b8856359979ac09f27173fd/topics/ajax/ajax.png" width="64"/>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/algorithm">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/amphp">
 <img alt="amphp" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/99fe59c0f4fb5d6545311440b4ce89a0d82b0804/topics/amphp/amphp.png" width="64"/>
 </a>,
 <a class

In [36]:
topic_link_tags[0]['href']

'/topics/3d'

In [37]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [38]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [39]:
topic_descriptions = []

for tag in topic_desc_tags:
    topic_descriptions.append(tag.text)
    
print(topic_descriptions[:5])

['\n          3D modeling is the process of virtually developing the surface and structure of a 3D object.\n        ', '\n          Ajax is a technique for creating interactive web applications.\n        ', '\n          Algorithms are self-contained sequences that carry out a variety of tasks.\n        ', '\n          Amp is a non-blocking concurrency library for PHP.\n        ', '\n          Android is an operating system built by Google designed for mobile devices.\n        ']


In [40]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Create CSV file(s) with the extracted information

In [41]:
import pandas as pd

In [42]:
topics_dict = {'title' : topic_titles,
               'description' :topic_descriptions,
               'url': topic_urls              
              }

In [43]:
topics_df = pd.DataFrame(topics_dict)

In [44]:
topics_df

Unnamed: 0,title,description,url
0,3D,\n 3D modeling is the process of virt...,https://github.com/topics/3d
1,Ajax,\n Ajax is a technique for creating i...,https://github.com/topics/ajax
2,Algorithm,\n Algorithms are self-contained sequ...,https://github.com/topics/algorithm
3,Amp,\n Amp is a non-blocking concurrency ...,https://github.com/topics/amphp
4,Android,\n Android is an operating system bui...,https://github.com/topics/android
5,Angular,\n Angular is an open source web appl...,https://github.com/topics/angular
6,Ansible,\n Ansible is a simple and powerful a...,https://github.com/topics/ansible
7,API,\n An API (Application Programming In...,https://github.com/topics/api
8,Arduino,\n Arduino is an open source hardware...,https://github.com/topics/arduino
9,ASP.NET,\n ASP.NET is a web framework for bui...,https://github.com/topics/aspnet


In [45]:
topics_df.to_csv('topics.csv', index=None)

### Getting information out of a topic page

In [46]:
# click on the first topic to get the topic info page. in this case click the 3D topic.
topic_urls[0]

'https://github.com/topics/3d'

In [47]:
# let's get a single topic url
topic_page_url = topic_urls[0]

In [48]:
topic_page_url

'https://github.com/topics/3d'

In [49]:
response = requests.get(topic_page_url)

In [50]:
response.status_code

200

In [51]:
len(response.text)

641576

In [52]:
topic_doc = BeautifulSoup(response.text, 'html.parser' )

In [53]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3', {'class' : h3_selection_class})

In [54]:
len(repo_tags)

30

In [55]:
repo_tags[1]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":509841,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="760dcd7b253cb1a27d9b1a8675e86db885295be4e0d8d9fa7397adf923075d36" data-turbo="false" data-view-component="true" href="/libgdx">
            libgdx
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":5373551,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hm

In [56]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac

In [57]:
a_tags = repo_tags[0].find_all('a')

In [58]:
len(a_tags)

2

In [59]:
a_tags[0].text

'\n            mrdoob\n'

In [60]:
a_tags[0].text.strip()

'mrdoob'

In [61]:
a_tags[1].text.strip()

'three.js'

In [62]:
a_tags[1]['href']

'/mrdoob/three.js'

In [63]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [64]:
star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})

In [65]:
len(star_tags)

30

In [66]:
star_tags[0]

<span aria-label="82780 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="82,780">82.8k</span>

In [67]:
star_tags[0].text

'82.8k'

In [68]:
type(star_tags[0].text)

str

In [69]:
# let's convert this into a number
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    
    return int(stars_str)
    

In [70]:
stars_str = '82.6k'

In [71]:
stars_str[-1] #gives the last character

'k'

In [72]:
stars_str[:-1] # gives everything below the last character

'82.6'

In [73]:
float(stars_str[:-1]) # removes the quotes

82.6

In [74]:
int(float(stars_str[:-1]) * 1000)

82600

In [75]:
a =  "10.5k"

In [76]:
star_tags[0].text.strip()

'82.8k'

In [77]:
parse_star_count(a)

10500

In [78]:
parse_star_count(star_tags[0].text.strip())

82800

In [79]:
#let's put it in one function
def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [80]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 82800, 'https://github.com/mrdoob/three.js')

In [81]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [82]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'blender',
  'domlysz',
  'spritejs',
  'openscad',
  'jagenjo',
  'tensorspace-team',
  'YadiraF',
  'AaronJackson',
  'google',
  'ssloy',
  'FyroxEngine',
  'mosra',
  'tengbao',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'cnr-isti-vclab'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'blender',
  'BlenderGIS',
  'spritejs',
  'openscad',
  'webglstudio.js',
  'tensorspace',
  'PRNet',
  'vrn',
  'model-viewer',
  'tinyraytracer',
  'Fyrox',
  'magnum',
  'vanta',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'meshlab'],
 'stars': [82800,
  20100,
  18400,
  17600,
  1430

In [83]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [84]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,82800,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20100,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,18400,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,17600,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,14300,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,13900,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,13100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,11500,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9200,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8800,https://github.com/CesiumGS/cesium


    This was the topic for 3D.

In [85]:
         #let's make a function for that
#def get_topic_repos(topic_url):
         # download the page
  #  response = requests.get(topic_page_url)
           #check successfull response
  #  response.status_code != 200:
 #       raise exception('Failed to load page {}'.format(topic_url))
               #perse using beautifulsoup
  #  topic_doc = BeautifulSoup(response.text, 'html.parser' )
              #get the h3 tags containing repo title, repo url and username
#    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
#    repo_tags = topic_doc.find_all('h3', {'class' : h3_selection_class})
                # get star tags
  #  star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
             # get repo information
#    topic_repos_dict = {
#        'username' : [],
 #       'repo_name' : [],
 #       'stars' : [],
  #      'repo_url' : []
#    }

 #   for i in range(len(repo_tags)):
  #      repo_info = get_repo_info(repo_tags[i], star_tags[i])
   #     topic_repos_dict['username'].append(repo_info[0])
   #     topic_repos_dict['repo_name'].append(repo_info[1])
   #     topic_repos_dict['stars'].append(repo_info[2])
    #    topic_repos_dict['repo_url'].append(repo_info[3])
        
  #  return pd.DataFrame(topic_repos_dict)

this function seems to be long, let's devide into several functions

In [86]:
def get_topic_page(topic_url):
     # download the page
    response = requests.get(topic_url)
    #check successfull response
    if response.status_code != 200:
        raise exception('Failed to load page {}'.format(topic_url))
    #perse using beautifulsoup
    topic_doc = BeautifulSoup(response.text, 'html.parser' )
    return topic_doc


def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
     #get the h3 tags containing repo title, repo url and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', {'class' : h3_selection_class})
    # get star tags
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    # get repo information
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)
    
    

In [87]:
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [88]:
url4 = topic_urls[4]
url4

'https://github.com/topics/android'

In [89]:
topic4_doc = get_topic_page(url4)

In [90]:
topic4_repos = get_topic_repos(topic4_doc)

In [91]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,142000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,93300,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,66400,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,51400,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46000,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,42800,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,42300,https://github.com/square/okhttp
7,Solido,awesome-flutter,41000,https://github.com/Solido/awesome-flutter
8,android,architecture-samples,41000,https://github.com/android/architecture-samples
9,square,retrofit,40100,https://github.com/square/retrofit


We can as well do this in a single line

In [92]:
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,142000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,93300,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,66400,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,51400,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46000,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,42800,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,42300,https://github.com/square/okhttp
7,Solido,awesome-flutter,41000,https://github.com/Solido/awesome-flutter
8,android,architecture-samples,41000,https://github.com/android/architecture-samples
9,square,retrofit,40100,https://github.com/square/retrofit


In [93]:
# save it to csv
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('android.csv', index = None)

In [94]:
import os
def get_topic_page(topic_url):
     # download the page
    response = requests.get(topic_url)
    #check successfull response
    if response.status_code != 200:
        raise exception('Failed to load page {}'.format(topic_url))
    #perse using beautifulsoup
    topic_doc = BeautifulSoup(response.text, 'html.parser' )
    return topic_doc


def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
     #get the h3 tags containing repo title, repo url and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', {'class' : h3_selection_class})
    # get star tags
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    # get repo information
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    fname = topic_name + ".csv" # we do this mainly bcz we did downloaded some df_topics e.g. android
    if os.path.exists(fname):
        print("The file {} already exists. Skipping...".format(fname))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + ".csv", index = None)
    
    

***Write a function to:***

    * Get the list of topics from the topic page
    * get the list of top repos from the individual topic pages
    * For each topic, create a csv of the top repos for the topic

In [95]:
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p', {'class' : selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_desc(doc):
     # let's now find the topic desc tags. we inspect on the first topic description(3D)
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector})
    topic_descriptions = []
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text)
    return topic_descriptions
        
def get_topic_url(doc):
     # we can directly copy the class by inspection
    topic_link_tags = doc.find_all('a', {'class' : "no-underline flex-grow-0"})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    
        
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    doc = BeautifulSoup(response.text, 'html.parser' )
    topics_dict = {
        "title": get_topic_titles(doc),
        "description": get_topic_desc(doc),
        "url": get_topic_url(doc)   
         }

    return pd.DataFrame(topics_dict)
    
    


In [96]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,\n 3D modeling is the process of virt...,https://github.com/topics/3d
1,Ajax,\n Ajax is a technique for creating i...,https://github.com/topics/ajax
2,Algorithm,\n Algorithms are self-contained sequ...,https://github.com/topics/algorithm
3,Amp,\n Amp is a non-blocking concurrency ...,https://github.com/topics/amphp
4,Android,\n Android is an operating system bui...,https://github.com/topics/android
5,Angular,\n Angular is an open source web appl...,https://github.com/topics/angular
6,Ansible,\n Ansible is a simple and powerful a...,https://github.com/topics/ansible
7,API,\n An API (Application Programming In...,https://github.com/topics/api
8,Arduino,\n Arduino is an open source hardware...,https://github.com/topics/arduino
9,ASP.NET,\n ASP.NET is a web framework for bui...,https://github.com/topics/aspnet


In [97]:
def scrape_topics_repos():
    print("Scrapping list of topics")
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print("Scrapping top repositories for {}".format(row["title"]))
        scrape_topic(row["url"], row["title"])
              
              

In [99]:
scrape_topics_repos()

Scrapping list of topics
Scrapping top repositories for 3D
The file 3D.csv already exists. Skipping...
Scrapping top repositories for Ajax
The file Ajax.csv already exists. Skipping...
Scrapping top repositories for Algorithm
The file Algorithm.csv already exists. Skipping...
Scrapping top repositories for Amp
The file Amp.csv already exists. Skipping...
Scrapping top repositories for Android
The file Android.csv already exists. Skipping...
Scrapping top repositories for Angular
The file Angular.csv already exists. Skipping...
Scrapping top repositories for Ansible
The file Ansible.csv already exists. Skipping...
Scrapping top repositories for API
The file API.csv already exists. Skipping...
Scrapping top repositories for Arduino
The file Arduino.csv already exists. Skipping...
Scrapping top repositories for ASP.NET
The file ASP.NET.csv already exists. Skipping...
Scrapping top repositories for Atom
The file Atom.csv already exists. Skipping...
Scrapping top repositories for Awesome Li

**Let's put all the data in one folder**

we do some changes to scrape_topic() func and the scrape_topics_repos() func

In [101]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

In [105]:
def scrape_topics_repos():
    print("Scrapping list of topics")
    topics_df = scrape_topics()
    
    os.makedirs("data", exist_ok = True)
    for index, row in topics_df.iterrows():
        print("Scrapping top repositories for {}".format(row["title"]))
        scrape_topic(row["url"], 'data/{}.csv'.format(row['title']))
              

In [106]:
scrape_topics_repos()

Scrapping list of topics
Scrapping top repositories for 3D
Scrapping top repositories for Ajax
Scrapping top repositories for Algorithm
Scrapping top repositories for Amp
Scrapping top repositories for Android
Scrapping top repositories for Angular
Scrapping top repositories for Ansible
Scrapping top repositories for API
Scrapping top repositories for Arduino
Scrapping top repositories for ASP.NET
Scrapping top repositories for Atom
Scrapping top repositories for Awesome Lists
Scrapping top repositories for Amazon Web Services
Scrapping top repositories for Azure
Scrapping top repositories for Babel
Scrapping top repositories for Bash
Scrapping top repositories for Bitcoin
Scrapping top repositories for Bootstrap
Scrapping top repositories for Bot
Scrapping top repositories for C
Scrapping top repositories for Chrome
Scrapping top repositories for Chrome extension
Scrapping top repositories for Command line interface
Scrapping top repositories for Clojure
Scrapping top repositories for