# Pick a website and describe your objective

Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [1]:
! pip install requests --upgrade



In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get (topics_url)

In [5]:
response.status_code
#https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

    #Informational responses (100 – 199)
    #Successful responses (200 – 299)
    #Redirection messages (300 – 399)
    #Client error responses (400 – 499)
    #Server error responses (500 – 599)


200

In [6]:
#Every request contain content

In [7]:
len(response.text)

152740

In [8]:
page_contents = response.text

In [9]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="https:/

# Use the requests library to download web pages

In [10]:
import io
with open('webpage.html', 'w',encoding='utf-8') as f:
    f.write(page_contents)

# Use Beautiful Soup to parse and extract information

In [11]:
! pip install beautifulsoup4 --upgrade --quiet

In [12]:
from bs4 import BeautifulSoup

In [13]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [14]:
#find using queries

In [15]:
p_tags = doc.find_all('p')

In [16]:
len(p_tags)

67

In [17]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         PICO-8
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs in Lua.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Kubernetes
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications.</p>]

In [18]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [19]:
len(topic_title_tags)

30

In [20]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [21]:
type(doc)

bs4.BeautifulSoup

# Topic Descriptions

In [22]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class': desc_selector})

In [23]:
len(topic_desc_tags)

30

In [24]:
topic_desc_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (App

# URL

In [25]:
topic_link_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})

In [26]:
len(topic_link_tags)

30

In [27]:
topic_link_tags[0]['href']

'/topics/3d'

In [28]:
topic0_url = 'http://github.com' + topic_link_tags[0]['href']
print(topic0_url)

http://github.com/topics/3d


# Now

In [29]:
topic_title_tags[0].text

'3D'

In [30]:
topic_desc_tags[0].text

'\n          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.\n        '

In [31]:
print(topic0_url)

http://github.com/topics/3d


# Let's make them a list

In [32]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)


['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [33]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text)

print(topic_descs)

['\n          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.\n        ', '\n          Ajax is a technique for creating interactive web applications.\n        ', '\n          Algorithms are self-contained sequences that carry out a variety of tasks.\n        ', '\n          Amp is a non-blocking concurrency library for PHP.\n        ', '\n          Android is an operating system built by Google designed for mobile devices.\n        ', '\n          Angular is an open source web application platform.\n        ', '\n          Ansible is a simple and powerful automation engine.\n        ', '\n          An API (Application Programming Interface) is a collection of protocols and subroutines for building software.\n        ', '\n          Arduino is an open source platform for building electronic devices.\n        ', '\n          ASP.NET is a web framework for building modern web apps and services.\n        ', '\n          Atom is a open sour

In [34]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

print(topic_descs)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [35]:
topic_urls = []

for tag in topic_link_tags:
    topic_urls.append(tag['href'])

print(topic_urls)

['/topics/3d', '/topics/ajax', '/topics/algorithm', '/topics/amphp', '/topics/android', '/topics/angular', '/topics/ansible', '/topics/api', '/topics/arduino', '/topics/aspnet', '/topics/atom', '/topics/awesome', '/topics/aws', '/topics/azure', '/topics/babel', '/topics/bash', '/topics/bitcoin', '/topics/bootstrap', '/topics/bot', '/topics/c', '/topics/chrome', '/topics/chrome-extension', '/topics/cli', '/topics/clojure', '/topics/code-quality', '/topics/code-review', '/topics/compiler', '/topics/continuous-integration', '/topics/covid-19', '/topics/cpp']


In [36]:
topic_urls = []
base_url = 'http://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

print(topic_urls)

['http://github.com/topics/3d', 'http://github.com/topics/ajax', 'http://github.com/topics/algorithm', 'http://github.com/topics/amphp', 'http://github.com/topics/android', 'http://github.com/topics/angular', 'http://github.com/topics/ansible', 'http://github.com/topics/api', 'http://github.com/topics/arduino', 'http://github.com/topics/aspnet', 'http://github.com/topics/atom', 'http://github.com/topics/awesome', 'http://github.com/topics/aws', 'http://github.com/topics/azure', 'http://github.com/topics/babel', 'http://github.com/topics/bash', 'http://github.com/topics/bitcoin', 'http://github.com/topics/bootstrap', 'http://github.com/topics/bot', 'http://github.com/topics/c', 'http://github.com/topics/chrome', 'http://github.com/topics/chrome-extension', 'http://github.com/topics/cli', 'http://github.com/topics/clojure', 'http://github.com/topics/code-quality', 'http://github.com/topics/code-review', 'http://github.com/topics/compiler', 'http://github.com/topics/continuous-integration

In [37]:
! pip install pandas --quiet

In [38]:
import pandas as pd

In [39]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url':topic_urls
}

In [40]:
topics_dict

{'title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'description': ['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency library for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a 

In [41]:
topics_df = pd.DataFrame(topics_dict)

In [42]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


# Create CSV file(s) with the extracted information

In [43]:
topics_df.to_csv('topics.csv', index = None)

# Getting information out of a topic page

In [44]:
topic_page_url = topic_urls[0]

In [45]:
topic_page_url

'http://github.com/topics/3d'

In [46]:
response = requests.get(topic_page_url)

In [47]:
response.status_code

200

In [48]:
len(response.text)

456115

In [49]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

### Find username tags 2

In [50]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags =  topic_doc.find_all('h3',{'class':h3_selection_class })

In [51]:
len(repo_tags)

20

In [52]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [53]:
a_tags = repo_tags[0].find_all('a')

In [54]:
a_tags

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [55]:
a_tags[0]

<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [56]:
a_tags[0].text.strip()

'mrdoob'

In [57]:
a_tags[1]

<a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
            three.js
</a>

In [58]:
a_tags[1].text.strip()

'three.js'

In [59]:
repo_url = a_tags[1]["href"]

In [60]:
repo_url

'/mrdoob/three.js'

In [61]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]["href"]
print(repo_url)

https://github.com/mrdoob/three.js


In [62]:
star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

In [63]:
len(star_tags)

20

In [64]:
star_tags[0]

<span aria-label="89402 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="89,402">89.4k</span>

In [65]:
star_tags[0].text.strip()

'89.4k'

### To convert string into number 2

In [66]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        stars_str[:-1]


In [67]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        float(stars_str[:-1]) * 1000

In [68]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
       int( float(stars_str[:-1]) * 1000)

In [69]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
       return int( float(stars_str[:-1]) * 1000)
    return int (stars_str)

In [70]:
parse_star_count(star_tags[0].text.strip())

89400

In [71]:
# for the first line

def get_repo_info(h1_tag, star_tag) : 
# returns all the required info about a repository 
    a_tags = h1_tag.find_all ('a')
    
    username = a_tags [0].text.strip() 
    repo_name = a_tags [1].text. strip() 
    repo_url = base_url +  a_tags[1]['href']
    
    stars =  parse_star_count(star_tag.text.strip()) 
    
    return username, repo_name, stars, repo_url


In [72]:
get_repo_info(repo_tags[0], star_tags[0]) 

('mrdoob', 'three.js', 89400, 'https://github.com/mrdoob/three.js')

In [73]:
# for the whole data
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars' : [],
    'repo_url': []
}

for i in range (len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])




In [74]:
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'aframevr',
  'lettier',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'timzhang642',
  'isl-org',
  'blender',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'openscad',
  'google',
  'spritejs',
  'jagenjo'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  'aframe',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'cesium',
  'zdog',
  '3D-Machine-Learning',
  'Open3D',
  'blender',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'openscad',
  'model-viewer',
  'spritejs',
  'webglstudio.js'],
 'stars': [89400,
  21600,
  21200,
  19400,
  16200,
  15100,
  14600,
  13300,
  10000,
  9600,
  8700,
  8100,
  7800,
  7300,
  6100,
  5900,
  5400,
  5400,
  5100,
  4900],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
  '

In [75]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [76]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,89400,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,21600,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21200,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,19400,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,16200,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15100,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,14600,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,13300,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10000,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9600,https://github.com/metafizzy/zdog


# FINAL CODE

In [162]:
import os

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h1_tag, star_tag) : 
# returns all the required info about a repository 
    a_tags = h1_tag.find_all ('a')
    
    username = a_tags [0].text.strip() 
    repo_name = a_tags [1].text. strip() 
    repo_url = base_url +  a_tags[1]['href']
    
    stars =  parse_star_count(star_tag.text.strip()) 
    
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags =  topic_doc.find_all('h3',{'class':h3_selection_class })
    
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars' : [],
    'repo_url': []
}

    # Get repo infos
    for i in range (len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

#one more step
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print ('The file already exists. Skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)


In [163]:
topic_urls

['http://github.com/topics/3d',
 'http://github.com/topics/ajax',
 'http://github.com/topics/algorithm',
 'http://github.com/topics/amphp',
 'http://github.com/topics/android',
 'http://github.com/topics/angular',
 'http://github.com/topics/ansible',
 'http://github.com/topics/api',
 'http://github.com/topics/arduino',
 'http://github.com/topics/aspnet',
 'http://github.com/topics/atom',
 'http://github.com/topics/awesome',
 'http://github.com/topics/aws',
 'http://github.com/topics/azure',
 'http://github.com/topics/babel',
 'http://github.com/topics/bash',
 'http://github.com/topics/bitcoin',
 'http://github.com/topics/bootstrap',
 'http://github.com/topics/bot',
 'http://github.com/topics/c',
 'http://github.com/topics/chrome',
 'http://github.com/topics/chrome-extension',
 'http://github.com/topics/cli',
 'http://github.com/topics/clojure',
 'http://github.com/topics/code-quality',
 'http://github.com/topics/code-review',
 'http://github.com/topics/compiler',
 'http://github.com/to

In [164]:
url4 = topic_urls[4]

In [165]:
url4

'http://github.com/topics/android'

In [166]:
topic4_doc = get_topic_page(url4)

In [167]:
topic4_repos = get_topic_repos(topic4_doc)

In [168]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,151000,https://github.com/flutter/flutter
1,facebook,react-native,108000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,100000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,77300,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,61900,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47600,https://github.com/google/material-design-icons
6,wasabeef,awesome-android-ui,45600,https://github.com/wasabeef/awesome-android-ui
7,Solido,awesome-flutter,45600,https://github.com/Solido/awesome-flutter
8,square,okhttp,43600,https://github.com/square/okhttp
9,android,architecture-samples,42300,https://github.com/android/architecture-samples


In [169]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,89400,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,21600,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21200,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,19400,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,16200,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15100,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,14600,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,13300,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10000,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9600,https://github.com/metafizzy/zdog


In [170]:
#just for Android
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,151000,https://github.com/flutter/flutter
1,facebook,react-native,108000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,100000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,77300,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,61900,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47600,https://github.com/google/material-design-icons
6,wasabeef,awesome-android-ui,45600,https://github.com/wasabeef/awesome-android-ui
7,Solido,awesome-flutter,45600,https://github.com/Solido/awesome-flutter
8,square,okhttp,43600,https://github.com/square/okhttp
9,android,architecture-samples,42300,https://github.com/android/architecture-samples


In [171]:
#just for Angular
get_topic_repos(get_topic_page(topic_urls[6])).to_csv("ansible.csv",index=None)

# Putting it together

In [172]:



def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})

    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
    
def get_topic_urls(doc):    
    topic_link_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
    
    topic_urls = []
    base_url = 'http://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    




def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
        
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)
        



In [173]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


In [174]:
#iterate rows in pandas DF
for index, row in topics_df.iterrows():
    print (row['title'],row['url'])

3D http://github.com/topics/3d
Ajax http://github.com/topics/ajax
Algorithm http://github.com/topics/algorithm
Amp http://github.com/topics/amphp
Android http://github.com/topics/android
Angular http://github.com/topics/angular
Ansible http://github.com/topics/ansible
API http://github.com/topics/api
Arduino http://github.com/topics/arduino
ASP.NET http://github.com/topics/aspnet
Atom http://github.com/topics/atom
Awesome Lists http://github.com/topics/awesome
Amazon Web Services http://github.com/topics/aws
Azure http://github.com/topics/azure
Babel http://github.com/topics/babel
Bash http://github.com/topics/bash
Bitcoin http://github.com/topics/bitcoin
Bootstrap http://github.com/topics/bootstrap
Bot http://github.com/topics/bot
C http://github.com/topics/c
Chrome http://github.com/topics/chrome
Chrome extension http://github.com/topics/chrome-extension
Command line interface http://github.com/topics/cli
Clojure http://github.com/topics/clojure
Code quality http://github.com/topics/

In [175]:
#def scrape_topics_repos():
    #print('Scraping list of topics from GitHub')
    
    #topics_df = scrape_topics()
    
    #for index, row in topics_df.iterrows():
     #   print ('Scraping top repositories for "{}"'.format(row['title']))
        
      #  scrape_topic(row['url'], row['title'])

In [176]:
#scrape_topics_repos()

putting into a folder

###### import os

help(os.makedirs)

In [177]:
def scrape_topics_repos():
    print('Scraping list of topics from GitHub')
    
    topics_df = scrape_topics()
    
    #Create a folder here inside folder data **
    os.makedirs('data', exist_ok = True )
    
    
    
    for index, row in topics_df.iterrows():
        
        print ('Scraping top repositories for "{}"'.format(row['title']))
        
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [178]:
scrape_topics_repos()

Scraping list of topics from GitHub
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Cloj