# webscraping-github

Use the "Run" button to execute the code.

### Picking a website and general overview of the project
- Scraping the following website: 'https://github.com/topics'
- Getting list of the topics. For every topic, getting topic title, topic page URL and description.
- For every topic, getting top 25 repositories in the topic from the topic page.
- For each repository, grabbing the repository name, username, stars and repository URL.
- For each topic, creating the CSV file of the following format:
``````
Repo Name,Username,Stars,Repo URL
infinite-scroll,metafizzy,7100,https://github.com/metafizzy/infinite-scroll
Blog,ljianshu,7000,https://github.com/ljianshu/Blog
``````

In [1]:
# importing requests library
import requests

In [2]:
# webpage selected for scraping
topics_url = 'https://github.com/topics'

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
len(response.text)

141849

In [6]:
page_contents = response.text

In [7]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [8]:
# creating a local copy of the webpage on the server
with open('webpage.html', 'w') as f:
    f.write(page_contents)

### Using Beautiful Soup for parsing and extracting information

In [9]:
!pip install beautifulsoup4 --upgrade --quiet

In [10]:
from bs4 import BeautifulSoup

In [11]:
# parsing information using html.parser
doc = BeautifulSoup(page_contents, 'html.parser')

In [12]:
# getting the topic title of the topics from the website by inspecting it
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary' 

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
# getting the topic description of the topics from the website by inspecting it
desc_selector = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class': desc_selector})

In [16]:
len(topic_desc_tags)

30

In [17]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [18]:
# getting the topic title tag at the 0th element which is 3D
topic_title_tag0 = topic_title_tags[0]

In [19]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [20]:
# getting the parent tag of the above mentioned tag ie looking up inside which tag does the above tag lie
div_tag = topic_title_tag0.parent

In [21]:
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [22]:
# getting the link URL of the topics from the a tag (parent of the p tags for topic titles and descriptions)
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column' })

In [23]:
len(topic_link_tags)

30

In [24]:
# printing URL of the first topic (3D) on the GitHub page
topic0_url = 'https://github.com' + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [25]:
# printing only the title of the topics
topic_titles = [] #created an empty dictionary for title tags
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [26]:
# parsing description of the topics
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip()) # strip removes the unnecessary space present in the text
topic_descs

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [27]:
# parsing topic URLs
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

### Creating CSV file/s with the extracted information

In [28]:
!pip install pandas --quiet

In [29]:
import pandas as pd

In [30]:
# creating topics dictionary to display the data in a table using pandas library
topics_dict = {'Topic': topic_titles, # 'Topic'-name of the column, topic_titles-info of that attribute to be stored in the column 
               'Description': topic_descs,
               'URL': topic_urls }

In [31]:
topics_df = pd.DataFrame(topics_dict)

In [32]:
topics_df

Unnamed: 0,Topic,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [73]:
# creating CSV file of topics
topics_df.to_csv('topics.csv', index = None) 
# setting index=None so that the index no doesnt appear at start of each element

### Getting information out of a topic page


In [34]:
# getting the url of each topic page (3D page here)
topic_page_url = topic_urls[0]

In [35]:
topic_page_url

'https://github.com/topics/3d'

In [36]:
# geting response of topic page
response = requests.get(topic_page_url)

In [37]:
response.status_code

200

In [38]:
# checking the no of characters on the 3D topic page
len(response.text)

644758

In [39]:
doc = BeautifulSoup(response.text, 'html.parser')

In [40]:
# parsing the username of the top 30 repositories from the 3D topic page which lies inside h3 tag
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed' 

repo_tags = doc.find_all('h3', {'class': h3_selection_class})

In [41]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click

In [42]:
len(repo_tags)

30

In [43]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac

In [65]:
# finding all 'a' tags in the first repo tag
a_tags = repo_tags[0].find_all('a')

In [66]:
# parsing username of the first repository from the 3D page using the 'a' tag
a_tags[0].text.strip()

'mrdoob'

In [67]:
# parsing the repo name of the first repo using the 'a' tag
a_tags[1].text.strip()

'three.js'

In [68]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']

In [69]:
# printing the URL of the first repo from the 3D topic webpage
print(repo_url)

https://github.com/mrdoob/three.js


In [70]:
# getting the no of stars from the repsoitory
star_tags = doc.find_all('span', {'class': 'Counter js-social-count'})

In [71]:
len(star_tags)

30

In [72]:
star_tags[0].text.strip()

'83.1k'

In [52]:
# function to convert stars string to integer
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [53]:
parse_star_count(star_tags[0].text.strip())

83100

In [54]:
# function to return info about a specific repository
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [55]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 83100)

In [56]:
# function to find repo info about all the repos (top 30 in this case)
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'repo_url': [],
    'stars': []
}
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [57]:
import os
# function to return topic page
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
    # Parse using Beautiful Soup 
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

# function to return info about a specific repository
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h3 tag containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed' 
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'repo_url': [],
        'stars': []
    }
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['repo_url'].append(repo_info[2])
        topic_repos_dict['stars'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

In [58]:
# checking if the function is working
url4 = topic_urls[4]

In [59]:
url4

'https://github.com/topics/android'

In [60]:
topic4_doc = get_topic_page(url4)

In [61]:
topic4_repos = get_topic_repos(topic4_doc)

In [62]:
topic4_repos 

Unnamed: 0,username,repo_name,repo_url,stars
0,flutter,flutter,142000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,93800,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,66900,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,52000,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46000,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,42900,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,42400,https://github.com/square/okhttp
7,Solido,awesome-flutter,41200,https://github.com/Solido/awesome-flutter
8,android,architecture-samples,41000,https://github.com/android/architecture-samples
9,square,retrofit,40100,https://github.com/square/retrofit


In [63]:
# writing the function in a single line
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,repo_url,stars
0,flutter,flutter,142000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,93800,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,66900,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,52000,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46000,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,42900,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,42400,https://github.com/square/okhttp
7,Solido,awesome-flutter,41200,https://github.com/Solido/awesome-flutter
8,android,architecture-samples,41000,https://github.com/android/architecture-samples
9,square,retrofit,40100,https://github.com/square/retrofit


In [64]:
# creating CSV file of the topic page
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('android.csv', index = None)