# **Webscraping Using bs4**


In [1]:
import requests
from bs4 import BeautifulSoup

## **Requesting Pages using requests**


In [2]:
url = "https://github.com/topics"
response = requests.get(url)

if response:
    print("Access Accpeted")
else:
    print("Access Rejected")

Access Accpeted


In [3]:
# Getting content of the page
content = response.text

# lenght of the characters in content
print(len(content))

165659


In [4]:
# saving content into a html file
with open("topic.html", "w") as F:
    F.write(content)

# **Parsing using bs4**


In [5]:
# making a bs4 object
doc = BeautifulSoup(content, "html.parser")

In [6]:
# printing all the html data in an ordered way/form
print(doc.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="false" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https:/

In [7]:
topics = []
for i in doc.find_all("p", class_="f3 lh-condensed mb-0 mt-1 Link--primary"):
    topics.append(i.string)
    print(i.string)

3D
Ajax
Algorithm
Amp
Android
Angular
Ansible
API
Arduino
ASP.NET
Atom
Awesome Lists
Amazon Web Services
Azure
Babel
Bash
Bitcoin
Bootstrap
Bot
C
Chrome
Chrome extension
Command line interface
Clojure
Code quality
Code review
Compiler
Continuous integration
COVID-19
C++


In [64]:
topic_urls = []
base = "https://github.com"
for i in doc.find_all("a", class_="no-underline flex-1 d-flex flex-column"):
    topic_urls.append(base + i.get("href"))
    print(base + i.get("href"))

https://github.com/topics/3d
https://github.com/topics/ajax
https://github.com/topics/algorithm
https://github.com/topics/amphp
https://github.com/topics/android
https://github.com/topics/angular
https://github.com/topics/ansible
https://github.com/topics/api
https://github.com/topics/arduino
https://github.com/topics/aspnet
https://github.com/topics/atom
https://github.com/topics/awesome
https://github.com/topics/aws
https://github.com/topics/azure
https://github.com/topics/babel
https://github.com/topics/bash
https://github.com/topics/bitcoin
https://github.com/topics/bootstrap
https://github.com/topics/bot
https://github.com/topics/c
https://github.com/topics/chrome
https://github.com/topics/chrome-extension
https://github.com/topics/cli
https://github.com/topics/clojure
https://github.com/topics/code-quality
https://github.com/topics/code-review
https://github.com/topics/compiler
https://github.com/topics/continuous-integration
https://github.com/topics/covid-19
https://github.com/

In [9]:
descriptions = []
for i in doc.find_all("p", class_="f5 color-fg-muted mb-0 mt-1"):
    descriptions.append(i.string.strip())
    print(i.string.strip())

3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
Ajax is a technique for creating interactive web applications.
Algorithms are self-contained sequences that carry out a variety of tasks.
Amp is a non-blocking concurrency library for PHP.
Android is an operating system built by Google designed for mobile devices.
Angular is an open source web application platform.
Ansible is a simple and powerful automation engine.
An API (Application Programming Interface) is a collection of protocols and subroutines for building software.
Arduino is an open source platform for building electronic devices.
ASP.NET is a web framework for building modern web apps and services.
Atom is a open source text editor built with web technologies.
An awesome list is a list of awesome things curated by the community.
Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.
Azure is a cloud computing service created by Microsoft.
Bab

# **Making a Dataframe and saving in CSV format**


In [10]:
# importing libraray
import pandas as pd

In [65]:
# making a dataframe out of the scraped data
df = pd.DataFrame({"Topic": topics, "Description": descriptions, "URL": topic_urls})

# saving a csv
df.to_csv("topics")

# viewing dataframe
df

Unnamed: 0,Topic,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [22]:
# shape
print(f"The scraped CSV contains information of '{df.shape[0]}' topics")

The scraped CSV contains information of '30' topics


# **Scraping topic-wise Top Repositories**


In [89]:
  def enter_topic_url(URL:str):
    # topic url
    #topic_url = "https://github.com/topics/3d"
    topic_url=URL

    topic_name=topic_url.split('/')[-1]

    # making request
    response = requests.get(topic_url)
    #print(response)

    # getiing content
    content = response.text

    # making a bs4 object
    soup = BeautifulSoup(content, "html.parser")


    # initializing empty lists to store information
    usernames = []
    repository_names = []
    username_acc_url = []
    repository_url = []
    base_url = "httpls://github.com"
    stars=[]


    # parsing useful information
    for i in soup.find_all("h3", class_="f3 color-fg-muted text-normal lh-condensed"):
        usernames.append(i.find_all("a")[0].string.strip())
        repository_names.append(i.find_all("a")[1].string.strip())
        username_acc_url.append(base_url + i.find_all("a")[0].get("href"))
        repository_url.append(base_url + i.find_all("a")[1].get("href"))

    for i in soup.find_all('span',class_="Counter js-social-count"):
        stars.append(int(float(i.text.replace('k',''))*1000))



    # making a dataframe
    df=pd.DataFrame({"Topic":topic_name,"Username": usernames, "Repo_name": repository_names,
    'username_url':username_acc_url,'repo_url':repository_url,'star':stars})


    return df

In [90]:
enter_topic_url("https://github.com/topics/atom")

Unnamed: 0,Topic,Username,Repo_name,username_url,repo_url,star
0,atom,atom,atom,httpls://github.com/atom,httpls://github.com/atom/atom,59600
1,atom,miniflux,v2,httpls://github.com/miniflux,httpls://github.com/miniflux/v2,5400
2,atom,themerdev,themer,httpls://github.com/themerdev,httpls://github.com/themerdev/themer,5200
3,atom,shd101wyy,markdown-preview-enhanced,httpls://github.com/shd101wyy,httpls://github.com/shd101wyy/markdown-preview...,3900
4,atom,nteract,hydrogen,httpls://github.com/nteract,httpls://github.com/nteract/hydrogen,3900
5,atom,atom,teletype,httpls://github.com/atom,httpls://github.com/atom/teletype,2400
6,atom,mmcdole,gofeed,httpls://github.com/mmcdole,httpls://github.com/mmcdole/gofeed,2200
7,atom,NaiboWang,CommandlineConfig,httpls://github.com/NaiboWang,httpls://github.com/NaiboWang/CommandlineConfig,2000
8,atom,feeddd,feeds,httpls://github.com/feeddd,httpls://github.com/feeddd/feeds,1900
9,atom,mehcode,awesome-atom,httpls://github.com/mehcode,httpls://github.com/mehcode/awesome-atom,1900


# **Scraping all the topics at once**

In [92]:
# inititlaizing empty dataframe
full_df=pd.DataFrame()


for i in topic_urls:
    temp=enter_topic_url(i)
    full_df=pd.concat([full_df,temp])

full_df


Unnamed: 0,Topic,Username,Repo_name,username_url,repo_url,star
0,3d,mrdoob,three.js,httpls://github.com/mrdoob,httpls://github.com/mrdoob/three.js,94000
1,3d,pmndrs,react-three-fiber,httpls://github.com/pmndrs,httpls://github.com/pmndrs/react-three-fiber,23500
2,3d,libgdx,libgdx,httpls://github.com/libgdx,httpls://github.com/libgdx/libgdx,21800
3,3d,BabylonJS,Babylon.js,httpls://github.com/BabylonJS,httpls://github.com/BabylonJS/Babylon.js,21200
4,3d,ssloy,tinyrenderer,httpls://github.com/ssloy,httpls://github.com/ssloy/tinyrenderer,17600
...,...,...,...,...,...,...
15,cpp,dragonflydb,dragonfly,httpls://github.com/dragonflydb,httpls://github.com/dragonflydb/dragonfly,20800
16,cpp,ethereum,solidity,httpls://github.com/ethereum,httpls://github.com/ethereum/solidity,20700
17,cpp,gabime,spdlog,httpls://github.com/gabime,httpls://github.com/gabime/spdlog,19500
18,cpp,microsoft,vcpkg,httpls://github.com/microsoft,httpls://github.com/microsoft/vcpkg,19400


# **Merging Topics & Repositories**

In [109]:
# making a copy of topic dataframe
df_copy=df.copy()

In [110]:
# creating a dummy column for merging with full_df
df_copy['topic_name_from_url']=df.URL.str.split('/').str[-1]

In [121]:
# merged data
complete_dataframe=full_df.merge(df_copy,left_on='Topic',right_on='topic_name_from_url').drop(columns=['Topic_x','topic_name_from_url']).rename(columns={'Topic_y':'Topic'})

In [122]:
# viewing the final dataframe
complete_dataframe.head()

Unnamed: 0,Username,Repo_name,username_url,repo_url,star,Topic,Description,URL
0,mrdoob,three.js,httpls://github.com/mrdoob,httpls://github.com/mrdoob/three.js,94000,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,pmndrs,react-three-fiber,httpls://github.com/pmndrs,httpls://github.com/pmndrs/react-three-fiber,23500,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
2,libgdx,libgdx,httpls://github.com/libgdx,httpls://github.com/libgdx/libgdx,21800,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
3,BabylonJS,Babylon.js,httpls://github.com/BabylonJS,httpls://github.com/BabylonJS/Babylon.js,21200,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
4,ssloy,tinyrenderer,httpls://github.com/ssloy,httpls://github.com/ssloy/tinyrenderer,17600,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d


In [124]:
# saving into a csv
complete_dataframe.to_csv('completescapring')