# git-scraper

Use the "Run" button to execute the code.

# Problem Statement :
To scrape github.com for topics.
Get a list of topics. For each topic, getting  topic title, topic page URL and topic description. 
For each topic, getting the top 25 repositories in the topic from the topic page
For each repository, grabbing the repo name, username, stars and repo URL
For each topic, creating a CSV file in the following format:
```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx 
```

# Using requests to download web pages

In [2]:
!pip install requests --upgrade

Collecting requests
  Downloading requests-2.25.1-py2.py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 3.2 MB/s eta 0:00:011
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.24.0
    Uninstalling requests-2.24.0:
      Successfully uninstalled requests-2.24.0
Successfully installed requests-2.25.1


In [3]:
import requests

In [75]:
base_url = "https://github.com"
topics_url = 'https://github.com/topics'

In [5]:
response = requests.get(topics_url)

In [6]:
response.status_code

200

In [7]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-7KjiGvJiLLy6LJPGf3m67ejAdgQsgDdnxZYoaI6+Agd0ZxHKTCjoKZgaf3PgUjURCcVceAwySJJJWgitRskDiA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-eca8e21af2622cbcba2c93c67f79baed.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-dDsAoT3mMaA8gyLZkshXL3vrnDAuIv4cNq2iN06+o44rOFIngYNNiTehUUzNuMoBXMaDg0MLhEaZNumoCiLJkw==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-743b00a13de631a03c8322d992c8572f.css" />\n    \n    \n    \n    <link crossorigin="anonymous" media="all" integrity="sha512-Rzg

In [8]:
with open("webpage.html",'w') as f:
    f.write(page_contents)

# Using Beautiful soup to parse and extract information 

In [10]:
!pip install beautifulsoup4 --upgrade --quiet

In [11]:
from bs4 import BeautifulSoup

In [13]:
doc = BeautifulSoup(page_contents,'html.parser')

In [14]:
type(doc)

bs4.BeautifulSoup

In [22]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p',class_=selection_class)

In [41]:
topic_titles = [tag.text for tag in topic_title_tags]
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [120]:
topic_desc_class = "f5 color-text-secondary mb-0 mt-1"
topic_desc_tags = doc.find_all('p',{'class':topic_desc_class})

topic_descriptions = [tag.text.strip() for tag in topic_desc_tags]


print(topic_descriptions[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


![](https://imgur.com/VtXeMhv)

![.](https://imgur.com/VtXeMhv.png)

In [119]:
topic_link_tags = doc.find_all('a',{'class':'d-flex no-underline'})

topic_urls = [base_url + topic_link_tags[i]['href'] for i in range(len(topic_link_tags))]

print(topic_urls[:5])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android']


In [47]:
!pip install pandas 



In [48]:
import pandas as pd

In [52]:
topics_dict = {'title':topic_titles,
               'description':topic_descriptions,
               'url':topic_urls
}
topics_df = pd.DataFrame(topics_dict)

In [53]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Creating a CSV file with the information

In [56]:
topics_df.to_csv('topics.csv',index=None)

## Getting information from topic pages

In [57]:
topic_page_url = topic_urls[0] 

In [58]:
response = requests.get(topic_page_url)

In [59]:
response.status_code

200

In [60]:
len(response.text)

580471

In [61]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [65]:
repo_tags = topic_doc.find_all("h1",{'class':'f3 color-text-secondary text-normal lh-condensed'})

In [99]:
len(repo_tags)

30

In [77]:
a_tags = repo_tags[0].find_all('a')
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [79]:
 star_tags = topic_doc.find_all('a',{'class':'social-count float-none'})

In [80]:
len(star_tags)

30

In [95]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if(stars_str[-1].isalpha()):
        return int(float(stars_str[:-1]) * 1000)
    return int(float(stars_str))

In [97]:
parse_star_count(star_tags[0].text)

69900

In [111]:
def get_topic_repos(topic_url):
    topic_doc = get_topic_page(topic_url)
    # Getting the h1 tags for title, username, title and a tags for stars
    repo_tags = topic_doc.find_all("h1",{'class':'f3 color-text-secondary text-normal lh-condensed'})
    star_tags = topic_doc.find_all('a',{'class':'social-count float-none'})
    
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[] 
    }

In [112]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 69900, 'https://github.com/mrdoob/three.js')

In [106]:
topic_repos_dict = {
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
for i in range(len(repo_tags)):
    details = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(details[0])
    topic_repos_dict['repo_name'].append(details[1])
    topic_repos_dict['stars'].append(details[2])
    topic_repos_dict['repo_url'].append(details[3])
    
    
    

In [108]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [109]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,69900,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18300,https://github.com/libgdx/libgdx
2,BabylonJS,Babylon.js,13900,https://github.com/BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,13000,https://github.com/pmndrs/react-three-fiber
4,aframevr,aframe,12700,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,10500,https://github.com/ssloy/tinyrenderer
6,FreeCAD,FreeCAD,9200,https://github.com/FreeCAD/FreeCAD
7,metafizzy,zdog,8400,https://github.com/metafizzy/zdog
8,lettier,3d-game-shaders-for-beginners,8400,https://github.com/lettier/3d-game-shaders-for...
9,CesiumGS,cesium,6900,https://github.com/CesiumGS/cesium


# Final version

In [135]:
import os 

def get_topic_page(topic_url):
    
    # Downloading the page 
    response = requests.get(topic_url)
    # Checking response status
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    # Parsing using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser') 
    return topic_doc

def get_repo_info(h1_tag,star_tag):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

def get_topic_repos(topic_url):
    topic_doc = get_topic_page(topic_url)
    # Getting the h1 tags for title, username, title and a tags for stars
    repo_tags = topic_doc.find_all("h1",{'class':'f3 color-text-secondary text-normal lh-condensed'})
    star_tags = topic_doc.find_all('a',{'class':'social-count float-none'})
    
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[] 
    }
    # Get repo's info
    for i in range(len(repo_tags)):
        details = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(details[0])
        topic_repos_dict['repo_name'].append(details[1])
        topic_repos_dict['stars'].append(details[2])
        topic_repos_dict['repo_url'].append(details[3])
    
    return pd.DataFrame(topic_repos_dict)
    
def scrape_topic(topic_url, topic_name, path):
    fname = topic_name + '.csv'
    if os.path.exists(path):
        print("The file {} already exists.".format(fname))
        return 
    topic_df = get_topic_repos(topic_url)
    topic_df.to_csv(path, index = None)

In [118]:
get_topic_repos(topic_urls[6]).to_csv('ansible.csv',index=None)

Actions : 

1. Getting list of topics from topics page 
2. Getting list of top repos from topic pages
3. Creating a CSV file for top repos for the topic 

In [125]:
def get_topic_details(doc):
    
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p',class_=selection_class)
    topic_desc_class = "f5 color-text-secondary mb-0 mt-1"
    topic_desc_tags = doc.find_all('p',{'class':topic_desc_class})
    topic_link_tags = doc.find_all('a',{'class':'d-flex no-underline'})
    
    topic_descriptions = [tag.text.strip() for tag in topic_desc_tags]
    topic_urls = [base_url + topic_link_tags[i]['href'] for i in range(len(topic_link_tags))]
    topic_titles = [tag.text for tag in topic_title_tags]
    
    return topic_titles, topic_descriptions, topic_urls

def scrape_topics():
    response = requests.get('https://github.com/topics')
    if response.status_code != 200:
        raise Exception("Failed to load the topics page")
    doc = BeautifulSoup(response.text,'html.parser')
    topic_titles, topic_descriptions, topic_urls = get_topic_details(doc)
    topics_dict = {
        'title':topic_titles,
        'description':topic_descriptions,
        'url':topic_urls
    }
    return pd.DataFrame(topics_dict)
    


    

In [126]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [140]:
def scrape_topics_repos():
    print('Scraping the list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index, row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['title']))
        scrape_topic(row['url'], row['title'], 'data/' + row['title'] + '.csv')
    

In [141]:
scrape_topics_repos()

Scraping the list of topics
Scraping top repositories for 3D
The file 3D.csv already exists.
Scraping top repositories for Ajax
The file Ajax.csv already exists.
Scraping top repositories for Algorithm
The file Algorithm.csv already exists.
Scraping top repositories for Amp
The file Amp.csv already exists.
Scraping top repositories for Android
The file Android.csv already exists.
Scraping top repositories for Angular
The file Angular.csv already exists.
Scraping top repositories for Ansible
The file Ansible.csv already exists.
Scraping top repositories for API
The file API.csv already exists.
Scraping top repositories for Arduino
The file Arduino.csv already exists.
Scraping top repositories for ASP.NET
The file ASP.NET.csv already exists.
Scraping top repositories for Atom
The file Atom.csv already exists.
Scraping top repositories for Awesome Lists
The file Awesome Lists.csv already exists.
Scraping top repositories for Amazon Web Services
The file Amazon Web Services.csv already exi

In [142]:
import jovian

In [143]:
# Execute this to save new versions of the notebook
jovian.commit(project="git-scraper")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "cherau/git-scraper" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/cherau/git-scraper[0m


'https://jovian.ai/cherau/git-scraper'