# Web Scraping - Top 30 repositories of featured topics on GitHub

## Scraping top 30 repositories for top 30 featured topics on GitHub


1. Scraping https://github.com/topics to collect a list of topics form GitHub.
2. Extracting the topic title, topic page URL and topic description for each topic.
3. Extracting top 30 repositories from the list of topics from the topic page.
4. Collecting repository name, username, stars and repository URL for each repository.
5. We will create a seperate CSV file for each topic in the below format -

   username,repo_name,stars,repo_url
   mrdoob,three.js,79400,https://github.com/mrdoob/three.js
   libgdx,libgdx,19700,https://github.com/libgdx/libgdx
   

## Steps followed for scraping:
    - Use requests module to download the page for scraping.
    - Use bs4 to parse and extract the data.
    - Convert the collected data into a Pandas DataFrame and save it as a 
      csv file.

- Importing the required modules.

In [1]:
import requests
from bs4 import BeautifulSoup

### Creating a function for grabbing the topics page

In [2]:
def get_topic_page():
    topic_url="https://github.com/topics"
    response = requests.get(topic_url)
    # Adding a condition to execute if the response is not successful or the page fails to load.
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
        # Parsing the html code from the requested page.
    doc=BeautifulSoup(response.text,'html.parser')
    return doc

In [3]:
doc=get_topic_page()

In [4]:
'''Writing a function for collecting all the 'p' tags with
class "f3 lh-condensed mb-0 mt-1 Link--primary" to grab the topic titles.'''

def get_topic_titles(doc):
    title_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags=doc.find_all('p',{'class':title_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [5]:
titles=get_topic_titles(doc)

In [6]:
len(titles)

30

We have extracted the titles from the document.

In [7]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [8]:
'''Writing a function for collecting all the 'p' tags with
class "f5 color-fg-muted mb-0 mt-1" to grab the description for each topic.'''

def get_topic_desc(doc):
    desc_class="f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags=doc.find_all('p',{'class':desc_class})
    topic_descriptions=[]
    for descriptions in topic_desc_tags:
        topic_descriptions.append(descriptions.text.strip())
    return topic_descriptions

In [9]:
descriptions=get_topic_desc(doc)

In [10]:
len(descriptions)

30

- We have extracted the descriptions from the document.

In [11]:
descriptions[:3]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

In [12]:
'''Writing a function for collecting all the 'a' tags with
class "no-underline flex-1 d-flex flex-column" to grab the link for each topic page.'''

def get_topic_urls(doc):
    topic_link_class="no-underline flex-1 d-flex flex-column"
    topic_link_tags=doc.findAll('a',{'class':topic_link_class})
    topic_urls=[]
    base_url="https://github.com"
    for urls in topic_link_tags:
        topic_urls.append(base_url+urls['href'])
    return topic_urls

In [13]:
topic_urls=get_topic_urls(doc)

In [14]:
len(topic_urls)

30

In [15]:
topic_urls[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

### Creating a single function which will return the title, description and link for the topics as a pandas dataframe.

In [16]:
import pandas as pd

In [17]:
def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code != 200:
        raise Exception (f'Failed to load page {topics_url}')
    topics_dict={
        'title':get_topic_titles(doc),
        'description':get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [18]:
topics_df=scrape_topics()

In [19]:
len(topics_df)

30

In [20]:
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Creating a function to grab the topic page for each topic.

In [21]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [22]:
'''Writing a function for collecting all the  tags for
grabbing the information for a repository.'''
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [23]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get star_tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    ''' Creating a dictionary of the information of
        repositories for converting it into a dataframe.'''
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [24]:
import os

In [25]:
#Writing a function to save every dataframe as a csv.
def scrape_topic(topic_url,path):
    #Adding a condition to skip the file if it already exists.
    if os.path.exists(path):
        print(f"The file {path} already exists. Skipping...")
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Creating a single function which will create a seperate  csv file for every topic.

In [26]:
def scrape_topics_repos():
    print("Scraping list of topics")
    topics_df=scrape_topics()
    os.makedirs('data', exist_ok=True)
    for index,row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['title']}")
        scrape_topic(row['url'],"data/{}.csv".format(row['title']))
    print('\nDone!!!')

In [28]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for 3D
The file data/3D.csv already exists. Skipping...
Scraping top repositories for Ajax
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for Algorithm
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for Amp
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for Android
The file data/Android.csv already exists. Skipping...
Scraping top repositories for Angular
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for Ansible
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for API
The file data/API.csv already exists. Skipping...
Scraping top repositories for Arduino
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for ASP.NET
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for Atom
The file data/Atom.csv already exists. Skipping..