# Scraping-Repositories-of-Top-GitHub-Topics

`Web scraping` is the process of gathering information form websites in an automated fashion with the help of a computer program and presenting them in a meaningful way. It's a useful technique for creating datasets for research and learning. 

For this project, we will scrape repositories of top topics availble on`GitHub`. `GitHub` is a platform which allow us to host our code in the cloud for the purpose of collaboration and version control. Basically `GitHub` lets the people work together on the project, and it host both public and private repositories.

To do this, we will write code in python and also will use some `python libraries` like ` requests`, `bs4` and `pandas` and then will save the generated out in `csv` file for each topic.

## Setting up the environment by installing required Python Modules
 - `Requests` allows to interact with websites and download the page
 - `Beautiful Soup` allows us to `parse` the HTML documents
 - `Pandas` allows to create `DataFrame` and store information

In [4]:
!pip install requests --upgrade --quiet
!pip install pandas --upgrade --quiet
! pip install bs4 --upgrade --quiet

## Import the packages:

In [5]:
import requests                          # use requests to download the page
from bs4 import BeautifulSoup            # use BeautifulSoup to parse the page
import os                                # use os to make a directory to store downloaded files
import pandas as pd                      # use Pandas to create DataFrame

### Read the topics of page from the site `https://github.com/topics`

![Imgur](https://imgur.com/Uzw1ClS.jpg)

## Action items:-

 - get the data from website (title, description and url)
 - converting the returned web page in Beautiful Soup Object
 - build Python dictionaries to contain title, description and url
 - create a DataFrame of python dictionary
 
Then the output of the Dataset will be saved in `.csv` file after the web scraping process.

In [21]:
'''
function get_topics_page
function to get page from the website
and then converted the webpage in Beautifulsoup Document
'''

# get topics page
def get_topics_page():
    topics_url = 'https://github.com/topics'
    page_response = requests.get(topics_url)
    if page_response.status_code !=200:
        raise Exception('Failed to load page {}.format(topics_url)')
    doc = BeautifulSoup(page_response.text, 'html.parser')
    return doc

'''
function get_topic_titles
function to get topics title available on website in `p` tag
and return the response
params:
    doc = base url for website we're scraping
'''
# get topic titles
def get_topic_titles(doc):
    title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    title_topic_tag = doc.find_all('p', {'class': title_class})
    topic_titles= []                                   #storing titles in list
    for tag in title_topic_tag:
        topic_titles.append(tag.text)
    return topic_titles

'''
function get_topic_descs
function to get descriptions of topics, thease are also available in `p` tag on website
and return the response
params:
    doc = base url for website we're scraping
'''
#  get topic description
def get_topic_descs(doc):
    desc_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tag = doc.find_all('p',desc_class )
    topic_descs = []
    for desc in topic_desc_tag:
        topic_descs.append(desc.text.strip())
    return topic_descs

'''
function get_topic_urls
function to get url of topics available in `a` tag on website
and return the response
params:
    doc = base url for website we're scraping
'''

# get topic urls
def get_topic_urls(doc):
    topic_url_class= 'no-underline flex-1 d-flex flex-column'
    topic_url_tags = doc.find_all('a', topic_url_class )
    topic_urls = []
    for tag in topic_url_tags:
        topic_urls.append('https://github.com' + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [22]:
'''
view the data using Pandas
'''
topics_df= pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Saving the Dataframe into `csv` file

In [23]:
# removed the row number by giving `index=None`

topics_df.to_csv('topics.csv', index=None)

## Get the top 25 repositories from each topic page

### Action Item: 
    - We will create a function to get the list of topic urls
    - We will convert them into Pandas DataFrame
    - Then we will create a CSV file for each topic repository

In [31]:
'''
function get_topic_page(topic_url)
function to get page url from the scraped data
and then will convert the webpage in Beautifulsoup Document
'''
def get_topic_page(topic_urls):
    # Download the page
    response = requests.get(topic_urls)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_urls))
    #Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc
    
'''
function get_repo_info()
function to get app information like username, repository name, repo url and stars
'''
    
def get_repo_info(h3_tag, star_tag):
    # Get a tags containing h3 and a tag insite
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'https://github.com'
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


'''
function get_topic_repos()
function to get topic tag , start tag from beautifulsoup object `topic_doc`,then save those into a dictionary
and then converting the data into Dataframe
'''
def get_topic_repos(topic_doc):
    # Get h3 tags that we need which contains repo title, repo url and username
    repo_tags = topic_doc.find_all('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'})
    # Get star tags
    star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
    
    
    #Get all repo information in dictonary
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
        }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)        #Convert the dictonary into Dataframe


# we have build a helper function, which will first create a DataFrame by calling `get_topic_repos()`, 
#`get_topic_page()` and `topic_url`
# get_topic_repos(get_topic_page(topic_url)) it means := from `topic_url`, we will get `get_topic_page` and from page we will get `get_topic_page`

def scrape_topic(topic_url, topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    #saving into the csv file
    topic_df.save(topic_name + '.csv', index=None)

In [32]:
'''
skipping already downloaded files
'''

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [33]:
'''
saving all repositories
'''
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [34]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

## Summary

In this project we have collected information of Top topics and their repositories available on GitHub from https://github.com/topics.


We have taken the following steps by using the python libraries requests, BeautifulSoup, and pandas.
 - scraped the website, gathering topic name, description, and topic links
 - to do this we
     - opened each topic page and by topic links
     - use the requests library to scrape the page
     - use the Beautiful Soup library to parse the data from the web page returned (user_name, repo_name, stars)
 - create a DataFrame for dataset and finally saved it in the form of a .csv file with topic name.
     

## Future work
    - Create a dataset of popular books in different genres
    - for other webscraping projects, use Selenium to scrape websites with dynamically changing data

## References

[information about web scraping](https://en.wikipedia.org/wiki/Web_scraping)

[Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)

## Make a submission

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project= "test1-final")

<IPython.core.display.Javascript object>