# Scraping Top Repositories for Github Topics
- Scraping the top repositories for top 30 most popular topics on Github
- Tools and Libraries used:
  * Python
  * Pandas
  * Requests
  * Beautiful Soup

## Steps:
* We are going to scrape https://github.com/topics.
* We will get a list of 30 topics and for each topic we will get it's title, URL and description.
* For each topic we will get top 20 repositories.
* For each repositories, we will get the repo name, it's owner, URL and number of stars.
* For each topic we will create a CSV file and store the above data.
* Example of CSV format:
                Year,Make,Model
                1997,Ford,E350
                2000,Mercury,Cougar

## Importing necessary libraries

In [1]:
import pandas as pd
import requests as req
from bs4 import BeautifulSoup
import os

## Scraping the list of topics from Github
- Use requests to get the web page
- Use blacksoup to parse and extract information from the HTML
- Convert it into a pandas Dataframe

### Using requests library to get the web page

In [2]:
topic_url = 'https://github.com/topics'
res = req.get(topic_url)
page_content=res.text

### Using Blacksoup to parse the page and getting the required tags containing information

In [3]:
parsed_page=BeautifulSoup(page_content,'html.parser')

topic_title_tags=parsed_page.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

title_desc_tags=parsed_page.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

title_links_tags=parsed_page.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

### Converting the information in the tag to lists

In [4]:
topic_links=[]
for link in title_links_tags:
    topic_links.append("https://github.com"+link['href'])

topic_text=[]
for txt in topic_title_tags:
    topic_text.append(txt.text)

topic_desc=[]
for desc in title_desc_tags:
    topic_desc.append(desc.text.strip())

### Converting the lists to a data frame

In [5]:
topic_dict={
    'title':topic_text,
    'description':topic_desc,
    'url':topic_links
}

topics_df = pd.DataFrame(topic_dict)

topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Scraping the top 20 repositories for each topic
Defining functions for:
   - Getting the web page through requests and parsing it through beautifulsoup
   - Getting necessary information from the page
   - Converting the information to a data frame
   - Storing all the data frame as CSV in a separate folder using OS library

### Getting the web page and parsing it

In [6]:
def get_topic_repo(url):
    response=req.get(url)
    if(response.status_code != 200):
        raise Exception('Error getting the page')
    web_page=BeautifulSoup(response.text,'html.parser')
    repo_headings=web_page.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    stars_tags=web_page.find_all('span',{'id':'repo-stars-counter-star'})
    return repo_headings,stars_tags

### Getting repository information

In [7]:
def get_repo_info (h3_tags,stars_tags):
    username = h3_tags.find_all('a')[0].text.strip()
    repo_name = h3_tags.find_all('a')[1].text.strip()
    repo_url = "https://github.com"+h3_tags.find_all('a')[1]['href']
    stars = stars_tags.text
    return username,repo_name,repo_url,stars

### Converting the information into a Dataframe

In [8]:
def get_dataframe(url):
    topic_repo_dict = {
    'username':[],
    'repo_name':[],
    'url':[],
    'stars':[]
    }
    repo_headings,stars_tags=get_topic_repo(url)
    for i in range(len(repo_headings)):
        repo_info=get_repo_info(repo_headings[i],stars_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['url'].append(repo_info[2])
        topic_repo_dict['stars'].append(repo_info[3])

    return pd.DataFrame(topic_repo_dict)

### Saving all the dataframes as CSVs in a separate folder

In [9]:
def scrape_topic_repos():
    os.makedirs('topic-wise_datasets',exist_ok=True)
    for index,row in topics_df.iterrows():
        path='topic-wise_datasets/'+row['title']+'.csv'
        if os.path.exists(path):
            print("File {} already exists. Skipping....".format(row['title']))
            continue
        print('Scraping the repo "{}"...'.format(row['title']))
        df=get_dataframe(row['url'])
        df.to_csv(path,index=None)

In [10]:
scrape_topic_repos()

Scraping the repo "3D"...
Scraping the repo "Ajax"...
Scraping the repo "Algorithm"...
Scraping the repo "Amp"...
Scraping the repo "Android"...
Scraping the repo "Angular"...
Scraping the repo "Ansible"...
Scraping the repo "API"...
Scraping the repo "Arduino"...
Scraping the repo "ASP.NET"...
Scraping the repo "Atom"...
Scraping the repo "Awesome Lists"...
Scraping the repo "Amazon Web Services"...
Scraping the repo "Azure"...
Scraping the repo "Babel"...
Scraping the repo "Bash"...
Scraping the repo "Bitcoin"...
Scraping the repo "Bootstrap"...
Scraping the repo "Bot"...
Scraping the repo "C"...
Scraping the repo "Chrome"...
Scraping the repo "Chrome extension"...
Scraping the repo "Command line interface"...
Scraping the repo "Clojure"...
Scraping the repo "Code quality"...
Scraping the repo "Code review"...
Scraping the repo "Compiler"...
Scraping the repo "Continuous integration"...
Scraping the repo "COVID-19"...
Scraping the repo "C++"...
