<span style="font-size: 35px; font-family: Times New Roman, sans-serif;"><b>Scraping Top Repositories for Topics on GitHub</span></br></br>
 
<span style="font-size: 24px; font-family: Times New Roman, sans-serif;"><b>Introduction</span></br>
    
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;"><b>Web scraping</b> is a powerful technique for extracting information from websites, enabling us to gather data for analysis or other purposes.</span>
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;">In this project, I am focusing on scraping top repositories for various topics on GitHub.</span></br> 
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;"><b>GitHub</b> is a widely used platform for hosting and collaborating on software projects. The goal is to extract information about topics, including their titles, URLs, and descriptions, and then scrape the top repositories for each topic.</span></br></br>

<span style="font-size: 24px; font-family: Times New Roman, sans-serif;"><b>Problem Statement</span></br>
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;">GitHub provides a dedicated page for exploring different topics (https://github.com/topics). The challenge is to efficiently gather information about these topics and then extract details about the top repositories within each topic.</span></br></br>

<span style="font-size: 24px; font-family: Times New Roman, sans-serif;"><b>Tools Used</span>  
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;">
&#8226; Python</br>
&#8226; Requests library for making HTTP requests</br>
&#8226; BeautifulSoup (BS4) library for HTML parsing</br>
&#8226; Pandas for data manipulation</br>
&#8226; OS for handling file operations</br>
</span>

<span style="font-size: 24px; font-family: Times New Roman, sans-serif;"><b>Steps followed</span></br> 

<span style="font-size: 17px; font-family: Times New Roman, sans-serif;"><b>1. Importing Libraries</span>  
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;">Let's import the necessary libraries:</span>  



In [37]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os


<span style="font-size: 17px; font-family: Times New Roman, sans-serif;"><b>2. Scrape the list of topics from GitHub</span></br>
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;">
To achieve this, the following steps are performed:
- Use the `requests` library to download the page.</br>
- Use `BeautifulSoup` to parse and extract information.</br>
- Convert the extracted data to a Pandas dataframe.</span>  


In [38]:
def get_topics_page():
    #Function to download the page
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check sucessful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Explanation: The get_topics_page function downloads the GitHub topics page and returns the parsed document.

In [72]:
doc = get_topics_page()
print(type(doc))

<class 'bs4.BeautifulSoup'>


### Let's create some helper function to parse information from the page 

To get topic titles, Can pick `p` tags with the `class` ...


![](https://i.imgur.com/xwjEDUD.png)

In [40]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

Explanation: The get_topic_titles function extracts the titles of topics from the parsed document. 

In [41]:
# get_topic_titles can be used to get the list of titles 

titles = get_topic_titles(doc)
len(titles)

30

In [42]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly, Let's define function for descriptions and URLs

In [73]:
def get_topic_descs(doc):
    # Function to extract topic descriptions
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descriptions = []
    for desc in topic_desc_tags:
        topic_descriptions.append(desc.text.strip())
    return topic_descriptions

Explanation: The `get_topic_descs` function extracts the descriptions of topics from the parsed document.

In [44]:
# Get the list of topic descriptions
descriptions = get_topic_descs(doc)
print(len(descriptions))
descriptions[:5]

30


['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [45]:
def get_topic_urls(doc):
    # Function to extract topic URLs
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls = []
    base_url = "https://github.com"
    for url in topic_link_tags:
        topic_urls.append(base_url + url['href'])
    return topic_urls

Explanation: The `get_topic_urls` function extracts the URLs of topics from the parsed document.

In [46]:
# Get the list of topic URLs
urls = get_topic_urls(doc)
print(len(urls))
urls[:5]

30


['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [47]:
#Let's put it all together into a single function 

def scrape_topics():
    # Function to scrape topics and their details
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check sucessful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'topic_title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

Explanation: The `scrape_topics` function integrates the three functions to scrape topics along with their titles, descriptions, and URLs.

In [48]:
# Example
topics_df = scrape_topics()
topics_df.head()

Unnamed: 0,topic_title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


<span style="font-size: 17px; font-family: Times New Roman, sans-serif;"><b>3. Scrape Top Repositories for Each Topic</span></br>
<span style="font-size: 17px; font-family: Times New Roman, sans-serif;">
For each topic, need to get the top repositories. This involves downloading the topic page, parsing it, and extracting relevant information.</span>  


In [49]:
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #Check sucessful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using beautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [50]:
# Let's try now this function
doc = get_topic_page('https://github.com/topics/3d')

In [51]:
#Let's conver it to number
def parse_star_count(stars_str):
    # Function to parse star count
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k': 
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

Explanation: The `parse_star_count` function converts star counts like '1.2k' to an integer format

In [52]:
# Example
stars = parse_star_count('1.2k')
print(stars)

1200


In [53]:
base_url = "https://github.com"

def get_repo_info(h3_tag, star_tag):
    #return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


Explanation: The `get_repo_info` function extracts information about a repository from the provided `h3` and star tags. 

In [55]:
def get_topic_repos(topic_doc):
    #get the h3 tags containg repo title, repo url and username
    h3_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_class})
    #get star tages
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    #get repo info
    
    topic_repos_dict = {
        'username': [], 
        'repo_name': [], 
        'stars': [], 
        'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)
 

Explanation: The `get_topic_repos` function extracts information about top repositories for a given topic.

In [17]:
def scrape_topic(topic_url, path):
    
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path + '.csv', index=False)

Explanation: The `scrape_topic` function scrapes a specific topic, creates a dataframe, and saves it to a CSV file. 

## Putting all together

- Already have a function to geth the list of topics 
- Also have a function to create a csv file for scraped repos from a topics page
- Let's create a function to put them together

In [63]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('Github_topics', exist_ok=True)    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'. format(row['topic_title']))
        repo_data = scrape_topic(row['url'], 'Github_topics/{}'.format(row['topic_title']))
    return repo_data

Explanation: The `scrape_topics_repos` function orchestrates the entire scraping process for all topics.

### Final steps involves running the scraping process for all topics and storing the results in CSV files.

In [65]:
#Putting all together
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

In [70]:
#
df = pd.read_csv(r'C:\Users\aakru\Github_topics\3D.csv')
df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,96600,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24900,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22300,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18600,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16500,https://github.com/lettier/3d-game-shaders-for...
6,FreeCAD,FreeCAD,16100,https://github.com/FreeCAD/FreeCAD
7,aframevr,aframe,15900,https://github.com/aframevr/aframe
8,CesiumGS,cesium,11400,https://github.com/CesiumGS/cesium
9,blender,blender,10500,https://github.com/blender/blender


## Summary 

In this web scraping project, I successfully navigated through the GitHub Topics page to gather valuable insights into various programming topics. Leveraging Python and powerful libraries such as Requests, BeautifulSoup, and Pandas, we accomplished the following:

1. Topic Information Retrival:
- Extracted topic titles, descriptions, and URLs from the GitHub Topics page.
- Utilized the Requests library to download the page, and BeautifulSoup for parsing and extraction.
- Organized the data into a Pandas DataFrame for further analysis.

2. Top Repositories Scraping:
- For each topic, scraped the top repositories, capturing repository names, usernames, stars, and URLs.
- Implemented functions for efficient extraction and parsing of relevant information.

3. Automation and Scalability:
- Developed a comprehensive function, scrape_topics_repos(), to automate the entire scraping process for multiple topics.
- Demonstrated the potential for scalability by scraping data for all topics on the first page.

4. Data Storage:
- Stored the scraped data in CSV files, creating a structured format for easy access and analysis.

## Next Steps and Future Work

While this project provides a solid foundation for scraping GitHub topics, there are opportunities for enhancement and expansion:

- Pagination Handling: Extend the scraping process to cover multiple pages of GitHub topics.
- Error Handling: Implement robust error handling and retries to ensure reliability in data extraction.