## Scraping Top Repositories for Topics on Github
### TODO:
- Introduction about web scraping
- Introduction about Github and the problem statement
- Mention the tools like Python ,requests Beautiful soup,Pandasdas

### Here are the steps we will follow:
- Outline:
- We are going to scrape https://github.com/topics
- We will get a list of topics. For each topic, we'll get topic title, topic page URL and topic
description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository ,we'll grab the repo name,username,stars and repo URL
- for each topic we'll create a CSV file in the following format:
- Repo Name,Username,Stars,Repo URL

## Scrape the list of topics from Github
### Explain your approach
- use requests to download the page
- use BS4 to parse and extract information
- convert to a pandas dataframe
  
let's write a function to download the page

In [4]:
import requests
from bs4 import BeautifulSoup
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
        
        # Check for a successful page load
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topics_url}')
        
    doc = BeautifulSoup(response.text, 'html.parser')
        
    return doc

In [5]:
doc=get_topics_page()

In [6]:
type(doc)

bs4.BeautifulSoup

In [7]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" data-skip-target-assigned="false" href="#start-of-content">Skip to content</a>

### Let's create some helper functions to parse information from the page
### To get topic titles, we can pick 'p' tags with the 'class'...
![](https://i.imgur.com/SBIH1gw.png)

In [9]:
# Function to extract topic titles from the HTML document
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    # Find all <p> tags with the specified class for topic titles
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    # Extract and store the text of each topic title
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

get_topic titles can be used to get the list of titles

In [11]:
titles=get_topic_titles(doc)

In [12]:
len(titles)

30

### Similarly we have defined functions for finding descriptions and urls

In [14]:
# Function to extract topic descriptions from the HTML document
def get_topic_descs(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    # Find all <p> tags with the specified class for topic descriptions
    topic_description_tags = doc.find_all('p', {'class': desc_selector})
    topic_descriptions = []
    # Extract and store the text of each description
    for tag in topic_description_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

### TODO:
### Example and Explanation

In [16]:
# Function to extract URLs for each topic from the HTML document
def get_topic_urls(doc):
    # Find all <a> tags with the specified class for topic links
    topic_link_tags = doc.find_all('a', {'class': "no-underline flex-grow-0"})
    topic_urls = []
    # Construct the full URL and store it
    for tag in topic_link_tags:
        topic_urls.append("https://github.com" + tag['href'])
    return topic_urls

### Lets put this all together into a single function

In [18]:
# Function to scrape topics and their related repositories from GitHub Topics page
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    # Check for a successful page load
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topics_url}')
    
    doc = BeautifulSoup(response.text, 'html.parser')
    
    # Create a dictionary to store the scraped data
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    # Return the data as a DataFrame
    return pd.DataFrame(topics_dict)

## Get the top 25 repositories in the topic from the topic page
### TODO-Explanation and step

In [20]:
def get_topic_page(topic_url):
    ## Download the page
    response = requests.get(topic_url)
    
    ## Check successful response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    
    ## Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [21]:
doc=get_topic_page('https://github.com/topics/3d')

 TODO-talk about the h3 tags

In [37]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if(stars_str[-1]=='k'):
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [33]:
def get_repo_info(h3_tag,star_tag):
    ##returns all the required info about a repository
    base_url='https://github.com/topics'
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url


TODO-show an example

In [25]:
def get_topic_repos(topic_doc):
    
    ## Get the h3 tags containing repo title, repo URL, username
    repo_tags = topic_doc.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')
    
    ## Get star tags
    star_tags = topic_doc.find_all('span', class_='Counter js-social-count')
    
    ## Get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)


TODO-Show an example

In [27]:
def scrape_topic(topic_url,topic_name):
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name+'.csv',index=None)

TODO-Show an example

## Putting it all together
- We have a function to get the list of topics
- We have a function to create a CSV file from scraped repos from a topic page
- Let's create a function to put them together

In [30]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df=scrape_topics()
    for index,row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['title']))
        scrape_topic(row['url'],row['title'])

Let's run it to scrape the top repos for all the topics on the first page of 
'https://github.com/topics'

In [39]:
scrape_topics_repos()

Scraping list of topics


<IPython.core.display.Javascript object>

Scraping top repositories for 3D


<IPython.core.display.Javascript object>

Scraping top repositories for Ajax


<IPython.core.display.Javascript object>

Scraping top repositories for Algorithm


<IPython.core.display.Javascript object>

Scraping top repositories for Amp


<IPython.core.display.Javascript object>

Scraping top repositories for Android


<IPython.core.display.Javascript object>

Scraping top repositories for Angular


<IPython.core.display.Javascript object>

Scraping top repositories for Ansible


<IPython.core.display.Javascript object>

Scraping top repositories for API


<IPython.core.display.Javascript object>

Scraping top repositories for Arduino


<IPython.core.display.Javascript object>

Scraping top repositories for ASP.NET


<IPython.core.display.Javascript object>

Scraping top repositories for Awesome Lists


<IPython.core.display.Javascript object>

Scraping top repositories for Amazon Web Services


<IPython.core.display.Javascript object>

Scraping top repositories for Azure


<IPython.core.display.Javascript object>

Scraping top repositories for Babel


<IPython.core.display.Javascript object>

Scraping top repositories for Bash


<IPython.core.display.Javascript object>

Scraping top repositories for Bitcoin


<IPython.core.display.Javascript object>

Scraping top repositories for Bootstrap


<IPython.core.display.Javascript object>

Scraping top repositories for Bot


<IPython.core.display.Javascript object>

Scraping top repositories for C


<IPython.core.display.Javascript object>

Scraping top repositories for Chrome


<IPython.core.display.Javascript object>

Scraping top repositories for Chrome extension


<IPython.core.display.Javascript object>

Scraping top repositories for Command-line interface


<IPython.core.display.Javascript object>

Scraping top repositories for Clojure


<IPython.core.display.Javascript object>

Scraping top repositories for Code quality


<IPython.core.display.Javascript object>

Scraping top repositories for Code review


<IPython.core.display.Javascript object>

Scraping top repositories for Compiler


<IPython.core.display.Javascript object>

Scraping top repositories for Continuous integration


<IPython.core.display.Javascript object>

Scraping top repositories for C++


<IPython.core.display.Javascript object>

Scraping top repositories for Cryptocurrency


<IPython.core.display.Javascript object>

Scraping top repositories for Crystal


<IPython.core.display.Javascript object>

 We can check if the CSV files were created properly
 We can read and display a CSV file using Pandas

## References and Future Work
- Summary of what we did
- reference to links you found useful
- Ideas for future work