# Scraping Top Repositories for Topics on Github

#### Introduction: 
- GitHub is a platform for making and sharing computer programs. The project's main goal is to gather and organize information about GitHub topics and the programs (repositories) linked to them. It collects data about topic names, descriptions, and web addresses and also gets info about the programs like who made them, what they're called, how many people like them, and where to find them on the web. This data is structured and saved in a neat format for further study and analysis, helping us understand how topics are organized on GitHub and which programs are the most popular. All of this is done using a process called web scraping.

#### Web Scraping:
- Web scraping, also known as web harvesting or web data extraction, is the process of automatically extracting data from websites. It involves parsing the HTML or other structured data on a web page and then collecting, transforming, and storing the desired information for various purposes.

##### Here are the steps we'll follow:

1. We'll initiate the web scraping process by accessing the web page: "https://github.com/topics."

2. Next, we will retrieve a list of topics. For each topic, we aim to collect the topic title, the URL of the topic page, and a brief description of the topic.

3. Within each topic, we'll delve further and retrieve information about the top 25 repositories within that topic, using the respective topic page.

4. For each repository in the topic, we are interested in capturing the repository's name, the number of stars it has garnered, and its URL.

5. Lastly, for each of the topics, we will organize the gathered information into a CSV (Comma-Separated Values) file. The CSV file will be structured in a tabular format, similar to this example:

   ```
   Name, Stars, URL
   Repository 1, 100, https://github.com/repo1
   Repository 2, 250, https://github.com/repo2
   ...
   ```


### Scrape the list of Topics from Github
1. Fetching the GitHub Topics Page:
The script begins by fetching the GitHub "Topics" page by making an HTTP GET request using the `requests` library.
2. Checking the Response and Parsing with `BeautifulSoup`:
The script then checks the response to ensure it was successful (HTTP status code 200) and parses the HTML content of the page using `BeautifulSoup`.
3. Extracting GitHub Topic Titles, Descriptions, and URLs:
The script contains functions to extract the titles, descriptions, and URLs of GitHub topics.
4. Organizing Data with `pandas`:
The extracted data (titles, descriptions, and URLs) is then organized into a DataFrame using `pandas`.
5. Scraping Repository Data for Each Topic:
The script also includes functions to scrape data for repositories related to each topic. This involves making additional requests to specific topic pages, extracting repository information, and saving it. 

#### Importing Libraries:
- os: This library allows you to interact with the operating system, particularly for tasks like file and directory operations.
- pandas as pd: The pandas library is used for data manipulation and analysis, including working with structured data using DataFrames and Series.
- requests: This library is used for making HTTP requests, enabling the script to fetch data from websites.
- BeautifulSoup (from bs4): BeautifulSoup is a Python library for parsing and navigating HTML and XML documents. It's commonly used in web scraping to extract specific data from web pages.

In [None]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup

A URL for the GitHub "Topics" page is defined.
The requests.get() function is used to make an HTTP GET request to the URL.
The HTTP status code of the response is obtained using resp.status_code.

In [None]:
url="https://github.com/topics"
resp=requests.get(url)
resp.status_code

In [None]:
len(resp.text)

In this code snippet, We are extracting the text content from the HTTP response we received from the GitHub "Topics" page and then displaying the first 1000 characters of that content.

In [None]:
page_cont=resp.text
page_cont[:1000]

By using BeautifulSoup to create the `doc` object, we convert the raw HTML content in `page_cont` into a structured format, simplifying the process of exploring and extracting data from the web page. This parsed representation of the HTML content facilitates tasks such as web scraping and webpage analysis.

In [None]:
doc =  BeautifulSoup(page_cont, 'html.parser')
type(doc)

- Let's create a function to encapsulate the code for parsing HTML content using BeautifulSoup.

In [None]:
def get_topics_page():
    topics_url='https://github.com/topics'
    resp=requests.get(topics_url)
    if resp.status_code != 200:
        raise Exception("Failed to load page {}".format(topics_url))
    doc = BeautifulSoup(resp.text,'html.parser')
    return doc

In [None]:
doc = get_topics_page()
type(doc)
doc.find('a')

- Now, we are using the BeautifulSoup object doc to find and extract the titles of topics from the parsed GitHub web page.

In [None]:
topic_titl_tags=doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})
print(len(topic_titl_tags))
topic_titl_tags[:5]

- We are extracting the text content (Topic title names) from the elements in topic_titl_tags and storing them in the topic_titles list. We can then use this list for further processing or analysis.

In [None]:
topic_titles=[]

for i in topic_titl_tags:
    topic_titles.append(i.text)

print(topic_titles)

- We are continuing to use the BeautifulSoup object 'doc' to find and extract specific HTML elements, specifically the descriptions of topics from the parsed web page

In [None]:
topic_desc_tags=doc.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})
print(len(topic_desc_tags))
topic_desc_tags[:5]

- We are collecting the text content (Description) of topic descriptions from a list of HTML elements and storing them in the topic_desc list.

In [None]:
topic_desc=[]

for i in topic_desc_tags:
    topic_desc.append(i.text.strip())
    
print(topic_desc)

- Now we are extracting topic links from a GitHub web page. `topic_link_tags` will contain a list of matching `<a>` elements.


In [None]:
topic_link_tags=doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})
len(topic_link_tags)
topic_link_tags[:5]

- We are storing the URLs of Topic titles from `topic_link_tags` into `topic_urls` list.

In [None]:
topic_urls=[]
base_url= 'https://github.com'
for i in topic_link_tags:
    topic_urls.append(base_url+i['href'])

print(topic_urls)

##### Now, we can create functions to encapsulate the code for extracting topic titles, descriptions, and URLs from a web page.
1. `get_topic_titles()` for topic titles
2. `get_topic_desc()` for description of the topic
3. `get_topic_urls()` for  links of the topics

1. To get topic titles, we can pick `p` tags with the `class` ...

![alt text](abcd.png)


In [None]:
def get_topic_titles(doc):
    topic_titl_tags=doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_titles=[]
    for i in topic_titl_tags:
        topic_titles.append(i.text)
    return topic_titles

In [None]:
titles = get_topic_titles(doc)
print(len(titles))
titles[:5]

2. To get topic description, we can pick `p` tags with the `class` ...

![alt text](Topic_desc.png)


In [None]:
def get_topic_desc(doc):
    topic_desc_tags=doc.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})
    topic_desc=[]
    for i in topic_desc_tags:
        topic_desc.append(i.text.strip())
    return topic_desc

In [None]:
desc = get_topic_desc(doc)
print(len(desc))
desc[:5]

3. To get topic urls, we can pick `a` tags with the `class` ... `no-underline flex-1 d-flex flex-column`

In [None]:
def get_topic_urls(doc):
    topic_link_tags=doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})
    topic_urls=[]
    base_url= 'https://github.com'
    for i in topic_link_tags:
        topic_urls.append(base_url+i['href'])
    return topic_urls

In [None]:
urls = get_topic_urls(doc)
print(len(urls))
urls[:5]

- Let's consolidate all of this into a single function, which will structure the data and return it as a DataFrame.

In [None]:
def scrape_topics():
    topics_url='https://github.com/topics'
    resp=requests.get(topics_url)
    if resp.status_code != 200:
        raise Exception("Failed to load page {}".format(topics_url))
    doc = BeautifulSoup(resp.text,'html.parser')
    topics_dict = {
        'titles' : get_topic_titles(doc),
        'description' : get_topic_desc(doc),
        'url' : get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

## Get the top repositorie from a topic page

1. `get_topic_page(topic_url)`: This function downloads and parses a web page related to a specified topic URL.

- It sends an HTTP GET request to the topic URL.
- Checks for a successful response (status code 200).
- Parses the HTML content of the page using BeautifulSoup with the 'html.parser' parser.
- Returns the parsed document as a BeautifulSoup object.

In [None]:
def get_topic_page(topic_url):
    #Download the page
    resp = requests.get(topic_url)
    #Check successful response
    if resp.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    #Parse using Beautiful soup
    topic_doc=BeautifulSoup(resp.text,'html.parser')
    return topic_doc


In [None]:
doc = get_topic_page('https://github.com/topics/3d')
doc

2. `parse_star_count(stars_str)`: This function parses star counts from strings.

- Removes leading and trailing spaces from the input string.
- Checks if the string ends with 'k' (indicating thousands).
- Converts the string to an integer (e.g., "1.5k" to 1500) and returns the result.

In [None]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

Example:

In [None]:
star_tags=doc.find_all('span',{'class':"Counter js-social-count"})
len(star_tags)
parse_star_count(star_tags[0].text)

3. `get_repo_info(h3_tag, star_tag)`: This function returns information about a repository.

- Extracts details such as username, repository name, repository URL, and star count from the provided HTML tags.
- Returns these details as separate values.

In [None]:
base_url="https://github.com"
def get_repo_info(h3_tag,star_tag):
    #return all the required info about a repository
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url = base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username, repo_name,stars,repo_url

4. `get_topic_repos(topic_doc)`: This function extracts information about repositories related to a specific topic.

- Finds and stores H3 tags containing repository titles, URLs, and usernames.
- Finds star count tags for the repositories.
- Creates a dictionary (topic_repos_dict) to store username, repository name, stars, and repository URL.
- Iterates through the found tags to extract repository information and populates topic_repos_dict.
- Returns the information as a Pandas DataFrame.

In [None]:
def get_topic_repos(topic_doc):
    #Get the h3 tags containing repo title, repo URL and username
    repo_tags=topic_doc.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
    #Get star tags
    star_tags=topic_doc.find_all('span',{'class':"Counter js-social-count"})
    #get repo info
    topic_repos_dict={
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
        }
    
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

5. scrape_topic(topic_url, path): This function is used to scrape and save topic-related repository data.

- Checks if a file at the specified path already exists. If it does, it skips further processing.
- Calls get_topic_repos to retrieve repository information from the topic page.
- Writes the information to a CSV file at the specified path.

In [None]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

##### The `scrape_topics_repos()` function automates the extraction of data related to GitHub topics and their top repositories. It first collects topic data and saves it in a DataFrame. Then, it iterates through these topics, scraping data for each one and storing it in CSV files within a 'data' directory. This function simplifies the process of collecting and organizing GitHub data for analysis.

In [None]:
def scrape_topics_repos():
    print("Scraping list of topics")
    topics_df=scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['titles']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['titles']))

In [None]:
scrape_topics_repos()

#### This project involves creating a program to collect information about topics and top repositories on GitHub. Here's what the code does in simple terms:

1. It starts by gathering a list of GitHub topics, including their titles, descriptions, and URLs.

2. It then goes on to scrape data about the top repositories for each of these topics. The data includes the username, repository name, number of stars, and repository URL.

3. The scraped data is organized and saved in CSV files. Each CSV file contains information about a specific topic's top repositories.

4. The program automates the entire process, making it easier to collect and store GitHub data for analysis and further use.

In essence, it's a tool to fetch and save information about GitHub topics and their popular repositories.