<a href="https://colab.research.google.com/github/Unseen-Elder/Web_Scraping_of_Github_Topics/blob/main/main/Web_scraping_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Top Repositories for Topics on GitHub




Web Scraping : 
- Web scraping is the process of automatically collecting information from websites. It involves using software or programming tools to extract data from web pages, which can then be analyzed, stored, or used in other applications.

- Web scraping can be used for a variety of purposes, such as market research, price monitoring, content aggregation, and more. However, it's important to note that not all websites allow web scraping, and some may require permission or have legal restrictions on the use of their data.

GitHub : 
- GitHub is like a virtual filing cabinet where programmers can store and manage their code. It provides a centralized location for developers to collaborate on code, share their work with others, and track changes made by different contributors. GitHub also offers tools for managing issues, tracking bugs, and organizing code repositories.

- GitHub is widely used in the software development community, and it's an essential tool for open source projects. It allows developers to work together on projects, contribute code to other projects, and collaborate with others around the world. Additionally, GitHub provides a platform for code hosting, which means that anyone can access and download code for free, making it a valuable resource for learning and education.



Tools we will be using are : 
- **Python**
- **Selenium**
- **BeautifulSoup** 
- **Pandas**



## Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 40  repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

| Repo Name | Username | Stars | Repo URL                            |
|-----------|----------|-------|-------------------------------------|
| three.js  | mrdoob   | 69.7k | https://github.com/mrdoob/three.js  |
| libgdx    | libgdx   | 18.3k | https://github.com/libgdx/libgdx    |



## 1.Downloading important libraries like Selenium and BeautifulSoup

In [None]:
%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update
apt-get install -y chromium-browser --quiet
apt-get install chromium chromium-driver --quiet

# Install selenium
pip install selenium --quiet

# Install Beautifulsoup
pip install beautifulsoup4 --quiet

Executing: /tmp/apt-key-gpghome.d6yB2qBIKw/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
gpg: key DCC9EFBF77E11517: public key "Debian Stable Release Key (10/buster) <debian-release@lists.debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Executing: /tmp/apt-key-gpghome.oHlBxTHVOg/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
gpg: key DC30D7C23CBBABEE: public key "Debian Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Executing: /tmp/apt-key-gpghome.0HE6kIcLu3/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
gpg: key 4DFAB270CAA96DFA: public key "Debian Security Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
Get:2 http://security.ubuntu.com/ubuntu



## 2.Using Selenium to load full page and BeautifulSoup to Extract information

In [None]:
# Importing Essential Libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd
import os

In [None]:
# Setting up Webdriver for Google Colab

options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--disable-dve-shm-uage')

driver= webdriver.Chrome(executable_path='/usr/bin/chromedriver',options=options)


  driver= webdriver.Chrome(executable_path='/usr/bin/chromedriver',options=options)


## 3.Writing Helper Functions

In [None]:
# function to create CSV file 

def CSV_Convertor(name,input_dict):
  CSV_df=pd.DataFrame(input_dict)
  CSV_df.to_csv(name+'.csv',index=False)
  return None

In [None]:
# function to press loadmore button n number of times

def press_button(driver, times):
  
  for i in range(times):
    while True:
      try:
        # find and click "Load More" button
        load_more_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CLASS_NAME, 'ajax-pagination-btn')))
        load_more_button.click()
        time.sleep(5)
        break

      except:
        break

  return None  

In [None]:
# converting star_counts into an integer

def parse_star_count(stars_str):
  if stars_str[-1]=='k':
    return int(float(stars_str[:-1])*1000)
  
  return int(stars_str)

In [None]:
# function to get all topics and its corresponding info present on github/topics and save it into a dictionary

def get_topics():
  
  driver.get('https://github.com/topics')

  press_button(driver,5)

  topic_name,topic_desc,topic_url=get_topic_info(BeautifulSoup(driver.page_source, 
                                                          'html.parser'))
  
  topic_dict={
      'topic_name':topic_name,
      'topic_description':topic_desc,
      'topic_url':topic_url
  }

  return topic_dict

In [None]:
# function to retrive info from parsed html page and return it back to 'get_topic'

def get_repo_info(html_page):
  
  h3_class='f3 color-fg-muted text-normal lh-condensed'
  star_class='Counter js-social-count'
  repo_info=html_page.find_all('h3',{'class':h3_class})
  star_info=html_page.find_all('span',{'class':star_class})

  username=[]
  repo_name=[]
  repo_url=[]
  star_count=[]

  for i in range(len(repo_info)):
    a_tags=repo_info[i].find_all('a')
    username.append(a_tags[0].text.strip())
    repo_name.append(a_tags[1].text.strip())
    repo_url.append('https://github.com' + a_tags[1]['href'])
    star_count.append(parse_star_count(star_info[i].text.strip()))
    
  return username,repo_name,star_count,repo_url

In [None]:
# function to get each topic webpage and save its info as a dictionary

def get_topic_repos(topic_url):

  driver.get(topic_url)

  press_button(driver,1)

  username,repo_name,star_count,repo_url=get_repo_info(BeautifulSoup(driver.page_source, 
                                                          'html.parser'))
  info_dict={
      'username':username,
      'repo_name':repo_name,
      'stars_count':star_count,
      'repo_url':repo_url
  }

  return info_dict

In [None]:
# function to retrive info from parsed html page and return it back to 'get_topic_repos'

def get_topic_info(html_page):

  title_class='f3 lh-condensed mb-0 mt-1 Link--primary'
  desc_class='f5 color-fg-muted mb-0 mt-1'
  url_class='no-underline flex-1 d-flex flex-column'

  title_info=html_page.find_all('p',{'class':title_class})
  desc_info=html_page.find_all('p',{'class':desc_class})
  url_info=html_page.find_all('a',{'class':url_class})


  topic_name=[]
  topic_desc=[]
  topic_url=[]

  for i in range(len(title_info)):
    topic_name.append(title_info[i].text.strip())
    topic_desc.append(desc_info[i].text.strip())
    topic_url.append('https://github.com'+url_info[i]['href'])

  return topic_name,topic_desc,topic_url


In [None]:
# button to generate all info

def generate_info():
  folder1_name = 'topic_info'
  folder2_name = 'repo_info'
  folder1_path = '/content/' + folder1_name
  folder2_path=folder1_path + '/' + folder2_name

  if not os.path.exists(folder1_path):
    os.makedirs(folder1_path)

  CSV_Convertor('topics',get_topics())
  os.replace('/content/topics.csv', folder1_path+'/'+'topics.csv')

  CSV_Convertor('merged_file',repo_generator())
  os.replace('/content/merged_file.csv', folder1_path+'/'+'merged_file.csv')

  return None

In [None]:
# function to make a merged dictionary and save different topics in given folders

def repo_generator():
  
  dict_topic=[]
  dict_username=[]
  dict_repo_name=[]
  dict_star_count=[]
  dict_repo_url=[]
  
  if not os.path.exists('/content/topic_info/repo_info/'):
    os.makedirs('/content/topic_info/repo_info/')
  
  for topic_name, topic_url in zip(get_topics()['topic_name'], get_topics()['topic_url']):
    d=get_topic_repos(topic_url)
    CSV_Convertor(topic_name,get_topic_repos(topic_url))
    os.replace('/content/'+topic_name+'.csv', '/content/topic_info/repo_info/'+topic_name+'.csv')
    for i in range(len(d['username'])):
      dict_topic.append(topic_name)
      dict_username.append(d['username'][i])
      dict_repo_name.append(d['repo_name'][i])
      dict_star_count.append(d['stars_count'][i])
      dict_repo_url.append(d['repo_url'][i])

  merged_dict={
      'topic':dict_topic,
      'username':dict_username,
      'repo_name':dict_repo_name,
      'star_count':dict_star_count,
      'repo_url':dict_repo_url
  }

  return merged_dict

## 4.Just Click The Button

In [None]:
generate_info()

## 5.Closing the driver and its associated window 

In [None]:
driver.quit()

## 6.Common Problems and rough notebook

1. **Error message: "Unknown error: no Chrome binary found at C:\Program Files\Google\Chrome\Application\chrome. xe."**

- Solution: You cannot simply call "driver = webdriver.Chrome()" because we are using Google Colab instead of a local IDE. To resolve this issue, you need to download the Chromium browser and its corresponding webdriver for the Google instance by executing the following commands:

  apt-get update  
  apt-get install -y chromium-browser --quiet  
  apt-get install chromium chromium-driver --quiet
___

2. **Error message : HTTPConnectionPool(host='localhost', port=46371): Max retries exceeded with url: /session/efd08405139d53f9d9d5731b446f7b65/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8be84e7640>: Failed to establish a new connection: [Errno 111] Connection refused'))**
- Solution:This error message indicates that there was a problem establishing a connection to the specified host and port. This could occur if you attempted to access a URL using the webdriver, but the driver was not set up correctly using the webdriver.Chrome() method. Alternatively, it could be caused if you attempted to call the driver after it was already quit. To resolve this issue, make sure to properly set up the webdriver before attempting to use it and avoid trying to call it again after it has already been quit.
___

3. **Error message:WebDriverException: unknown error: cannot find Chrome binary error with Selenium in Python for older versions of Google Chrome**
- solution: This error message suggests that there is a problem with the Chrome binary version when using Selenium with Python for older versions of Google Chrome. The pre-installed Chromium browser that can be accessed through the path /usr/lib/chromium-browser is of version 90.0, but we need a Chromium browser version 111 or higher to work with our Chromium driver.To resolve this issue, we can download the latest version of Chromium browser that is compatible with our current Chrome driver version. Then, we need to use the executable path '/usr/bin/chromedriver' to access our latest downloaded Chromium browser. By doing so, we can ensure that our Python code can interact with the latest version of the Chromium browser, and the error should be resolved.
___

4. **Different webdriver.chrome() option's and their purpose**
- **options.add_argument('--no-sandbox')**: The sandbox mode is a security feature in Chrome that isolates web page rendering and execution in a separate process to prevent malicious web pages from accessing system files and other sensitive data.
- **options.add_argument('--headless')**: It is used in Selenium WebDriver to run the Chrome browser in headless mode.In headless mode, the browser runs without a graphical user interface, meaning that it operates entirely in the background without displaying any windows or user interface elements. This can be beneficial in automated testing scenarios since it can speed up the test execution and reduce the resources required to run the test.
- **options.add_argument('--disable-gpu')**: When the disable-gpu argument is passed to the options, Chrome will not use the GPU for rendering web pages, which can be useful in cases where there are compatibility issues between the browser and the GPU or when the system does not have a dedicated GPU.
- **options.add_argument('--disable-dve-shm-uage')**:"disable-dve-shm-usage" is used to turn off a setting in the browser that controls how it uses shared memory when encoding video. This can help improve compatibility and security in certain situations.
___
5. **Error message: TimeoutException: Message: Stacktrace:#0**  
- solution: This error message indicates a TimeoutException, and it can occur for two reasons. Firstly, it may be because the topic_url passed to the 'get_topic_repos' function is not found. This can happen when the C++ URL cannot be generated using the base_url and topic_name. Secondly, it may occur when the "load more" button is not found. For instance, in the case of 'WordPlate', there are only 12 repositories, so there is no load more button.
To resolve this issue, we need to use a try-except block in our while loop. This ensures that if the error occurs due to either of the above reasons, the loop will continue to execute instead of terminating the program. By doing this, we can successfully retrieve the required data without encountering the TimeoutException error.
___
6. **Error:Not able to get all repositores even though loadmore button was pressed**  
- solution:This error occurs when we are unable to retrieve all repositories even though the "load more" button was pressed. The reason for this is that we do not give enough time for the data to load before reading it through the driver.get() method.To resolve this issue, we need to include a pause using the time.sleep() method to give the web page enough time to load all the required data. By doing so, we can ensure that all the repositories are loaded before we attempt to retrieve them, and the error should be resolved.
___