<a href="https://www.kaggle.com/code/anjusukumaran4/web-scraping-beautiful-soup-github-repo?scriptVersionId=145198845" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Project : Scraping Top Repositories for Topics on GitHub

Web scraping is the process extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database.


When we run the code for web scraping, a request is sent to the URL that we have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. 

Steps extract data using web scraping with python

1. Find the URL that you want to scrape
2. Inspecting the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format 

Importing libraries

In [None]:
# Web scraping packages
from bs4 import BeautifulSoup
import requests

### To Scrape GitHub Topic Page

#### To download the webpage

In [None]:
topic_url = 'https://github.com/topics'

In [None]:
response = requests.get(topic_url)   #to download the webpage

In [None]:
#to check the request was successful           [200 is the http status code for successful execution]
response.status_code

In [None]:
#content of the webpage
page_contents = response.text

In [None]:
len(page_contents)

In [None]:
#print the first 1000
page_contents[:1000]

In [None]:
#to save the above html code as a file
with open('webpage.html','w') as f:
    f.write(page_contents)

#### Use Beautiful Soup to parse and extract information

In [None]:
doc = BeautifulSoup(page_contents,'html.parser')

In [None]:
#to check the type
type(doc)

#### To get the topic title

In [None]:
topic_title_tags = doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})

In [None]:
len(topic_title_tags)    

In [None]:
# top 5 topic title
topic_title_tags[:5]

In [None]:
topic_title_tags[0].text

#### To get the topic description

In [None]:
topic_desc_tags = doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

In [None]:
len(topic_desc_tags)

In [None]:
#first 5 topic description
topic_desc_tags[:5]

In [None]:
topic_desc_tags[0].text.strip()           #strip() - remove whitespaces

#### To find the topic url

In [None]:
topic_title_tag0 = topic_title_tags[0]
topic_title_tag0

In [None]:
#to check the parent of the p tag
topic_title_tag0.parent

In [None]:
#from above we got the class of topic url
topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})

In [None]:
len(topic_link_tags )

In [None]:
#url of the first topic
base_url = 'https://github.com'
topic0_url = base_url + topic_link_tags [0]['href']
print(topic0_url)

In [None]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)    

In [None]:
topic_drscription = []

for tag in topic_desc_tags:
    topic_drscription.append(tag.text.strip())
    
print(topic_drscription)    

In [None]:
topic_drscription[0]

In [None]:
topic_url = []

for tag in topic_link_tags:
    topic_url.append(base_url + tag['href'])
    
print(topic_url)    

In [None]:
topic_url[0]

#### To create a csv file

In [None]:
import pandas as pd

In [None]:
topics_dict = {
    'title' : topic_titles,
    'description' : topic_drscription,
    'url' : topic_url
}

In [None]:
#convert it into a dictionary
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df.head()

In [None]:
# To create CSV file with the extracted information
topics_df.to_csv('topics.csv', index=None)

### Getting information out of a topic page

In [None]:
topic_page_url = topic_url[0]

In [None]:
topic_page_url

In [None]:
response = requests.get(topic_page_url)

In [None]:
#status code check
response.status_code

In [None]:
len(response.text)

In [None]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

#### To get the repo user name

In [None]:
repo_tags = topic_doc.find_all('h3' , {'class': 'f3 color-fg-muted text-normal lh-condensed'}) 

In [None]:
len(repo_tags)

In [None]:
#h3 tag contain information about the repo
repo_tags[0]

In [None]:
#to gat the username information which is in the first 'a' tag
a_tag = repo_tags[0].find_all('a')
a_tag

In [None]:
a_tag[0]

In [None]:
# user name of the specific topic
a_tag[0].text.strip()

In [None]:
#to get the name of the repo
a_tag[1].text.strip()

#### To get the repo url

In [None]:
base_url = 'https://github.com'

In [None]:
a_tag[1]['href']

In [None]:
repo_url = base_url +  a_tag[1]['href']
print(repo_url)

#### To get the total number of stars achieved

In [None]:
star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count'})

In [None]:
len(star_tags)

In [None]:
star_tags[0].text

In [None]:
#function to convert it into a number

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [None]:
parse_star_count(star_tags[0].text.strip())

#### To get all the info about a repository

In [None]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [None]:
get_repo_info(repo_tags[0],star_tags[0])

In [None]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
     'repo_url' : []
    
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [None]:
topic_repos_dict

In [None]:
# convert the extracted data into a dataframe
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [None]:
topic_repos_df .head()

Above is for the fist topic (here it is '3d'). Now it is for all the topics by creating functions 

In [None]:
import os
def get_topic_page(topic_url):
    #for downloading the page
    response = requests.get(topic_url)
    
    #to check successful response
    if response.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topic_url))
    #parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text , 'html.parser') 
    return topic_doc
    
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    
    #get h3 tags containing repo title, repo url, username
    repo_tags = topic_doc.find_all('h3' , {'class': 'f3 color-fg-muted text-normal lh-condensed'}) 
    #get star tags
    star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }
    
    #get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    
    #convert it into a dataframe
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,topic_name):
    file_name = topic_name + '.csv'
    if os.path.exists(file_name):
        print('The file {} already exists.Skipping . . .'.format(file_name))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    topic_df.to_csv(file_name, index = None)

Test the created functions

In [None]:
url4=topic_url[4]
url4

In [None]:
topic4_doc=get_topic_page(url4)

In [None]:
topic4_repos = get_topic_repos(topic4_doc)

In [None]:
topic4_repos.head()

Lets do it in a single line of code

In [None]:
get_topic_repos(get_topic_page(topic_url[4]))

In [None]:
# lets check another topic and save it into csv file
get_topic_repos(get_topic_page(topic_url[5])).to_csv('angular.csv',index=None)

Putting it all together to :
1. Get the list of topics from the topic page
2. Get the list of top repos from the individual topic page


In [None]:
doc = BeautifulSoup(page_contents,'html.parser')

1. Get the list of topics from the topic page

In [None]:
def get_topic_title(doc):
    topic_title_tags = doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles
    
    
def get_topic_description(doc):
    topic_desc_tags = doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})   
    topic_description = []
    for tag in topic_desc_tags:
        topic_description.append(tag.text.strip())
    return topic_description

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})
    topic_url = []
    bae_url = 'https://github.com' 
    for tag in topic_link_tags:
        topic_url.append(base_url + tag['href'])
    return topic_url


def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topic_url))
    topics_dict = {
         'title' : get_topic_title(doc),
         'description' : get_topic_description(doc),
         'url' : get_topic_urls(doc)
     }
   
    return pd.DataFrame(topics_dict)
 

In [None]:
scrape_topics().head()

2. Get the list of top repos from the individual topic page

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topic_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])
        
        

In [None]:
#above code explanation -to get the list of rows from the dataframe
for index, row in topics_df.iterrows():
    print(row['title'], row['url'])

In [None]:
scrape_topics_repos()