 # Github Topics Repositories

### Project Outline

- Scraping https://github.com/topics
- Get a list of topics. For each topic, get topic title, topic page URL, and topic description
- For each topic, get the top 25 repositories in the topic from the topic page
- For each repository, grab the repo name, username, stars, and repo URL
- For each topic create a CSV file

### Download the web pages with requests

In [1]:
import requests

In [2]:
topic_url = 'https://github.com/topics'

In [3]:
response = requests.get(topic_url)

In [4]:
#response.status_code
page_contents = response.text

### Parse and extract necessary data with Beautiful Soup

In [5]:
from bs4 import BeautifulSoup

In [6]:
doc = BeautifulSoup(page_contents, 'html.parser')

#### Extracting p tags for topic title

In [7]:
topic_class_selector = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class' : topic_class_selector})

In [8]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

#### Extracting p tags for topic description

In [9]:
topic_desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class' : topic_desc_selector})

In [10]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

#### Extracting a tags for topic url

In [11]:
topic_url_selector = 'no-underline flex-1 d-flex flex-column'
topic_url_tags = doc.find_all('a', {'class' : topic_url_selector})

In [12]:
topic0_url = 'https://github.com' + topic_url_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


#### Storing title, descriptions, and url in lists

In [13]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [14]:
topic_desc = []

for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())

print(topic_desc[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [15]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])
    
print(topic_urls[:5])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android']


#### Creating dataframe 

In [16]:
import pandas as pd

In [17]:
topic_dict = {
    'Title' : topic_titles,
    'Description' : topic_desc,
    'URLs' : topic_urls
}

In [18]:
topic_df = pd.DataFrame(topic_dict)

In [19]:
topic_df.head()

Unnamed: 0,Title,Description,URLs
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


#### Creating CSV File

In [20]:
topic_df.to_csv('Topics.csv', index=False)

### Getting data from the repository pages

In [21]:
# function to convert stars in 'k' to actual value
def star_count(star_str):
    if star_str[-1] == 'k':
        return int(float(star_str[:-1]) * 1000)
    return (star_str)

### Defining the functions

In [22]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [23]:
def get_repo_info(h3_tag, star_tag):
    #function to return info about repository
    a_tags = h3_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url_name = base_url + a_tags[1]['href']
    number_stars = star_count(star_tag.text.strip())
    return user_name, repo_name, repo_url_name, number_stars

In [24]:
def get_topic_repos(topic_doc):
    # get h3 tags containing username, repo name, repo url
    h3_tag_selector = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class' : h3_tag_selector})
    
    # get star tags containing number of stars
    star_tag_selector = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span', {'class' : star_tag_selector})
    
    topics_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
    }

    #loop to get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_dict['username'].append(repo_info[0])
        topics_dict['repo_name'].append(repo_info[1])
        topics_dict['repo_url'].append(repo_info[2])
        topics_dict['stars'].append(repo_info[3])
        
    return pd.DataFrame(topics_dict)

In [25]:
get_topic_repos(get_topic_page(topic_urls[2]))

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,238000,https://github.com/jwasham/coding-interview-un...
1,CyC2018,CS-Notes,158000,https://github.com/CyC2018/CS-Notes
2,trekhleb,javascript-algorithms,154000,https://github.com/trekhleb/javascript-algorithms
3,TheAlgorithms,Python,148000,https://github.com/TheAlgorithms/Python
4,yangshun,tech-interview-handbook,81800,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,58600,https://github.com/kdn251/interviews
6,azl397985856,leetcode,49800,https://github.com/azl397985856/leetcode
7,TheAlgorithms,Java,48900,https://github.com/TheAlgorithms/Java
8,algorithm-visualizer,algorithm-visualizer,40200,https://github.com/algorithm-visualizer/algori...
9,youngyangyang04,leetcode-master,33200,https://github.com/youngyangyang04/leetcode-ma...


In [26]:
get_topic_repos(get_topic_page(topic_urls[1])).to_csv('Ajax.csv', index=None)

In [27]:
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('Android.csv', index=None)