# Top Github Repositories by Topic

### Pick a website and describe your objective
- Browse through different sites and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook.

#### Project Outline:
- We are going to scrape https://github.com/topics
- Get a list of topics. For each topic, we will have topic title, topic page url (which includes the 'id'), and topic description
- For each topic, we will get the top 25 repositories
- For each repo, we will have the repo name, username, stars, and repo url
- Each topic will have a csv file with the following format
```
Repo Name,Username,Stars,Repo URL
free-programming-books-zh_CN,justjavac,102000,https://github.com/justjavac/free-programming-books-zh_CN
kotlin,JetBrains,44600,https://github.com/JetBrains/kotlin
```

### Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [121]:
! python3 -m pip install --upgrade pip --user
! python3 -m pip install requests --upgrade

Defaulting to user installation because normal site-packages is not writeable


In [122]:
! python3 -m pip install pandas --upgrade

Defaulting to user installation because normal site-packages is not writeable


In [123]:
import requests
import pandas as pd

In [124]:
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
# 200-299 is a successful response download for status_code
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status contains more info on status codes
response.status_code

200

In [125]:
page_contents = response.text
with open('topics-page.html', 'w') as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


#### Topic information

In [126]:
! python3 -m pip install beautifulsoup4 --upgrade

Defaulting to user installation because normal site-packages is not writeable


In [127]:
from bs4 import BeautifulSoup

In [128]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [129]:
topic_tag = 'p'
topic_link_tag = 'a'

topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_link_class = 'no-underline flex-1 d-flex flex-column'

topic_title_tags = doc.find_all(topic_tag, {'class': topic_title_class})
topic_description_tags = doc.find_all(topic_tag, {'class': topic_description_class})
topic_link_tags = doc.find_all(topic_link_tag, {'class': topic_link_class})

In [130]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [131]:
topic_descriptions = []

for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [132]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

#### Topic page information

##### Data request

In [133]:
topic_page_url = topic_urls[0]
response = requests.get(topic_page_url)
response.status_code

200

In [134]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [135]:
repo_tag = 'h3'
star_tag = 'span'

repo_class = 'f3 color-fg-muted text-normal lh-condensed'
star_class = 'Counter js-social-count'

repo_tags = topic_doc.find_all(repo_tag, {'class': repo_class})
star_tags = topic_doc.find_all(star_tag, {'class': star_class})

##### Deciding parse methodology

In [136]:
test_repo_tag = repo_tags[0]
repo_user_name = test_repo_tag.find_all('a')
repo_user = repo_user_name[0].text.strip()
repo_name = repo_user_name[1].text.strip()
repo_url = base_url + repo_user_name[1]['href']

print(repo_user)
print(repo_name)
print(repo_url)

mrdoob
three.js
https://github.com/mrdoob/three.js


In [137]:
repo_star = star_tags[0].text.strip()
print(repo_star)

91.6k


In [138]:
def parse_star_count(stars_str):
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [139]:
repo_star_parse = parse_star_count(repo_star)
repo_star_parse

91600

##### Parsing

In [140]:
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [141]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 91600, 'https://github.com/mrdoob/three.js')

In [142]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

##### General functions for each topic

In [143]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    repo_tag = 'h3'
    star_tag = 'span'
    repo_class = 'f3 color-fg-muted text-normal lh-condensed'
    star_class = 'Counter js-social-count'
    repo_tags = topic_doc.find_all(repo_tag, {'class': repo_class})
    star_tags = topic_doc.find_all(star_tag, {'class': star_class})
    topics_repo_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_repo_dict['username'].append(repo_info[0])
        topics_repo_dict['repo_name'].append(repo_info[1])
        topics_repo_dict['stars'].append(repo_info[2])
        topics_repo_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topics_repo_dict)

In [144]:
url4 = topic_urls[4]
topic4_doc = get_topic_page(url4)
topic4_repos = get_topic_repos(topic4_doc)
print(url4)
topic4_repos

https://github.com/topics/android


Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,153000,https://github.com/flutter/flutter
1,facebook,react-native,110000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,102000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,83400,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,64800,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,48000,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,46700,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,46300,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,44000,https://github.com/square/okhttp
9,android,architecture-samples,42600,https://github.com/android/architecture-samples


In [145]:
url5 = topic_urls[5]
topic5_doc = get_topic_page(url5)
topic5_repos = get_topic_repos(topic5_doc)
print(url5)
topic5_repos

https://github.com/topics/angular


Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,102000,https://github.com/justjavac/free-programming-...
1,angular,angular,88100,https://github.com/angular/angular
2,storybookjs,storybook,78500,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,56100,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,49000,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,45600,https://github.com/prettier/prettier
6,Asabeneh,30-Days-Of-JavaScript,35700,https://github.com/Asabeneh/30-Days-Of-JavaScript
7,SheetJS,sheetjs,32800,https://github.com/SheetJS/sheetjs
8,angular,angular-cli,26100,https://github.com/angular/angular-cli
9,angular,components,23500,https://github.com/angular/components


##### General function that:

1. Gets the list of topics from the topics page
2. Gets the list of top repos in each topic
3. Create csv of top repos for each topic

In [146]:
def get_topic_titles(doc):
    topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': topic_title_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles

def get_topic_descriptions(doc):
    topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_description_tags = doc.find_all('p', {'class': topic_description_class})
    topic_descriptions = []
    for tag in topic_description_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

def get_topic_urls(doc):
    topic_link_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': topic_link_class})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    page_contents = response.text
    doc = BeautifulSoup(page_contents, 'html.parser')

    topic_titles = get_topic_titles(doc)
    topic_descriptions = get_topic_descriptions(doc)
    topic_urls = get_topic_urls(doc)

    topics_dict = {
        'title': topic_titles,
        'description': topic_descriptions,
        'url': topic_urls
    }
    return pd.DataFrame(topics_dict)

In [147]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

##### Testing methods

In [148]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descriptions,
    'url': topic_urls
}
topic_df = pd.DataFrame(topics_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [149]:
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,91600,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22500,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21500,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20600,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,16900,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15300,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,15200,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,14000,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10400,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9700,https://github.com/metafizzy/zdog


In [150]:
topic_df.to_csv('topics_example.csv', index=None)

##### Final Function combinging the general functions

In [151]:
from pathlib import Path  

def scrape_topic(topic_url, topic_name):
    filepath = Path('topic_csvs/' + topic_name + '.csv')
    filepath.parent.mkdir(parents = True, exist_ok = True)
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(filepath, index = None)

In [152]:
def scrape_topic_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], row['title'])

In [153]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

### Document and share your work
- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your GitHub
- (Optional) Write a blog post about your project and share it online.