# Top Github Repositories by Topic

### Pick a website and describe your objective
- Browse through different sites and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook.

#### Project Outline:
- We are going to scrape https://github.com/topics
- Get a list of topics. For each topic, we will have topic title, topic page url (which includes the 'id'), and topic description
- For each topic, we will get the top 25 repositories
- For each repo, we will have the repo name, username, stars, and repo url
- Each topic will have a csv file with the following format
```
Repo Name,Username,Stars,Repo URL
free-programming-books-zh_CN,justjavac,102000,https://github.com/justjavac/free-programming-books-zh_CN
kotlin,JetBrains,44600,https://github.com/JetBrains/kotlin
```

### Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [71]:
! python3 -m pip install --upgrade pip --user
! python3 -m pip install requests --upgrade

Defaulting to user installation because normal site-packages is not writeable


In [72]:
import requests

In [73]:
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
# 200-299 is a successful response download for status_code
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status contains more info on status codes
response.status_code

200

In [74]:
page_contents = response.text
with open('topics-page.html', 'w') as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


#### Topic information

In [75]:
! python3 -m pip install beautifulsoup4 --upgrade

Defaulting to user installation because normal site-packages is not writeable


In [76]:
from bs4 import BeautifulSoup

In [77]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [78]:
topic_tag = 'p'
topic_link_tag = 'a'

topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_link_class = 'no-underline flex-1 d-flex flex-column'

topic_title_tags = doc.find_all(topic_tag, {'class': topic_title_class})
topic_description_tags = doc.find_all(topic_tag, {'class': topic_description_class})
topic_link_tags = doc.find_all(topic_link_tag, {'class': topic_link_class})

In [79]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [80]:
topic_descriptions = []

for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [81]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

#### Topic page information

##### Data request

In [82]:
topic_page_url = topic_urls[0]
response = requests.get(topic_page_url)
response.status_code

200

In [83]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [84]:
repo_tag = 'h3'
star_tag = 'span'

repo_class = 'f3 color-fg-muted text-normal lh-condensed'
star_class = 'Counter js-social-count'

repo_tags = topic_doc.find_all(repo_tag, {'class': repo_class})
star_tags = topic_doc.find_all(star_tag, {'class': star_class})

##### Deciding parse methodology

In [85]:
test_repo_tag = repo_tags[0]
repo_user_name = test_repo_tag.find_all('a')
repo_user = repo_user_name[0].text.strip()
repo_name = repo_user_name[1].text.strip()
repo_url = base_url + repo_user_name[1]['href']

print(repo_user)
print(repo_name)
print(repo_url)

mrdoob
three.js
https://github.com/mrdoob/three.js


In [86]:
repo_star = star_tags[0].text.strip()
print(repo_star)

91.6k


In [87]:
def parse_star_count(stars_str):
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [88]:
repo_star_parse = parse_star_count(repo_star)
repo_star_parse

91600

##### Parsing

In [89]:
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [90]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 91600, 'https://github.com/mrdoob/three.js')

In [91]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

##### General function for each topic

In [None]:
# def get_topic_repos(topic_url):
#     response = requests.get(topic_url)
#     if response.status_code != 200:
#         raise Exception('Failed to load page {}'.format(topic_url))

### Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [None]:
! python3 -m pip install pandas --upgrade

Defaulting to user installation because normal site-packages is not writeable


In [None]:
import pandas as pd

In [None]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descriptions,
    'url': topic_urls
}
topic_df = pd.DataFrame(topics_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [94]:
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,91600,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22500,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21500,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20600,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,16900,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15300,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,15200,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,14000,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10400,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9700,https://github.com/metafizzy/zdog


In [None]:
topic_df.to_csv('topics.csv', index=None)

### Document and share your work
- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your GitHub
- (Optional) Write a blog post about your project and share it online.