# Scrape Top Repositories from GitHub Topics 

## Pick a website and mark your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

Outline : 

- We're going to scrape : https://github.com/topics
- We'll get a list of topics. For each topic, we will get topic title, topic page URL and topic description.
- For each topic, we'll get top 25 repositories in the topic from topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a csv file in the following format:

```
Repo Name,Username,Stars ,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Use the requests library to download web pages

In [1]:
import requests

In [2]:
topics_url = "https://github.com/topics"

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
len(response.text)

128545

In [6]:
page_contents = response.text

In [7]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-B/jj6qcXmAwGUh/FG7mfpfFSb0lW1UpGiufFhzIeC+u3lXE5VDEJQzVxZ3gquw8xjZBNQ6CgWDSgCgjRzqPUgw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-07f8e3eaa717980c06521fc51bb99fa5.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-5q2K3HE6SpFCTmTQaW6z9/MX/PVxQ/IRcjqNDVDesJQA/LKzwLxWf+kCGVvI7zkNBhMEJnV3OZKT79Swh03xfw==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-e6ad8adc713a4a91424e64d0696eb3f7.css" />\n    \n    \

In [8]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information



In [19]:
# !pip install beautifulsoup4 --upgrade --quiet

In [10]:
from bs4 import BeautifulSoup

In [11]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [12]:
type(doc)

bs4.BeautifulSoup

In [13]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [14]:
len(topic_title_tags)

30

In [15]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [16]:
description_class = 'f5 color-text-secondary mb-0 mt-1'

topic_description = doc.find_all('p', {'class': description_class})

In [17]:
len(topic_description)

30

In [18]:
topic_description[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [29]:
topic_link_tags = doc.find_all('a',{'class': 'd-flex no-underline'})

In [30]:
len(topic_link_tags)

30

In [38]:
#topic_link_tags[0]

In [33]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [34]:
topic_titles = []

for topic in topic_title_tags:
    topic_titles.append(topic.text)

In [35]:
len(topic_titles)

30

In [37]:
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [41]:
topic_descs = []

for desc in topic_description:
    topic_descs.append(desc.text.strip())
    
print(topic_descs[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


## Create CSV file(s) with the extracted information

## Document and share your work