# Top Repositories for Github Topics

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. 
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Outline:-

- Scrape https://github.com/topics
- Get a list of topics. For each topic get topic title, topic url and description.
- For each topic get the top 25 repositories.
- For each repository, grab the repo name, username, stars and repo url.
- Each topic should have its own csv file. Format:-

Repo Name,Username,Stars,Repo URL

## Use the requests library to download web pages

In [3]:
!pip install requests --upgrade



In [4]:
import requests

In [5]:
topics_url = 'https://github.com/topics'

In [6]:
response = requests.get(topics_url)

In [8]:
response.status_code #check the response to know if request was successful

200

In [9]:
len(response.text)

146935

In [10]:
page_contents = response.text

In [15]:
with open('webpage.html','w',encoding="utf-8") as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [16]:
!pip install beautifulsoup4 --upgrade

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.9.3
    Uninstalling beautifulsoup4-4.9.3:
      Successfully uninstalled beautifulsoup4-4.9.3
Successfully installed beautifulsoup4-4.10.0


In [17]:
from bs4 import BeautifulSoup

In [18]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [25]:
selec_class = "f3 lh-condensed mb-0 mt-1 Link--primary"

topic_title_tags = doc.find_all('p',{'class':selec_class}) #Query to get the topic names

In [26]:
len(topic_title_tags)

30

In [27]:
topic_title_tags[:10]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [28]:
desc_class = "f5 color-text-secondary mb-0 mt-1"

topic_desc_tags = doc.find_all('p',{'class':desc_class}) #Query to get the topic descriptions

In [29]:
len(topic_desc_tags)

30

In [30]:
topic_desc_tags[:10]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Angular is an open source web application platform.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ansible is a simple and powerful automat

In [31]:
url_class = 'd-flex no-underline'

url_tags = doc.find_all('a',{'class':url_class})

In [32]:
len(url_tags)

30

In [35]:
url_tags[10]['href']

'/topics/atom'

Now lets clean the data

In [36]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [37]:
topic_titles #List of all topic titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [40]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip()) #strip() removes spaces from front and end

In [41]:
topic_descs #List of all topic descriptions

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [44]:
topic_urls = []
base_url = "https://github.com/topics"

for tag in url_tags:
    topic_urls.append(base_url+tag['href'])

In [45]:
topic_urls

['https://github.com/topics/topics/3d',
 'https://github.com/topics/topics/ajax',
 'https://github.com/topics/topics/algorithm',
 'https://github.com/topics/topics/amphp',
 'https://github.com/topics/topics/android',
 'https://github.com/topics/topics/angular',
 'https://github.com/topics/topics/ansible',
 'https://github.com/topics/topics/api',
 'https://github.com/topics/topics/arduino',
 'https://github.com/topics/topics/aspnet',
 'https://github.com/topics/topics/atom',
 'https://github.com/topics/topics/awesome',
 'https://github.com/topics/topics/aws',
 'https://github.com/topics/topics/azure',
 'https://github.com/topics/topics/babel',
 'https://github.com/topics/topics/bash',
 'https://github.com/topics/topics/bitcoin',
 'https://github.com/topics/topics/bootstrap',
 'https://github.com/topics/topics/bot',
 'https://github.com/topics/topics/c',
 'https://github.com/topics/topics/chrome',
 'https://github.com/topics/topics/chrome-extension',
 'https://github.com/topics/topics/cl

### Now use pandas to create dataframe

In [46]:
import pandas as pd

In [47]:
topic_dict = {
    'title':topic_titles,
    'description':topic_descs,
    'url':topic_urls
}

In [49]:
topic_df = pd.DataFrame(topic_dict)

In [51]:
topic_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/topics/android


In [52]:
topic_df.to_csv('topics.csv',index=None)