##                  TOP REPOSITORIES FOR GITHUB TOPICS
                              Steps:


#### 1.Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### 2.Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the ***requests*** library.
- Create a function to automate downloading for different topics/search queries.


#### 3.Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using ***Beautiful soup***.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


#### 4.Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.


#### 5.Document and share your work
- Add proper headings and documentation in your Jupyter notebook, in Datalore
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.


# Project outline of work by steps

**STEP 1**
   - Scraping website: https://github.com/topics
   - Create a list of topics available on website. For each topic, scrap the following data: topic name, topic page URL and topic description
   - Scrap data only for top 25 repositories per topic
   - Scrap the following data for each repository by topic: repo name, user name, stars and repo URL
   - For each topic, create a CSV file with data about repos inside it
  

**STEP 2**



In [274]:
# request is pre-installed in datalore
import requests 
topics_url = 'https://github.com/topics'

# download the webpage
response = requests.get(topics_url)
# reguest library has opened this url and downloaded it

response.status_code
# checking status of http requests

len(response.text)
# number of character

155964

In [275]:
page_content = response.text
page_content[:1000]
# show content of downloaded webpage - html code/file

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0946cdc16f15.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-3946c959759a.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="h

In [276]:
with open('webpage.html', 'w') as f:
    f.write(page_content)
# creating local file with html code of webpage

**STEP 3**

In [277]:
# tool for extracting data from html code is library Beutiful soup which is pre-installed on Datalore
from bs4 import BeautifulSoup
# that's way for importing

doc = BeautifulSoup(page_content,'html.parser')
# parsing of html code and saving it in string variable doc
# doc variable is now beutifulsoup object
# type(doc)

In [278]:
# now we can query and find data we need in that html code
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_name_tags = doc.find_all('p', {'class': selection_class})
# find all p tags in html code with specific class
# we don't need all p tags

len(topic_name_tags)
# how many p tag we've found

30

In [279]:
topic_name_tags[:10]
# seeing the data
# here we have topic name

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [280]:
# extracting topic desctiption data from html code
selection_class_description = 'f5 color-fg-muted mb-0 mt-1'
topic_description_tags = doc.find_all('p', {'class': selection_class_description})

topic_description_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [281]:
# extracting each topic page URL data from html code
selection_class_urltopic = 'no-underline flex-1 d-flex flex-column'
topic_url_tags = doc.find_all('a', {'class': selection_class_urltopic})
topic_url_tags[0]['href']

'/topics/3d'

In [282]:
#creating lists for topic name, desc and url
topic_names = []


for a in topic_name_tags:
    topic_names.append(a.text)
print(topic_names)


['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [283]:
topic_desc = []
for a in topic_description_tags:
    topic_desc.append(a.text.strip())
print(topic_desc)
 

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [284]:
topic_urls = []
base = 'https://github.com'
for a in topic_url_tags:
    topic_urls.append(base + a['href'])
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [285]:
# creating CSV file with topics using pandas library and his DataFrame
import  pandas as pd
topic_df = pd.DataFrame(list(zip(topic_names,topic_desc,topic_urls)), 
                        columns=['Topic', 'Description', 'Web page']) 
topic_df

Unnamed: 0,Topic,Description,Web page
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


**STEP 4**

In [286]:
topic_df.to_csv('topics.csv', index=None)

In [287]:
# scaping webpage for every topic that is previously extracted
topic_page_url = topic_urls[0]
response = requests.get(topic_page_url)

# first topic webpage for scraping 
#topic_page_url    =  'https://github.com/topics/3d'


In [288]:
response.status_code

200

In [289]:
len(response.text)

465582

In [290]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [291]:
h1_selection = 'f3 color-fg-muted text-normal lh-condensed' 
repo_tags = topic_doc.find_all('h3',{'class' : h1_selection})

 

In [292]:
a_tags = repo_tags[0].find_all('a')
a_tags


[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [293]:
a_tags[0].text.strip()

'mrdoob'

In [294]:
a_tags[1].text.strip()

'three.js'

In [295]:
repo_url = base + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [296]:
star_selection = 'Counter js-social-count'
star_tags = topic_doc.find_all('span',{'class' : star_selection})
star_tags[0].text.strip()
# stars = []
# for a in star_tags:
#     stars.append(a.text)
# stars
# maybe convert this string list to integer list with some defined function

'92.5k'

In [297]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)
        

In [298]:
parse_star_count(star_tags[0].text.strip())

92500

In [299]:
def get_repo_info(h3_tag, star_tag):
    # returns all the data about repository [user name, name of repo, url, stars]
    h3_tags = h3_tag.find_all('a')
    username = h3_tags[0].text.strip()
    repo_name = h3_tags[1].text.strip()
    repo_url = base + h3_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars,repo_url

In [300]:
# try function on first element 
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 92500, 'https://github.com/mrdoob/three.js')

In [301]:
# # run function on whole list of repos
# create dictionary
topic_repos_dict = {
    'username' : [],
    'repo name' : [],
    'stars': [],
    'repo url' : []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo url'].append(repo_info[3])

topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'isl-org',
  'timzhang642',
  'blender',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad',
  'nerfstudio-project',
  'spritejs'],
 'repo name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'zdog',
  'Open3D',
  '3D-Machine-Learning',
  'blender',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad',
  'nerfstudio',
  'spritejs'],
 'stars': [92500,
  22800,
  21600,
  20800,
  17100,
  15500,
  15400,
  14200,
  10500,
  9800,
  9000,
  8900,
  8700,
  7400,
  6400,
  6200,
  5700,
  5700,
  5300,
  5200],
 'repo url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.j

In [302]:
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo name,stars,repo url
0,mrdoob,three.js,92500,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22800,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21600,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17100,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15500,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15400,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14200,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10500,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9800,https://github.com/metafizzy/zdog


In [303]:
# creating a functions for automatization of this task; scrap data for every topic 
def get_topic_page(topic_url):
     # dowloand a html code of webpage of topic
    response = requests.get(topic_url)
    # check status of http
    if response.status_code != 200:
        raise  Exception('Failed to load page {}'.format(topic_url))
    
    # parse to html using beautiful soup library
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

# ***FINAL CODE***  

In [310]:
import os
def get_topic_repos(topic_doc):
    
    # list of h3 tags where is username, repo name and url
    h1_selection = 'f3 color-fg-muted text-normal lh-condensed' 
    repo_tags = topic_doc.find_all('h3',{'class' : h1_selection})
    
    # list of span tags where is stars for every repo
    star_selection = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class' : star_selection})

    # function for getting all this date automatically for one topic
    topic_repos_dict = {
            'username' : [],
            'repo name' : [],
            'stars': [],
            'repo url' : []
                    }
        
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo url'].append(repo_info[3]) 
    return  pd.DataFrame(topic_repos_dict)

# helping function
def scrape_topic(topic_url, path):
    
    if os.path.exists(path):
        return
    print('File  {} already exists. Skipping...'.format(path))
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

In [311]:
# FUNCTION for automation and creating csv from result
get_topic_repos(get_topic_page(topic_urls[1])).to_csv()
 

',username,repo name,stars,repo url\n0,ljianshu,Blog,7600,https://github.com/ljianshu/Blog\n1,metafizzy,infinite-scroll,7300,https://github.com/metafizzy/infinite-scroll\n2,developit,unfetch,5600,https://github.com/developit/unfetch\n3,olifolkerd,tabulator,5500,https://github.com/olifolkerd/tabulator\n4,jquery-form,form,5200,https://github.com/jquery-form/form\n5,Studio-42,elFinder,4400,https://github.com/Studio-42/elFinder\n6,elbywan,wretch,4000,https://github.com/elbywan/wretch\n7,dwyl,learn-to-send-email-via-google-script-html-no-server,3000,https://github.com/dwyl/learn-to-send-email-via-google-script-html-no-server\n8,ded,reqwest,2900,https://github.com/ded/reqwest\n9,LeaVerou,bliss,2400,https://github.com/LeaVerou/bliss\n10,wendux,ajax-hook,2300,https://github.com/wendux/ajax-hook\n11,noelboss,featherlight,2100,https://github.com/noelboss/featherlight\n12,craftpip,jquery-confirm,1800,https://github.com/craftpip/jquery-confirm\n13,ptaoussanis,sente,1700,https://github.com/ptaoussa

In [312]:
topics_url

'https://github.com/topics'

In [313]:
def get_topic_names(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_name_tags = doc.find_all('p', {'class': selection_class})
    topic_names = []
    for a in topic_name_tags:
        topic_names.append(a.text)
    return topic_names
    
def get_topic_descs(doc):
    selection_class_description = 'f5 color-fg-muted mb-0 mt-1'
    topic_description_tags = doc.find_all('p', {'class': selection_class_description})  
    topic_desc = []
    for a in topic_description_tags:
        topic_desc.append(a.text.strip())
    return topic_desc
    
def get_topic_urls(doc):
    selection_class_urltopic = 'no-underline flex-1 d-flex flex-column'
    topic_url_tags = doc.find_all('a', {'class': selection_class_urltopic})
    topic_urls = []
    base = 'https://github.com'
    for a in topic_url_tags:
        topic_urls.append(base + a['href'])
    return  topic_urls   

def scrap_topics():
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise  Exception('Failed to load page {}'.format(topics_url))
    topic_dict  =  {
        'name' : get_topic_names(doc),
        'description' : get_topic_descs(doc),
        'url' :  get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict,index=None)
# scrap_topics

In [314]:
def scrap_topics_repos():
    print('Scraping list of topics from github')
    topics_df = scrap_topics()
    
    # create folder and put all csv file in it when they are created
    os.makedirs('data', exist_ok=True)

    for index,row in topics_df.iterrows():
        print('Scraping top repos for "{}"'.format(row['name']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['name']))


In [316]:
scrap_topics_repos()

Scraping list of topics from github
Scraping top repos for "3D"
File  data/3D.csv already exists. Skipping...
Scraping top repos for "Ajax"
File  data/Ajax.csv already exists. Skipping...
Scraping top repos for "Algorithm"
File  data/Algorithm.csv already exists. Skipping...
Scraping top repos for "Amp"
File  data/Amp.csv already exists. Skipping...
Scraping top repos for "Android"
File  data/Android.csv already exists. Skipping...
Scraping top repos for "Angular"
File  data/Angular.csv already exists. Skipping...
Scraping top repos for "Ansible"
File  data/Ansible.csv already exists. Skipping...
Scraping top repos for "API"
File  data/API.csv already exists. Skipping...
Scraping top repos for "Arduino"
File  data/Arduino.csv already exists. Skipping...
Scraping top repos for "ASP.NET"
File  data/ASP.NET.csv already exists. Skipping...
Scraping top repos for "Atom"
File  data/Atom.csv already exists. Skipping...
Scraping top repos for "Awesome Lists"
File  data/Awesome Lists.csv alread