## Web Scraping

Web scraping is the process of extracting specific data on a targeted webpages.

 or
 
 
we write some code to fetch data from websites in an automated fashion..

#### web crawling: 

is like search engine,it goes through different webpages without specific goal.

Websites with dynamic content(changes with user need) cannot be scraped using BeautifulSoup. One way to scrape dynamic website is by using Selenium.

#### 1. Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### 2. Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

#### 3.Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.
#### 4.Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

#### 5.Document and share your work

### Scraping-Github-topics-repositories

### Problem Statement:

Find top 20 repositories in each github topics of 3D,ajax etc..

Before jumping directy for coding its better to prepare output by hands using sheets.new. then we can get basic idea of how final outcome looks as like.

#### 1. Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### Project Outline:
- we're going to scrape github topics https://github.com/topics
- we'll get a list of topics. For each topic, we'll get topic title, topic url & topic description
- for each topic,pick top 25 repositories in the topic from the topic page
- for each repository we grab repo name,username,star,url.
- for each topic we'll create a csv file in the following format:

```
Repo Name,username,stars,repo url
three.js,mrdood,93.4,https://github.com/mrdoob/three.js
react-three-fiber,pmndrs,23.3,https://github.com/pmndrs/react-three-fiber
```

#### 2. Use the requests library to download web pages

In [1]:
url='https://github.com/topics'

In [2]:
pip install requests --upgrade

Note: you may need to restart the kernel to use updated packages.


use '-- quiet' to hide output

In [3]:
!pip install requests --upgrade --quiet

In [4]:
import requests

In [5]:
response=requests.get(url)

In [6]:
response.status_code
#to check whether the response is succesfull or not
#successful responses(200-299)
#check HTTP response status code in google for more details

200

In [7]:
len(response.text)

164895

In [8]:
page_contents=response.text

In [9]:
page_contents[:1000]
#written in HTML

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

In [10]:
with open('webpage.html','w') as f:
    f.write(page_contents)

#### 3.Use Beautiful Soup to parse and extract information

In [11]:
pip install beautifulsoup4 --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [12]:
from bs4 import BeautifulSoup

In [13]:
soup=BeautifulSoup(page_contents,'lxml')

In [14]:
soup

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/assets/dark_dimmed-71414d661fe2.c

In [15]:
type(soup)

bs4.BeautifulSoup

In [16]:
p_tags=soup.find_all('p',class_='f3 lh-condensed mb-0 mt-1 Link--primary')

In [17]:
p_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [18]:
main_topics=[i.text for i in p_tags]
len(main_topics)

30

In [19]:
p_tags_descr=soup.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')

In [20]:
p_tags_descr[0]

<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>

In [21]:
topic_des=[i.text[1:-10] for i in p_tags_descr]
topic_des[:5]

['          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries',
 '          Ajax is a technique for creating interactive web applications',
 '          Algorithms are self-contained sequences that carry out a variety of tasks',
 '          Amp is a non-blocking concurrency library for PHP',
 '          Android is an operating system built by Google designed for mobile devices']

In [22]:
#to get inside of main_topic use .parents

In [23]:
p_tags[0].parent.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="5

In [24]:
main_topic_3d=soup.find_all('a',class_='no-underline flex-grow-0')

In [25]:
main_topic_3d[0]

<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>

In [26]:
main_topic_3d[0]['href']

'/topics/3d'

In [27]:
main_topic_3d_url='https://github.com'+main_topic_3d[0]['href']
print(main_topic_3d_url)

https://github.com/topics/3d


In [28]:
p_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [29]:
p_tags[0].text

'3D'

In [30]:
topicc_title=[i.text for i in p_tags]
topicc_title

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [31]:
topic_titles=[]
for i in p_tags:
    topic_titles.append(i.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [32]:
topic_des[:5]

['          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries',
 '          Ajax is a technique for creating interactive web applications',
 '          Algorithms are self-contained sequences that carry out a variety of tasks',
 '          Amp is a non-blocking concurrency library for PHP',
 '          Android is an operating system built by Google designed for mobile devices']

In [33]:
p_tags_descr[0]

<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>

In [34]:
topic_des=[]
for i in p_tags_descr:
    topic_des.append(i.text.strip())#strip()-removes any space in text
print(topic_des[:5])

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [35]:
topic_url=soup.find_all('a',class_='no-underline flex-1 d-flex flex-column')

In [36]:
topic_url[:2]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>]

In [37]:
topic_url[0]['href']

'/topics/3d'

In [38]:
topic_url_3d="https://github.com"+topic_url[0]['href']
topic_url_3d #use print function to get url highlight

'https://github.com/topics/3d'

In [39]:
print(topic_url_3d)

https://github.com/topics/3d


In [40]:
topic_urls=[]
for i in topic_url:
    topic_urls.append('https://github.com'+i['href'])
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [41]:
topic_urls=[]
base_url='https://github.com'
for i in topic_url:
    topic_urls.append(base_url+i['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [42]:
len(topic_urls)

30

In [43]:
import pandas

In [44]:
topic_dataset=pd.DataFrame({'Title':topic_titles,'Description':topic_des,'Url':topic_urls})

<IPython.core.display.Javascript object>

In [45]:
topic_dataset.head()

Unnamed: 0,Title,Description,Url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [46]:
topic_dataset.to_csv('Topic.csv')

#### Getting information out of a topic_url

#### 3D

In [47]:
Topic_page_url=topic_urls[0]

In [48]:
Topic_page_url

'https://github.com/topics/3d'

In [49]:
response=requests.get(Topic_page_url)

In [50]:
topic_3d_soup=BeautifulSoup(response.content,'lxml')

In [51]:
username=topic_3d_soup.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')

In [52]:
username[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href

In [53]:
a_tag=username[0].find_all('a')

In [54]:
a_tag

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [55]:
a_tag[0].text.strip()

'mrdoob'

In [56]:
a_tag[1].text.strip()

'three.js'

In [57]:
a_tag[1]

<a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
            three.js
</a>

In [58]:
a_tag[1]['href']

'/mrdoob/three.js'

In [59]:
base_url='https://github.com'
repo_url=base_url+a_tag[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [60]:
repo_url

'https://github.com/mrdoob/three.js'

In [61]:
star_3d=topic_3d_soup.find_all('span',class_='Counter js-social-count')

In [62]:
star_3d[0].text.strip()

'93.4k'

In [63]:
def parse_star_count(star_str):
    star_str=star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [64]:
parse_star_count(star_3d[0].text)

93400

In [65]:
def repo_info(username,stars):
    a_tag=username.find_all('a')
    username=a_tag[0].text.strip()
    repo_name=a_tag[1].text.strip()
    repo_url=base_url+a_tag[1]['href']
    star_rating=parse_star_count(stars.text.strip())
    return username,repo_name,star_rating,repo_url

In [66]:
repo_info(username[1],star_3d[1])

('pmndrs',
 'react-three-fiber',
 23300,
 'https://github.com/pmndrs/react-three-fiber')

In [67]:
topic_data={'username':[],'repo_name':[],'star':[],'url':[]}
for i in range(len(username)):
    repo_info_3d=repo_info(username[i],star_3d[i])
    topic_data['username'].append(repo_info_3d[0])
    topic_data['repo_name'].append(repo_info_3d[1])
    topic_data['star'].append(repo_info_3d[2])
    topic_data['url'].append(repo_info_3d[3])
topic_data

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'isl-org',
  'blender',
  'timzhang642',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'nerfstudio-project',
  'google',
  'openscad',
  'spritejs'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'zdog',
  'Open3D',
  'blender',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'nerfstudio',
  'model-viewer',
  'openscad',
  'spritejs'],
 'star': [93400,
  23300,
  21700,
  21100,
  17500,
  15800,
  15600,
  14600,
  10700,
  10000,
  9200,
  9000,
  9000,
  7400,
  6500,
  6400,
  5900,
  5800,
  5700,
  5200],
 'url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
 

In [68]:
import pandas as pd

In [69]:
dataset=pd.DataFrame(topic_data)

In [70]:
dataset

Unnamed: 0,username,repo_name,star,url
0,mrdoob,three.js,93400,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23300,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21700,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17500,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15800,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15600,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14600,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10700,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


In [71]:
dataset.to_csv('3d_topic.csv')

#### ajax

In [72]:
topic_urls[1]

'https://github.com/topics/ajax'

In [73]:
topic_ajax=topic_urls[1]

In [74]:
import requests

In [75]:
response=requests.get(topic_ajax)

In [76]:
from bs4 import BeautifulSoup

In [77]:
Topic_ajax=BeautifulSoup(response.content,'lxml')

In [78]:
ajax_tag=Topic_ajax.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')

In [79]:
ajax_tag[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":36322912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="9074acd8ed677416751ed4a6a8764f616627f58efec6a83519e500c62fe3c738" data-turbo="false" data-view-component="true" href="/ljianshu">
            ljianshu
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":137582912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="5f285461b513fd5a20b16ea1e5d33b4a0a01a278ae2357647e16b2f4938e9dfc" data-turbo="false" data-view-compone

In [80]:
ajax_topic=ajax_tag[0].find_all('a')
ajax_topic

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":36322912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="9074acd8ed677416751ed4a6a8764f616627f58efec6a83519e500c62fe3c738" data-turbo="false" data-view-component="true" href="/ljianshu">
             ljianshu
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":137582912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="5f285461b513fd5a20b16ea1e5d33b4a0a01a278ae2357647e16b2f4938e9dfc" data-turbo="false" data-view-component="true" href="/ljianshu/Blog">
             Blog
 </a>]

In [81]:
ajax_username=ajax_topic[0].text.strip()
ajax_username

'ljianshu'

In [82]:
ajax_repo_name=ajax_topic[1].text.strip()
ajax_repo_name

'Blog'

In [83]:
ajax_topic[1]

<a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":137582912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="5f285461b513fd5a20b16ea1e5d33b4a0a01a278ae2357647e16b2f4938e9dfc" data-turbo="false" data-view-component="true" href="/ljianshu/Blog">
            Blog
</a>

In [84]:
ajax_topic[1]['href']

'/ljianshu/Blog'

In [85]:
base_url='https://github.com/'
repo_url=base_url+ajax_topic[1]['href']
repo_url

'https://github.com//ljianshu/Blog'

In [86]:
stars=Topic_ajax.find_all('span',class_='Counter js-social-count')

In [87]:
stars[0].text

'7.7k'

In [88]:
def star_rating(star_str):
    star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [89]:
star=star_rating(stars[1].text)
star

7300

In [90]:
def topic_info(ajax_tag,stars):
    ajax_topic=ajax_tag.find_all('a')
    username=ajax_topic[0].text.strip()
    repo_name=ajax_topic[1].text.strip()
    url='https://github.com/'+ajax_topic[1]['href']
    star=star_rating(stars.text)
    return username,repo_name,url,star

In [91]:
topic_info(ajax_tag[0],stars[0])

('ljianshu', 'Blog', 'https://github.com//ljianshu/Blog', 7700)

In [92]:
Ajax_topic={'username':[],'repo_name':[],'star_rating':[],'url':[]}
for i in range(len(ajax_tag)):
    Topic_info=topic_info(ajax_tag[i],stars[i])
    Ajax_topic['username'].append(Topic_info[0])
    Ajax_topic['repo_name'].append(Topic_info[1])
    Ajax_topic['star_rating'].append(Topic_info[3])
    Ajax_topic['url'].append(Topic_info[2])
Ajax_topic

{'username': ['ljianshu',
  'metafizzy',
  'developit',
  'olifolkerd',
  'jquery-form',
  'Studio-42',
  'elbywan',
  'dwyl',
  'ded',
  'wendux',
  'LeaVerou',
  'noelboss',
  'craftpip',
  'taoensso',
  'nette',
  'loadingio',
  'joaomilho',
  'k8w',
  'codingforentrepreneurs',
  'u014427391'],
 'repo_name': ['Blog',
  'infinite-scroll',
  'unfetch',
  'tabulator',
  'form',
  'elFinder',
  'wretch',
  'learn-to-send-email-via-google-script-html-no-server',
  'reqwest',
  'ajax-hook',
  'bliss',
  'featherlight',
  'jquery-confirm',
  'sente',
  'tracy',
  'css-spinner',
  'Enterprise',
  'tsrpc',
  'eCommerce',
  'jeeplatform'],
 'star_rating': [7700,
  7300,
  5600,
  5600,
  5200,
  4500,
  4100,
  3000,
  2900,
  2400,
  2400,
  2100,
  1800,
  1700,
  1700,
  1600,
  1600,
  1600,
  1400,
  1400],
 'url': ['https://github.com//ljianshu/Blog',
  'https://github.com//metafizzy/infinite-scroll',
  'https://github.com//developit/unfetch',
  'https://github.com//olifolkerd/tabulator

In [93]:
ajax_dataset=pd.DataFrame(Ajax_topic)

In [94]:
ajax_dataset

Unnamed: 0,username,repo_name,star_rating,url
0,ljianshu,Blog,7700,https://github.com//ljianshu/Blog
1,metafizzy,infinite-scroll,7300,https://github.com//metafizzy/infinite-scroll
2,developit,unfetch,5600,https://github.com//developit/unfetch
3,olifolkerd,tabulator,5600,https://github.com//olifolkerd/tabulator
4,jquery-form,form,5200,https://github.com//jquery-form/form
5,Studio-42,elFinder,4500,https://github.com//Studio-42/elFinder
6,elbywan,wretch,4100,https://github.com//elbywan/wretch
7,dwyl,learn-to-send-email-via-google-script-html-no-...,3000,https://github.com//dwyl/learn-to-send-email-v...
8,ded,reqwest,2900,https://github.com//ded/reqwest
9,wendux,ajax-hook,2400,https://github.com//wendux/ajax-hook


In [95]:
ajax_dataset.to_csv('ajax.csv')

#### algorithm

In [96]:
def get_info(topic_tag,stars):
    inner_topic_tag=topic_tag.find_all('a')
    username=inner_topic_tag[0].text.strip()
    repo_name=inner_topic_tag[1].text.strip()
    url='https://github.com'+inner_topic_tag[1]['href']
    star=star_rating(star_tag[i].text)
    return username,repo_name,url,star

In [97]:
get_info(topic_tag[0],stars[0])

NameError: name 'topic_tag' is not defined

In [None]:
Algo_topic={'username':[],'repo_name':[],'star_rating':[],'url':[]}
for i in range(len(topic_tag)):
    Topic_info=get_info(topic_tag[i],stars[i])
    Algo_topic['username'].append(Topic_info[0])
    Algo_topic['repo_name'].append(Topic_info[1])
    Algo_topic['star_rating'].append(Topic_info[3])
    Algo_topic['url'].append(Topic_info[2])
Algo_topic

In [None]:
algo_dataset=pd.DataFrame(Algo_topic)
algo_dataset

In [None]:
algo_dataset.to_csv('Algorithm.csv')

In [None]:
topic_urls[3]

In [None]:
url=topic_urls[3]

In [102]:
import requests

In [103]:
response=requests.get(url)

In [104]:
from bs4 import BeautifulSoup

In [105]:
topic=BeautifulSoup(response.content,'lxml')

In [None]:
h_tag=topic.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
h_tag[:1]

In [None]:
a_tag=h_tag[0].find_all('a')
a_tag

In [None]:
username=a_tag[0].text.strip()

In [None]:
repo_name=a_tag[1].text.strip()

In [None]:
a_tag[1]['href']

In [None]:
url=a_tag[1]['href']
repo_url='https://github.com'+a_tag[1]['href']
repo_url

In [None]:
star_tag=topic.find_all('span',class_='Counter js-social-count')

In [None]:
star_tag[-1].text

In [None]:
def get_stars(star_str):
    star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [None]:
star_rating=get_stars(star_tag[1].text)

In [None]:
star_rating

In [116]:
def get_topic_col(tag,star_tag):
    a_tag=tag.find_all('a')
    username=a_tag[0].text.strip()
    repo_name=a_tag[1].text.strip()
    repo_url='https://github.com'+a_tag[1]['href']
    star_rating=get_stars(star_tag)
    return username,repo_name,repo_url,star_rating

In [117]:
get_topic_col(h_tag[0],star_tag[0])

NameError: name 'h_tag' is not defined

In [None]:
cols=get_topic_col(h_tag[1],star_tag[1])

In [None]:
cols[0]

In [None]:
cols[1]

In [None]:
Topics={'username':[],'repo_name':[],'repo_url':[],'star':[]}
for i in range(len(h_tag)):
    cols=get_topic_col(h_tag[i],star_tag[i])
    Topics['username'].append(cols[0])
    Topics['repo_name'].append(cols[1])
    Topics['repo_url'].append(cols[2])
    Topics['star'].append(cols[3])

In [None]:
topic_urls[:5]

In [None]:
topic_urls[5]

In [None]:
url=topic_urls[0]

In [None]:
url

In [143]:
def topic_doc(url):
    response=requests.get(url)
    if response.status_code!=200:
        raise Exception
    data=BeautifulSoup(response.content,'lxml')
    return data


def get_stars(star_tag):
    star_tag.strip()
    if star_tag[-1]=='k':
        return int(float(star_tag[:-1])*1000)
    return int(star_tag)


def get_topic_col(h_tag,star_tag):
    a_tag=h_tag.find_all('a')
    username=a_tag[0].text.strip()
    repo_name=a_tag[1].text.strip()
    repo_url='https://github.com'+a_tag[1]['href']
    star_rating=get_stars(star_tag.text.strip())
    return username,repo_name,repo_url,star_rating


def topic_link(data):
    h_tag=data.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
    star_tag=topic.find_all('span',class_='Counter js-social-count')
    
    
    
    Topics={'username':[],'repo_name':[],'rating':[],'repo_url':[]}
    for i in range(len(h_tag)):
        cols=get_topic_col(h_tag[i],star_tag[i])
        Topics['username'].append(cols[0])
        Topics['repo_name'].append(cols[1])
        Topics['repo_url'].append(cols[2])
        Topics['rating'].append(cols[3])
    return pd.DataFrame(Topics)

In [147]:
url=topic_urls[1]

In [148]:
page_3d=topic_doc(url)

In [149]:
topic_link(page_3d)

Unnamed: 0,username,repo_name,rating,repo_url
0,ljianshu,Blog,93400,https://github.com/ljianshu/Blog
1,metafizzy,infinite-scroll,23300,https://github.com/metafizzy/infinite-scroll
2,developit,unfetch,21700,https://github.com/developit/unfetch
3,olifolkerd,tabulator,21100,https://github.com/olifolkerd/tabulator
4,jquery-form,form,17500,https://github.com/jquery-form/form
5,Studio-42,elFinder,15800,https://github.com/Studio-42/elFinder
6,elbywan,wretch,15600,https://github.com/elbywan/wretch
7,dwyl,learn-to-send-email-via-google-script-html-no-...,14600,https://github.com/dwyl/learn-to-send-email-vi...
8,ded,reqwest,10700,https://github.com/ded/reqwest
9,wendux,ajax-hook,10000,https://github.com/wendux/ajax-hook


In [183]:
def topic_doc(url):
    response=requests.get(url)
    if response.status_code!=200:
        raise Exception
    data=BeautifulSoup(response.content,'lxml')
    return data
def get_stars(star_tag):
    star_tag.strip()
    if star_tag[-1]=='k':
        return int(float(star_tag[:-1])*1000)
    return int(star_tag)
def get_topic_col(h_tag,star_tag):
    a_tag=h_tag.find_all('a')
    username=a_tag[0].text.strip()
    repo_name=a_tag[1].text.strip()
    repo_url='https://github.com'+a_tag[1]['href']
    star_rating=get_stars(star_tag.text.strip())
    return username,repo_name,repo_url,star_rating
def topic_link(data):
    h_tag=data.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
    star_tag=topic.find_all('span',class_='Counter js-social-count')
    return h_tag,star_tag

In [175]:
main_url=topic_urls[0]

In [176]:
main_page=topic_doc(main_url)

In [177]:
topic_link(main_page)

Unnamed: 0,username,repo_name,rating,repo_url
0,mrdoob,three.js,93400,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23300,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21700,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17500,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15800,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15600,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14600,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10700,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


In [185]:
Topics={'username':[],'repo_name':[],'rating':[],'repo_url':[]}
for i in range(len(h_tag)):
    cols=get_topic_col(h_tag[i],star_tag[i])
    Topics['username'].append(cols[0])
    Topics['repo_name'].append(cols[1])
    Topics['repo_url'].append(cols[2])
    Topics['rating'].append(cols[3])
Topics

NameError: name 'h_tag' is not defined