 # Scraping Top Repositories of Topics from GitHub 
 
 About Web Scraping-
 
         "Web Scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch."
 

### Project Outline :

- I will scrape https://github.com/topics
- I will get a list of topics. For each topic, I'll fetch topic's title, topic pages URL and topic description.
- For each topic, I'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
- Repo Name,Username,Stars,Repo URL
- three.js,mrdoob,69700,https://github.com/mrdoob/three.js
- libgdx,libgdx,18300,https://github.com/libgdx/libgdx
    

## Modus Operandi

 - I have done this project in two parts:
 
       First Part - It contains the detailed step by step parsing of data from different layers of a webpage.

       Second Part - It contains the main code where I have defined functions to do the same project in a more concise way.

 - I will be using tools like Python, Requests, Beautiful Soup, Pandas, OS

# Part - 1

## Sub Part - 1.(A)

###   a) Used the "requests" library to download the web page

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

134501

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-+z3z7w/QKK6v7DS9Y7YG7e3neIfYqIJaOykTRwMq4TdhAIQ7h3n7TCXttcuZDvdnaWPJV44oKM5vmkLhHO2ZHA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-fb3df3ef0fd028aeafec34bd63b606ed.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-fT2DQxqNhDV1nNx3O++a8GI7qYYEW9SXVa1DFqueH7oHuDL9whJKb3TOAvI6HA8fH6nRcesnmlOqZEkTo1i0Ig==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-7d3d83431a8d8435759cdc773bef9af0.css" />\n    \n    \

In [9]:
with open('webpage.html','w',encoding="utf-8") as f:
    f.write(page_contents)

###   b)  Used  Beautiful Soup to parse and extract information

In [10]:
from bs4 import BeautifulSoup

In [11]:
parsed_doc = BeautifulSoup(page_contents, 'html.parser')

In [12]:
type(parsed_doc)

bs4.BeautifulSoup

###   c) Selects the Titles from Github

In [13]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = parsed_doc.find_all('p', {'class': selection_class})

In [14]:
len(topic_title_tags)

30

In [15]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

###   d) Selects the Description of the Topic/Title

In [16]:
desc_selector = "f5 color-text-secondary mb-0 mt-1"
topic_desc_tags = parsed_doc.find_all('p', {'class': desc_selector})

In [17]:
topic_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

###  e)  Selects the Link of the Titles Tags

In [18]:
link_selector = 'd-flex no-underline'
topic_link_tags = parsed_doc.find_all('a', {'class': link_selector })

In [19]:
len(topic_link_tags)

30

In [20]:
topic_link_tags[0]

<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
<div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg aria-hidden="true" class="octicon octicon-star

In [21]:
topic_link_tags[0]['href']

'/topics/3d'

In [22]:
topic0_url = 'https://github.com' + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


###  f)  Creats a list of Titles

In [23]:
topic_titles= []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


###  g)  Creates a list of descriptions of Titles

In [24]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

###   h) Creates a list of urls of Titles

In [25]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [26]:
import pandas as pd

###  i)  Creates a dictionary of Title name ,its description and its url

In [27]:
topic_dict = {
        'Title':topic_titles,
        'Description': topic_descs,
        'Url': topic_urls
}

In [28]:
topic_dict

{'Title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'Description': ['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency framework for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a coll

In [29]:
topics_df = pd.DataFrame(topic_dict)

###  j)  Creates a  Dataframe of Titles/topics

In [30]:
topics_df

Unnamed: 0,Title,Description,Url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


###  k) Creates a CSV file containing the list of titles with description and their urls

In [31]:
topics_df.to_csv('github.csv', index= None)

## Sub Part - 1.(B)

##  Getting Information out of a Title/Topic page

###   a)  Uses first topics page url and parses it

In [32]:
topic_page_url = topic_urls[0]

In [33]:
topic_page_url

'https://github.com/topics/3d'

In [34]:
response = requests.get(topic_page_url)

In [35]:
response.status_code

200

In [36]:
len(response.text)

614872

In [37]:
parsed_topic_doc = BeautifulSoup(response.text, 'html.parser')

###  b)  Parsing its Repository Tag

In [38]:
information_class ="f3 color-text-secondary text-normal lh-condensed"

repstry_tags = parsed_topic_doc.find_all( 'h3', {'class': information_class})

In [39]:
len(repstry_tags)

30

In [40]:
repstry_tags[0]

<h3 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904

In [41]:
a_tags = repstry_tags[0].find_all('a')

In [42]:
a_tags[0]

<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [43]:
a_tags[0].text.strip()

'mrdoob'

In [44]:
a_tags[1].text.strip()

'three.js'

###  c)  Parses its repository url

In [45]:
base_url = 'https://github.com'
repstry_url = base_url + a_tags[1]['href']

In [46]:
repstry_url

'https://github.com/mrdoob/three.js'

In [47]:
print(repstry_url)

https://github.com/mrdoob/three.js


###  d)  Parsing repositories star rating

In [48]:
star_rating_class = "social-count float-none"
star_rating_tags = parsed_topic_doc.find_all('a',{'class': star_rating_class})

In [49]:
len(star_rating_tags)

30

In [50]:
star_rating_tags[0].text.strip()

'73.1k'

###  e)  Converts star ratings from string to number

In [51]:
def parse_star_count(stars_str):
        stars_str = stars_str.strip()
        if stars_str[-1] == 'k':
            return int(float(stars_str[:-1]) * 1000)
        else:
            return int(stars_str)

In [52]:
parse_star_count(star_rating_tags[0].text)

73100

###  f)  Function to fetch repositories informations of the first Topic/Title

In [53]:
def get_repstry_info(repstry_tags, star_rating_tags):
    # returns all the required info about a repository
    a_tags = repstry_tags.find_all('a')
    username = a_tags[0].text.strip()
    repstry_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_rating_tags.text.strip())
    return username, repstry_name, stars, repo_url
    

In [54]:
get_repstry_info(repstry_tags[0], star_rating_tags[0])

('mrdoob', 'three.js', 73100, 'https://github.com/mrdoob/three.js')

### g)  Function to create a dictionary containing list of repositories and their informations of the first Topic/Title

In [55]:
topic_repstry_dict = {
    'Username': [],
    'Repstry_Name': [],
    'Stars': [],
    'Repstry_url': []
}

for i in range(len(repstry_tags)):
    repstry_info = get_repstry_info(repstry_tags[i], star_rating_tags[i])
    topic_repstry_dict['Username'].append(repstry_info[0])
    topic_repstry_dict['Repstry_Name'].append(repstry_info[1])
    topic_repstry_dict['Stars'].append(repstry_info[2])
    topic_repstry_dict['Repstry_url'].append(repstry_info[3])
    

In [56]:
topic_repstry_dict

{'Username': ['mrdoob',
  'libgdx',
  'BabylonJS',
  'pmndrs',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'openscad',
  'domlysz',
  'ssloy',
  'mosra',
  'google',
  'gfxfundamentals',
  'blender',
  'cleardusk',
  'jasonlong',
  'antvis',
  'cnr-isti-vclab',
  'pissang'],
 'Repstry_Name': ['three.js',
  'libgdx',
  'Babylon.js',
  'react-three-fiber',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'openscad',
  'BlenderGIS',
  'tinyraytracer',
  'magnum',
  'model-viewer',
  'webgl-fundamentals',
  'blender',
  '3DDFA',
  'isometric-contributions',
  'L7',
  'meshlab',
  'claygl'],
 'Stars': [73100,
  18700,
  14500,
  14100,
  12900,
 

In [57]:
topic_repstry_df = pd.DataFrame(topic_repstry_dict)

###  h)  Creates Dataframe containing repositories of the first Title/Topic

In [58]:
topic_repstry_df

Unnamed: 0,Username,Repstry_Name,Stars,Repstry_url
0,mrdoob,three.js,73100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18700,https://github.com/libgdx/libgdx
2,BabylonJS,Babylon.js,14500,https://github.com/BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,14100,https://github.com/pmndrs/react-three-fiber
4,aframevr,aframe,12900,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11000,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,10800,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9600,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8600,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,7300,https://github.com/CesiumGS/cesium


# Part - 2

#  The Concised Code

###  a) Functions to Scrap list of repositories and their details of different Titles/Topics

In [67]:
import os


def get_topics_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
        
    # Parse using Beautiful soup
    parsed_topic_doc = BeautifulSoup(response.text, 'html.parser')
    return parsed_topic_doc

def get_repstry_info(repstry_tags, star_rating_tags):
    # returns all the required info about a repository
    a_tags = repstry_tags.find_all('a')
    username = a_tags[0].text.strip()
    repstry_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_rating_tags.text.strip())
    return username, repstry_name, stars, repo_url    



def get_topics_repstry(parsed_topic_doc):  
    # Get h1 tags containing repository title, repository URL and Username
    information_class ="f3 color-text-secondary text-normal lh-condensed"
    repstry_tags = parsed_topic_doc.find_all( 'h3', {'class': information_class})
   
    # Get star tags
    star_rating_class = "social-count float-none"
    star_rating_tags = parsed_topic_doc.find_all('a',{'class': star_rating_class})
    
    # Creat a dictionary
    topic_repstry_dict = {
    'Username': [],
    'Repstry_Name': [],
    'Stars': [],
    'Repstry_url': []
    }
    
    # Get Repository Information
    for i in range(len(repstry_tags)):
        repstry_info = get_repstry_info(repstry_tags[i], star_rating_tags[i])
        topic_repstry_dict['Username'].append(repstry_info[0])
        topic_repstry_dict['Repstry_Name'].append(repstry_info[1])
        topic_repstry_dict['Stars'].append(repstry_info[2])
        topic_repstry_dict['Repstry_url'].append(repstry_info[3])
    
    return pd.DataFrame(topic_repstry_dict)

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topics_repstry(get_topics_page(topic_url))
    topic_df.to_csv(path)
    

In [68]:
topic_urls[2]

'https://github.com/topics/algorithm'

In [69]:
get_topics_repstry( get_topics_page(topic_urls[2]))

Unnamed: 0,Username,Repstry_Name,Stars,Repstry_url
0,jwasham,coding-interview-university,188000,https://github.com/jwasham/coding-interview-un...
1,CyC2018,CS-Notes,136000,https://github.com/CyC2018/CS-Notes
2,trekhleb,javascript-algorithms,115000,https://github.com/trekhleb/javascript-algorithms
3,TheAlgorithms,Python,114000,https://github.com/TheAlgorithms/Python
4,yangshun,tech-interview-handbook,55800,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,53200,https://github.com/kdn251/interviews
6,azl397985856,leetcode,43300,https://github.com/azl397985856/leetcode
7,algorithm-visualizer,algorithm-visualizer,35100,https://github.com/algorithm-visualizer/algori...
8,crossoverJie,JCSprout,26300,https://github.com/crossoverJie/JCSprout
9,donnemartin,interactive-coding-challenges,23200,https://github.com/donnemartin/interactive-cod...


###  Loads list of Topics and its details into a CSV file

In [70]:
get_topics_repstry( get_topics_page(topic_urls[2])).to_csv('algorith.csv', index = None)

## 
Write a single function to :
1. Get the list of topics from the topics page.
2. Get the list of top repository from the individual topic pages.
3. For each topic, Create a CSV of the top repositories of that particular topic.

###   b)  Functions to scrap details of all the topics

In [71]:
def get_topic_titles(parsed_doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = parsed_doc.find_all('p', {'class': selection_class})
    ## Parsing Topics Titles
    topic_titles= []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = "f5 color-text-secondary mb-0 mt-1"
    topic_desc_tags = parsed_doc.find_all('p', {'class': desc_selector})
    ## Parsing Topics Description
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    link_selector = 'd-flex no-underline'
    topic_link_tags = parsed_doc.find_all('a', {'class': link_selector })
    ## Parsing Topics Url
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
    topics_dict = {
        'Title': get_topic_titles(parsed_doc),
        'Description': get_topic_descs(parsed_doc),
        'url': get_topic_urls(parsed_doc)
    }
    return pd.DataFrame(topics_dict)

In [72]:
def scrape_topics_repos():
    print("Scraping list of topics")
    topics_df = scrape_topics()
    
    ## Creating a folder
    os.makedirs('Github_Scrapped_Data', exist_ok = True)
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['Title']))
        scrape_topic(row['url'], 'Github_Scrapped_Data/{}.csv'.format(row['Title']))

##  c)  Finally Scrappes the CSV files

In [73]:
scrape_topics_repos()

Scraping list of topics
scraping top repositories for "3D"
scraping top repositories for "Ajax"
scraping top repositories for "Algorithm"
scraping top repositories for "Amp"
scraping top repositories for "Android"
scraping top repositories for "Angular"
scraping top repositories for "Ansible"
scraping top repositories for "API"
scraping top repositories for "Arduino"
scraping top repositories for "ASP.NET"
scraping top repositories for "Atom"
scraping top repositories for "Awesome Lists"
scraping top repositories for "Amazon Web Services"
scraping top repositories for "Azure"
scraping top repositories for "Babel"
scraping top repositories for "Bash"
scraping top repositories for "Bitcoin"
scraping top repositories for "Bootstrap"
scraping top repositories for "Bot"
scraping top repositories for "C"
scraping top repositories for "Chrome"
scraping top repositories for "Chrome extension"
scraping top repositories for "Command line interface"
scraping top repositories for "Clojure"
scrapin

# THE END