<a href="https://colab.research.google.com/github/ashraf281/Data-Analysis-Project/blob/main/github_topics_page_scrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Pick a website and describe your objective
Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## PROJECT OUTLINE
--We're going to scrape https://github.com/topics

--We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description

--For each topic, we'll get the top 25 repositories in the topic from the topic page

--For each repository, we'll grab the repo name, username, stars and repo URL

--For each topic we'll create a CSV file in the following format:

     Repo Name,Username,Stars,Repo URL

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup

In [None]:
github_topics = "https://github.com/topics"

-- GET THE HTML CONTENTS

In [None]:
r = requests.get(github_topics)

In [None]:
html_content = r.text

In [None]:
# successful request will give 200 status code
r.status_code

200

#### PARSE THE HTML CONTENT USING BEAUTIFULSOUP

In [None]:
doc = BeautifulSoup(html_content, 'html.parser')

In [None]:
type(doc)

bs4.BeautifulSoup

### scraping topic tag 

In [None]:
topic_selection = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tag = doc.find_all('p', class_=topic_selection)

In [None]:
topic_title_tag[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

### scraping description of title

In [None]:
desc_selection ="f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all('p', class_= desc_selection)

In [None]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

### Scraping the link of the topic

In [None]:
link_selection ="no-underline flex-grow-0"
topic_link_tags = doc.find_all('a',class_=link_selection)


In [None]:
"https://www.github.com" +topic_link_tags[0]['href']

'https://www.github.com/topics/3d'

In [None]:
topic_title = []
for tag in topic_title_tag:
  topic_title.append(tag.text)
print(topic_title)


['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [None]:
topic_desc= []
for tag in topic_desc_tags:
  topic_desc.append(tag.text.strip())

In [None]:
topic_desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [None]:
topic_url = []
base_url = "https://www.github.com"
for tag in topic_link_tags:
  topic_url.append(base_url+tag['href'])
print(topic_url)

['https://www.github.com/topics/3d', 'https://www.github.com/topics/ajax', 'https://www.github.com/topics/algorithm', 'https://www.github.com/topics/amphp', 'https://www.github.com/topics/android', 'https://www.github.com/topics/angular', 'https://www.github.com/topics/ansible', 'https://www.github.com/topics/api', 'https://www.github.com/topics/arduino', 'https://www.github.com/topics/aspnet', 'https://www.github.com/topics/atom', 'https://www.github.com/topics/awesome', 'https://www.github.com/topics/aws', 'https://www.github.com/topics/azure', 'https://www.github.com/topics/babel', 'https://www.github.com/topics/bash', 'https://www.github.com/topics/bitcoin', 'https://www.github.com/topics/bootstrap', 'https://www.github.com/topics/bot', 'https://www.github.com/topics/c', 'https://www.github.com/topics/chrome', 'https://www.github.com/topics/chrome-extension', 'https://www.github.com/topics/cli', 'https://www.github.com/topics/clojure', 'https://www.github.com/topics/code-quality', 

## Create CSV File from extracted data

In [None]:
import pandas as pd

In [None]:
topic_dict ={
     'Title' : topic_title,
      'Description' : topic_desc,
       'Topic link' : topic_url
}

In [None]:
topic_df = pd.DataFrame(topic_dict)

In [None]:
topic_df

Unnamed: 0,Title,Description,Topic link
0,3D,3D modeling is the process of virtually develo...,https://www.github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://www.github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://www.github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://www.github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://www.github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://www.github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://www.github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://www.github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://www.github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://www.github.com/topics/aspnet


In [None]:
topic_df.to_csv('topics_csv', index= None)