### Project : Create a dataset of popular topics in GitHub by scraping the site

Web scraping is the process extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database.


When we run the code for web scraping, a request is sent to the URL that we have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. 

Steps extract data using web scraping with python

1. Find the URL that you want to scrape
2. Inspecting the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format 

Importing libraries

In [None]:
#--Web scraping packages
from bs4 import BeautifulSoup
import requests

## To Scrape GitHub Topic Page

#### To download the webpage

In [None]:
topic_url = 'https://github.com/topics'

In [None]:
response = requests.get(topic_url)   #to download the webpage

In [None]:
#to check the request was successful   [200 is the http status code ]
response.status_code

In [None]:
#content of the webpage
page_contents = response.text

In [None]:
len(page_contents)

In [None]:
#print the first 1000
page_contents[:1000]

In [None]:
#to save the above html code as a file
with open('webpage.html','w') as f:
    f.write(page_contents)

#### Use Beautiful Soup to parse and extract information

In [None]:
doc = BeautifulSoup(page_contents,'html.parser')

In [None]:
#to check the type
type(doc)

#### To get the topic title

In [None]:
topic_title_tags = doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})

In [None]:
len(topic_title_tags)    

In [None]:
# top 5 topic title
topic_title_tags[:5]

In [None]:
topic_title_tags[0].text

#### To get the topic description

In [None]:
topic_desc_tags = doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

In [None]:
len(topic_desc_tags)

In [None]:
#first 5 topic description
topic_desc_tags[:5]

In [None]:
topic_desc_tags[0].text.strip()

#### To find the topic url

In [None]:
topic_title_tag0 = topic_title_tags[0]
topic_title_tag0

In [None]:
#to check the parent of the p tag
topic_title_tag0.parent

In [None]:
#from above we got the class of topic url
topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})

In [None]:
len(topic_link_tags )

In [None]:
#url of the first topic
base_url = 'https://github.com'
topic0_url = base_url + topic_link_tags [0]['href']
print(topic0_url)

In [None]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)    

In [None]:
topic_drscription = []

for tag in topic_desc_tags:
    topic_drscription.append(tag.text.strip())
    
print(topic_drscription)    

In [None]:
topic_drscription[0]

In [None]:
topic_url = []

for tag in topic_link_tags:
    topic_url.append(base_url + tag['href'])
    
print(topic_url)    

In [None]:
topic_url[0]

#### To create a csv file

In [None]:
import pandas as pd

In [None]:
topics_dict = {
    'title' : topic_titles,
    'description' : topic_drscription,
    'url' : topic_url
}

In [None]:
#convert it into a dictionary
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df.head()

In [None]:
# To create CSV file with the extracted information
topics_df.to_csv('topics.csv', index=None)