# Python-web-scraping-project

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. We will be build a web scraping project from scratch using Python and its libraries:

#### 1. Pick a website and describe your objective
    
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### 2. Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

#### 3. Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

#### 4. Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.


#### 5.Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.



#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- use BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [51]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

`get_topics_page` can be used to get the list of page

In [52]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)

In [53]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [54]:
titles = get_topic_titles(doc)

In [55]:
len(titles)

30

Getting first 5 topics from list

In [56]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

since the length of title is `30` `(total titles on first page)`, with below approach we can get last 5 topics of 

titles[25:]

defining a  functions to get `descriptions`

In [57]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

<b>Note:</b> `.text` used to get text of description form `p tag` and `.strip()` used to remove the additional `space`     

Similarly we have defined functions for `URLs`.

In [58]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

since the page have `relative url` so we need to make it an `absolute url` by appending the `base url` it with tag[`href`]

Let's put this all together into a single function

In [59]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

##### commiting the code to save on jovian repository 

In [60]:
import jovian

In [61]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "vikasrajoria11ece/python-web-scraping-practice-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/vikasrajoria11ece/python-web-scraping-practice-project[0m


'https://jovian.ai/vikasrajoria11ece/python-web-scraping-practice-project'

## Get the top 25 repositories from a topic page

TODO - explanation and step

In [62]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [63]:
doc = get_topic_page('https://github.com/topics/3d')

TODO - talk about the h1 tags 

converting the star counts in number `(like converting from 16K to 16000)`


In [64]:
def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':                  #checking if the last character is `k`
        return int(float(stars[:-1])*1000)
    return(int(stars))

In [65]:
stars_string = '69.9K'
stars_string [-1]             # checking last character of my demo stars_string


'K'

In [66]:
stars_string [:-1]            # getting everything expect the last character of my demo stars_string

'69.9'

In [67]:
float(stars_string [:-1])           # converting this value in float

69.9

In [68]:
float(stars_string [:-1]) * 1000          # we also need to take care of `k` here, so giving k =1000

69900.0

In [69]:
int(float(stars_string [:-1]) * 1000 )         # converting the whole number in integer

69900

in second case if rating star done not contain any `k` in it then no will be general value like 699, so we need to also convert it in integer

In [70]:
stars_string = '699'
int(stars_string)

699

so, now creating a logic to get number of stars

the below function, first will check the starts
- it will remove spaces by `.strip()` if any space is there
- it will check the last character
- if the last character is `k` then it will take all the elements except last one using `start[:-1]`
- replacing `k` with `1000` by multiplying the value with 1000 (it will return a string value)
- converting the value in float
- lastly, converting the number in int and returning the same number with this line: `int(float(stars[:-1])*1000)`

- suppose, the rating is not in thousands and it is in `hundreds` like `699`, then simply change the type of that num to int
- and return the function to it like this command `return(int(stars))`

In [71]:
def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':                  
        return int(float(stars[:-1])*1000)
    return(int(stars))

In [76]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'https://github.com'
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

TODO - show a example

In [40]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

TODO - show an example

In [73]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [74]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [77]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

We can check that the CSVs were created properly

In [46]:
# read and display a CSV using Pandas

#### Saving notebook on jovian

In [78]:
!pip install jovian --upgrade --quiet

In [79]:
# Execute this to save new versions of the notebook
jovian.commit(project="python-web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "vikasrajoria11ece/python-web-scraping-practice-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/vikasrajoria11ece/python-web-scraping-practice-project[0m


'https://jovian.ai/vikasrajoria11ece/python-web-scraping-practice-project'

In [80]:
import jovian

In [82]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "vikasrajoria11ece/python-web-scraping-practice-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/vikasrajoria11ece/python-web-scraping-practice-project[0m


'https://jovian.ai/vikasrajoria11ece/python-web-scraping-practice-project'

## References and Future Work

Summary of what we did

- ?
- ?


References to links you found useful

- ?
- ?
 
Ideas for future work

- ?
- ?

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>