### Web Scraping Exercise

I intend to familiarize myself with the http requests to extract and store the scraped data. The documentation of for requests is [here](https://requests.readthedocs.io/en/latest/).
In this project I will scrape [Github](https://api.github.com)

First I must do some pip installs

- `!pip install requests`

then import packages

- `import requests`
- `from requests.exceptions import HTTPError`
- `import json`
- `import pathlib`
- `import pandas`

python 3.10.8 64 bit

In [1]:
#!pip install requests

In [2]:
# imports
# web scraping packages
import requests
from requests.exceptions import HTTPError
import json
import pathlib
import pandas

__Request site__

In [3]:
# website
url = 'https://api.github.com'
r = requests.get(url)
# status
r.status_code
if r.status_code == 200:
    print('Success!')
elif r.status_code == 404:
    print('Not Found.')

Success!


__Request status__

In [4]:
for url in ['https://api.github.com', 'https://api.github.com/invalid']:
    try:
        r = requests.get(url)

        # If the request was successful, no Exception will be raised
        r.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')  # Python 3.6
    except Exception as err:
        print(f'Other error occurred: {err}')  # Python 3.6
    else:
        print('Success!')

Success!
HTTP error occurred: 404 Client Error: Not Found for url: https://api.github.com/invalid


__Request content__

In [5]:
response = requests.get('https://api.github.com')
response.content

b'{\n  "current_user_url": "https://api.github.com/user",\n  "current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",\n  "authorizations_url": "https://api.github.com/authorizations",\n  "code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",\n  "commit_search_url": "https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}",\n  "emails_url": "https://api.github.com/user/emails",\n  "emojis_url": "https://api.github.com/emojis",\n  "events_url": "https://api.github.com/events",\n  "feeds_url": "https://api.github.com/feeds",\n  "followers_url": "https://api.github.com/user/followers",\n  "following_url": "https://api.github.com/user/following{/target}",\n  "gists_url": "https://api.github.com/gists{/gist_id}",\n  "hub_url": "https://api.github.com/hub",\n  "issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",\n  "issues_url": "https://api.

request content - alternative

`response.encoding = 'utf-8'` # Optional: requests infers this internally though

`response.text`

In [6]:
response.json()

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

In [7]:
response.headers

{'Server': 'GitHub.com', 'Date': 'Tue, 25 Oct 2022 13:21:49 GMT', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept, Accept-Encoding, Accept, X-Requested-With', 'ETag': '"4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Content-Type': 'application/json; charset=utf-8', 'X-GitHub-Media-Type': 'github.v3; format

In [8]:
response.headers['content-type']

'application/json; charset=utf-8'

In [9]:
import requests

# search GitHub's repositories for requests
response = requests.get(
'https://api.github.com/search/repositories',
params={'q': 'requests+language:python'},
)

# inspect some attributes of the `requests` repository
json_response = response.json()
repository = json_response['items'][0]
print(f'Repository name: {repository["name"]}')  # Python 3.6+
print(f'Repository description: {repository["description"]}')  # Python 3.6+

Repository name: grequests
Repository description: Requests + Gevent = <3


In [10]:
# pass params to get() in the form of a dictionary, as you have just done, or as a list of tuples:
requests.get(
'https://api.github.com/search/repositories',
params = [('q', 'requests+language:python')],
)

<Response [200]>

In [11]:
# pass the values as bytes
requests.get(
'https://api.github.com/search/repositories',
params = b'q=requests+language:python',
)

<Response [200]>

__Request headers__

In [12]:
response = requests.get(
'https://api.github.com/search/repositories',
params={'q': 'requests+language:python'},
headers={'Accept': 'application/vnd.github.v3.text-match+json'},
)

# View the new `text-matches` array which provides information
# about your search term within the results
json_response = response.json()
repository = json_response['items'][0]
print(f'Text matches: {repository["text_matches"]}')

Text matches: [{'object_url': 'https://api.github.com/repositories/4290214', 'object_type': 'Repository', 'property': 'description', 'fragment': 'Requests + Gevent = <3', 'matches': [{'text': 'Requests', 'indices': [0, 8]}]}]


__Storing__

In [13]:
soliditems = requests.get('https://api.github.com')
data = soliditems.content
with open ('data.json', 'wb') as f:
    f.write(data)

__Converting json to csv__

In [21]:
import json
import csv
json_file = open(r"C:\Users\ckraft-bot\path\data.json", 'r')
csv_file = open(r"C:\Users\ckraft-bot\path\data_converted.csv", 'w')
json_data_to_python_dict = json.load(json_file)
write = csv.writer(csv_file)
write.writerow(json_data_to_python_dict.keys())
write.writerow(json_data_to_python_dict.values())
json_file.close()
csv_file.close()