### Collecting data from web-based sources

Apart from data libraries there are Two Main data Sources:

1) querying an API (the majority of which are web-based, these days);

2) scraping data from a web page.

In [None]:
import requests
response = requests.get("http://www.cmu.edu")

print("Status Code:", response.status_code)
print("Headers:", response.headers)

Status Code: 200
Headers: {'Date': 'Wed, 09 Dec 2020 00:36:46 GMT', 'Server': 'Apache', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff, nosniff', 'x-frame-options': 'SAMEORIGIN', 'Vary': 'Referer', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=7200, must-revalidate', 'Expires': 'Wed, 09 Dec 2020 02:36:46 GMT', 'Keep-Alive': 'timeout=5, max=500', 'Connection': 'Keep-Alive', 'Transfer-Encoding': 'chunked', 'Content-Type': 'text/html'}


In [None]:
## Another Example

params = {"query": "python download url content", "source":"chrome"}
response = requests.get("http://www.google.com/search", params=params)
print(response.status_code)

200


### RESTful APIs


While parsing data in HTML (the format returned by these web queries) is sometimes a necessity, and we’ll discuss it further before, HTML is meant as a format for displaying pages visually, not as the most efficient manner for encoding data. Fortunately, a fair number of web-based data services you will use in practice employ something called REST (Representational State Transfer, but no one uses this term) APIs. We won’t go into detail about REST APIs, but there are a few main feature that are important for our purposes:

1. You call REST APIs using standard HTTP commands: GET, POST, DELETE, PUT. 
2. You will probably see GET and POST used most frequently.
3. REST servers don’t store state. This means that each time you issue a request, you need to include all relevant information like your account key, etc.
4. REST calls will usually return information in a nice format, typically JSON (more on this later). The requests library will automatically parse it to return a Python dictionary with the relevant data.
5. Let’s see how to issue a REST request using the same method as before. We’ll here query my GitHub account to get information. More info about GitHub’s REST API is available at their Developer Site.

In [None]:
#### Get your own at https://github.com/settings/tokens/new
token = "3125e4430a58c5259a14ddd48157061cdb7055c0" 
response = requests.get("https://api.github.com/user", params={"access_token":token})

print(response.status_code)
print(response.headers["Content-Type"])
print(response.json().keys())

## Common data formats and handling

1. CSV (comma separated value) files

2. JSON (Javascript object notation) files and string

3. HTML/XML (hypertext markup language / extensible markup language) files and string



### CSV Example

refers to any delimited text file (for instance, fields could be delimited by spaces or tabs, 
or any other character, specific to the file). For example, 
let’s take a look at the following data file describing weather data near at Pittsburg airport:

Description of the meaning of each data column above is here: https://shawxiaozhang.github.io/wefacts/
but the important points are that the first two columns are time (UTC and local), 
and for example the third column is degrees Celsius scaled by 10.

In [None]:

import pandas as pd
dataframe = pd.read_csv("kpit_weather.csv", delimiter=",", quotechar='"')
dataframe.head()

### JSON data

JSON allows for storing a few different data types:

- Numbers: e.g. 1.0, either integers or floating point, but typically always parsed as floating point
- Booleans: true or false (or null)
- Strings: "string" characters enclosed in double quotes (the " character then needs to be escaped as \")
- Arrays (lists): [item1, item2, item3] list of items, where item is any of the described data types
- Objects (dictionaries): {"key1":item1, "key2":item2}, where the keys are strings and item is again any data type

### XML/HTML

XML contains “open” tags denoted by brackets, like <tag>, 
which are then closed by a corresponding “close” tag </tag>. 
The tags can be nested, and have optional attributes, 
of the form attribute_name="attribute_value". 
Finally, there are “open/close” tags that don’t have any included content (except perhaps attributes), 
denoted by <openclosetag/>.



## Regular expressions

Regular expressions are invaluable when parsing any type of unstructured data, 
if you’re trying to quickly find or extract some text from a long string, and even if you’re writing a more complex parser. In general, regular expressions let us find and match portions of text using a simple syntax (by some definition).




In [None]:


## Finding 

import re
text = "This course will introduce the basics of data science"
match = re.search(r"data science", text)
print(match.start())

41


The important element here is the re.search(r"data science", text) call. 
It searches text for the string “data science” and returns a regular expression
 “match” object that contains information about where this match was found:
  for instance, we can find the character index (in text) where the match is found, using the match.start() call. In addition to the search call, there are two or three more regular expression matching commands you may find useful:



re.match(): Match the regular expression starting at the beginning of the text string
re.finditer(): Find all matches in the text, returning a iterator over match objects
re.findall(): Find all matches in the text, returning a list of the matched text only (not a match object)