## Data Collection
Data Collection is an important process leading to understanding how to get data, how to clean and prepare data for data science applications. This notebook will demonstrate some methods that can be used to gather and clean data. 

This notebook demonstrate multiple ways that we can collect data. 
Ways that you can get data
1. government/state sources - https://www.data.gov/  https://opendata.cityofnewyork.us/data/
2. database queries - using SQL
3. data feeds from instruments/senses
4. API - web-based calls using autheticated calls to get data
5. scraping data (legally) - https://apnews.com/article/1e1cacd92df74f48846e8bce5237b97d

Most data are unstructred and requires extensive processing. 


## Download data from URL

In [None]:
import pandas as pd
# COVID-19 Hospital Data Coverage Detail
url ='https://healthdata.gov/api/views/ieks-f4qs/rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url, delimiter=',', quotechar='"')
df

## Using Databases

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.

In [None]:
import sqlite3
connection = sqlite3.connect("publisher2.db")
# Allows Python code to execute PostgreSQL command in a database session. 
# Cursors are created by the connection.cursor() method: they are bound to the connection 
# for the entire lifetime and all the commands are executed in the context of the database 
# session wrapped by the connection.
cursor = connection.cursor()
# do your stuff
#connection.close()

In [None]:
# Execute a database operation (query or command).
cursor.execute("""
CREATE TABLE publisher2 (
id INTEGER PRIMARY KEY,
name TEXT
)""")

In [None]:
# Insert data into the table:
cursor.execute("INSERT INTO publisher2 VALUES (1, 'Pearson Press')")
cursor.execute("INSERT INTO publisher2 VALUES (2, 'MIT Press')")
cursor.execute("INSERT INTO publisher2 VALUES (3, 'Cambridge Press')")
connection.commit()

In [None]:
# delete item
cursor.execute("DELETE FROM publisher2 WHERE id == 3")
conn = sqlite3.connect("publisher2")
conn.commit()

In [None]:
# query data
for row in cursor.execute('SELECT * FROM publisher2'):
  print (row)

In [None]:
import pandas as pd
# Read table directly into a Pandas DataFrame:
pd.read_sql_query("SELECT * FROM publisher2", connection, index_col="id")

### Collecting Data using http/s protocols
The first step of collecting web-based data is to issue a request for this data via some protocol: HTTP (HyperText Transfer Protocol) or HTTPS (the secure version).
#### using https://docs.python-requests.org/en/master/

In [None]:
import requests
response = requests.get("http://andyguna.com")

print("Status Code:", response.status_code)   # code = 200 is ok
print("Headers:", response.headers)
print(response.text[:1000])
print(type(response.text))           

### Google Search
https://www.google.com/search?q=how+to+break+dance&source=chrome

In [None]:
import requests
params = {"query": "how to break dance", "source":"chrome"}
response = requests.get("http://www.google.com/search", params=params)
print(response.status_code)
response.text[:1000]

### RESTful APIs
a fair number of web-based data services you will use in practice employ something called REST (Representational State Transfer, but no one uses this term) APIs.
API calls - GET, POST, DELETE, PUT

In [None]:
# find your github access token at https://github.com/settings/tokens/new
token = "????" 
response = requests.get("https://api.github.com/user", params={"access_token":token})

print(response.status_code)
print(response.headers["Content-Type"])
print(response.json().keys())

### Authentication
The standard here for a while was called "Basic Authentication", and can be used via the requests library by simply passing the login and password as the auth argument to the relevant calls, as below.

In [None]:
response = requests.get("https://api.github.com/user", auth=("andyguna", "github_password"))
print(response.status_code)

## Common Data Formats
Data comes in many different formats, but some of the more common ones that you'll deal with as a data scientist are:
CSV, JSON, HTML, XML

### CSV Data Formats

In [None]:
import pandas as pd
df = pd.read_csv("data/439_01.csv", delimiter=",", quotechar='"')
df.head(10)
df.size
print(df.describe())
df.shape
df.columns

## json data
**data types**
Numbers: e.g. 1.0, either integers or floating point, but typically always parsed as floating point
Booleans: true or false (or null)
Strings: "string" characters enclosed in double quotes (the " character then needs to be escaped as \")
Arrays (lists): [item1, item2, item3] list of items, where item is any of the described data types
Objects (dictionaries): {"key1":item1, "key2":item2}, where the keys are strings and item is again any data type

In [None]:
# read a jupyter notebook file in its json format
# f = open("data/Lab 1 - Introduction.ipynb","r")
f = open("Week01_Notebook.ipynb","r")
lines = f.readlines()   # output is a list
# convert list to string
str_text = ' '.join(str(e) for e in lines)
str_text

In [None]:
import json
y = json.loads(str_text)   #parse the string
# y["cells"][0]['source']   # extract the title of the lab
# y["cells"][0]['source']   # extract the title of the lab
y["description"]        # extract the collection name

In [None]:
y["modules"]        # extract the collection name

In [None]:
y["modules"][4]['videos'][0]['video_id']     # extract the index 4 the item in the collection

In [None]:
y["modules"][4]['videos'][2]['video_id']  # find the video_id

## dictionary to json object

In [None]:
# convert python dictionary to json object
import json
data = {"a":[1,2,3,{"b":2.1}], 'c':4}
json.dumps(data)
data

In [None]:
import json
f = open(".json","r")
lines = f.readlines()
# convert list to string
str_text = ' '.join(str(e) for e in lines)
y = json.loads(str_text)
y["collection_name"]

In [None]:
json.dumps(response)    # types that cannot be represented by json object

### XML/HTML

<tag attribute="value">
    <subtag>
        Some content for the subtag
    </subtag>
    <openclosetag attribute="value2"/>
</tag>

<tag attribute="value">
    <subtag>
        Some content for the subtag
    </subtag>
    <openclosetag attribute="value2"/>
</tag>

### Scraping Data

In [None]:
from bs4 import BeautifulSoup

root = BeautifulSoup("""
<tag attribute="value">
    <subtag>
        Some content for the subtag
    </subtag>
    <openclosetag attribute="value2"/>
    <subtag>
        Second one
    </subtag>
</tag>
""", "lxml-xml")

print(root, "\n")
print(root.tag.subtag, "\n")
print(root.tag.openclosetag.attrs)

In [None]:
print(root.tag.find_all("subtag"))

In [None]:
print(root.find_all("subtag"))

### home page information

In [None]:
# parsing rutgers CS web page
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.cs.rutgers.edu/")
root = BeautifulSoup(response.content, "lxml")
for div in root.find_all("div", class_="custom"):    # quick links are custom tag
    for li in div.find_all("li"):                    # then within li tag
        print(li.text.strip())

### extract product information

In [None]:
# parsing product information
from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = "https://www.newegg.com/p/pl?d=headphones"
html = urlopen(URL)
page_str = html.read()
html.close()
page_str

## Regular Expressions
Any character (except special characters, ".$*+?{}\[]|() ), just matches itself. I.e., the character a just matches the character a. This is actually what we used previously, where each character in the r"data science" regular expression was just looking to match that exact character.

Putting a group of characters within brackets [abc] will match any of the characters a, b, or c. You can also use ranges within these brackets, so that [a-z] matches any lower case letter.

Putting a caret within the bracket matches anything but these characters, i.e., [^abc] matches any character except a, b, or c.
The special character \d will match any digit, i.e. [0-9]
The special character \w will match any alphanumeric character plus the underscore; i.e., it is equivalent to [a-zA-Z0-9_]
The special character \s will match whitespace, any of [ \t\n\r\f\v] (a space, tab, and various newline characters).
The special character . (the period) matches any character. In their original versions, regular expressions were often applies line-by-line to a file, so by default . will not match the newline character. If you want it to match newlines, you pass re.DOTALL to the "flags" argument of the various regular expression calls.
.* any number of characters including zero
a+ at least one character {a, aa, aaa, aaaa, ....}

In [None]:
import re
text = "This course will introduce the basics of data science and more Data science"
match = re.search(r"data science", text)
print(match.start())
matches = re.findall(r"[Dd]ata science", text)    # returns all matches as a list
print(matches)

In [None]:
# compile and search
regex = re.compile(r"data science")
regex.search(text)

## multiple pattern strings

In [None]:
# lower or upper case D and S and \s space between the two
print(re.search(r"[Dd]ata\s[Ss]cience", text))   

In [None]:
matches = re.findall(r"[Dd]ata\s[Ss]cience", text)    # returns all matches as a list
print(matches)

### Grouping with regex

In [None]:
match = re.search(r"(\w+)\s([Ss]cience)", text)
print(match.groups())

In [None]:
match = re.search(r"(\w+)\s([Ss]cience)", text)
print(match.group(0))
print(match.group(1))
print(match.group(2))

### Substitution

In [None]:
print(re.sub(r"data science", r"data schmience", text))

In [None]:
print(re.sub(r"(\w+) ([Ss])cience", r"\1 \2chmience", text))
print(re.sub(r"(\w+) ([Ss])cience", r"\1 \2chmience", "Life Science"))

### Miscellaneous Items

In [None]:
# order of operations
print(re.match(r"abc|def", "abc"))
print(re.match(r"abc|def", "def"))

In [None]:
# crazy stuff
import re
str = "101"
matches = re.findall(r".?|(..+?)\\1+", str)    # returns all matches as a list
print(matches)

### @ Copyright 2023  A.D. Gunawardena