# The Many Sources of Data

In [None]:
import pandas as pd
import json
import requests as rq
from bs4 import BeautifulSoup
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from matplotlib import pyplot as plt
import wave
import plotly.express as px

## Tables

Stereotypically, data comes in the form of *spreadsheets* or *tables*, with a row-and-column structure.

This is still very much a popular source of data for the practicing data scientist. In the raw code for such data files, the individual values in the different cells are often separated by commas or by `tab` characters. (These are often good choices since these characters rarely appear in tables themselves and so there is generally little chance for ambiguity.)

### .csv ($\underline{c}$omma-$\underline{s}$eparated $\underline{v}$alues)

In [None]:
# This data taken from the Australian Institute of Health and Welfare:
# https://www.aihw.gov.au/data-by-subject.

pd.read_csv('../data/aihw-phc-4-csv/PHN_ERP_CSV.csv').head(10)

Let's use *VIM* to see the commas in the raw data.

### .tsv ($\underline{t}$ab-$\underline{s}$eparated $\underline{v}$alues)

In [None]:
# This data taken from Peter Norvig's pytudes:
# https://github.com/norvig/pytudes.
# It's a list of his bike rides that were longer
# than 25 miles. The sixth column records the
# elevation change.

pd.read_csv('../data/bikerides25.tsv',
           delimiter='\t',
           header=None).head()

Let's use *VIM* to see the `tab`s in the raw data.

## Clickstreams

But data gets generated in lots of different ways these days. Sometimes the data of interest are patterns in the clicks of a website's users.

In [None]:
# This data taken from https://github.com/mafudge/datasets.
# It is a set of sample weblogs from a Seattle e-commerce
# website called "nopCommerce".

stream = []
with open('../data/u_ex160211.log') as f:
    clicks = f
    for click in clicks:
        stream.append(click.replace('\n', ''))

In [None]:
stream

## APIs

An $\underline{A}$pplication $\underline{P}$rogramming $\underline{I}$nterface, or API, is a generic term to describe how various bits of software interact with users or with each other. [Foursquare](https://foursquare.com/) is an application that allows developers to supplement their own applications with information about geographical location.

In [None]:
url = 'https://api.foursquare.com/v2/venues/explore'
with open('../.secrets/credentials.json') as f:
    params = json.load(f)

In [None]:
params['v'] = '20201201'
params['ll'] = '34, -118',
params['query'] = 'tacos',
params['intent'] = 'browse',
params['radius'] = 100000,
params['limit'] = 10

In [None]:
response = rq.get(url=url, params=params)
data = json.loads(response.text)
[(item['venue']['name'], item['venue']['location']) for item\
 in data['response']['groups'][0]['items']]

## Sraping from the Web

There is also a ton of data just sitting on various webpages. If need be, data scientists can also access data by cracking into the HTML code underlying the webpages that have the data of interest.

In [None]:
url = 'https://www.pro-football-reference.com/'

res = rq.get(url)
soup = BeautifulSoup(res.content, 'lxml')

In [None]:
teams = []
table = soup.find('table', {'id': 'AFC'})

In [None]:
for row in table.find('tbody').find_all('tr'):
    try:
        team = {'name': row.find('th', {'data-stat': 'team'}).text,
               'wins': row.find('td', {'data-stat': 'wins'}).text,
               'losses': row.find('td', {'data-stat': 'losses'}).text,
               'ties': row.find('td', {'data-stat': 'ties'}).text}
        teams.append(team)
    except:
        pass

In [None]:
teams

## Text - Natural Language Processing

Data-processing tools have become so sophisticated that even long-form bits of text are useful these days.

In [None]:
# This is a list of the texts of all 58 Presidential inaugural addresses.
# Most of them can be found here: https://avalon.law.yale.edu/subject_menus/inaug.asp.

with open('../data/speeches.pkl', 'rb') as f:
    speeches = pickle.load(f)

In [None]:
speeches

In [None]:
cv = CountVectorizer()
pd.DataFrame(cv.fit_transform(speeches).todense(),
             columns=cv.get_feature_names()).iloc[:10, 71:]

## Audio Files

Audio files these days are created digitally, and so, even though we experience them most immediately as *sounds*, the digital encoding means that they're data as well! If we want to, we can access their digital descriptions.

In [None]:
# One second from Elivs Presley's "All Shook Up",
# courtesy of https://www.wavsource.com.

with wave.open('../data/all_shook_up.wav') as f:
    stats = f.getparams()
    frames = f.readframes(200)

In [None]:
stats

In [None]:
# The numbers here represent the volume for a 16-bit mono track.

frames

## Images

An image is a visual object, of course. But again there are digital representations of them that bring them into the domain of data science. A digital image is a grid of pixels, and each pixel contains a part of the image. Each part of the image has a color, and we can represent any color by a number or sequence of numbers.

In [None]:
# These are images of handwritten digits, 0-9.
# Here's a 9.

from tensorflow.keras.datasets import mnist

plt.imshow(mnist.load_data()[0][0][4]);

In [None]:
# And here's a digital representation of that same 9.

mnist.load_data()[0][0][4]