# Data in Python

- data files in Python
    - semi-structured files
    - `pandas`
    - Web Scraping & APIs
- Working With Data

## Setup

In [1]:
# Import standard libraries
%matplotlib inline
import pandas as pd
import numpy as np

## Data 'Friendliness'

The degree to which a data filetype easily lends itself to useful analysis.

## 'Friendly' File Types:

- csv
- tsv
- json
- txt
- xml

## 'Unfriendly' File Types:
- pdf
- docx
- html
- Anything made to look nice for humans

### CSV Files

- 'Comma Separated Value' files store data, separated by comma's. 
- Think of them like lists.

In [2]:
# Note: through this notebook, I will be using '!' to run the shell command 'cat'
#  to print out the content of example data files

!cat data/dat.csv

1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12

In [3]:
# Python has a module devoted to working with csv's
import csv

In [4]:
# We can read through our file with the csv module
with open('data/dat.csv') as csvfile:
    csv_reader = csv.reader(csvfile, delimiter=',')
    for row in csv_reader:
        print(', '.join(row))

1,  2,  3,  4
5,  6,  7,  8
9,  10,  11,  12


In [5]:
# Pandas also has functions to directly load csv data
pd.read_csv?

In [6]:
# Let's read in our csv file
pd.read_csv('data/dat.csv', header=None) 

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


## iclicker Question #1

What does `pd` in `pd.read_csv()` specify?

- A) it's the name of the function
- B) that the `read_csv` method is from the pd package
- **C) that the `read_csv` method is from the pandas package (we're using the shortcut `pd`)**
- D) to read a csv file into python
- E) I'm super lost

### JSON

- JavaScript Object Notation files can store hierachical key/value pairings. 
- Think of them like dictionaries.

In [7]:
!cat data/dat.json

{
  "firstName": "John",
  "age": 53
}


In [8]:
# Think of json's as similar to dictionaries
d = {'firstName': 'John', 'age': '53'}
print(type(d),'\n',d)

<class 'dict'> 
 {'firstName': 'John', 'age': '53'}


In [9]:
# Python also has a module for dealing with json
import json

In [10]:
# Load a json file
with open('data/dat.json') as dat_file:    
    dat = json.load(dat_file)

In [11]:
# Check what data type this gets loaded as
print(type(dat))

<class 'dict'>


In [12]:
# Pandas also has support for reading in json files
pd.read_json?

In [13]:
# You can read in json formatted strings with pandas
pd.read_json('{ "first": "Alan", "place": "Manchester"}', typ = 'series')

first          Alan
place    Manchester
dtype: object

In [14]:
# Read in our json file with pandas
pd.read_json('data/dat.json', typ = 'series')

firstName    John
age            53
dtype: object

### XML

- eXtensible Markup Language files store 'tagged' data. 
- Think of them like HTML.

In [15]:
!cat data/dat.xml

<person>
	<who>Claude</who>
	<what>Info</who>
	<when>50s</when>
</person>

In [16]:
# We can read in the XML file with standard python I/O
with open('data/dat.xml') as dat_file:
    dat = dat_file.read()

In [18]:
# Check out the data
dat

'<person>\n\t<who>Claude</who>\n\t<what>Info</who>\n\t<when>50s</when>\n</person>'

In [19]:
# Beautiful Soup has functions to 'clean up' XML into human-friendlier formats
from bs4 import BeautifulSoup
nice_dat = BeautifulSoup(dat, 'xml')

In [20]:
# Check out the parsed data
print(nice_dat)

<?xml version="1.0" encoding="utf-8"?>
<person>
<who>Claude</who>
<what>Info</what>
<when>50s</when>
</person>


<center>
<img src="img/pandas.png" alt="pandas" width="600px">
</center>

Pandas is Python library for managing heterogenous data.

At it's core, Pandas is used for the **DataFrame** object, which is:
- a data structure for labeled rows and columns of data
- associated methods and utilities for working with data.
- each column contains a `pandas` **Series**

## Loading Data

In [21]:
# Load a csv file of data
df = pd.read_csv('data/my_data.csv')

In [22]:
# Check out a few rows of the dataframe
df.head()

Unnamed: 0,id,first_name,last_name,age,score,value
0,295,Andrea,Clark,46,-1,24547.87
1,620,Bill,Woods,46,492,46713.9
2,891,Alexander,Jacobson,48,489,32071.74
3,914,Derrick,Bradley,52,-1,30650.48
4,1736,Allison,Thomas,44,-1,9553.12


Pandas DataFrame:
- Index for each row
- Column name for each column
- Stores heterogenous types

## Indexing & Slicing

In [23]:
# Indexing: select a column using its name
df['last_name']

0         Clark
1         Woods
2      Jacobson
3       Bradley
4        Thomas
         ...   
195       Ortiz
196    Chambers
197       Pitts
198     Jenkins
199       Brown
Name: last_name, Length: 200, dtype: object

In [24]:
type(df['last_name'])

pandas.core.series.Series

In [25]:
# Indexing: select a row & column with 'loc'
df.loc[10, 'score']

500

## iclicker Question #2

What would be the output of `df['age'] > 10`?

- A) subset of `df` including only rows of individuals older than 10
- **B) a Boolean with `True` for rows where age is greater than 10 and `False` otherwise**
- C) `id`s of rows where observations are greater than 10 
- D) an error
- E) I'm super lost

In [28]:
## YOUR CODE HERE
df['age'] > 10

# to get dataframe 
df[df['age'] > 10]

Unnamed: 0,id,first_name,last_name,age,score,value
0,295,Andrea,Clark,46,-1,24547.87
1,620,Bill,Woods,46,492,46713.90
2,891,Alexander,Jacobson,48,489,32071.74
3,914,Derrick,Bradley,52,-1,30650.48
4,1736,Allison,Thomas,44,-1,9553.12
...,...,...,...,...,...,...
195,97441,Krista,Ortiz,34,-1,24074.79
196,97728,Anna,Chambers,37,598,0.00
197,98115,Jennifer,Pitts,29,606,6876.75
198,98284,Brittany,Jenkins,34,665,43525.88


## Checking out the DataFrame

In [29]:
# Check how large our dataframe is
df.shape

(200, 6)

In [30]:
# Check what columns we have in our DataFrame
df.columns

Index(['id', 'first_name', 'last_name', 'age', 'score', 'value'], dtype='object')

In [31]:
# Check the datatypes of our variables
df.dtypes

id              int64
first_name     object
last_name      object
age             int64
score           int64
value         float64
dtype: object

In [32]:
# Set the index to a string (non-numerical) and use it as index (row labels)
df['id'] = df['id'].astype('str')
df = df.set_index(df['id'])
df.head() 

Unnamed: 0_level_0,id,first_name,last_name,age,score,value
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
295,295,Andrea,Clark,46,-1,24547.87
620,620,Bill,Woods,46,492,46713.9
891,891,Alexander,Jacobson,48,489,32071.74
914,914,Derrick,Bradley,52,-1,30650.48
1736,1736,Allison,Thomas,44,-1,9553.12


## Exploring the data

- quantitative (numbers)
- qualitative (categorical)
- basic descriptive statistics

In [33]:
# Checking categorical data
df['first_name'].value_counts()[0:10]

David       6
Michael     5
Eric        4
James       4
Charles     4
Jason       3
John        3
Jonathan    3
Sarah       3
Jennifer    3
Name: first_name, dtype: int64

In [34]:
# Check a particular descriptive statistic
df['value'].mean()

28730.336296296293

In [35]:
# Describe a particular column
df['score'].describe()

count    200.000000
mean     416.595000
std      237.176674
min       -1.000000
25%      288.750000
50%      463.500000
75%      596.500000
max      942.000000
Name: score, dtype: float64

In [36]:
# Get descriptive statistics of all numerical columns
df.describe()

Unnamed: 0,age,score,value
count,200.0,200.0,189.0
mean,46.02,416.595,28730.336296
std,10.028582,237.176674,32493.945741
min,14.0,-1.0,0.0
25%,39.0,288.75,9593.03
50%,46.0,463.5,17976.51
75%,53.0,596.5,33163.31
max,69.0,942.0,204999.96


## iclicker Question #3

What's the average (mean) age of the individuals in this dataset?

- A) 14
- **B) 46**
- C) 28730
- D) NA
- E) I'm super lost/unsure

In [37]:
## YOUR CODE HERE
df['age'].mean()

46.02

## Application Program Interface (APIs)

- APIs are basically a way for software to talk to software 
    - They are an interface into an application / website / database designed for computers / software.

Notes on APIs:
- Follow API guidelines! 
- These guidelines typically specify the number / rate / size of requests

## Github API

You can access the github api with the following API. Just added specifiers for what you are looking for. 

https://api.github.com/

For example, the following URL will search for the user 'ShanEllis'

https://api.github.com/users/shanellis

<center>
<img src="img/github.png" alt="sql" height="100" width="100">
</center>

## Requesting Web Pages from Python

In [38]:
# The requests module allows you to send URL requests from python
import requests  
from bs4 import BeautifulSoup

In [39]:
# Request data from the Github API on a particular user
page = requests.get('https://api.github.com/users/shanellis')  

In [40]:
# The content we get back is a messily organized json file
page.content

b'{"login":"ShanEllis","id":6606571,"node_id":"MDQ6VXNlcjY2MDY1NzE=","avatar_url":"https://avatars3.githubusercontent.com/u/6606571?v=4","gravatar_id":"","url":"https://api.github.com/users/ShanEllis","html_url":"https://github.com/ShanEllis","followers_url":"https://api.github.com/users/ShanEllis/followers","following_url":"https://api.github.com/users/ShanEllis/following{/other_user}","gists_url":"https://api.github.com/users/ShanEllis/gists{/gist_id}","starred_url":"https://api.github.com/users/ShanEllis/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/ShanEllis/subscriptions","organizations_url":"https://api.github.com/users/ShanEllis/orgs","repos_url":"https://api.github.com/users/ShanEllis/repos","events_url":"https://api.github.com/users/ShanEllis/events{/privacy}","received_events_url":"https://api.github.com/users/ShanEllis/received_events","type":"User","site_admin":false,"name":"Shannon Ellis","company":null,"blog":"shanellis.com","location":"San Die

## iclicker Question #6

What type/format of output is this?

- A) CSV
- B) XML
- **C) JSON**
- D) API
- E) I'm super lost

In [41]:
# We can read in the json data with pandas
git_data = pd.read_json(page.content, typ='series')

In [42]:
# Check out the pandas series object full of data
git_data  

login                                                          ShanEllis
id                                                               6606571
node_id                                             MDQ6VXNlcjY2MDY1NzE=
avatar_url             https://avatars3.githubusercontent.com/u/66065...
gravatar_id                                                             
url                               https://api.github.com/users/ShanEllis
html_url                                    https://github.com/ShanEllis
followers_url           https://api.github.com/users/ShanEllis/followers
following_url          https://api.github.com/users/ShanEllis/followi...
gists_url              https://api.github.com/users/ShanEllis/gists{/...
starred_url            https://api.github.com/users/ShanEllis/starred...
subscriptions_url      https://api.github.com/users/ShanEllis/subscri...
organizations_url            https://api.github.com/users/ShanEllis/orgs
repos_url                   https://api.github.com/

### Authorized Access - OAuth

Open Authorization is a protocol to authorize access (of a user / application) to an API.

OAuth provides a secure way to 'log-in' without using account names and passwords. 

It is effectively a set of keys, and passwords you can use to access APIs. 

## Web Scraping vs. APIs

Web scraping and APIs are different approaches:

- APIs are an interface to interact with an application, designed for programmatic use
    - They allow systematic, controlled access to (for example) and applications database
    - They typically return structured (friendly) data 

- Web scraping (typically) involves navigating through the internet, programmatically following an architecture built for humans
    - This can be hard to systematize, being dependent on the idiosyncracies of a web page, at the time you request it
    - This typically returns relatively unstructured data
    - This entails much more wrangling of the data

# Notes on Working with Data

### Data Science is Ad-Hoc

- It is part of the job description to put things together that were not designed to go together.
- We do not have universal solutions, but haphazard, idiosyncratic systems, for data collection, storage and analysis.
- Data is everywhere. But relatively little of it was collected *as data*.

### Data Collection, Curation, and Storage are Difficult

- It can be difficult to choose broadly useful standards
- Take time to think about your data, and how you will load, store, organize and save it

### Data is Inherently Noisy

- We live in a messy, noisy, world, with messy, noisy, people, using messy, noisy instruments.
- There is no perfect data. 
    - There is better / or worse data, given the context.

### Different Objectives

- Humans and computers are different.
- We interact with '*data*' in different ways.
- This underlies many aspects of data wrangling
    - The 'friendliness' of data types / files
    - The difference between web scraping and APIs
    - A disconnect between data in the real world, and data we want to use

## So... What to do?

- Think about how your data are stored & its structure?
- Look at your data before you anayze it
    - are there missing values? 
    - outlier values? 
- Are your data trustworthy? 
    - source?
    - how was it generated?

## Specific Recommendations

- Prioritize using well structured, common, open file types
    - Take advantage of existing tools to deal with these files (numpy, pandas, etc.)

- Look into, and then follow, common conventions
    - Minimize custom objects, workflows and data files 
- Look for APIs. Ask if they are available.
    - Acknowledge that web scraping and/or wrangling unstructured data are complex / long tasks

- Think about data flow from the beginning. Organize your data pipeline, consider the 'wrangling' aspects throughout
    - Set yourself up with well organized, labelled approach to your data
    - Think about when and how you might want/need to save out intermediate results.