# Finding other ways to get data

In [None]:
import pandas as pd

As it so happens..... pandas also has a function for importing Excel spreadsheets (i.e. we don't necessarily need the openpyxl if we aren't doing more sophisticated things with Excel)

In [None]:
pd.read_excel('data/iris-excel-starter.xlsx')

This still has the buggy data.

Let's say that we have a couple friends who have already corrected this data for us.

# Fisher

Fisher has already given us one source of data:

<img src='data-sci-images/fisher-table-all.png' style='height:500px'>

Um....

That requires us to convert an image with text into an array of data..... let's not do that (though there are libraries and applications for doing optical character recognition and importing images!)

# Colleague #1

One of our colleagues has given us a JSON file.

In [None]:
import json

In [None]:
with open('data/iris-v1.json') as f:
    x = json.load(f)

In [None]:
x

Putting this into a DataFrame requires that we flatten the dictionary....

In [None]:
pd.read_json('data/iris-v1.json')

Apparently Pandas by default wants us to do that too.

In [None]:
pd.read_json('data/iris-v1-Copy1.json')

We could do more, but let's move on.  If you would like to work through some details, check this out:

In [None]:
pd.read_json('data/iris-v2.json')

But let's try another data set.

# Colleague #2

Another of our colleagues has given use a CSV.

In [None]:
import csv

In [None]:
with open('data/iris.csv') as f:
    x = csv.reader(f)

In [None]:
x

In [None]:
with open('data/iris.csv') as f:
    x = csv.reader(f)
    for row in x:
        print(row)

In [None]:
csvlists = []
with open('data/iris.csv') as f:
    x = csv.reader(f)
    for row in x:
        csvlists.append(row)
dfcsv = pd.DataFrame(csvlists[1:],columns=csvlists[0])

In [None]:
dfcsv

...or ....

In [None]:
pd.read_csv('data/iris.csv')

# Colleague #3

Colleague #3 exported tab-separated values.

(Look at 'data/iris-tab.txt')

In [None]:
pd.read_csv('data/iris-tab.txt',delimiter='\t')

# Online resources

Let's say we're uneasy about consulting our colleagues on this one.

The UCI machine learning repository hosts this data.

https://archive.ics.uci.edu/ml/datasets/iris

In Python, you can execute shell commands.  For example, let's execute a command with wget.

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names

"GNU Wget is a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc."
<br>-- https://www.gnu.org/software/wget/

The above uses "!" to execute shell commands from this notebook.

The way to run shell commands from a Python program:
* use `os.system()`
* use `os.popen()`
* use the `subprocess` module

In [None]:
import os

In [None]:
os.system('wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names')

Consulting that file -- it's a text file, and gives us good information, but maybe more than we want right now.

In [None]:
import requests

In [None]:
x = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')

In [None]:
x

In [None]:
print(x.content)

In [None]:
x.text.split('\n')

To make a DataFrame from a list, we need the list items (that is, the eventual rows) to be lists.

This can be obtained via list comprehension.

In [None]:
# Starter examples of list comprehension
[i for i in [1,2,3]]

In [None]:
[i for i in ['What,is,this','a,comma,full,sentence']]

In [None]:
[i.split(',') for i in ['What,is,this','a,comma,full,sentence']]

In [None]:
xarr = [i.split(',') for i in x.text.split('\n')]

Essentially this is a Pythonic way of saying:
```
make a new list
whose elements are i.split(',')
-- e.g. ['5.1','3.5','1.4','0.2','Iris-setosa'] from '5.1,3.5,1.4,0.2,Iris-setosa'
for every element i from x.text.split('\n') 
-- e.g. the list of strings above
```    

In [None]:
pd.DataFrame(xarr)

In [None]:
pd.DataFrame(xarr).info()

We should still fix the data types.

# Pandas through the web

Lo and behold, you can even pass a website into Panda's read_csv

In [None]:
irisdf = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv')

For pd.read_csv, "Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file."
<br> --https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

This means you can import from AWS S3 buckets too! (though this further requires installing S3Fs and handling authentication variables)

In [None]:
irisdf

In [None]:
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv'
irisdf = pd.read_csv(path, header=None)
irisdf.columns = ['sepalLength','sepalWidth','petalLength','petalWidth','species']

In [None]:
irisdf

Pandas has many IO capabilities:

<img src='data-sci-images/pdio.png' width=700>

## There are many ways to grab data, and many places from which to grab it.

## But now we will turn to visualizing our data