<a href="https://colab.research.google.com/github/jbshirk/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module2-loadingdata/Joseph_Shirk_LS_DSPT3_112_Loading_Data_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Loading Datasets

This assignment is purposely semi-open-ended you will be asked to load datasets both from github and also from CSV files from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). 

Remember that the UCI datasets may not have a file type of `.csv` so it's important that you learn as much as you can about the dataset before you try and load it. See if you can look at the raw text of the file either locally, on github, using the `!curl` shell command, or in some other way before you try and read it in as a dataframe, this will help you catch what would otherwise be unforseen problems.


## 1) Load a dataset from Github (via its *RAW* URL)

Pick a dataset from the following repository and load it into Google Colab. Make sure that the headers are what you would expect and check to see if missing values have been encoded as NaN values:

<https://github.com/ryanleeallred/datasets>

In [0]:
!git clone https://raw.githubusercontent.com/ryanleeallred/datasets/master/concrete_data.csv

Cloning into 'concrete_data.csv'...
fatal: repository 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/concrete_data.csv/' not found


* above method is unreliable. In this case it seems to be a credential/privacy issue.

In [0]:
import pandas as pd

In [0]:
url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/concrete_data.csv'
df1 = pd.read_csv(url)

In [0]:
df1.head()

Unnamed: 0,cement,blast_furnace_slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,concrete_compressive_strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


* graciously, column headings are included

## 2) Load a dataset from your local machine
Download a dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and then upload the file to Google Colab either using the files tab in the left-hand sidebar or by importing `files` from `google.colab` The following link will be a useful resource if you can't remember the syntax: <https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92>

While you are free to try and load any dataset from the UCI repository, I strongly suggest starting with one of the most popular datasets like those that are featured on the right-hand side of the home page. 

Some datasets on UCI will have challenges associated with importing them far beyond what we have exposed you to in class today, so if you run into a dataset that you don't know how to deal with, struggle with it for a little bit, but ultimately feel free to simply choose a different one. 

- Make sure that your file has correct headers, and the same number of rows and columns as is specified on the UCI page. If your dataset doesn't have headers use the parameters of the `read_csv` function to add them. Likewise make sure that missing values are encoded as `NaN`.

### [UCI ML - Forest Fires](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)

NOTE: It may appear a lack of creativity to have chosen the same dataset as was used in the lecture, but actually I had chosen and started using it before the lecture, not knowing in advance.

In [0]:
# https://archive.ics.uci.edu/ml/datasets/Forest+Fires

from google.colab import files
uploaded = files.upload()

Saving forestfires.csv to forestfires.csv


In [0]:
import io
df2 = pd.read_csv(io.BytesIO(uploaded['forestfires.csv']))

In [0]:
df2.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


* graciously, column headings are included

## 3) Load a dataset from UCI using `!wget`

"Shell Out" and try loading a file directly into your google colab's memory using the `!wget` command and then read it in with `read_csv`.

With this file we'll do a bit more to it.

[x] - Read it in, 

[x] fix any problems with the header 

[_] as well as make sure missing values are encoded as `NaN`.
- Use the `.fillna()` method to fill any missing values. 
 - [.fillna() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)


Create one of each of the following plots using the Pandas plotting functionality:
 - Scatterplot
 - Histogram
 - Density Plot


### UCI ML [Adult](https://archive.ics.uci.edu/ml/datasets/Adult) dataset

In [0]:
# parse the column names from the .names file

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

--2019-09-06 20:55:54--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5229 (5.1K) [application/x-httpd-php]
Saving to: ‘adult.names’


2019-09-06 20:55:54 (102 MB/s) - ‘adult.names’ saved [5229/5229]



In [0]:
!tail adult.names

education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El

In [0]:
#do the same with old.adult.names
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/old.adult.names

--2019-09-06 20:58:55--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/old.adult.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4267 (4.2K) [application/x-httpd-php]
Saving to: ‘old.adult.names’


2019-09-06 20:58:55 (114 MB/s) - ‘old.adult.names’ saved [4267/4267]



In [0]:
!tail old.adult.names

 NOTE : this could all be done with just awk or sed, if one has time to master them 

 1. use 'awk' to just get the last "paragraph"
 https://unix.stackexchange.com/questions/82944/how-to-grep-for-text-in-a-file-and-display-the-paragraph-that-has-the-text

 2. use 'grep' to eliminate non-matches (awk prepends those with '|')
 https://stackoverflow.com/questions/3548453/negative-matching-using-grep-match-lines-that-do-not-contain-foo

 3. use 'cut' with ':' as delimiter to get column names only without their definitions
 https://www.computerhope.com/unix/ucut.htm

 4. use 'tr' to replace newlines with comma to form csv header. theoretically can be done with sed, but it's arcane
 https://www.computerhope.com/unix/utr.htm

 5. save output to adult.header

In [0]:
#this needs improvement but does most of the work

!awk -v RS='' '/:/' adult.names | grep -v '|' | cut -d ':' -f 1 | tr \\n , > adult.header

!cat adult.header

age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,

In [0]:
# same as above but for old.adult.names for the sake of comparison
# not fully processing or saving this

!awk '/Attribute Information/{y=1;next}/^8/{y=0}y' old.adult.names 
#| cut -d ':' -f 1 | tr \\n , > adult.header

#!cat adult.header


age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-R

In [0]:
#manually fixed and string quoted
adultheaders = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income']

the last column, **income**, was omitted from the names description. the 'old.adult.names' file calls this 'class' - needed to add this manually in the adultheaders variable.

In [0]:
adultdataurl =  'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
df2 = pd.read_csv(adultdataurl  , 
    names=adultheaders)

In [0]:
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [0]:
df2.describe(exclude="number")

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
count,32561,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27816,21790,29170,24720


In [0]:
df2["marital-status"].value_counts()

 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: marital-status, dtype: int64

In [0]:
df2.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

so there are no missing values (nulls) but...

In [0]:
# randomly inspect parts of the data without having to show all of it at once
df2.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
3353,49,Private,186172,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,7688,0,45,United-States,>50K
20257,28,Self-emp-inc,31717,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8583,49,Private,143482,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
4771,34,Local-gov,43959,HS-grad,9,Married-civ-spouse,Other-service,Husband,White,Male,0,0,50,United-States,<=50K
20890,35,Self-emp-not-inc,170174,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,7298,0,60,United-States,>50K
10884,45,Private,189225,HS-grad,9,Never-married,Other-service,Unmarried,Black,Female,0,0,40,United-States,<=50K
5064,18,?,163788,Some-college,10,Never-married,?,Own-child,White,Female,0,0,40,United-States,<=50K
2935,50,Local-gov,177705,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,1740,48,United-States,<=50K
236,40,State-gov,170525,Some-college,10,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,38,United-States,<=50K
27934,34,Private,180714,Some-college,10,Married-civ-spouse,Transport-moving,Husband,Black,Male,0,2179,40,United-States,<=50K


this seems to be not so effective with large datasets with sparse anomalies or irregularities.

In [0]:
df2["native-country"].value_counts()

USA                            29170
 Mexico                          643
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
India                            100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 Greece                           29
 

there are 500+ unknown countries represented by '?'. If we had known this before loading we could have loaded it as:

In [0]:
df4 = pd.read_csv(adultdataurl, names=adultheaders, na_values="?")
df4.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


But, as we already read it in, in case it were huge or a lot of work has already been done on it, then it's better to do a replace:

In [0]:
import numpy as np

In [0]:
df2 = df2.replace(regex=".+\?", value=np.NaN) # use regex pattern instead of to_replace= because these have imported hidden spaces   

In [0]:
df2.head(16)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,USA,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,USA,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,USA,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,USA,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,USA,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,USA,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,USA,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,USA,>50K


TODO: find out how to replace part of matched string (e.g.: " India" > "India")

should be something like POSIX regex:
````
 df.replace(regex="[ ]+(India)", value="\1") #just India
 
 df.replace(regex="\s+(.+)", value="\1")        #any str
 ````

In [0]:
df2 = df2.replace(regex="[ ]+(India)", value="India")

In [0]:
df2 = df2.replace(regex="[ ]+(United-States)", value="USA")

In [0]:
df2["native-country"].value_counts()
#SHOWS that USA, India were stripped of leading space, but same problem exists for all other values ; probably best to do a list comprehension

USA                            29170
 Mexico                          643
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
India                            100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 Greece                           29
 

In [0]:
import pandas as pd

In [0]:
import re #where regex lives ; not the same as df.replace(regex=) so methods and syntax could be different
help(re)

In [0]:
help(df2.replace)

In [0]:
df5["native-country"] = df2["native-country"].replace(regex="[ ]+(.+)", value="\1") #spaces followed by alphanumeric string
df5["native-country"].value_counts()

USA      29170
         2708
India      100
Name: native-country, dtype: int64

In [0]:
df5.shape() #above code destroyed all other values in native-country

given the problems in this case, it would be best to just eliminate all spaces via setting the delimiter as ", " upon import

In [0]:
df2 = pd.read_csv(adultdataurl  , sep=', ' , 
    names=adultheaders)

  


In [0]:
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [0]:
df2["native-country"].value_counts()

United-States                 29170
Mexico                          643
unknown                         583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [0]:
df2 = df2.replace(regex="[ ]+(United-States)", value="USA")

In [0]:
df2 = df2.replace(regex="\?", value="unknown") #use unknown because NaN are not counted

In [0]:
help(pd.read_csv)

# Stretch Goals - Other types and sources of data

Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. Image, text, or (public) APIs are probably more tractable - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.

## https://www.bls.gov/data/#employment

query API https://beta.bls.gov/dataQuery/find?fq=survey:[ln]&s=popularity:D

current population survey https://www.bls.gov/cps/

employment https://www.bls.gov/cps/lfcharacteristics.htm#emp

databases https://data.bls.gov/cgi-bin/surveymost?ln

characteristics of the employed https://www.bls.gov/cps/tables.htm#charemp
* HTML format https://www.bls.gov/cps/cpsaat23.htm

## Reading HTML (scraping)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

this will be difficult because of stacked summary colummns. USGov loves to publish cross tabulations for statistics.

* [read_html() doesn't handle tables with multiple header rows #13434 ](https://github.com/pandas-dev/pandas/issues/13434)


I've gone down a dead end. Need to backtrack to some data that is more accessible.

Arg. BLS gives .xlsx files, which I can't open.

Trying to work with BLS was a bad idea.




###Resourses for dealing with summarized data in pandas

[MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

[How to create a multilevel dataframe in pandas](https://stackoverflow.com/questions/40820017/how-to-create-a-multilevel-dataframe-in-pandas)





NOTE: multi index dataframe is well beyond my capablity currently, so I'll skip to the JSON problem.

## CryptoCompare API
https://min-api.cryptocompare.com/documentation

dealing with JSON as a dict https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html

In [0]:
ccurl='https://min-api.cryptocompare.com/data/price?fsym=HOLO&tsyms=USD,JPY,EUR' #endpoint for HOLO priced in USD, JPY, EUR


In [0]:
cchurl='https://min-api.cryptocompare.com/data/histoday?fsym=HOLO&tsym=USD&limit=10' #endpoint for historical price of HOLO in USD


retrieve JSON data into Pandas https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html



In [0]:
import requests

In [0]:
ccrequest = requests.get(cchurl)

In [0]:
jsondata = ccrequest.json()

In [0]:
type(jsondata)

dict

In [0]:
jsondata.keys()



In [0]:
res = jsondata['Data'] #key 'All' did not work

In [0]:
res[:3]

[{'close': 0.0007976,
  'high': 0.0008893,
  'low': 0.0007855,
  'open': 0.0008498,
  'time': 1566950400,
  'volumefrom': 992242598,
  'volumeto': 791412.7},
 {'close': 0.0007839,
  'high': 0.0008025,
  'low': 0.0007586,
  'open': 0.0007976,
  'time': 1567036800,
  'volumefrom': 287459166,
  'volumeto': 225339.24},
 {'close': 0.0007988,
  'high': 0.0008325,
  'low': 0.0007718,
  'open': 0.0007839,
  'time': 1567123200,
  'volumefrom': 307556309,
  'volumeto': 245675.98}]

In [0]:
df6 = pd.read_json(jsondata) #wrong way to do it

JSON is basically just a dictionary in python (or can be converted to one)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html

In [116]:
df6 = pd.DataFrame.from_dict(jsondata ) # orient='columns' is default

ValueError: ignored

https://stackoverflow.com/questions/49505872/read-json-to-pandas-dataframe-valueerror-mixing-dicts-with-non-series-may-lea

In [0]:
import json

In [119]:
data = json.loads(jsondata)

TypeError: ignored

In [24]:
# https://stackoverflow.com/questions/33559660/error-while-reading-json-file

df6 = pd.DataFrame(res["Data"])

NameError: ignored

In [23]:
data

NameError: ignored

In [0]:
df6 = pd.DataFrame(res[:3]) #only a slice ; not sure how to get all data

In [127]:
df6.head()

Unnamed: 0,close,high,low,open,time,volumefrom,volumeto
0,0.000798,0.000889,0.000785,0.00085,1566950400,992242598,791412.7
1,0.000784,0.000803,0.000759,0.000798,1567036800,287459166,225339.24
2,0.000799,0.000833,0.000772,0.000784,1567123200,307556309,245675.98


In [0]:
df6 = pd.DataFrame(res[:]) 

In [129]:
df6.head()

Unnamed: 0,close,high,low,open,time,volumefrom,volumeto
0,0.000798,0.000889,0.000785,0.00085,1566950400,992242598.0,791412.7
1,0.000784,0.000803,0.000759,0.000798,1567036800,287459166.0,225339.24
2,0.000799,0.000833,0.000772,0.000784,1567123200,307556309.0,245675.98
3,0.000795,0.000837,0.000792,0.000799,1567209600,233869685.0,185996.56
4,0.000797,0.000802,0.00078,0.000795,1567296000,198968365.0,158597.68


Ought to be more than 5 results ; there are 11

Seems to me that the trick must have something to do with [read_json(... 'orient='...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

orient : string,

    Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is:

        'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
        'records' : list like [{column -> value}, ... , {column -> value}]
        'index' : dict like {index -> {column -> value}}
        'columns' : dict like {column -> {index -> value}}
        'values' : just the values array

    The allowed and default values depend on the value of the typ parameter.

        when typ == 'series',
            allowed orients are {'split','records','index'}
            default is 'index'
            The Series index must be unique for orient 'index'.
        when typ == 'frame',
            allowed orients are {'split','records','index', 'columns','values', 'table'}
            default is 'columns'
            The DataFrame index must be unique for orients 'index' and 'columns'.
            The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.


In [13]:
pd.read_json('cchurl',orient='records',typ='series')
# Defaults
 # typ='frame' : orient='columns'
 # typ='series' : orient='index'

ValueError: ignored

In [18]:
cchurl

'https://min-api.cryptocompare.com/data/histoday?fsym=HOLO&tsym=USD&limit=10'

In [19]:
type(cchurl) # I don't understand why 'url/path' can't be a str, pretty sure a str is an obj

str

[pandas.read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

pandas.read_json(path_or_buf=None, ...)

**path_or_buf** : a valid JSON str, path object or file-like object

Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.json.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.



In [0]:
jsondata

In [25]:
df = pd.read_json(cchurl,typ='series')
pd.DataFrame(data=df['Data'])

Unnamed: 0,close,high,low,open,time,volumefrom,volumeto
0,0.000784,0.000802,0.000759,0.000798,1567036800,287459166.0,225339.24
1,0.000799,0.000833,0.000772,0.000784,1567123200,307556309.0,245675.98
2,0.000795,0.000837,0.000792,0.000799,1567209600,233869685.0,185996.56
3,0.000797,0.000802,0.00078,0.000795,1567296000,198968365.0,158597.68
4,0.000799,0.000831,0.000781,0.000797,1567382400,211339298.0,168881.23
5,0.000793,0.00081,0.000779,0.000799,1567468800,195096404.0,154809.0
6,0.000805,0.000814,0.000768,0.000793,1567555200,188384917.7,151649.86
7,0.00082,0.000827,0.000795,0.000805,1567641600,195401246.0,160150.86
8,0.000786,0.000815,0.000764,0.00082,1567728000,369126913.0,290244.49
9,0.000797,0.000834,0.000613,0.000786,1567814400,298542574.0,237968.29


Found helpful solution at : https://stackoverflow.com/questions/50437311/reading-a-json-file-into-a-dataframe-without-using-the-json-module#1

## Conclusion

I simply don't understand multi-dimensional JSON data well enough to complete this as well as i would like, but this result is satisfactory for the time being.

Ironically, the problem of multi-index json data is not much different than the BLS summary statistics. I literally went down the same blind alley twice.

I can see that it will be worth the time to get a grip on multi-index and crosstab data import to pandas.