<a href="https://colab.research.google.com/github/dunkelweizen/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module2-loadingdata/Cai_Nowicki_Dealing_With_Data_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Loading Datasets

This assignment is purposely semi-open-ended you will be asked to load datasets both from github and also from CSV files from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). 

Remember that the UCI datasets may not have a file type of `.csv` so it's important that you learn as much as you can about the dataset before you try and load it. See if you can look at the raw text of the file either locally, on github, using the `!curl` shell command, or in some other way before you try and read it in as a dataframe, this will help you catch what would otherwise be unforseen problems.


## 1) Load a dataset from Github (via its *RAW* URL)

Pick a dataset from the following repository and load it into Google Colab. Make sure that the headers are what you would expect and check to see if missing values have been encoded as NaN values:

<https://github.com/ryanleeallred/datasets>

In [18]:
import pandas as pd
data_url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/messy-data.csv'

df = pd.read_csv(data_url)

df.head()

Unnamed: 0,alpha,beta,gamma,delta,epsilon,zeta,eta
0,2,48,12,240,3.0,Yes,AZ
1,3,46,18,230,5.0,,VT
2,4,44,24,220,7.0,No,PA
3,5,42,30,210,9.0,Yes,OK
4,6,44,36,220,11.0,Yes,MD


In [19]:
df.isnull().sum()

alpha       0
beta        0
gamma       0
delta       0
epsilon    24
zeta       19
eta         0
dtype: int64

## 2) Load a dataset from your local machine
Download a dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and then upload the file to Google Colab either using the files tab in the left-hand sidebar or by importing `files` from `google.colab` The following link will be a useful resource if you can't remember the syntax: <https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92>

While you are free to try and load any dataset from the UCI repository, I strongly suggest starting with one of the most popular datasets like those that are featured on the right-hand side of the home page. 

Some datasets on UCI will have challenges associated with importing them far beyond what we have exposed you to in class today, so if you run into a dataset that you don't know how to deal with, struggle with it for a little bit, but ultimately feel free to simply choose a different one. 

- Make sure that your file has correct headers, and the same number of rows and columns as is specified on the UCI page. If your dataset doesn't have headers use the parameters of the `read_csv` function to add them. Likewise make sure that missing values are encoded as `NaN`.

In [20]:
import io
import pandas as pd
from google.colab import files
uploaded = files.upload()


In [21]:
df2 = pd.read_csv('poker-hand-testing.data')
col_names = ['S1', 'C1', 'S2', 'C2', 'S3', 'C3', 'S4', 'C4', 'S5', 'C5', 'Class']
df2.columns = col_names

FileNotFoundError: ignored

In [0]:
df2.head()

In [0]:
df2.describe().T

In [0]:
df2.isnull().sum()

In [0]:
df2 = df2.dropna()


In [0]:
df2.isnull().sum()

In [0]:
df2 = df2.astype(int)

In [0]:
df2.dtypes

## 3) Load a dataset from UCI using `!wget`

"Shell Out" and try loading a file directly into your google colab's memory using the `!wget` command and then read it in with `read_csv`.

With this file we'll do a bit more to it.

- Read it in, fix any problems with the header as make sure missing values are encoded as `NaN`.
- Use the `.fillna()` method to fill any missing values. 
 - [.fillna() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)
- Create one of each of the following plots using the Pandas plotting functionality:
 - Scatterplot
 - Histogram
 - Density Plot


In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt

In [0]:
df_seeds = pd.read_csv(r'seeds_dataset.txt', header=None, delim_whitespace=True)  #this dataset used / as sep, r removes it
                                                                                  #delim_whitespace makes the space the sep

In [0]:
df_seeds.head()
#column 7 seems to be superfluous

In [0]:
seed_cols = ['area', 'perimeter', 'compactness', 'length', 
             'width', 'asymmetry coefficient', 'length of groove']
df_seeds = df_seeds.drop(df_seeds.columns[7], axis=1)  #this drops column with index 7
df_seeds.columns = seed_cols

In [0]:
df_seeds.head()

In [0]:
df_seeds.isnull().sum()

In [0]:
df_seeds['area'].hist()

In [0]:

df_seeds['area'].plot.density()

In [0]:
df_seeds.plot.scatter('area', 'length of groove')

## Stretch Goals - Other types and sources of data

Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. Image, text, or (public) APIs are probably more tractable - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.

In [29]:
!pip install tabula-py



In [0]:
from tabula import read_pdf
from tabula import convert_into

#data file from https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/31721/datadocumentation

In [44]:
convert_into('/content/31721-0001-Codebook.pdf', "data.json", output_format='json')  #It keeps making an empty list even though I know there is data in the PDF
!cat data.json

[]

In [48]:
!pip install PyPDF2



In [0]:
import PyPDF2

In [0]:
pdfFileObj = open('/content/31721-0001-Codebook.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [51]:
pdfReader.numPages   #at least this is reading the PDF like it has data in it

286

In [56]:
pageObj = pdfReader.getPage(10)
pageObj.extractText()   #this does something but it's not useful

'ˆ\n6#\n2-˘\n()ˆ%˙\n7\n\n2˝˛\' "# !!˛˝!˜\n\nˇˆ\n\n9$\n\n\n\n;!!$\n˘\n˘ˇˆ\n\n\'\n ˜\n˜˜7\n2˛-)˛˝!#˛- ˛˝#˛ˇ\n\n.*!˚34#!5˛˝31\n\n#˚˛"˛#"\n˘˘ˇ˘\n\n$@A+%B˘%#%1\'ˆ!C,%C/%˝D%1˛\n=˚˛!˜-""˛#!˛˝?˚!#&˛#\'$"->0 "!#˛05"!#˛.2˝˛#\'1\nˆ\n6#\n2-˘\n()ˆ%˙\n7\n\n˝#\n\nˇˆ\n\n/˝#\n\n˘ˇˆ\n\n\'\n ˜\n˜˜7\n2˛-)˛˝!#˛- ˛˝#˛ˇ\n:\n;!!5-53ˇ\n:\n;˛<!5-53ˇ\n\n.*!˚34#!5˛˝31\n\n-5"!#\n$@A+%B˘%#%1\'ˆ!:˘˝ˆ%˛\n=˚˛!˜-""˛#!˛˝?˚!#&˛#\'$"->=˚!.D˛-#˛!˛1\nˆ\n6#\n2-˘\n()ˆ%˙\n7\n\n˝#\n\nˇ˘ˆ\n\n/˝#\n˘\n˘ˇˆ\n\n\'\n ˜\n˜˜7\n2˛-)˛˝!#˛- ˛˝#˛ˇ\n:\n;!!5-53ˇ\n:\n;˛<!5-53ˇ\n\n.*!˚34#!5˛˝31\n\n-5"!#\n$@A+%B˘%#%1\'ˆ!&%\'6\'B6\n=˚˛!˜-""˛#!˛˝?˚!#&˛#\'$"->@!˛!#"B˛!?B˛!˛\n\n\n\n\n'

In [58]:
pdfReader.isEncrypted   #so it's not encrypted, it's just not outputting anything useful

False

In [0]:
path = '/content/31721-0001-Codebook.pdf'   #I'm getting of typing this out each time

In [64]:
!pip install camelot-py  #a different library that says it reads PDF into Python https://hackernoon.com/announcing-camelot-a-python-library-to-extract-tabular-data-from-pdfs-605f8e63c2d5

Collecting camelot-py
[?25l  Downloading https://files.pythonhosted.org/packages/70/d6/a47894242a6fba58a2332489358afedc6209da43942ab7f850b932019101/camelot_py-0.7.3-py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 3.4MB/s 
Installing collected packages: camelot-py
Successfully installed camelot-py-0.7.3


In [0]:
import camelot

In [68]:
tables = camelot.read_pdf(path)   #I get an error saying I need Ghostscript so I guess I'll install that

RuntimeError: ignored

In [69]:
!pip install ghostscript



In [70]:
tables = camelot.read_pdf(path)

RuntimeError: ignored