# 05 Obtaining and Cleaning Data
__Math 3080: Fundamentals of Data Science__

Reading:
* McKinney, Chapter 6 - Data Loading, Storage, and File Formats
* Grus, Chapter 9 - Getting Data
* Geron, Chapter 2 - End-to-End Machine Learning Project, pp. 46-51, 62-68

Outline:
1. Obtaining the data
    * Dowloading the data directly
    * HTML Scraping
    * APIs, JSON, and XML
    * Filtering the Data
2. Loading the data
    * Loading in NumPy
    * Loading in Pandas
3. Cleaning the Data

-----
We have learned about the calculations needed for basic models. Linear algebra is used frequently to handle data as well as to accomplish the mathematics involved in the models which we use to solve the statistics. We will return to linear regression later in the course, along with logistic regression and decision trees.

* Bring up the Question/Data circle: 
  * Questions --> Datasets --> Data Types --> Input/Calculations/Output
* We have talked about some of the calculations, and there will be a lot more
* An overarching concern in the entire process is __Data Wrangling__ (Draw in middle of data circle). Data Wrangling consists of
    1. Obtaining the data
    2. Cleaning the data
    3. Manipulating the data (not changing numbers, but arranging the data in useful formats)
    4. Visualization and Analysis of the data (Leads into the calculation part of the cycle)

In this segment, we will look at how to obtain and clean the data. The next two segments will look at how to manipulating and visualizing the data.

## 5.1 Obtaining the data
Where can we get data?

### Online websites (Kaggle, Data Centers)
Data is stored all over on the web. Most websites that deal with data will have a way to download the data. For example:
  * kaggle.com

Sometimes, the data is available to be displayed, but you have to copy it and put it into Excel or a text editor and save is in a format that can be loaded. For example:
  *  https://www.weather.gov/wrh/timeseries?site=K41U

This works just fine, but then every time you need some updated data, you have to capture the data, put it into excel, save it into the right format, and then load it into the computer. It would be very helpful if we could just automatically get the data. There are a couple of good ways to do this:
  * HTML Scraping: code to go through an html file and grab the printed data (done in Data Mining - 2nd semester)
  * API's (Application Programming Interfaces): some data is available online, and can be loaded directly into the program
    * INQUIRE TO SEE WHO HAS ALREADY WORKED WITH APIs

### APIs



## 5.2 Loading the data

In [27]:
# Loading files for reading using NumPy
import numpy as np

matrix = np.array([[1,2,3],
                   [2,3,4],
                   [3,4,5]])

np.save('data/matrix.npy', matrix)

In [28]:
load_file = np.load('data/matrix.npy')
load_file

array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

In [29]:
np.loadtxt('data/test.txt', delimiter=' ')
# Be sure to look at the documentation for the different options you can have
  # ','
  # '\t'

array([[1., 2., 3., 4., 5.],
       [2., 3., 4., 5., 6.],
       [3., 4., 5., 6., 7.],
       [4., 5., 6., 7., 8.]])

In [30]:
# Loading files using pandas
import pandas as pd

df = pd.read_csv('data/test.txt', delimiter=' ')
df

Unnamed: 0,1,2,3,4,5
0,2,3,4,5,6
1,3,4,5,6,7
2,4,5,6,7,8


In [31]:
df = pd.read_csv('data/test.txt', delimiter=' ', header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8


In [32]:
df = pd.read_csv('data/test.txt', delimiter=' ', header=None)
df.columns=[['HW 1', 'HW 2', 'HW 3', 'Quiz 1', 'Exam 1']]
df.index=[['001','002','003','004']]
df

Unnamed: 0,HW 1,HW 2,HW 3,Quiz 1,Exam 1
1,1,2,3,4,5
2,2,3,4,5,6
3,3,4,5,6,7
4,4,5,6,7,8


In [33]:
df['HW 2']

Unnamed: 0,HW 2
1,2
2,3
3,4
4,5


In [34]:
df.loc['003']

Unnamed: 0,HW 1,HW 2,HW 3,Quiz 1,Exam 1
3,3,4,5,6,7


In [35]:
df.iloc[2]

HW 1      3
HW 2      4
HW 3      5
Quiz 1    6
Exam 1    7
Name: (003,), dtype: int64

In [36]:
df.loc['003','HW 2']

Unnamed: 0,HW 2
3,4


In [37]:
df['HW 3'] >= 4

Unnamed: 0,HW 3
1,False
2,True
3,True
4,True


In [None]:
df[df['HW 3'] == 4]

## 5.3 Cleaning the data