# 05 Obtaining and Cleaning Data
__Math 3080: Fundamentals of Data Science__

Reading:
* McKinney, Chapter 6 - Data Loading, Storage, and File Formats
* Grus, Chapter 9 - Getting Data
* Geron, Chapter 2 - End-to-End Machine Learning Project, pp. 46-51, 62-68

Outline:
1. Obtaining the data
    * Dowloading the data directly
    * HTML Scraping
    * APIs, JSON, and XML
    * Filtering the Data
2. Loading the data
    * Loading in NumPy
    * Loading in Pandas
3. Cleaning the Data

-----
We have learned about the calculations needed for basic models. Linear algebra is used frequently to handle data as well as to accomplish the mathematics involved in the models which we use to solve the statistics. We will return to linear regression later in the course, along with logistic regression and decision trees.

* Bring up the Question/Data circle: 
  * Questions --> Datasets --> Data Types --> Input/Calculations/Output
* We have talked about some of the calculations, and there will be a lot more
* An overarching concern in the entire process is __Data Wrangling__ (Draw in middle of data circle). Data Wrangling consists of
    1. Obtaining the data
    2. Cleaning the data
    3. Manipulating the data (not changing numbers, but arranging the data in useful formats)
    4. Visualization and Analysis of the data (Leads into the calculation part of the cycle)

In this segment, we will look at how to obtain and clean the data. The next two segments will look at how to manipulating and visualizing the data.

## 5.1 Obtaining the data
Where can we get data?

### Online websites (Kaggle, Data Centers)
Data is stored all over on the web. Most websites that deal with data will have a way to download the data. For example:
  * kaggle.com

Sometimes, the data is available to be displayed, but you have to copy it and put it into Excel or a text editor and save is in a format that can be loaded. For example:
  *  https://www.weather.gov/wrh/timeseries?site=K41U

This works just fine, but then every time you need some updated data, you have to capture the data, put it into excel, save it into the right format, and then load it into the computer. It would be very helpful if we could just automatically get the data. There are a couple of good ways to do this:
  * HTML Scraping: code to go through an html file and grab the printed data (done in Data Mining - 2nd semester)
  * API's (Application Programming Interfaces): some data is available online, and can be loaded directly into the program
    * INQUIRE TO SEE WHO HAS ALREADY WORKED WITH APIs

### APIs



## 5.2 Loading the data

In [27]:
# Loading files for reading using NumPy
import numpy as np

matrix = np.array([[1,2,3],
                   [2,3,4],
                   [3,4,5]])

np.save('data/matrix.npy', matrix)

In [28]:
load_file = np.load('data/matrix.npy')
load_file

array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

In [29]:
np.loadtxt('data/test.txt', delimiter=' ')
# Be sure to look at the documentation for the different options you can have
  # ','
  # '\t'

array([[1., 2., 3., 4., 5.],
       [2., 3., 4., 5., 6.],
       [3., 4., 5., 6., 7.],
       [4., 5., 6., 7., 8.]])

In [30]:
# Loading files using pandas
import pandas as pd

df = pd.read_csv('data/test.txt', delimiter=' ')
df

Unnamed: 0,1,2,3,4,5
0,2,3,4,5,6
1,3,4,5,6,7
2,4,5,6,7,8


In [31]:
df = pd.read_csv('data/test.txt', delimiter=' ', header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8


In [32]:
df = pd.read_csv('data/test.txt', delimiter=' ', header=None)
df.columns=[['HW 1', 'HW 2', 'HW 3', 'Quiz 1', 'Exam 1']]
df.index=[['001','002','003','004']]
df

Unnamed: 0,HW 1,HW 2,HW 3,Quiz 1,Exam 1
1,1,2,3,4,5
2,2,3,4,5,6
3,3,4,5,6,7
4,4,5,6,7,8


In [33]:
df['HW 2']

Unnamed: 0,HW 2
1,2
2,3
3,4
4,5


In [34]:
df.loc['003']

Unnamed: 0,HW 1,HW 2,HW 3,Quiz 1,Exam 1
3,3,4,5,6,7


In [35]:
df.iloc[2]

HW 1      3
HW 2      4
HW 3      5
Quiz 1    6
Exam 1    7
Name: (003,), dtype: int64

In [36]:
df.loc['003','HW 2']

Unnamed: 0,HW 2
3,4


In [37]:
df['HW 3'] >= 4

Unnamed: 0,HW 3
1,False
2,True
3,True
4,True


In [None]:
df[df['HW 3'] == 4]

## 5.3 Cleaning the data
If we are lucky, datasets are perfect and ready to use. Most of the time, however, there are problems
* Unlabeled data
* Missing data
* Unorganized



#### 5.3.1 Unlabeled data
Consider the following dataset:

In [20]:
import pandas as pd

df = pd.read_csv('../Datasets/unlabeled_data.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,71,72,73,74,75,76,77,78,79,80
0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
2,2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
3,3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
4,4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,1456,60,RL,62,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1457,1457,20,RL,85,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,1458,70,RL,66,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,1459,20,RL,68,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


All data should have 2 things: Labeled Columns and Documentation. Now, consider the original dataset:

In [21]:
df = pd.read_csv('../Datasets/Housing_Data/train.csv')
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


* Open the file ./math3080/Datasets/Housing_Data/data_description.txt

Every dataset should come with labeled columns and with another file that provides more information for each variable.

#### 5.3.2 Missing data

#### 5.3.3 Unorganized data
Another potential problem is that data could be unorganized. Sometimes data is just a list. Sometimes values are not in the right place. Sometimes they have the wrong units. So before anything else can be done, we have to organize the data.

How do we determine if data is missing? Missing values can have any of the following:
* An extreme number (-9999)
* NaN (Not a Number)
  * In python, a NaN is produced using `np.nan`
* Blank entries (no information) - programs usually fill these with NaN

Let's look at the following dataset on passengers of Titanic:

In [22]:
titanic = pd.read_csv('../Datasets/Titanic/titanic_train.csv')
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


There are definitely some missing values in the __Age__ and __Cabin__ columns. Are there any others?

We can test for missing values using the `.isnull()` command. The result will be a DataFrame with True/False values indicated if the value is missing (True) or if it is not missing (False).

In [23]:
titanic.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


By taking the sum of this `.isnull()` DataFrame, we can find out exactly how many elements are missing.

In [24]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can't just leave these missing values, otherwise it will affect our data. Here are 4 ways we can fill in the data:
* Filling missing values with column/row average
  * If there are few missing values, and all the other values in the column/row are random
  * Using row averages will be rare since rows are generally composed of different variables
* Filling missing values with previous/next value
  * If there are few missing values, and there is an order to the values in that variable
* Filling missing values with avg of previous/next values
  * If there are few missing values, and there is an order to the values in that variable
* Removing NaN rows and columns
  * If it doesn't make sense to fill them
  * If that column is needed but missing values will cause problems
  * If there are too many 

In [16]:
df2 = pd.read_csv('../Datasets/Titanic/titanic_test.csv')
df2

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [17]:
df2.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64