## Loading Data

### 2.0 Introduction
The first step in any machine learning endeavor is to get the raw data into our system. The raw data might be a logfile, dataset file, or database. Furthermore, often we will want to retrieve data from multiple sources. The recipies in this chapter look at methods of loading data from a variety of sources, including CSV files and SQL databases. We also cover methods of generating simulated data with desirable properties for experimentation. Finally, while there are many ways to load data in the Python ecosystem, we will focus on using the pandas library's extensive set of methods for loading external data, and using scikit-learn--an open source machine learning library in Python--for generating simulated data.

### 2.1 Loading a Sample Dataset
#### Problem
You want to load a prexisting sample dataset

#### Solution
scikit-learn comes with a number of popular datasets for you to use:

In [1]:
# load scikit-learn's datasets
from sklearn import datasets

# load digits dataset
digits = datasets.load_digits()

# create features matrix
features = digits.data

# create target vector
target = digits.target

# view first observation
features[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

#### Discussion
Often we do not want to go through the work of loading, transforming and cleaning a real-world dataset before we can explore some machine learning algorithm or method. Luckily, scikit-learn comes with some common datasets we can quickly load. These datasets are often called "toy" datasets because they are far smaller and cleaner than a dataset we would see in the real world. Some popular sample datasets in scikit-learn are:

`load_boston`
* Contains 503 observations on Boston housing prices. It is a good dataset for exploring regression algorithms.
    
`load_iris`
* Contains 150 observations on the measurements of Iris flowers. It is a good dataset for exploring classification algorithms

`load_digits`
* Cotnains 1,797 observations from images of handwritten digits. It is a good dataset for teaching image classification

#### See Also
* scikit-learn toy datasets (http://scikit-learn.org/stable/datasets/index.html#toy-datasets)
* The Digit Dataset (http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html)

### 2.2 Creating a Simulated Dataset
#### Problem
You need to generate a dataset of simulated data

#### Solution
scikit-learn offers any methods for creating simulated data. Of those, three methods are particularly useful


#### See Also
* `make_regression` documentation (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html)
* `make_classification` documentation (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html)
* `make_blobs` documetnation (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)

### 2.3 Loading a CSV File
#### Problem
You need to import a comma-separated values (CSV) file.

#### Solution
Use the `pandas` library's `read_csv` to load a local or hosted CSV file:

In [2]:
# load library
import pandas as pd

# create url
url = "http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv"

# load data
df = pd.read_csv(url)

df.head(2)

Unnamed: 0,cdatetime,address,district,beat,grid,crimedescr,ucr_ncic_code,latitude,longitude
0,1/1/06 0:00,3108 OCCIDENTAL DR,3,3C,1115,10851(A)VC TAKE VEH W/O OWNER,2404,38.55042,-121.391416
1,1/1/06 0:00,2082 EXPEDITION WAY,5,5A,1512,459 PC BURGLARY RESIDENCE,2204,38.473501,-121.490186


### 2.4 Loading an Excel File
#### Problem
You need to import an Excel spreadsheet

#### Solution
Use the `pandas` library's `read_excel` to load an Excel spreadsheet:

In [3]:
# load library
import pandas as pd

# create url
url = "https://www.sample-videos.com/xls/Sample-Spreadsheet-10-rows.xls"

# load data
df = pd.read_excel(url, sheet_name=0, header=None)

# view the first two rows
df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1,"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35.0,Nunavut,Storage & Organization,0.8
1,2,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",Barry French,293,457.81,208.16,68.02,Nunavut,Appliances,0.58


### 2.5 Loading a JSON File
#### Problem
You need to load a JSON file for data preprocessing

#### Solution
The pandas library provides `read_json` to convert a JSON file a pandas object:

In [4]:
# load library
import pandas as pd

# create url
url = "https://api.github.com/users/TheAlgorithms/repos"

# load data
df = pd.read_json(url, orient="columns")

# view first two rows
df.head(2)

Unnamed: 0,archive_url,archived,assignees_url,blobs_url,branches_url,clone_url,collaborators_url,comments_url,commits_url,compare_url,...,subscribers_url,subscription_url,svn_url,tags_url,teams_url,trees_url,updated_at,url,watchers,watchers_count
0,https://api.github.com/repos/TheAlgorithms/Alg...,False,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,https://github.com/TheAlgorithms/Algorithms-Ex...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,https://github.com/TheAlgorithms/Algorithms-Ex...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,https://api.github.com/repos/TheAlgorithms/Alg...,2018-12-24 13:45:51,https://api.github.com/repos/TheAlgorithms/Alg...,209,209
1,https://api.github.com/repos/TheAlgorithms/C/{...,False,https://api.github.com/repos/TheAlgorithms/C/a...,https://api.github.com/repos/TheAlgorithms/C/g...,https://api.github.com/repos/TheAlgorithms/C/b...,https://github.com/TheAlgorithms/C.git,https://api.github.com/repos/TheAlgorithms/C/c...,https://api.github.com/repos/TheAlgorithms/C/c...,https://api.github.com/repos/TheAlgorithms/C/c...,https://api.github.com/repos/TheAlgorithms/C/c...,...,https://api.github.com/repos/TheAlgorithms/C/s...,https://api.github.com/repos/TheAlgorithms/C/s...,https://github.com/TheAlgorithms/C,https://api.github.com/repos/TheAlgorithms/C/tags,https://api.github.com/repos/TheAlgorithms/C/t...,https://api.github.com/repos/TheAlgorithms/C/g...,2018-12-25 14:04:50,https://api.github.com/repos/TheAlgorithms/C,1009,1009


### 2.6 Querying a SQL Database
#### Problem
You need to load data from a databaseu sing structured query language (SQL)

#### Solution
`pandas`' `read_sql_query` allows us to make a SQL query to a database and load it:

In [5]:
# load libraries
import pandas as pd
from sqlalchemy import create_engine

# create a connection to the database
db_connection = create_engine('sqlite:///sample.db')

# see table names
print(db_connection.table_names())

# load data
df = pd.read_sql_query('SELECT * FROM albums', db_connection)

df.head()

['albums', 'artists', 'customers', 'employees', 'genres', 'invoice_items', 'invoices', 'media_types', 'playlist_track', 'playlists', 'sqlite_sequence', 'sqlite_stat1', 'tracks']


Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3
