# Chapter 4
# How To Load Machine Learning Data

In this lesson you will learn three ways that you can use to load your CSV data in Python:
1. Load CSV Files with the Python Standard Library.
2. Load CSV Files with NumPy.
3. Load CSV Files with Pandas.

Let's get started.

## 4.1 Considerations When Loading CSV Data
There are a number of considerations when loading your machine learning data from CSV files. For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV request for comment titled Common Format and MIME Type for Comma-Separated Values (CSV) Files1.

## 4.1.1 File Header
Does your data have a file header? If so this can help in automatically assigning names to each column of data. If not, you may need to name your attributes manually. Either way, you should explicitly specify whether or not your CSV file had a file header when loading your data.

## 4.1.2 Comments
Does your data have comments? Comments in a CSV file are indicated by a hash (#) at the start of a line. If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.

## 4.1.3 Delimiter
The standard delimiter that separates values in fields is the comma (,) character. Your file could use a difierent delimiter like tab or white space in which case you must specify it explicitly.

## 4.1.4 Quotes
Sometimes field values can have spaces. In these CSV files the values are often quoted. The default quote character is the double quotation marks character. Other characters can be used, and you must specify the quote character used in your file.    

## 4.2 Pima Indians Dataset

The <b>Pima Indians Dataset</b> is used to demonstrate data loading in this lesson. It will also be used in many of the lessons to come. This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years. As such it is a classifiation problem. It is a good dataset for demonstration because all of the input attributes are numeric and the output variable to be predicted is binary (0 or 1).

## 4.3 Load CSV Files with the Python Standard Library

The Python API provides the module CSV and the function reader() that can be used to load CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for machine learning. For example, you can download3 the Pima Indians dataset into your local directory with the filename pima-indians-diabetes.data.csv. Allfields in this dataset are numeric and there is no header line.

In [34]:
# Load CSV Using Python Standard Library
import csv
import numpy

filename = '.\data\pima-indians-diabetes.data.csv'

with open(filename, newline='') as csvfile:
    datal = list(csv.reader(csvfile))
    
data = numpy.array(datal).astype('float')

print(data)
print(data.shape)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
(768, 9)


In [41]:
import numpy as np

csv = np.genfromtxt('.\data\pima-indians-diabetes.data.csv', delimiter=",")

data = numpy.array(csv).astype('float')

print(csv)

print('----')
print(data)

print('----')
print(data.shape)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
----
[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
----
(768, 9)


## 4.4 Load CSV Files with NumPy

You can load your CSV data using NumPy and the numpy.loadtxt() function. This function assumes no header row and all data has the same format. The example below assumes that the file pima-indians-diabetes.data.csv is in your data  working directory.

In [42]:
# Load CSV using NumPy
from numpy import loadtxt

filename = '.\data\pima-indians-diabetes.data.csv'

raw_data = open(filename, 'rb')
data = loadtxt(raw_data, delimiter=",")

print(data.shape)

(768, 9)


This example can be modified to load the same dataset directly from a URL as follows:

## IMPORTANT

In python3, urllib has been split into <a href='https://docs.python.org/3/library/urllib.request.html'>urllib.request</a> and <a href='https://docs.python.org/3/library/urllib.error.html'>urllib.error</a>.

In [49]:
# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen

url = 'https://gist.githubusercontent.com/ktisha/c21e73a1bd1700294ef790c56c8aec1f/raw/819b69b5736821ccee93d05b51de0510bea00294/pima-indians-diabetes.csv'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")

print(dataset.shape)

(768, 9)


## 4.5 Load CSV Files with Pandas

You can load your CSV data using Pandas and the pandas.read csv() function. This function is very  exible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas.DataFrame7 that you can immediately start summarizing and plotting.

In [47]:
# Load CSV using Pandas
from pandas import read_csv

filename = '.\data\pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = read_csv(filename, names=names)

print(data.shape)

(768, 9)


We can also modify this example to load CSV data directly from a URL.

In [48]:
# Load CSV using Pandas from URL
from pandas import read_csv

url = 'https://gist.githubusercontent.com/ktisha/c21e73a1bd1700294ef790c56c8aec1f/raw/819b69b5736821ccee93d05b51de0510bea00294/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)

print(data.shape)

(777, 9)
