## Chapter 1. How To Load Machine Learning Data
Any Machine Learning problem begins with loading of data into your workspace. The most common formate for Machine Learning data is CSV Files. In this chapter we will see the various techniques for loading CSV files into our workspace.
#### Content
1. Load CSV Files with the Python Standard Library.
2. Load CSV Files with NumPy.
3. Load CSV Files with Pandas.

### Pima Indians Dataset
This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within ﬁve years. This dataset is perfect to start with all the input variables are numericals and it is a problem of binary classification.

### 1. Load CSV Files with the Python Standard Library

In [2]:
# Load CSV file using Python Standard Library.

# Import python standard library module
import csv
import numpy as np

# Open csv file in read only mode
raw_data = open("diabetes.csv", "r")

# Load the CSV file
reader = csv.reader(raw_data, delimiter = ",")

# Skip the first line as the first line is the name of variables
next(reader, None)
x = list(reader)

# Convert datatype to float
data = np.array(x).astype('float')

# display data
print(data)

# display data dimension
print("Shape of DataFrame: ", data.shape)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
Shape of DataFrame:  (768, 9)


The resultant data is numpy array.
### 2.  Load CSV Files with NumPy

In [3]:
# Load CSV using numpy
from numpy import loadtxt

raw_data = open("diabetes.csv", "r")

# load the data and skip the first row
data = loadtxt(raw_data, delimiter = ",",  skiprows = 1)

print(data)
print(data.shape)

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
(768, 9)


The **limitation** with both the above approaches is - the functions assumes no header row and all data has the same format. Thus these methods of loading data are not so common and rarely used. The methods which is most commonly used is discussed below.

### 3. Load CSV Files with Pandas
You can load your CSV data using Pandas and the pandas.read csv() function. This function is very ﬂexible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas dataframe.

In [4]:
# Load CSV using Pandas 

import pandas as pd

# Load the data
data = pd.read_csv("diabetes.csv")

# display the top 5 rows
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Next
Now that you know how to load your CSV data using Python it is time to start looking at it. In the next lesson you will discover how to use simple descriptive statistics to better understand your data.