# How to load dataset in Machine Learning Projects!

You must be able to load your data before you can start your machine learning project. 
The most common format for machine learning data is CSV files. 
#### There are a number of ways to load a CSV file in Python.

#### 1. Load CSV Files with the Python Standard Library.
#### 2. Load CSV Files with NumPy.
#### 3. Load CSV Files with Pandas.

## 1. Considerations When Loading CSV Data:

#### 1.1 File Header:
Does your data have a file header? If so this can help in automatically assigning names to each
column of data. If not, you may need to name your attributes manually. 
###### Either way, you should explicitly specify whether or not your CSV file had a file header when loading your data.

#### 1.2 Comments:
Does your data have comments? Comments in a CSV file are indicated by a hash (#) at the
start of a line. 
##### If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.

#### 1.3 Delimiter:
The standard delimiter that separates values in fields is the comma (,) character. 
##### Your file could use a different delimiter like tab or white space in which case you must specify it explicitly.

#### 1.4 Quotes:
Sometimes field values can have spaces. In these CSV files the values are often quoted. The
default quote character is the double quotation marks character. 
##### Other characters can be used, and you must specify the quote character used in your file.

## 2. Load CSV Files with the Python Standard Library: 

The Python API provides the module **CSV** and the function **reader()** that can be used to load
CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for machine
learning.  <br> For example, you can download the Pima Indians dataset into your local directory
with the filename **pima-indians-diabetes.data.csv**. This dataset describes the medical records for Pima Indians
and whether or not each patient will have an onset of diabetes within five years. As such it
is a classification problem. It is a good dataset for demonstration because all of the input
attributes are numeric and the output variable to be predicted is binary (0 or 1) and there is no header line. <br> [Dataset Link: pima-indians-diabetes.data.csv](https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.csv)


#### CODE:

In [1]:
# Load CSV Using Python Standard Library

import csv
import numpy

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

raw_data = open(filename,'r')

reader = csv.reader(raw_data, delimiter= ',')

x = list(reader)

data = numpy.array(x).astype('float')
print(data)

print("Type: ", type(data))

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
Type:  <class 'numpy.ndarray'>


## 3. Load CSV Files with the NumPy Library: 

You can load your CSV data using NumPy and the **numpy.loadtxt()** function. This function
assumes no header row and all data has the same format. 

#### CODE:

In [2]:
# Load CSV using NumPy

import numpy 

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

raw_data = open(filename,'r')

data = numpy.loadtxt(raw_data, delimiter=',') ## It will load the file as a numpy.ndarray

print(data)
print("Type: ", type(data))

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
Type:  <class 'numpy.ndarray'>


### To load the same dataset directly from a URL :

#### CODE:

In [3]:
# Load CSV from URL using NumPy

import numpy
from urllib.request import urlopen ### For python3 

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'

raw_data = urlopen(url)
dataset = numpy.loadtxt(raw_data, delimiter=",")

print(dataset.shape)

print(dataset)
print("Type: ", type(dataset))

(768, 9)
[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
Type:  <class 'numpy.ndarray'>


## 4. Load CSV Files with the Pandas Library: 

You can load your CSV data using Pandas and the **pandas.read_csv()** function. The function returns a pandas.DataFrame that you can immediately start summarizing and plotting.

#### CODE:

In [4]:
# Load CSV using Pandas

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

print(data.shape)

print(data.head())

print("Type: ", type(data))

(768, 9)
    preg    plas    pres    skin    test    mass    pedi    age    class 
0       6     148      72      35       0    33.6   0.627     50        1
1       1      85      66      29       0    26.6   0.351     31        0
2       8     183      64       0       0    23.3   0.672     32        1
3       1      89      66      23      94    28.1   0.167     21        0
4       0     137      40      35     168    43.1   2.288     33        1
Type:  <class 'pandas.core.frame.DataFrame'>


### To load the same dataset directly from a URL :

#### CODE:

In [5]:
# Load CSV using Pandas from URL

import pandas as pd

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

print(data.shape)

print(data.head())

print("Type: ", type(data))

(768, 9)
    preg    plas    pres    skin    test    mass    pedi    age    class 
0       6     148      72      35       0    33.6   0.627     50        1
1       1      85      66      29       0    26.6   0.351     31        0
2       8     183      64       0       0    23.3   0.672     32        1
3       1      89      66      23      94    28.1   0.167     21        0
4       0     137      40      35     168    43.1   2.288     33        1
Type:  <class 'pandas.core.frame.DataFrame'>
