# Methods to Load CSV Data File

With respect to tabular data, the most common format of data for ML projects is `CSV (Comma Separated Values)`. CSV is a simple file format which is used to store tabular data (number and text) such as a spreadsheet in plain text. 

## Consideration While Loading CSV data

### File Header

In CSV data files, the header contains the information for each field. 
The following are the two cases related to CSV file header which must be considered:
- Case-I: When Data file is having a file header: It will automatically assign the names to each column of data if data file is having a file header.
- Case-II: When Data file is not having a file header: We need to assign the names to each column of data manually if data file is not having a file header.

In both the cases, we must need to specify explicitly weather our CSV file contains header or not.

### Comments: 
In CSV data file, comments are indicated by a hash (#) at the start of the line.

### Delimiter: 
In CSV data files, comma (,) character is the standard delimiter. we can also use a different delimiter such as a tab or white space. But in the case of using a different delimiter than standard one, we must have to specify it explicitly.

### Quotes: 
In CSV data files, double quotation (“ ”) mark is the default quote character.

The dataset we are going to use is the famous Iris data set. Some additional information about the Iris dataset is available at:

https://archive.ics.uci.edu/ml/datasets/iris

The dataset consists of 150 records of Iris plant with four features: 'sepal-length', 'sepal-width', 'petal-length', and 'petal-width'. All of the features are numeric. The records have been classified into one of the three classes i.e. 'Iris-setosa', 'Iris-versicolor', or 'Iris-verginica'.

## Load CSV with Python Standard Library

First, we need to import the csv module provided by Python standard library as follows:

In [1]:
import csv

Next, we need to import Numpy module for converting the loaded data into NumPy array.

In [2]:
import numpy as np

Now, provide the full path of the file, stored on my local directory, having the CSV data file:

In [3]:
path = "iris_with_header_without_class.csv"

Next, use the __`csv.reader()`__function to read data from CSV file:

In [4]:
with open(path,'r') as f:
    reader = csv.reader(f,delimiter = ',')
    headers = next(reader)    
    data = list(reader)    
    data = np.array(data).astype(float)

We can print the names of the headers with the following line of script:

In [5]:
print(headers)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']


The following line of script will print the shape of the data i.e. number of rows & columns in the file:

In [6]:
print(data.shape)

(150, 4)


Next script line will give the first three line of data file:

In [7]:
print(data[:3])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]


---
Let's see what is going to happen if we want to load data with class (which is string data type)

In [3]:
path = "iris_with_header.csv"
with open(path,'r') as f:
    reader = csv.reader(f,delimiter = ',')
    headers = next(reader)    
    data = list(reader)    

In [6]:
data = np.array(data)

  data = np.array(data)


In [7]:
data.shape

(151,)

In [8]:
with open(path,'r') as f:
    reader = csv.reader(f,delimiter = ',')
    headers = next(reader)    
    data = list(reader)
    data = np.array(data, dtype='object') 

In [9]:
data.shape

(151,)

In [10]:
print(data)

[list(['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'])
 list(['4.9', '3.0', '1.4', '0.2', 'Iris-setosa'])
 list(['4.7', '3.2', '1.3', '0.2', 'Iris-setosa'])
 list(['4.6', '3.1', '1.5', '0.2', 'Iris-setosa'])
 list(['5.0', '3.6', '1.4', '0.2', 'Iris-setosa'])
 list(['5.4', '3.9', '1.7', '0.4', 'Iris-setosa'])
 list(['4.6', '3.4', '1.4', '0.3', 'Iris-setosa'])
 list(['5.0', '3.4', '1.5', '0.2', 'Iris-setosa'])
 list(['4.4', '2.9', '1.4', '0.2', 'Iris-setosa'])
 list(['4.9', '3.1', '1.5', '0.1', 'Iris-setosa'])
 list(['5.4', '3.7', '1.5', '0.2', 'Iris-setosa'])
 list(['4.8', '3.4', '1.6', '0.2', 'Iris-setosa'])
 list(['4.8', '3.0', '1.4', '0.1', 'Iris-setosa'])
 list(['4.3', '3.0', '1.1', '0.1', 'Iris-setosa'])
 list(['5.8', '4.0', '1.2', '0.2', 'Iris-setosa'])
 list(['5.7', '4.4', '1.5', '0.4', 'Iris-setosa'])
 list(['5.4', '3.9', '1.3', '0.4', 'Iris-setosa'])
 list(['5.1', '3.5', '1.4', '0.3', 'Iris-setosa'])
 list(['5.7', '3.8', '1.7', '0.3', 'Iris-setosa'])
 list(['5.1', '3.8', '1.5', '0.

## Load CSV with NumPy

Another approach to load CSV data file is `NumPy` and `numpy.loadtxt()` function.

In [8]:
from numpy import loadtxt

path = "iris_without_header_without_class.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",")
print(data.shape)
print(data[:3])

(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]


## Load CSV with Pandas

Another approach to load CSV data file is by __`Pandas`__ and **`pandas.read_csv()`** function. This is the very flexible function that returns a __`pandas.DataFrame`__.

In [9]:
import pandas as pd

df = pd.read_csv('iris.data', header=None)
df.tail()

Unnamed: 0,0,1,2,3,4
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


<hr>

### Note:


You can directly read data from the Internet. The Iris dataset is from UCI at https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data. For instance, to load the Iris dataset from the Internet, you can replace the line 

    df = pd.read_csv('your/local/path/to/iris.data', header=None)
     
by

    df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
     


In [10]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
df.tail()

Unnamed: 0,0,1,2,3,4
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


Load the data with header.

In [11]:
df = pd.read_csv('iris_with_header.csv')
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [12]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


<hr>