# Load your ML dataset

More on [CSV files](https://tools.ietf.org/html/rfc4180). 

We will explore various ways to load CSV files.:

* Load CSV files from a local drive
* Load CSV Files with the Python Standard Library
* Load CSV Files with NumPy
* Load CSV Files with Pandas.

## Preliminary considerations

Review in particular:

* ***File Header***

* ***Comments***

* ***Delimiter***

* ***Quotes***

## Some test data 

We will use the famous "Pima Indians dataset". The data was freely available from the UCI ML Repository, and can now be found elsewhere. 

For your convenience it can be downloaded from multiple sources:

   * on gdrive: https://docs.google.com/spreadsheets/d/1bblAMi-MdOUszE8wv65Qv2G-jP2NtbP05ShsgMky_aU
   * on a web server: https://bonacor.web.cern.ch/DSC-AML-AA201920/datasets/pima-indians-diabetes.data.csv
   * on github: https://raw.githubusercontent.com/dbonacorsi/AML_basic_AA1920/master/datasets/pima-indians-diabetes.data.csv 



## Load CSV Files from a local drive

In [1]:
from google.colab import files
uploaded = files.upload()

Saving pima-indians-diabetes.data.csv to pima-indians-diabetes.data.csv


In [2]:
!pwd

/content


In [3]:
!ls /content

pima-indians-diabetes.data.csv	sample_data


In [4]:
!ls -trlh /content/pima-indians-diabetes.data.csv

-rw-r--r-- 1 root root 24K Apr 23 05:04 /content/pima-indians-diabetes.data.csv


In [5]:
!head -10 /content/pima-indians-diabetes.data.csv

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0,0.232,54,1


## Load CSV Files with the Python Standard Library

More info on `csv.reader()` can be found in the [CSV File Reading and Writing in the Python API](https://docs.python.org/2/library/csv.html).

In [0]:
import csv
import numpy as np

Note: file open options are documented [here](https://docs.python.org/3/library/functions.html#open).

In [7]:
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')           # t=text mode open for r=reading (both are defaults)
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
x

[['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'],
 ['1', '85', '66', '29', '0', '26.6', '0.351', '31', '0'],
 ['8', '183', '64', '0', '0', '23.3', '0.672', '32', '1'],
 ['1', '89', '66', '23', '94', '28.1', '0.167', '21', '0'],
 ['0', '137', '40', '35', '168', '43.1', '2.288', '33', '1'],
 ['5', '116', '74', '0', '0', '25.6', '0.201', '30', '0'],
 ['3', '78', '50', '32', '88', '31', '0.248', '26', '1'],
 ['10', '115', '0', '0', '0', '35.3', '0.134', '29', '0'],
 ['2', '197', '70', '45', '543', '30.5', '0.158', '53', '1'],
 ['8', '125', '96', '0', '0', '0', '0.232', '54', '1'],
 ['4', '110', '92', '0', '0', '37.6', '0.191', '30', '0'],
 ['10', '168', '74', '0', '0', '38', '0.537', '34', '1'],
 ['10', '139', '80', '0', '0', '27.1', '1.441', '57', '0'],
 ['1', '189', '60', '23', '846', '30.1', '0.398', '59', '1'],
 ['5', '166', '72', '19', '175', '25.8', '0.587', '51', '1'],
 ['7', '100', '0', '0', '0', '30', '0.484', '32', '1'],
 ['0', '118', '84', '47', '230', '45.8', '0.551',

In [8]:
data = np.array(x).astype('float')
data

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [9]:
print(data.shape)

(768, 9)


## Load CSV Files with NumPy

In [0]:
from numpy import loadtxt

More information on the `numpy.loadtxt()` function can be found on the [NumPy API documentation for loadtxt](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html). The code above loads the file as a `numpy.ndarray`: more info on the [NumPy API documentation for ndarray](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html)). 

In [11]:
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = loadtxt(raw_data, delimiter=",")
print(data.shape)

(768, 9)


## Load CSV Files with Pandas

Use the `pandas.read_csv()` function (more info [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)). The function returns a `pandas.DataFrame` (more information on the [Pandas API documentation for DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) that one can immediately process, summarize, plot, etc.

In [0]:
from pandas import read_csv

In [0]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)

In [14]:
type(data)

pandas.core.frame.DataFrame

In [15]:
print(data.shape)

(768, 9)


In [16]:
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


## Load CSV from github 

In [17]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML_basic_AA1920/master/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


## <font color='red'>Exercise</font>

Set up few lines of code, on a blank notebook, with the method of your choice, to import this csv file into google colab. Be confident it would work with any other CSV file you might be given access to, throughout this course - you will reuse it often!

Share your solution with the class (e.g. CTRL+C/V in the Teams chat window, or - better - share a github link to your notebook. 

## Summary

What we did:

* we discussed the need to import data
* we discussed the CSV format 
* we discussed peculiarities to check in the file before importing
* we familiarized with few ways to load data into Python (for ML purposes). We discussed why pands might be a just right way to go.

## What's next

It is time to start looking at the data we loaded. We will discover how to use simple descriptive statistics to better understand our data.