# Load your ML dataset

More on [CSV files](https://tools.ietf.org/html/rfc4180). 

First of all we need to get the dataset!

* Get CSV files from a local drive directly into Colab

Then, we will explore few common ways to load CSV files into the notebook:

* Load CSV Files with the Python Standard Library
* Load CSV Files with NumPy
* Load CSV Files with Pandas.

Then, we will try one last thing:

* Get CSV files from github and load into pandas

## Preliminary considerations

Review in particular:

* ***File Header***

* ***Comments***

* ***Delimiter***

* ***Quotes***

## Your dataset

We will use the famous "Pima Indians dataset". The data was freely available from the UCI ML Repository, and can now be found elsewhere. A good description of what it contains can be found here: https://www.kaggle.com/uciml/pima-indians-diabetes-database. 

For your convenience in this course, it can be downloaded from multiple sources:

   * on gdrive: https://docs.google.com/spreadsheets/d/1u6YdBpjywHjT3vlT4DewaWn4rMfIz6cHT87ILLjwnBU  
   * on github: https://raw.githubusercontent.com/dbonacorsi/AML2021Bas/main/datasets/pima-indians-diabetes.data.csv 



## Get CSV files from a local drive directly into Colab

_(If you want to try this option with the aforementioned file, i.e. pretend you have the dataset file locally and you are trying to upload it into your working notebook, you need to first download it from the source onto your local disk - do it now, before proceeding)_

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
!pwd

In [None]:
!ls /content


In [None]:
!ls -trlh pima-indians-diabetes.data.csv


In [None]:
!head -10 /content/pima-indians-diabetes.data.csv

## Load CSV Files with the Python Standard Library

More info on `csv.reader()` can be found in the [CSV File Reading and Writing in the Python API](https://docs.python.org/2/library/csv.html).

In [None]:
import csv
import numpy as np

Note: file open options are documented [here](https://docs.python.org/3/library/functions.html#open).

In [None]:
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')           
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)

In [None]:
x

In [None]:
mydata_withpython = np.array(x).astype('float')

In [None]:
mydata_withpython

In [None]:
print(mydata_withpython.shape)

## Load CSV Files with NumPy

In [None]:
from numpy import loadtxt

More information on the `numpy.loadtxt()` function can be found on the [NumPy API documentation for loadtxt](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html). The code above loads the file as a `numpy.ndarray`: more info on the [NumPy API documentation for ndarray](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html)). 

In [None]:
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
mydata_withnumpy = loadtxt(raw_data, delimiter=",")

In [None]:
mydata_withnumpy

In [None]:
print(mydata_withnumpy.shape)

## Load CSV Files with Pandas

Use the `pandas.read_csv()` function (more info [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)). The function returns a `pandas.DataFrame` (more information on the [Pandas API documentation for DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) that one can immediately process, summarize, plot, etc.

In [None]:
from pandas import read_csv

In [None]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
mydata_withpandas = read_csv(filename, names=names)

In [None]:
type(mydata_withpandas)

In [None]:
mydata_withpandas

In [None]:
print(mydata_withpandas.shape)

## Get CSV files from github and load into pandas

Can you read the code below and understand the differences/similarities w.r.t previous methods?

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML2021Bas/main/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

---

## <font color='red'>Exercise 1</font>

Compare the import methods above. Can you get confidence that they did actually import the same stuff?

*   _HINT_: let's say it is enough to check that e.g. one specific entry is the same in all imported datasets..



In [None]:
### add your solution here

---

## <font color='red'>Exercise 2</font>

Set up few lines of code, on a blank notebook, with the method of your choice, to import this csv file into google colab. Be confident it would work with any other CSV file you might be given access to, throughout this course - you will reuse it often!

Share your solution with the class (e.g. CTRL+C/V in the Teams chat window, or - better - share a github link to your notebook. 

## Summary

What we did:

* we discussed the need to import data
* we discussed the CSV format 
* we discussed peculiarities to check in the file before importing
* we familiarized with few ways to load data into Python (for ML purposes). We discussed why pands might be a just right way to go.

## What's next

It is time to start looking at the data we loaded. We will discover how to use simple descriptive statistics to better understand our data.