## Installing Python libraries

One great feature about Jupyter notebooks is that we can run terminal commands. This means we can install Python libraries on the fly, using the "!" prefix. So if you plan on running these notebooks on your own machine, you'll need to install a few libraries: 

In [1]:
#!pip install pandas --user

%conda install pandas

Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


## CSV files

You must be able to load your data before you can start your machine learning project. The most 
common format for machine learning data is CSV files. There are a number of ways to load a CSV 
file in Python. In the first section of this lab you will learn three ways that you can use to load your 
CSV data in Python: 

1. Load CSV Files with the Python Standard Library. 
2. Load CSV Files with NumPy.
3. Load CSV Files with Pandas.

There are a number of considerations when loading your machine learning data from CSV files. For 
reference, you can learn a lot about the expectations for CSV files by reviewing the CSV request for 
comment titled Common Format and MIME Type for Comma-Separated Values (CSV) Files (URL: 
https://tools.ietf.org/html/rfc4180)

**File Header**. Does your data have a file header? If so this can help in automatically assigning 
names to each column of data. If not, you may need to name your attributes manually. Either way, 
you should explicitly specify whether or not your CSV file had a file header when loading your 
data.

**Comments**. Does your data have comments? Comments in a CSV file are indicated by a hash (#) at 
the start of a line. If you have comments in your file, depending on the method used to load your 
data, you may need to indicate whether or not to expect comments and the character to expect to 
signify a comment line.

**Delimiter**. The standard delimiter that separates values in fields is the comma (,) character. Your 
file could use a different delimiter like tab or white space in which case you must specify it 
explicitly.

**Quotes**. Sometimes field values can have spaces. In these CSV files the values are often quoted. 
The default quote character is the double quotation marks character. Other characters can be used, 
and you must specify the quote character used in your file

### Pima Indians Dataset

The Pima Indians dataset will be used to demonstrate loading data. It will also be used in 
many of the practical lessons to come. 

This dataset describes the medical records for Pima Indians and whether 
or not each patient will have an onset of diabetes within five years. 

As such it is a classification problem. It is a good dataset for demonstration because all of the input attributes are numeric and the output variable to be predicted is binary (0 or 1). 

Below lists the eight attributes for the dataset: 

1. Number of times pregnant.
2. Plasma glucose concentration 2 hours in an oral glucose tolerance test. 
3. Diastolic blood pressure (mm Hg).
4. Triceps skin fold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (BMI).
7. Diabetes pedigree function.
8. Age (years).
9. Class, onset of diabetes within five years.

Given that all attributes are numerical makes it easy to use directly with machine learning 
algorithms that expect numerical inputs and output values. 

### Load CSV from file
But first we need to load it into memory so we can use it in our notebook. This first example shows you how to read a file from disk:

In [14]:
import pandas as pd

filename = './pima-indians-diabetes.data.csv'
header = ['Pregnancy_Count','Glucone_conc','Blood_pressure','Skin_thickness','Insulin','BMI','DPF','Age','Class']

data = pd.read_csv(filename, names=header)

The `.read_csv()` function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a `pandas.DataFrame` that you can immediately start summarising and 
plotting.  Note that in this code block we explicitly specify the names of each attribute to the `DataFrame`. 

For more information on the `pandas.DataFrame` see the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html">API documentation</a>.


### Load CSV using Pandas from URL
We can also modify this example to load CSV data directly from a URL.

In [15]:
from pandas import read_csv

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'

header = ['Pregnancy_Count','Glucone_conc','Blood_pressure','Skin_thickness','Insulin','BMI','DPF','Age','Class']

data = read_csv(url, names=header)

# Viewing our data

We can inspect the first few rows from the beginning of our data to check that it loaded correctly into columns using ```Pandas.head()``` function.

In [16]:
data.head()

Unnamed: 0,Pregnancy_Count,Glucone_conc,Blood_pressure,Skin_thickness,Insulin,BMI,DPF,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


This method will return the first few rows of our data as a preview.  You can specify how many rows to return as a parameter:

In [17]:
data.head(20)

Unnamed: 0,Pregnancy_Count,Glucone_conc,Blood_pressure,Skin_thickness,Insulin,BMI,DPF,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


We can also look at the end of our data using ```.tail()``` method.

In [18]:
data.tail()

Unnamed: 0,Pregnancy_Count,Glucone_conc,Blood_pressure,Skin_thickness,Insulin,BMI,DPF,Age,Class
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


# What is the dimensionality of our data?

Once we have loaded our data, we need to determine how many rows of data we have, and how many attribute or features are in the columns. The ```Pandas.shape``` attribute will provide this information straight-forwardly:

In [19]:
print(data.shape)

(768, 9)


From the tuple returned, we can see we have 768 rows of data (including our header), and 9 columns representing the 8 features and 1 target variable (class).