# Numeric data for Machine Learning

### Introduction

Before we can do anything with machine learning, we need to load our data—and one of the most common formats for storing data is CSV, which stands for *Comma-Separated Values*. A CSV file is simply a plain text file where each row represents a record (like a person, a review, or a product), and each value is separated by a comma.

Python gives us a few ways to load CSV files into a program. The most common methods are:

1. Using Python’s built-in `csv` module  
2. Using NumPy’s functions like `loadtxt` or `genfromtxt` (useful for numbers)  
3. Using the `pandas` library and its `read_csv()` function (the most flexible and beginner-friendly)

Each method has its own strengths:

- *Python’s csv module* is built in—you don’t need to install anything—but it requires a bit more effort to convert data into the types you need (like numbers or dates).
- *NumPy* is great for large numerical datasets, but it can struggle with files that contain text, headers, or missing values.
- *pandas* is the most popular option because it makes working with tables of data very easy. It automatically handles many of the common problems you might run into and gives you a powerful tool called a *DataFrame*—a table that you can search, filter, and analyse with just a few lines of code.

If you’re curious about how CSV files are structured or why they work the way they do, there’s an official guide called [RFC 4180](https://tools.ietf.org/html/rfc4180) that describes the standard format.

Here are a few things to check when loading a CSV file:

- *Header row*: Does the first line of your file contain column names (like "Name", "Age", or "Review")? If so, most tools can detect and use it automatically. If not, you’ll need to tell the tool what the column names should be.

- *Comments*: Some CSV files include comment lines that start with a symbol like `#`. These lines are meant for humans, not the program, so you may need to tell your loading function to skip them.

- *Delimiter*: Although CSV stands for "comma-separated", sometimes other characters are used instead—like tabs (`\t`) or semicolons (`;`). Make sure you specify the correct one so the data loads properly.

- *Quotes*: Some values in a file might contain commas, spaces, or special characters. To avoid confusion, these fields are wrapped in quotation marks (usually `"`). If your file uses a different type of quote (like `'`), you’ll need to set that too.

If you understand how your CSV file is set up, you’ll be able to load it correctly and avoid common errors—getting you one step closer to analysing and learning from your data.

### Installing Python libraries
One great feature about jupyter notebooks is that we can run terminal commands. This means we can install python libraries on the fly, using the `!` prefix. If you plan on running these notebooks on your own machine, you'll need to install a few libraries as and when they are required. Below is an example specifically installing `pandas` and `numpy`.

In [None]:
!pip install --upgrade pip

!pip install pandas numpy scikit-learn

## Pima Indians dataset

We will use the **Pima Indians dataset** to demonstrate how to load data into python. This dataset describes medical records for Pima Indians, indicating whether or not each patient develops diabetes within five years.

The Pima Indian diabetes dataset is a renowned benchmark in the field of machine learning. It was originally made available through the *National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)* in the United States, and later hosted on the *UCI Machine Learning Repository*. Over time, it has become a standard reference dataset for illustrating and evaluating classification algorithms.

### Who was studied

- The dataset focuses on adult female patients (aged 21 or older) of Pima Indian heritage residing near Phoenix, Arizona.  
- This population is known to have a significantly higher incidence of type 2 diabetes.
### What was measured

Each row in the dataset represents one person and includes *eight numbers* that describe their medical background and personal health measurements. These features help researchers and doctors predict the chances of developing type 2 diabetes.

Here’s what each number means:

1. *Number of times pregnant* – How many times the participant has been pregnant (applies only to female patients in the dataset).  
2. *Plasma glucose concentration* – The amount of glucose (sugar) in the blood after a two-hour test where the person drinks a sugary solution.  
3. *Diastolic blood pressure* – The lower of the two blood pressure numbers, measured in millimetres of mercury (mm Hg).  
4. *Triceps skin fold thickness* – A measurement in millimetres of the fat under the skin at the back of the upper arm.  
5. *2-hour serum insulin* – The level of insulin in the blood two hours after the glucose test, measured in micro-units per millilitre.  
6. *Body Mass Index (BMI)* – A number that shows whether someone is underweight, normal, overweight, or obese, based on their weight and height.  
7. *Diabetes Pedigree Function (DPF)* – A calculated value that estimates a person’s genetic risk of diabetes based on family history.  
8. *Age* – The person’s age in years.

Finally, there’s a *binary outcome* called the *class*, *target*, or *label*, which tells us whether that person *did* or *did not* develop type 2 diabetes within the next five years. This is what models will try to predict using the other measurements.

### Why it was collected

Researchers aimed to investigate risk factors for diabetes among a group with a particularly high risk of the disease. Various medical and demographic data (such as glucose tolerance tests, insulin measurements, and age) were gathered to determine which factors most strongly predicted the onset of diabetes.

### How the data was obtained

1. **Patient recruitment**  
   Eligible participants were female Pima Indians, aged 21 or older.  

2. **Measurements and testing**  
   Each participant underwent standard medical tests, including measuring plasma glucose concentration, blood pressure, and insulin levels, alongside providing demographic details such as age and number of pregnancies.  

3. **Five-year follow-up**  
   The pivotal outcome was whether each participant experienced the onset of type 2 diabetes within five years. This information was determined through follow-up medical records and diagnoses.  

4. **Compilation**  
   The anonymised data were assembled into a structured dataset of 768 entries, each representing a single participant’s measurements and diabetes outcome (onset or no onset).

The Pima Indian diabetes dataset remains a pivotal resource for demonstrating fundamental classification techniques and exploring how demographic and medical attributes can help predict the onset of type 2 diabetes.

### Download the dataset

In [None]:
import urllib.request

url = 'https://raw.githubusercontent.com/martyn-harris-bbk/AppliedMachineLearning/main/data/pima-indians-diabetes.data.csv'
filename = 'pima-indians-diabetes.data.csv'

urllib.request.urlretrieve(url, filename)
print("Download complete.")

### Load csv from file

We demonstrate how to read a CSV file from your local system. We do this by providing the file path in the `filename` variable, and specifying a list of column names (`header`). If your CSV already contains a header row, you could set `header=0` (or omit the `names=` parameter entirely) to tell pandas to use the first row of the file as the header.

The `.read_csv()` function returns a `pandas.DataFrame` object, which is a powerful 2D data structure that allows row and column operations, descriptive statistics, and data manipulations. You can immediately start summarising and visualising the data using DataFrame methods such as `.describe()`, `.head()`, `.plot()`, and so on.

For more information on `pandas.DataFrame`, see the [API documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
import pandas as pd

filename = 'pima-indians-diabetes.data.csv'

header = [
    'Pregnancy_Count',
    'Glucone_conc',
    'Blood_pressure',
    'Skin_thickness',
    'Insulin',
    'BMI',
    'DPF',
    'Age',
    'Class'
]

data = pd.read_csv(filename, names=header)

### Load csv using pandas from url

In many cases, you may want to load CSV data directly from a web resource. Below, we show you how to modify the example to read the Pima Indians data from a GitHub URL, without having to download it locally first.

In [None]:
url = 'https://raw.githubusercontent.com/martyn-harris-bbk/AppliedMachineLearning/main/data/pima-indians-diabetes.data.csv'

header = [
    'Pregnancy_Count',
    'Glucone_conc',
    'Blood_pressure',
    'Skin_thickness',
    'Insulin',
    'BMI',
    'DPF',
    'Age',
    'Class'
]

data = pd.read_csv(url, names=header)

In [None]:
data.head()

By default, `head()` returns 5 rows, but you can specify exactly how many rows you want to preview by passing an integer. For example, `data.head(20)` will show the first 20 rows.

Previewing just the first few rows is extremely helpful for verifying that the data has been read in correctly, especially if you suspect issues with delimiters, headers, or quoting.

In [None]:
data.head(20)

Similarly, `data.tail()` returns the last few rows of the DataFrame. This can help you inspect how data is structured near the end of the file and confirm if the file terminates properly.

In [None]:
data.tail()

### What is the dimensionality of our data?

Before we start analysing or training a model, it’s helpful to know the *shape* of the dataset—that is, how many rows and columns it contains. This gives us a sense of how big the dataset is and what we’re working with.

We can check this using the `.shape` attribute in a pandas DataFrame, which returns a result like this:

```
(rows, columns)
```

- The number of *rows* tells us how many *examples* or *participants* are in the dataset.  
- The number of *columns* tells us how many *features* (measurements) we have for each example. If the dataset includes the target variable (like whether someone developed diabetes), it will be one of these columns.

Understanding the dataset’s shape helps us:
- Plan how to split the data into training and testing sets  
- Estimate how much memory we’ll need  
- Choose models and techniques that are suitable for the size of the data  

For example, a dataset with 768 rows and 9 columns means we have 768 people in the study, and for each person, we’ve recorded 8 health measurements plus 1 outcome label:

In [None]:
# How big is our data
print(data.shape)

In more formal terms, from the tuple returned, we can see that the Pima Indians dataset contains **768 rows** and **9 columns**. The 9 columns correspond to 8 explanatory features plus 1 target column (`Class`).

This knowledge helps us ensure we have the complete dataset loaded. It's also a good initial check before we proceed to more in-depth data profiling or feature analysis.

### Exploring more details of the data

Beyond checking the first few and last few rows, pandas offers convenient methods to quickly summarise your dataset. For instance, if you want to view column data types and any non-null counts, you can use `data.info()`. This can be useful to spot missing values or confirm that columns are numeric:

In [None]:
# Checking data info
data.info()

From `data.info()`, we get a quick overview of the dataset’s structure. It tells us:

- *How many rows are in the dataset*  
- *Which columns are present*  
- *How many non-null (non-missing) values are in each column*  
- *What data type* each column uses (such as `int64` for whole numbers, `float64` for decimal numbers, or `object` for text)

This information is especially helpful for spotting issues early. For example:
- If a column has fewer non-null values than the total number of rows, it likely contains *missing data*.
- If a column that should contain numbers is listed as `object`, the values might be stored as text—possibly due to formatting issues (like commas or symbols).

Checking the results of `data.info()`, allows us to make sure the dataset is properly formatted and ready for analysis or model training.

## Data preparation and transformation

In the real world, data often doesn’t come in a neat and tidy format that’s ready for analysis or machine learning. Before we can build reliable models, we usually need to *prepare* and *transform* the data to make it suitable for the task at hand.

This step is crucial—poorly prepared data can lead to misleading results, even if you’re using the most advanced algorithms. Good preparation helps models learn more effectively, make better predictions, and generalise well to new data.

Here are some of the most common transformations applied to numeric data:

- *Rescaling* – Changing the range of values (for example, so everything falls between 0 and 1). This is useful when your data includes features measured on very different scales.
- *Standardising* – Shifting data so it has a mean of 0 and a standard deviation of 1. This helps many algorithms that assume the data is centred and scaled, such as linear models or clustering methods.
- *Normalising* – Adjusting values so that each row (or sample) has the same overall scale, often useful when comparing data with different units or magnitudes.
- *Binarising* – Turning numerical data into 0s and 1s, based on a threshold. This is useful when you want to highlight the presence or absence of a certain condition (e.g. blood pressure above a certain level).

As an example, we use the *Pima Indians Diabetes dataset*—a real-world medical dataset where preparation is especially important. This dataset contains health-related measurements such as blood pressure, glucose levels, and BMI. These features are all measured on different scales, which makes it harder for some models to interpret them fairly without some kind of adjustment.

When we apply these transformations, we ensure that each feature contributes appropriately to the model—rather than allowing one with larger numbers to dominate simply because of its scale.

### Rescale data

Sometimes in a dataset, the different features or attributes are measured on very different scales. For example, one column might show *income* in the tens of thousands, while another shows *age* in single or double digits. If we leave them as they are, some machine learning algorithms may give more importance to the larger numbers—even if they’re not actually more important.

*Rescaling* helps solve this problem by putting all features on a similar scale, usually between 0 and 1. This means that the smallest value in a column becomes 0, the largest becomes 1, and everything else is adjusted to fit in between.

This technique is useful for:
- *Ensuring that all features contribute fairly to a model*  
- *Improving the performance and stability of algorithms like neural networks, k-nearest neighbours, and gradient descent*  
- *Speeding up training time by making the data easier for the algorithm to process*

Think of it like putting everything on the same ruler—when all the numbers are in the same range, comparisons between them become more balanced.

For example, in a health dataset with *cholesterol level*, *heart rate*, and *age*, rescaling makes sure that no one measurement overshadows the others simply because it's measured on a larger scale.

In [None]:
from sklearn.preprocessing import MinMaxScaler  # Import MinMaxScaler for feature scaling

# Extract values from the DataFrame
array = data.values

# Separate features (columns 0 to 7) and target (column 8)
X = array[:, 0:8]  # Features
Y = array[:, 8]    # Target

# Initialise the MinMaxScaler to scale features to the range [0, 1]
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit the scaler on the feature data and transform it
rescaledX = scaler.fit_transform(X)

# Display the scaled feature matrix
rescaledX


### Standardise data

*Standardisation* is another way to adjust your data so it's easier for a machine learning model to work with. Instead of changing the range to 0–1 (like rescaling), standardisation shifts the data so that each feature has:
- A *mean* (average) of 0  
- A *standard deviation* of 1 (which measures how spread out the values are)

This doesn’t change the shape of the distribution, but it re-centres the data around zero and gives it a consistent scale. This is especially useful when features are measured in very different units or ranges.

Standardisation is helpful for:
- *Making sure features with large numbers don’t dominate smaller ones*  
- *Helping models that rely on distance or assume normally distributed data—like linear regression, logistic regression, and support vector machines*  
- *Speeding up learning and improving accuracy in many cases*

Imagine you’re analysing housing data. One feature might be *number of bedrooms* (typically between 1 and 5), and another might be *property value* (ranging into the hundreds of thousands). Standardising puts them on a level playing field so the model can treat both features fairly and interpret their influence correctly.

When you bring everything into a consistent form, standardisation makes it easier for your model to find patterns and make more reliable predictions:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

print(rescaledX[0:5,:])

### Normalise data

Normalisation is a transformation that adjusts each row of your dataset so that all its values are scaled relative to one another. In simple terms, it makes sure that each data point (or row) has the same overall size or *length*, usually set to 1. This is also called giving it a *unit norm*.

Unlike rescaling or standardising—which adjust each column—normalisation works across each row, balancing all the values in that row so they can be fairly compared.

This is especially useful when:
- *Your data is sparse* (meaning most of the values are zero or very small)  
- *You're using models that rely on measuring distances between points*, like k-nearest neighbours or clustering algorithms  
- *You want to compare patterns in proportions rather than absolute values*

For example, imagine you're comparing music streaming habits across users. One person might listen to 50 songs a day, while another listens to just 5. But if you normalise the data, you’re not comparing total listening time—you’re comparing *how each person divides their listening across different genres or artists*. This allows you to focus on their preferences rather than how much they listen.

Normalisation is helpful when the *direction* or *pattern* of the values matters more than their total size:

In [None]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X)
normalisedX = scaler.transform(X)

print(normalisedX[0:5,:])

### Binarise data

Binarisation is a technique that turns numerical values into simple 0s and 1s based on a set threshold. If a value is *above* the threshold, it becomes 1. If it’s *below* or equal to the threshold, it becomes 0.

This can be useful when you want to highlight the *presence or absence* of something, or turn a continuous variable into a yes/no type format.

Binarisation is helpful for:
- *Simplifying features into clear categories*  
- *Preparing data for models that work better with binary input*  
- *Highlighting specific conditions, like values above a healthy range*

For example, imagine you're working with temperature data. You could set a threshold of 0°C to create a new feature that tells you whether it was *freezing* (1) or *not freezing* (0) on each day. This makes it easy for a model to focus on a specific condition without dealing with the full range of temperatures.

It’s a straightforward way to turn continuous values into something simpler and more focused:

In [None]:
from sklearn.preprocessing import Binarizer

binariser = Binarizer(threshold=0.0).fit(X)
binaryX = binariser.transform(X)

print(binaryX[0:5,:])

## Influence of data transformation on ML models

Data transformations can have a big impact on how well machine learning models perform—especially those that rely on patterns, distances, or consistent scales across features.

In this example, we look at how *normalising* and *scaling* the data influences the performance of a basic decision tree model and a K-nearest neighbours model. We compare results *before* and *after* normalisation and standardisation to see if the transformations make any difference to the accuract of our model.

While decision trees aren’t usually sensitive to feature scales, this test helps illustrate that:
- *Some models (like decision trees)* may not change much in terms of accuracy, but this depends on the data.  
- *Other models (like k-nearest neighbours, logistic regression, or neural networks)* can be heavily affected  
- Even small differences in transformation can shift model accuracy, speed, or consistency

Running this kind of test is a quick way to check whether a transformation improves model behaviour—and helps you choose the right preprocessing steps for your task.

In [None]:
# Import necessary libraries
from sklearn.model_selection import KFold, cross_val_score  # For cross-validation (measuring accuracy)
from sklearn.tree import DecisionTreeClassifier              # Decision tree model
from sklearn.preprocessing import Normalizer                 # Normalizer for feature scaling

# Set up 10-fold cross-validation
kfold = KFold(n_splits=10)

# Initialize a Decision Tree Classifier
model = DecisionTreeClassifier()

# Evaluate model using original (unscaled) data
results1 = cross_val_score(model, X, Y, cv=kfold)
print("Mean estimated accuracy (original data):", results1.mean())

# Normalise the feature data (L2 norm by default)
scaler = Normalizer().fit(X)
normalisedX = scaler.transform(X)

# Evaluate model using normalised data
results2 = cross_val_score(model, normalisedX, Y, cv=kfold)
print("Mean estimated accuracy (normalised data):", results2.mean())


Since KNN is a distance-based model, it's often very sensitive to the scale of the input data—so you should see a noticeable difference in accuracy after applying a standard scaler:

In [None]:
# Import necessary libraries
from sklearn.model_selection import KFold, cross_val_score      # For cross-validation (for accuracy measures)
from sklearn.neighbors import KNeighborsClassifier              # K-Nearest Neighbors model
from sklearn.preprocessing import StandardScaler                # StandardScaler for feature scaling

# Set up 10-fold cross-validation
kfold = KFold(n_splits=10)

# Initialise a KNN classifier
model = KNeighborsClassifier()

# Evaluate the model using original (unscaled) data
results1 = cross_val_score(model, X, Y, cv=kfold)
print("Mean estimated accuracy (original data):", results1.mean())

# Standardise the feature data (zero mean, unit variance)
scaler = StandardScaler().fit(X)
standardisedX = scaler.transform(X)

# Evaluate the model using standardised data
results2 = cross_val_score(model, standardisedX, Y, cv=kfold)
print("Mean estimated accuracy (standardised data):", results2.mean())


## What have we learnt?

Numeric data is structured and typically easier for machine learning models to handle than text. It forms the basis of most traditional machine learning tasks, such as predicting house prices, detecting fraud, or classifying medical outcomes. 

We started by exploring how to inspect and understand numeric datasets using tools like pandas. This included loading, and checking the shape of the dataset, and viewing column types with `.info()`. These initial steps help us identify missing values before we consider preprocessing our data.

However, before we can use numeric data effectively, we must understand its structure, clean it, and apply the right transformations. We've seen how important it is to prepare and transform data before applying machine learning algorithms. Even small changes—like adjusting the scale or structure of the data—can make a noticeable difference in how models perform. This is because:

- *Real-world data is rarely ready “as-is”* — it often needs cleaning and transforming to work well with machine learning models.
- *Different transformations serve different purposes* — rescaling, standardising, normalising, and binarising each help in their own way, depending on the model and the data.
- *Some models are sensitive to scale* — algorithms like K-nearest neighbours and logistic regression perform better when features are on a similar scale. Other models, like decision trees, are less affected — but it’s still good practice to test and compare.

In short, data transformation isn't just a technical step—it’s a key part of the modelling process. The better you understand your data and how to prepare it, the more likely your model is to succeed.


## Recommended datasets

Here are some widely used, freely available numeric datasets you might explore for practice:

**Iris**  
- *Description*: Classic dataset of 150 iris flowers with four features each (sepal length, sepal width, petal length, petal width) and three species of iris as the target.  
- *Why it’s popular*: Very small (perfect for quick demos) and well-labelled, making it easy to visualise in 2D or 3D.  
- *Where to get it*: Built into scikit-learn (`sklearn.datasets.load_iris`) or from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Iris).

**Wine**  
- *Description*: Chemical analysis of wines grown in the same region in Italy but from three different cultivars. Each sample has 13 numeric features.  
- *Why it’s popular*: Good example for classification with multiple classes.  
- *Where to get it*: Built into scikit-learn (`sklearn.datasets.load_wine`) or from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wine).

**Wine Quality**  
- *Description*: Wine samples (red or white) with attributes such as acidity, sulphates, pH, and a quality rating.  
- *Why it’s popular*: Demonstrates regression or classification.  
- *Where to get it*: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).

**Adult**  
- *Description*: Census data (48,842 instances) used for predicting whether an individual’s income exceeds $50K/year.  
- *Why it’s popular*: Classification with categorical and numeric features, plus data-cleaning challenges.  
- *Where to get it*: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult).

**Titanic**  
- *Description*: Passenger data about who survived/perished on the RMS Titanic.  
- *Why it’s popular*: Classic Kaggle competition for beginners, with mixed feature types and missing data.  
- *Where to get it*: [Kaggle Titanic Competition](https://www.kaggle.com/c/titanic).