# Lab 6: Laboratory Notes - Week 7: Preparing datasets

## Introducing scikit-learn

As part of our basic libraries for data science, we are introducing the scikit-learn library.  This library is built on top of

* numpy  
* pandas  
* matplotlib  

that we have already introduced earlier. We will use this library to

* pre-process the datasets  
* implement machine learning algorithms  
* apply evaluation metrics  

for the next 3 to 4 weeks. In practice, we use libraries such as scikit-learn as we don’t want to recreate a complex algorithm every time we want to use it. Scikit-learn is a library in Python that provides numerous functions for data pre-processing, unsupervised and supervised learning algorithms, and various evaluation metrics. The functionality that scikit-learn provides include:

* Preprocessing, including Min-Max Normalization  
* Regression, including Linear and Logistic Regression  
* Classification, including K-Nearest Neighbours  
* Clustering, including K-Means  
* Model selection through evaluation metrics  

Let's start by importing scikit-learn.

<span style="color:red">import sklearn</span>

Let's go into something more specific first.  Let's have a look at normalisation.

<span style="color:red">from sklearn import preprocessing</span>

## What is a "seed"?

Before we proceed, we need to understand the term "seed" in generating numbers.  A seed, like a plant, signifies the initialisation.  In this case it is a number that is used to initialise a pseudo random number generator.  A computer is a state machine.  A statement machine is one which can be in one of a set number of stable conditions depending on its previous condition and on the present values of its inputs.  Hence, technically it cannot generate real random numbers and it generates pseudo random numbers that are based on a specific equation.  The initialisation of this equation is typically machine clock based or some other default start and can be set by fixing the seed.  This means that if we always use the same seed, the equation will generate the same set of pseudo random numbers.

This is important for us to be able to reproduce our results consistently.

## Normalisation

We will not go into details here (as it is part of your coursework 2 :-) ) but will give you an idea of what is expected.  Use the normalize() function in scikit-learn, e.g., the  sklearn.preprocessing.normalize().  We will use this to illustrate the normalisation of a vector (array-like) dataset.

<span style="color:red">from sklearn import preprocessing</span>

The <span style="color:red">normalize()</span> function is used to scale vector items individually to a unit so that the vector has a length of one. By default the function normalize() uses the square root of the sum of squares of each value, also known as the Euclidean norm, or just L2. If we use L1, then it will normalise where the sum of all the values will be 1.0.  Do note that the <span style="color:red">normalize()</span> function results in values between 0 and 1, but it is not the same as simply scaling the values to fall between 0 and 1.  Let's normalise a one dimensional numpy array.

<span style="color:red">import numpy as np  
x_array = np.random.randint(8, size=10)</span>

The above statement creates 8 integers between 0 - 9 (size).  Use the normalize() function on the array to normalize data along a row, in this case a one dimensional array:

<span style="color:red">normalised_l2 = preprocessing.normalize([x_array])  
print(normalised_l2)</span>

Run the the complete example code to demonstrate how to normalise a numpy array using the normalize() function.  Observe the output.  Now, rerun (including the random number generator) it a few times, and observe the output.

#### Exercise 6.1:

Do you get the same output?

The output should show that all the values are now in the range between 0 and 1.  If you try to manually compute it, square each value in the output and then add them together, you should get 1 as a result (or very close to 1 allowing for some rounding).

#### Exercise 6.2:

Using the <span style="color:red">numpy.square()</span> function and the <span style="color:red">numpy.sum()</span> function, check it results in a number close to 1.0.

When you rerun the <span style="color:red">np.random.randint()</span> function, you will get a different set of integers. 

#### Exercise 6.3: 

How can you get the same output each time you run it?

Hint: https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html

In the function <span style="color:red">normalize()</span>, you can use a different method, called "<span style="color:red">l1</span>" and it will produce fractions that add up to 1.0.   Execute the following:

<span style="color:red">normalised_l1 = preprocessing.normalize([x_array], norm="l1")  
print(normalised_l1)</span>

You can use the <span style="color:red">numpy.sum() function to add all the elements together.

<span style="color:red">np.sum(normalised_l1)</span>

You should get 1.0 as the result.  For other normalisations under sklearn, you can explore

* sklearn.preprocessing.MinMaxScaler (Min-Max range scaler)  
* sklearn.preprocessing.StandardScaler (Z-Score, using standard deviation)  
* sklearn.preprocessing.FunctionTransformer (Specify log_transform)

You will normally need to understand your data before applying any normalisation,

## Training and Testing Dataset

Let's now look at splitting the data.  We will use the titanic dataset for this exercise.

<span style="color:red">import pandas as pd  
titanic = pd.read_csv("titanic.csv")  
titanic.shape  
titanic.describe()</span>

The above tells us (reminds us) of the size of the dataset, which should consist of 891 rows (observations) and 12 columns (attributes).  The describe() will give us a statistical summary of the dataset that we have (which includes the 5-number summary).  Do note the statistics for the dataset.  If we are to split the dataset for training and testing, and we do it manually, e.g., we use about 800 observations for training and 91 observations for testing.

<span style="color:red">train_data = titanic[0:799]  
test_data = titanic[800:]</span>

Once we have the split, do a describe() for the training and testing dataset.

#### Exercise 6.4: 

Are the statistics (just use the 5-number summary) consistent with the original titanic dataset?  Which column would NOT be of interest?

Have a look at the "Fare" column.  Now, let's introduce a built-in function in scikit-learn for this.

<span style="color:red">from sklearn.model_selection import train_test_split</span>

We then ask it to keep 90% for training and 10% for testing by specifying that we want 0.1 of the dataset for testing.

<span style="color:red">train, test = train_test_split(titanic, test_size=0.1)</span>

Do note the syntax for Python function that returns 2 DataFrames.

#### Exercise 6.5: 

Are the train and test datasets more consistent in terms of the 5-number summary for the "Fare" column?  What do you think the function did?

Side note: The <span style="color:red">train_test_split()</span> function can accept more than 1 DataFrame input and usually it is used to also create appropriate training and testing datasets with the respective labels, e.g., 

<span style="color:red">X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)</span>

* The "<span style="color:red">X</span>" is the features for the training and for the testing, whereas the "y" are the corresponding labels.
* The <span style="color:red">test_size</span> in this case specifies that 30% of the observations are to be kept for testing.
* The <span style="color:red">random_state</span> is the "seed" number to ensure that the analyst can reproduce the data split.

If we have normalised the data, and then we split the training and testing datasets with the respective labels, we are ready for the next step.

#### Exercise 6.6:

Discuss among yourselves, which of the titanic columns (attributes) should be used to build a model to predict whether the titanic passenger survived and which of the column (attribute) may require normalisation?

## My code part

#### Introducing scikit-learn

In [1]:
#import sklearn
from sklearn import preprocessing

#### Normalisation

In [5]:
import numpy as np
x_array = np.random.randint(8, size=10)

normalised_l2 = preprocessing.normalize([x_array])
print(normalised_l2)

[[0.60302269 0.30151134 0.30151134 0.         0.10050378 0.40201513
  0.20100756 0.40201513 0.20100756 0.20100756]]


#### Exercise 6.1:

In [6]:
# Print results
print("Original Array:", x_array)
print("L2 Normalized Array:", normalised_l2)

Original Array: [6 3 3 0 1 4 2 4 2 2]
L2 Normalized Array: [[0.60302269 0.30151134 0.30151134 0.         0.10050378 0.40201513
  0.20100756 0.40201513 0.20100756 0.20100756]]


Since np.random.randint() generates different random numbers each time, the output will change every time you execute it.  
The normalization scales the vector so that the sum of the squares of its elements equals 1.

#### Exercise 6.2:

In [21]:
# Generate a random array of integers between 0 and 7
x_array = np.random.randint(8, size=10)

# Normalize using L2 norm
normalised_l2 = preprocessing.normalize([x_array], norm='l2')

# Compute the sum of squares
sum_of_squares = np.sum(np.square(normalised_l2))

# Print results
print("Original Array:", x_array)
print("L2 Normalized Array:", normalised_l2)
print("Sum of Squares:", sum_of_squares)

Original Array: [3 1 3 0 6 0 0 1 5 3]
L2 Normalized Array: [[0.31622777 0.10540926 0.31622777 0.         0.63245553 0.
  0.         0.10540926 0.52704628 0.31622777]]
Sum of Squares: 1.0


#### Exercise 6.3:

In [22]:
# Set seed for reproducibility
np.random.seed(42)

# Generate a random array of integers between 0 and 7
x_array = np.random.randint(8, size=10)

# L2 Normalization
normalised_l2 = preprocessing.normalize([x_array], norm="l2")
sum_of_squares = np.sum(np.square(normalised_l2))

# L1 Normalization
normalised_l1 = preprocessing.normalize([x_array], norm="l1")
sum_of_abs = np.sum(normalised_l1)

# Print results
print("Original Array:", x_array)
print("L2 Normalized Array:", normalised_l2)
print("Sum of Squares (L2 Norm):", sum_of_squares)  # Should be ~1.0

print("L1 Normalized Array:", normalised_l1)
print("Sum of Absolute Values (L1 Norm):", sum_of_abs)  # Should be 1.0

Original Array: [6 3 4 6 2 7 4 4 6 1]
L2 Normalized Array: [[0.40544243 0.20272121 0.27029495 0.40544243 0.13514748 0.47301616
  0.27029495 0.27029495 0.40544243 0.06757374]]
Sum of Squares (L2 Norm): 1.0
L1 Normalized Array: [[0.13953488 0.06976744 0.09302326 0.13953488 0.04651163 0.1627907
  0.09302326 0.09302326 0.13953488 0.02325581]]
Sum of Absolute Values (L1 Norm): 1.0


In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer

# Set seed for reproducibility
np.random.seed(42)

# Generate a random array of integers between 0 and 7
x_array = np.random.randint(8, size=(10, 1))  # Reshape needed for scalers

# Min-Max Scaling (scales values between 0 and 1)
minmax_scaler = MinMaxScaler()
x_minmax = minmax_scaler.fit_transform(x_array)

# Standard Scaling (Z-score: mean = 0, std = 1)
standard_scaler = StandardScaler()
x_standard = standard_scaler.fit_transform(x_array)

# Log Transformation (apply natural logarithm)
log_transformer = FunctionTransformer(np.log1p)  # log1p(x) = log(x + 1) to avoid log(0)
x_log = log_transformer.transform(x_array)

# Print results
print("Original Array:\n", x_array.flatten())
print("\nMin-Max Scaled:\n", x_minmax.flatten())
print("\nStandard Scaled (Z-score):\n", x_standard.flatten())
print("\nLog Transformed:\n", x_log.flatten())


Original Array:
 [6 3 4 6 2 7 4 4 6 1]

Min-Max Scaled:
 [0.83333333 0.33333333 0.5        0.83333333 0.16666667 1.
 0.5        0.5        0.83333333 0.        ]

Standard Scaled (Z-score):
 [ 0.92060161 -0.70398947 -0.16245911  0.92060161 -1.24551983  1.46213197
 -0.16245911 -0.16245911  0.92060161 -1.78705019]

Log Transformed:
 [1.94591015 1.38629436 1.60943791 1.94591015 1.09861229 2.07944154
 1.60943791 1.60943791 1.94591015 0.69314718]


#### Training and Testing Dataset

In [24]:
import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
titanic.shape
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [25]:
train_data = titanic[0:799]
test_data = titanic[800:]

#### Exercise 6.4:

In [26]:
from sklearn.model_selection import train_test_split

In [27]:
train, test = train_test_split(titanic, test_size=0.1)

In [28]:
print("Original dataset:\n", titanic.describe())
print("\nTraining dataset:\n", train.describe())
print("\nTesting dataset:\n", test.describe())


Original dataset:
        PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

Training dataset:
        PassengerId    Survived      Pclass

#### Exercise 6.5:

In [29]:
print("Original Fare summary:\n", titanic["Fare"].describe())
print("\nTraining Fare summary:\n", train["Fare"].describe())
print("\nTesting Fare summary:\n", test["Fare"].describe())


Original Fare summary:
 count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

Training Fare summary:
 count    801.000000
mean      31.352309
std       47.802900
min        0.000000
25%        7.895800
50%       14.108300
75%       30.070800
max      512.329200
Name: Fare, dtype: float64

Testing Fare summary:
 count     90.000000
mean      39.786111
std       63.947837
min        0.000000
25%        8.662500
50%       26.000000
75%       51.272925
max      512.329200
Name: Fare, dtype: float64


#### Exercise 6.6:

1) Selecting Relevant Features ("X")  
A. Potentially Useful Features  
Key variables (strongly correlated with survival):  

* Pclass (Passenger class) → First-class passengers had a higher survival rate.  
* Sex (Gender) → Women had a higher chance of survival.  
* Age → Children were prioritized for lifeboats.  
* SibSp (Number of siblings/spouses aboard) → May influence survival chances.  
* Parch (Number of parents/children aboard) → Family members may have helped each other.  
* Fare (Ticket price) → Could be correlated with Pclass (higher fare = first-class).  
* Embarked (Port of embarkation) → Might indicate socioeconomic status and class.  

B. Irrelevant or Less Useful Features  
Features to exclude or with limited predictive power:  

PassengerId → Just an identifier, no predictive value.  
* Name: Unique to each passenger, difficult to use meaningfully.  
* Ticket: No clear impact on survival.  
* Cabin: Too many missing values, challenging to use without advanced imputation.  

2) Which Features Need Normalization?  
Normalization is generally applied to continuous numerical variables to prevent certain features from dominating others when training the model.  

Features that need normalization (e.g., MinMaxScaler or StandardScaler):  
* Age: Ranges from 0 to 80, needs scaling.  
* Fare: Can range from 0 to over 500, has extreme values.  

Features that do not necessarily require normalization:  
* Pclass, SibSp, Parch → These are discrete values, so normalization is not necessary but possible.  

Categorical features that need encoding:  

* Sex (male, female) → Convert to 0 and 1.  
* Embarked (C, Q, S) → Use One-Hot Encoding to avoid introducing an arbitrary order.  

3) Conclusion: Data Preprocessing Pipeline  
Recommended preprocessing before training the model:  

* Encode categorical variables (Sex, Embarked).  
* Impute missing values (Age, Embarked, Fare).  
* Normalize continuous variables (Age, Fare).  
* Select final features (Pclass, Sex, Age, SibSp, Parch, Fare, Embarked).  