# Day 1: Introduction to Machine Learning with Python
## Lab 1-1: Cleaning and Processing Data

Welcome to the very first lab of the Machine Learning workshop! In this lab, we will cover the basics of data cleaning and processing. We'll start by looking at a popular toy dataset for machine learning - the Titanic dataset. Then, we will move on to looking at loading and handling FASTA and FASTQ files, which are commonly used in bioinformatics.

### The Titanic Dataset

The Titanic dataset is a popular practice dataset for machine learning. The dataset contains information about passengers on the Titanic, such as their age, ticket class, name, and crucially whether they survived or not. Generally, this dataset is used to build a model to predict whether a passenger survived or not based on the other information available.

However, many of the columns in the dataset contain missing values, which can cause problems when building a machine learning model. In this lab, we will learn about a few of the different ways to handle missing values in a dataset.

### Scikit-Learn

Scikit-Learn is a popular machine learning library in Python. It provides a wide range of tools for building machine learning models, including tools for data preprocessing, model building, and model evaluation. We will take advantage of Scikit-Learn's tools in this lab to handle missing values in the Titanic dataset. Let's start by loading the data, and inspecting it using Pandas:

In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml

# Load the Titanic dataset
titanic = fetch_openml(name='titanic', version=1)

# Convert the data to a Pandas DataFrame
df = pd.DataFrame(titanic.data, columns=titanic.feature_names)
# Add survival information to the DataFrame
df['survived'] = titanic.target

In [None]:
# Look at some of the data
df

Take a moment to familiarize yourself with the contents of the dataset. Try to answer some of the following questions. If you are new to Python, work in a group with someone who has familiarity with Python, or ask for help!

1. What percentage of passengers survived?
2. What was the average age of passengers?
3. What was the most common ticket class?

In [None]:
# Your code here

Note: At this stage we are also going to remove the `name`, `ticket`, `cabin`, `boat` and `home.dest` columns, as these contain non-numeric data that is difficult to work with. We will come back to these columns in a later lab.

In [None]:
# Remove non-numeric columns
df = df.drop(columns=['name', 'ticket', 'cabin', 'boat', 'home.dest'])

There is one other thing we need to do before we can move forward. The model we will build requires that all of the data be numeric, but we have a few columns with text in them. The `sex` and `embarked` columns are examples of this. We can convert these columns to numeric values using a technique called "one-hot encoding". This technique converts each unique value in a column to a new column, and assigns a 1 or 0 to each new column depending on whether the original column contained that value. So for example, we can replace the `sex` column with two new columns, `sex_male` and `sex_female`, one of which will be 1 and the other 0 for each row (note: we can actually get away with one fewer column than the number of unique values in the original column, but we'll ignore that for now. Think about why this is the case).

In [None]:
# Perform one-hot encoding on the 'sex' and 'embarked' columns
df = pd.get_dummies(df, columns=['sex', 'embarked'])

# Look at the data again
df

### Handling Missing Values

One of the most common problems in real-world datasets is missing values. Missing values can cause problems when building machine learning models, so it is important to handle them properly. There are several ways to handle missing values, including:

1. Removing rows or columns with missing values
2. "Imputing" missing values by filling them in with a best guess
3. Using a model that treats missing values as a separate category

We are going to look at each of these methods in turn, and see how they affect the performance of a machine learning model.

### Removing Rows with Missing Values

The simplest way to handle missing values is to remove any rows that contain missing values. This is a quick and easy way to handle missing values, but it can also lead to a loss of information. Pandas makes it easy to remove rows with missing values using the `dropna()` method. Let's see how this affects the Titanic dataset:

In [None]:
# Remove rows with missing values
df_dropped = df.dropna()

# Look at the shape of the original and modified data
print(f'Original data shape: {df.shape}')
print(f'Modified data shape: {df_dropped.shape}')

...whoops! It looks like we have lost the significant majority of our dataset with this approach. If we inspect the data further, we can see which columns contain a significant number of missing values:

In [None]:
# Report the number of missing values by column
print(df.isnull().sum()/len(df) * 100)

Based on this we can see that the `body` column contains a large number of missing values. This column encodes whether the body was recovered. It is not surprising that this information is missing for many passengers. Let's see what happens if we remove this column entirely, and then remove rows with missing values:

In [None]:
# Remove columns with a large number of missing values (and non-numeric columns)
df_removed = df.drop(columns=['body'])

# Remove rows with missing values
df_removed = df_removed.dropna()

# Look at the shape of the original and modified data
print(f'Original data shape: {df.shape}')
print(f'Modified data shape: {df_removed.shape}')

Okay, that loss of data is a bit easier to work with. We now know that every row in the dataset has complete information. There's an obvious tradeoff here, in that we have to decide whether it is better to have more data with missing values, or less data with complete information. However, there's also a slightly more subtle risk that this approach introduces. Can you think of what it might be? We'll come back to it later.

For now, let's build a very simple machine learning model to predict whether a passenger survived or not. We'll use the `RandomForestClassifier` model from Scikit-Learn, which is a popular model for classification tasks. We'll start by splitting the data into features and labels, and then splitting the data into training and testing sets:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split the data into features and labels
X = df_removed.drop(columns='survived')
y = df_removed['survived']

In [None]:
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# Train a Random Forest model
model = RandomForestClassifier()

# Fit the model to the training data
model.fit(X_train, y_train)

# Evaluate the model on the testing data
accuracy = model.score(X_test, y_test)

print(f'Model accuracy: {accuracy:.4f}')

There is a bit of randomness here based on how we split the data, but hopefully you should see an accuracy of around 80%. This is not bad! It's worth also breaking this down based on whether the passenger survived or not, as this can give us a better idea of how well the model is performing:

In [None]:
print(f'Model accuracy for passengers who did not survive: {model.score(X_test[y_test == "0"], y_test[y_test == "0"]):.4f}')
print(f'Model accuracy for passengers who survived:        {model.score(X_test[y_test == "1"], y_test[y_test == "1"]):.4f}')

As we go through the workshop, we will discuss further the idea that different ways of measuring performance can be pretty impactful. For now, it's just worth considering how this difference in performance might be important depending on what the model is being used for. What could be a situation where this difference in performance might be particularly important?

### Imputing Missing Values

Another way to handle missing values is to "impute" them, which means filling them in with a best guess. There are many ways to impute missing values, but one common way is to fill them in with the mean or median of the column. This is slightly more sophisticated than just removing rows with missing values, as it allows us to keep more data. Essentially, what we are doing is building a very simple model to predict the missing values based on the data we do have.

The simplest type of imputation is to fill in missing values with the mean of the column. This is easy to do with Pandas:

In [None]:
# Make a copy of our dataframe for imputation
df_imputed = df.copy()

In [None]:
for column in df.columns:
    if df_imputed[column].isnull().sum() > 0:
        print(f'Imputing missing values for "{column}", which has {df_imputed[column].isnull().sum()} missing value(s)')
        mean = df_imputed[df_imputed[column].notnull()][column].mean()
        print(f'  Mean value: {mean:.2f}')
        df_imputed[column] = df_imputed[column].fillna(mean)

An obvious flaw here is that we are just using the same mean for all missing values in a column. This is a very simple approach, and there are many more sophisticated ways to impute missing values. However, this is a good starting point. Let's see how this affects the performance of our model:

In [None]:
# Split the data into features and labels
X = df_imputed.drop(columns='survived')
y = df_imputed['survived']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a Random Forest model
model = RandomForestClassifier()

# Fit the model to the training data
model.fit(X_train, y_train)

# Evaluate the model on the testing data
accuracy = model.score(X_test, y_test)

print(f'Model accuracy: {accuracy:.4f}')

In [None]:
print(f'Model accuracy for passengers who did not survive: {model.score(X_test[y_test == "0"], y_test[y_test == "0"]):.4f}')
print(f'Model accuracy for passengers who survived:        {model.score(X_test[y_test == "1"], y_test[y_test == "1"]):.4f}')

Compare the performance of the model with imputed missing values to the performance of the model with missing values removed. What do you notice? What are the tradeoffs between these two approaches?

### Using a Model to Handle Missing Values

Another way to handle missing values is to use a model that treats missing values as a separate category. As we alluded to earlier, there can be a significant cost to removing rows with missing values: not only are we losing information, but often there can be an underlying reason why the data is missing. For example, in the Titanic dataset, the `body` column is missing for many passengers because their body was never recovered. This is not a random process, and removing these rows could introduce bias into our model.

In Scikit-Learn, unlike `RandomForestClassifier`, there are models that can handle missing values directly. One such model is `HistGradientBoostingClassifier`. This model can handle missing values by treating them as a separate category. Let's see how this model performs on the Titanic dataset:

In [None]:
# Split the data into features and labels
X = df.drop(columns='survived')
y = df['survived']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

# Train a Gradient Boosting model
model = HistGradientBoostingClassifier()

# Fit the model to the training data
model.fit(X_train, y_train)

# Evaluate the model on the testing data
accuracy = model.score(X_test, y_test)

print(f'Model accuracy: {accuracy:.4f}')

In [None]:
print(f'Model accuracy for passengers who did not survive: {model.score(X_test[y_test == "0"], y_test[y_test == "0"]):.4f}')
print(f'Model accuracy for passengers who survived:        {model.score(X_test[y_test == "1"], y_test[y_test == "1"]):.4f}')

How does the performance of the `HistGradientBoostingClassifier` model compare to the `RandomForestClassifier` model? What are the tradeoffs between these two approaches?

### Deciding on the Best Approach

There is no one-size-fits-all approach to handling missing values in a dataset. The best approach depends on the dataset, the problem you are trying to solve, and the model you are using. In general, it is a good idea to try multiple approaches and see which one works best for your particular problem. Some of the things that you might want to consider when deciding on an approach include:

- The amount of missing data in the dataset: if there are only a handful of missing values, it might be best to just remove them. If there are a large number of missing values, it might be better to impute them.
- The underlying reason for the missing data: could there be a systemic reason why the data is missing? If so, removing the missing values could introduce bias into the model.
- The model you are using: some models can handle missing values directly, while others cannot. It is a good idea to choose a model that can handle missing values if you have a large amount of missing data.

## FASTA and FASTQ Files

FASTA and FASTQ files are commonly used in bioinformatics to store DNA and protein sequences. FASTA files store sequences in a simple text format, while FASTQ files store sequences along with quality scores for each base in the sequence. In this section, we will look at how to load and handle FASTA and FASTQ files in Python.

### Loading FASTA Files

FASTA files store sequences in a simple text format. Each sequence is represented by a header line starting with a `>` character, followed by one or more lines containing the sequence itself. We can use the `Biopython` library to load and handle FASTA files in Python. Let's start by loading a FASTA file and looking at the sequences it contains. First we need to install the library:

In [None]:
!pip install -U biopython

In [None]:
from Bio import SeqIO

# Load a FASTA file
fasta_file = 'NM_001323632.2.fasta'
records = list(SeqIO.parse(fasta_file, 'fasta'))

In [None]:
# Look at the first record
record = records[0]
print(f'ID: {record.id}')
print(f'Description: {record.description}')
print(f'Sequence: {record.seq}')

Once we have loaded the FASTA file, we can access the sequences using the `SeqRecord` object. The `SeqRecord` object has several attributes, including `id`, `description`, and `seq`, which contain the ID of the sequence, a description of the sequence, and the sequence itself, respectively. We can do some simple processing on the sequences, such as calculating the length of the sequence, or counting the number of each base in the sequence:

In [None]:
# Calculate the length of the sequence
length = len(record.seq)

# Count the number of each base in the sequence
counts = {base: record.seq.count(base) for base in 'ACGT'}

In [None]:
print(f'Sequence length: {length}')
print(f'Base counts: {counts}')
# Calculate the GC content of the sequence
gc_content = (counts['G'] + counts['C']) / length
print(f'GC content: {gc_content:.2f}')

### Loading FASTQ Files

FASTQ files are similar to FASTA files, but they also contain quality scores for each base in the sequence. Quality scores are used to estimate the probability that a base is called incorrectly. We can use the `Biopython` library to load and handle FASTQ files in Python. Let's start by loading a FASTQ file and looking at the sequences and quality scores it contains:

In [None]:
fastq_file = 'SRR000129.fastq'
records = list(SeqIO.parse(fastq_file, 'fastq'))

In [None]:
record = records[0]
print(f'ID: {record.id}')
print(f'Description: {record.description}')
print(f'Sequence: {record.seq}')
print(f'Quality scores: {record.letter_annotations["phred_quality"]}')

In [None]:
# Calculate the average quality score for the sequence
average_quality = sum(record.letter_annotations['phred_quality']) / len(record.letter_annotations['phred_quality'])
print(f'Average quality score: {average_quality:.2f}')

In [None]:
# Calculate the GC content of the sequence
length = len(record.seq)
counts = {base: record.seq.count(base) for base in 'ACGT'}
gc_content = (counts['G'] + counts['C']) / length
print(f'GC content: {gc_content:.2f}')

In [None]:
# Calculate the average quality score for each base in the sequence
average_quality = {base: sum(record.letter_annotations['phred_quality'][i] for i, b in enumerate(record.seq) if b == base) / counts[base] for base in 'ACGT'}
print(f'Average quality scores: {average_quality}')

### Conclusion

In this lab, we have looked at some of the different conceptual approaches in machine learning to handling missing data. We have also looked at how to load and handle FASTA and FASTQ files in Python. In the next lab, we will look in more detail at building machine learning models using Scikit-Learn.