
# **Introduction: The Data Revolution in Biology**
- By Mugume Twinamatsiko Atwine

Biological research has traditionally relied on well-established experimental techniques and manual data analysis. However, with the advent of high-throughput technologies—such as next-generation sequencing, advanced imaging systems, and environmental sensors—the field of biology now generates data at an unprecedented scale and complexity. This explosion of data presents both an opportunity and a challenge:

- **Opportunity:**  
  Data-driven insights can lead to groundbreaking discoveries—from identifying novel genetic markers for diseases to unraveling complex ecological interactions.

- **Challenge:**  
  Without the right tools and methodologies, extracting meaningful information from vast datasets can be daunting. This is where data science, machine learning, and big data analytics come into play.

  ![image.png](attachment:91ab59e6-d9b2-420c-8edf-a0556612fe12.png)

---

## **Why Is This Course Important?**

**1. Accelerating Discoveries in Genomics and Precision Medicine:**

- **Real-World Use Case:**  
  *[The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/ccg/research/genome-sequencing/tcga)* is a landmark project that analyzed thousands of cancer samples using machine learning. By identifying molecular subtypes of cancer, TCGA has paved the way for personalised treatment strategies, helping clinicians select the best therapy for each patient.

  >> The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.

  >> Over the next dozen years, TCGA generated over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data. The data, which has already led to improvements in our ability to diagnose, treat, and prevent cancer, will remain publicly available for anyone in the research community to use.

    ![image.png](attachment:ac0101ba-6d7b-459c-9252-c93510b4234c.png)
  
- **Impact:**  
  With the ability to analyze and interpret large-scale genomic data, researchers can uncover subtle patterns that might indicate disease risk or treatment efficacy.

---

**2. Transforming Medical Imaging and Diagnostics:**

- **Real-World Use Case:**  
  Deep learning models have been developed to detect conditions such as diabetic retinopathy from retinal images. In several studies, these algorithms have achieved accuracy levels comparable to, or even exceeding, those of expert ophthalmologists.
  
- **Impact:**  
  Machine learning can improve diagnostic accuracy, reduce human error, and enable early detection of diseases, which is crucial for successful treatment.

---

**3. Enhancing Environmental Biology and Ecology:**

- **Real-World Use Case:**  
  Big data analytics are revolutionizing how ecologists monitor and predict environmental changes. For example, analyzing vast datasets on species distributions and migratory patterns helps predict the impacts of climate change on biodiversity.
  
- **Impact:**  
  This allows for more informed conservation strategies and policy decisions that can mitigate the effects of environmental changes.

---

**4. Accelerating Drug Discovery and Bioinformatics:**

- **Real-World Use Case:**  
  Machine learning models are being used to analyze protein structures and gene expression data to identify potential drug targets. This approach can significantly speed up the drug development process, reducing both time and cost.
  
- **Impact:**  
  By leveraging computational techniques, researchers can more efficiently sift through vast datasets, pinpointing key insights that drive innovation in therapeutic development.

![WhatsApp Image 2025-02-22 at 13.25.42_a40e300b.jpg](attachment:6d32a99a-f1ea-4311-997c-f2aefe97f73e.jpg)
---

## **Course Objectives**

This introductory course is designed specifically for those with a biological background who may not have formal training in computational methods. By the end of the session, you will:

- **Gain Hands-On Experience:**  
  Learn to load, clean, and visualize data using Python libraries like Pandas, Matplotlib, and Seaborn.

- **Understand Core Machine Learning Concepts:**  
  Apply simple classification techniques using scikit-learn to see how predictive models can be built and interpreted in a biological context.

- **Explore Big Data Tools:**  
  Get a glimpse into how distributed computing frameworks like Apache Spark can handle massive datasets typical in modern biology.

- **Bridge Theory with Practice:**  
  Work through real-world case studies that illustrate the transformative power of data science in genomics, diagnostics, ecology, and drug discovery.

---

### **Today's Lesson**

Today, we’ll not only explore the fundamental tools and techniques of data science and machine learning but also see how these methods are applied to solve real biological problems. Whether you're interested in understanding disease mechanisms, improving diagnostics, or contributing to environmental conservation, the skills you gain in this course will empower you to leverage data for innovative research.


# **Understanding Data: The Foundation of Data Science**

Before we jump into data preparation, it’s important to understand what data is, the types of data you might encounter, its varying sizes, and the challenges related to accessing and integrating data. These concepts are critical in any data-driven field, including biology.

---

## **1. Definition of Data**

- **Data** is a collection of facts, figures, measurements, or observations that can be processed, analyzed, and used to draw conclusions.  
- In biology, data may come from experiments, observations, high-throughput technologies (like sequencing or imaging), and simulations.

---

## **2. Different Types of Data**

Understanding the nature of your data helps determine how best to handle and analyze it:

- **Structured Data:**  
  - **Definition:** Highly organized data that fits neatly into tables (rows and columns).  
  - **Examples:** Gene expression matrices, clinical trial results, and patient records in a database.

- **Unstructured Data:**  
  - **Definition:** Data that does not have a predefined structure.  
  - **Examples:** DNA sequences (text), medical images, and free-form lab notes.

- **Semi-Structured Data:**  
  - **Definition:** Data that has some organizational properties, but not as rigid as structured data.  
  - **Examples:** JSON or XML files containing experimental metadata.

- Why is it important to know about the types? The simple answer is the methods applied to each type are different so you need to know so that you accurately apply the right methods to the data you have
---

## **3. Different Sizes of Data**

Data size matters, especially when processing and analyzing information. Here’s a quick comparison:

- **Megabytes (MB):**  
  - **Usage:** Small datasets such as basic spreadsheets or short text files.
  
- **Gigabytes (GB):**  
  - **Usage:** Moderate-sized datasets like detailed images or medium-sized experimental datasets.

- **Terabytes (TB):**  
  - **Usage:** Large-scale datasets, such as high-resolution imaging data or extensive sequencing data.

- **Petabytes (PB):**  
  - **Usage:** Massive datasets typical in big data environments—think population-scale genomics or large-scale environmental sensor data.

---

## **4. Data Access and Silos**

- **Data Access:**  
  - **Definition:** The methods and tools used to retrieve data from storage systems.  
  - **In Practice:** This could involve querying databases, using APIs, or directly accessing files on a server.

- **Data Silos:**  
  - **Definition:** Isolated data repositories where information is stored separately, making integration and holistic analysis difficult.  
  - **Challenge:** In multidisciplinary fields like biology, data silos can hinder the ability to combine datasets (e.g., clinical data with genomic data) for comprehensive analysis.  
  - **Solution:** Efforts to integrate data from various sources are essential to unlock deeper insights. `In your tenure with data, this is going to be one of the tough things you have to learn to deal with because of its complex organisations that keep this information and the systematic access gateways put in place, data laws and other things.`

![image.png](attachment:8a453f37-bbe0-471b-9a4d-da178ffadae0.png)

---

# **Transitioning to Data Preparation**

Now that we’ve established what data is, the different types and sizes you might encounter, and the challenges of data access, let’s move on to the first critical step in any data science workflow: **Data Preparation**. This stage is where you:

1. **Acquire and Load Data:**  
   Convert raw data into a workable format (e.g., using Pandas to create DataFrames).

2. **Explore the Data:**  
   Understand its structure, types, and summary statistics.

3. **Clean the Data:**  
   Handle missing values, remove or correct outliers, and ensure the data types are appropriate.

4. **Engineer and Transform Features:**  
   Scale, normalize, or encode data as needed to prepare for analysis.

5. **Split the Data:**  
   Divide it into training and testing sets for modeling and evaluation.

Below is an example of these steps using the Iris dataset. Although this dataset is clean and well-structured, the same principles apply to more complex biological datasets.


### Step 1: Data Acquisition & Loading

In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
iris = load_iris()

# Convert the dataset to a Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display the first few rows
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [7]:
# you can check the dimensions of the data
# all you have learnt in Numpy and Pandas can be applied here
df.shape

(150, 5)

In [8]:
# example, we can create a new column aggregating the columns above
# It can be anything, let the students give ideas on this.

In [9]:
# we can check for the metadata of the data loaded
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


In [10]:
# we can even check the statistical summaries on the go
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [11]:
# we can check for missing data
df.isnull().any().sum()

0

In [12]:
# intro to pandas profiling api
# there is a way to do all this quickly because most times you are rushing through to get to the good stuff
# !pip install ydata-profiling

In [13]:
# we can generate a report by running just one line of code
# this report helps you check a lot of aspects regarding such as data integrity, and others
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Profiling Report")

In [14]:
# we can learn various aspects about the data from the report generated below.
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Step 2: Data Cleaning

Even though the Iris dataset is clean, here’s how you might handle missing values:

In [28]:
# Simulate a missing value for demonstration purposes
df_missing = df.copy()
df_missing.loc[0, 'sepal length (cm)'] = np.nan

In [32]:
df_missing.isna().any().sum()

1

In [34]:
df_missing.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [38]:
# Fill missing values with the column mean
df_missing['sepal length (cm)'].fillna(df_missing['sepal length (cm)'].mean())

print("\nAfter Handling Missing Data:")
df_missing.head()

# of course this is a simple example, there are more complex questions to work with.


After Handling Missing Data:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.848322,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### **Why Feature Engineering and Transformations Are Important**

1. **Improving Model Performance:**  
   - Raw data might not be in the best form for a learning algorithm. Transforming features (e.g., scaling or encoding) can help models converge faster and produce more accurate results.

2. **Handling Different Data Types and Scales:**  
   - In biological datasets, you may have features measured on very different scales (e.g., gene expression levels, counts, or categorical labels). Normalizing or scaling these features ensures that one feature doesn't dominate the learning process.

3. **Revealing Hidden Relationships:**  
   - Creating new features (for example, by combining existing ones) can expose relationships that are not immediately obvious in the raw data.

4. **Reducing Noise and Redundancy:**  
   - Transformations like dimensionality reduction (e.g., PCA) help in removing noise and redundant features, leading to a more robust model.

5. **Handling Nonlinear Relationships:**  
   - Some relationships in the data might be nonlinear. Transformations like polynomial features can capture these interactions more effectively.

---

In [42]:

#1. Scaling and Normalization**
# Scaling standardizes the range of independent variables or features of data. This is especially important when features are measured in different units.
#Example: Scaling with StandardScaler**

from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.datasets import load_iris

# Load Iris dataset and create a DataFrame
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Initialize and apply StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[iris.feature_names])

# Create a DataFrame with scaled features
df_scaled = pd.DataFrame(scaled_features, columns=iris.feature_names)
print("First 5 rows after scaling:")
df_scaled.head()


# Why it’s important:  
# Scaling helps models (especially those based on distance metrics like k-NN or algorithms that use gradient descent) perform better and converge faster.

First 5 rows after scaling:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444


In [44]:

# 2. Encoding Categorical Variables**
# Many machine learning models require numerical input. Categorical features (e.g., labels or classes) must be transformed into a numeric format.
# Example: One-Hot Encoding**

# Example DataFrame with a categorical feature
data = {
    'gene_status': ['overexpressed', 'underexpressed', 'normal', 'overexpressed']
}
df_cat = pd.DataFrame(data)

# One-hot encode the 'gene_status' column
df_encoded = pd.get_dummies(df_cat, columns=['gene_status'])
print("One-hot encoded DataFrame:")
df_encoded

# Why it’s important:
# One-hot encoding allows models to understand categorical data without implying any ordinal relationship between the categories.


One-hot encoded DataFrame:


Unnamed: 0,gene_status_normal,gene_status_overexpressed,gene_status_underexpressed
0,False,True,False
1,False,False,True
2,True,False,False
3,False,True,False


In [None]:
# we can encode the whole dataframe if we want to

In [48]:
# 3. Polynomial Features
# Polynomial features are created by taking combinations of the existing features, allowing the model to capture nonlinear relationships.
# Example: Generating Polynomial Features**

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

# Sample DataFrame with two features
X_sample = pd.DataFrame({
    'feature1': [1, 2, 3, 4],
    'feature2': [5, 6, 7, 8]
})

# Initialize PolynomialFeatures (degree=2 creates interaction and squared terms)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_sample)

# Convert to DataFrame for better readability
poly_feature_names = poly.get_feature_names_out(X_sample.columns)
df_poly = pd.DataFrame(X_poly, columns=poly_feature_names)
print("Polynomial features DataFrame:")
print(df_poly)


# Why it’s important:
# Polynomial features can help a linear model learn more complex, nonlinear relationships, which might be present in biological data.


Polynomial features DataFrame:
   feature1  feature2  feature1^2  feature1 feature2  feature2^2
0       1.0       5.0         1.0                5.0        25.0
1       2.0       6.0         4.0               12.0        36.0
2       3.0       7.0         9.0               21.0        49.0
3       4.0       8.0        16.0               32.0        64.0
