# University of Guelph MBINF 2023 - Introduction to Python Workshop

Hello Guelph MBINF Program 2023! Today we will introduce Python for bioinformatics. This workshop is designed to gently introduce Python programming concepts used in bioinformatics, then apply them to do some real-world science with biological data. 

---

## Part 2. Introduction to Pandas for Data Science

- Introduction to the Pandas DataFrame
- Creating DataFrames
- Loading DataFrames
- Manipulating DataFrames
- Basic plotting with Pandas and Seaborn

---


### Introduction to the Pandas DataFrame

NOTE: This section requires fundemental knowledge of Python. Please review Part 1 before attempting. 

Pandas is a popular Python library widely used in data science for data manipulation, analysis, and exploration. It provides powerful tools for working with structured data, primarily in the form of two main data structures: Series and DataFrames. In this introduction, I'll focus on DataFrames and their components.

<div style="background-color: black; padding: 15px;">

> "Pandas" is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia

</div>

**DataFrames in Pandas:**

A DataFrame is a 2-dimensional labeled data structure with columns that can hold different data types. It is similar to a spreadsheet or a SQL table. DataFrames are incredibly versatile and are used to store, manipulate, and analyze tabular data.

**Components of a DataFrame:**
- **Columns:** Columns represent the variables or attributes of the dataset. Each column can hold a specific type of data, such as numbers, strings, or dates.
- **Rows (Index):** Rows represent individual observations or records in the dataset. Each row is identified by an index, which can be an integer or a label.
- **Index:** The index is a unique identifier for each row in the DataFrame. It helps you access and reference rows using labels or positions.
- **Values:** The actual data values are stored in a 2-dimensional array-like structure. Each column contains data of **the same type**.

---



### Creating DataFrames

You can create a DataFrame from various data sources, including dictionaries, lists, NumPy arrays, CSV files, and more. Here's a simple example using a dictionary:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 28],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)
```
</div>


**Accessing Data in a DataFrame:**

You can access data in a DataFrame using various methods:

Column Selection:
You can select a specific column by using its name as an attribute of the DataFrame:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
ages = df['Age']
```
</div>

Row Selection:
You can select rows using the .loc[] and .iloc[] indexers:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
row = df.loc[0]  # Select the first row using the label-based index
row = df.iloc[1]  # Select the second row using the position-based index
```
</div>

**Exercises:**

<div style="background-color: black; padding: 15px;">

Exercise 1: Analyzing Student Data

In this exercise, you'll work with student data using a Pandas DataFrame. You'll create a DataFrame with student information and then perform various data manipulation and analysis tasks.

1. Import the Pandas library and create a DataFrame containing the following information for at least 6 students:
- Name (as strings)
- Age (as integers)
- Grade (as floats)
- City (as strings)

You can use a dictionary to define the data.

2. Print the entire DataFrame to see the data.

3. Print the first few rows of the DataFrame using the `.head()` method.

4. Access and print the names of all the students.

5. Calculate and print the average age of the students.

6. Filter the DataFrame to show only students who are older than 20.

7. Create a new column in the DataFrame called "Status." For each student, if their grade is greater than or equal to 70, set the status as "Pass," otherwise set it as "Fail."
</div>

Here's a starting point for the exercise that you can copy and paste into a new Code cell:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
import pandas as pd

# Step 1: Create the DataFrame
data = {
    'Name': ['Alice', ...],
    'Age': [25, ...],
    'Grade': [85.5, ...],
    'City': ['Guelph', ...]
}


# Step 2: Print the entire DataFrame


# Step 3: Print the first few rows


# Step 4: Access and print names


# Step 5: Calculate and print average age

# Step 6: Filter students older than 20

# Step 7: Add a "Status" column

```
</div>

This exercise will give you hands-on experience with creating a Pandas DataFrame, accessing its data, and performing various data manipulation tasks.

---

### Loading DataFrames

Loading data into Pandas DataFrames from CSV (comma separated value) files is a fundamental task in data analysis using Python. Here's a short and simple summary for new Python users:

**Loading DataFrames from CSV Files in Pandas**

To load data from a CSV file into a Pandas DataFrame, you can follow these steps:

Import Pandas: Begin by importing the Pandas library. You'll typically import it at the beginning of your script or notebook. Pandas is almost always aliased to `pd` when imported using the `as` statement. 

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
import pandas as pd
```
</div>

Load CSV Data: Use the `pd.read_csv()` function to load data from a CSV file into a DataFrame. Provide the path to your CSV file as an argument.

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
data = pd.read_csv('path_to_your_file.csv')
```
</div>

Exploring Data: You can now explore your data using various DataFrame methods. For example, you can use data.head() to display the first few rows of the DataFrame and get a sense of its structure.

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
print(data.head())
```
</div>

The `read_csv()` function has many optional parameters to handle various scenarios, like specifying the delimiter, handling missing values, skipping rows, and more. You can refer to the [Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html) for a complete list of options.


**Exercises:**

<div style="background-color: black; padding: 15px;">

> Exercise 2: Loading a csv file. I have prepared two csv files for you in the GitHub repo for this workshop, located at the following URLs:
- Gene Expression Data: https://raw.githubusercontent.com/davidlevybooth/MBINF-Introduction-to-Python/main/data/mbinf_gene_expression.csv
- Patient Data: https://raw.githubusercontent.com/davidlevybooth/MBINF-Introduction-to-Python/main/data/mbinf_patient_info.csv

> Load both data csv files into dataframes in a new code block

</div>

---

### Manipulating DataFrames

**1. Using `pd.merge` for Data Frame Joining**

The `pd.merge()` function is used to combine two or more DataFrames based on a common column or index. It's similar to a SQL JOIN operation. Here's a basic example:

Suppose you have two DataFrames, `df1` and `df2`, with the following data:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
import pandas as pd

# Create DataFrame df1
df1 = pd.DataFrame({'ID': [1, 2, 3],
                    'Name': ['Alice', 'Bob', 'Charlie']})

# Create DataFrame df2
df2 = pd.DataFrame({'ID': [2, 3, 4],
                    'Age': [25, 30, 28]})
```

</div>

Now, you can merge these DataFrames using the common column 'ID':

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
merged_df = pd.merge(df1, df2, on='ID')

print(merged_df)
```

</div>

Output:
```
   ID     Name  Age
0   2      Bob   25
1   3  Charlie   30
```

In this example, the `pd.merge()` function combines the two DataFrames based on the 'ID' column, only including rows with matching 'ID' values.

</div>

**2. Melting DataFrames**

Melting a DataFrame means transforming it from a wide format (where multiple columns represent different variables) to a long format (with a single column representing variable names and another column representing corresponding values). The `pd.melt()` function is used for this purpose.

Consider a DataFrame `wide_df` as follows:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
# Create a wide DataFrame
wide_df = pd.DataFrame({'ID': [1, 2, 3],
                        'Math': [85, 90, 78],
                        'Science': [70, 88, 92]})
```
</div>

You can melt this wide DataFrame to long format using `pd.melt()`:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
melted_df = pd.melt(wide_df, id_vars=['ID'], var_name='Subject', value_name='Score')

print(melted_df)
```

</div>

Output:
```
   ID  Subject  Score
0   1     Math     85
1   2     Math     90
2   3     Math     78
3   1  Science     70
4   2  Science     88
5   3  Science     92
```

In this example, the columns 'Math' and 'Science' in the original DataFrame are "melted" into a single 'Subject' column, and their corresponding values are placed in the 'Score' column.

</br>

**3. Filtering a DataFrame by a Column's Value**

Filtering a DataFrame allows you to extract rows that meet certain conditions. You can filter a DataFrame based on the values in a specific column using boolean indexing. To filter rows based on a column's value, you can create a boolean condition that evaluates to `True` or `False` for each row. Use this condition to index the DataFrame.

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

```python
data = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                     'Age': [25, 30, 28]})

filtered_data = data[data['Age'] > 28]

print(filtered_data)
```
</div>

Output:
```
      Name  Age
1     Bob   30
```

By understanding and applying filtering techniques, you can extract relevant subsets of data from your DataFrame for further analysis.

**Exercises:**

<div style="background-color: black; padding: 15px;">

Exercise 3. Using the `pd.merge()` function, join the gene expression and patient data dataframes that you loaded in the previous exercise. What is the common key on which to join? 

Exercise 4. Using the `pd.melt()` function, melt the gene expression dataframe so that all genes can be found in a single column called "gene" and all counts are in a separate column called "counts". 

Excercise 5. 
- Return a dataframe of all patients under 25. 
- Return a dataframe of only patients from North American cities. 

</div>

---

### Basic plotting with Pandas and Seaborn

Seaborn is a powerful data visualization library in Python that works in conjunction with Matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics. Here's an introduction to creating scatterplots, histograms, and bar plots using Seaborn:

**Introduction to Plotting with Seaborn**

1. **Import Seaborn and Matplotlib**: Start by importing the necessary libraries.

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

   ```python
   import seaborn as sns
   import matplotlib.pyplot as plt
   ```
</div>

2. **Load Data**: For the examples, let's assume you have a dataset named `data`:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

   ```python
   import pandas as pd

   data = pd.DataFrame({'X': [1, 2, 3, 4, 5],
                        'Y': [3, 5, 4, 7, 6],
                        'Category': ['A', 'B', 'A', 'C', 'B']})
   ```

</div>

3. **Scatterplot with Seaborn**:

   A scatterplot is used to visualize the relationship between two numeric variables. Seaborn makes it easy to create visually appealing scatterplots:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

   ```python
   sns.scatterplot(x='X', y='Y', data=data, hue='Category')
   plt.title('Scatterplot')
   plt.show()
   ```
</div>

   This code will create a scatterplot where 'X' is plotted on the x-axis, 'Y' on the y-axis, and points are colored based on the 'Category' column.

4. **Histogram with Seaborn**:

   A histogram displays the distribution of a single numeric variable. Seaborn simplifies creating histograms:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

   ```python
   sns.histplot(data['Y'], bins=10, kde=True)
   plt.title('Histogram')
   plt.show()
   ```
</div>

   This code generates a histogram of the 'Y' values with 10 bins and a Kernel Density Estimate (KDE) curve.

5. **Bar Plot with Seaborn**:

   A bar plot is used to compare categories or groups. Seaborn's bar plot function provides an elegant way to create them:

<div style="background-color: rgb(50, 50, 50); padding: 15px;">

   ```python
   sns.barplot(x='Category', y='Y', data=data, ci='sd')
   plt.title('Bar Plot')
   plt.show()
   ```
</div>

   This code creates a bar plot where 'Category' is on the x-axis, 'Y' on the y-axis, and the confidence interval (ci) is represented using the standard deviation ('sd').

These are just basic examples of what you can achieve with Seaborn. The library offers a wide range of customization options to fine-tune your visualizations, including controlling colors, styles, labels, and more. As you become more comfortable with Seaborn, you can explore its documentation for more advanced techniques and features.

**Exercises:**

<div style="background-color: black; padding: 15px;">

Exercise 6. Using the joined dataframe from the previous exercise, use a seaborn plot to determine which treatment led to the strongest change in gene expression. 

Exercise 7. Use a seaborn plot to explore the effect of age on the response of gene expression to treatment. 

Exercise 8. Determine if any other patient variables influence the response in gene expression to treatment. 

</div>

---