# Data Wrangling

In this notebook, we will see the initial steps of the Data Science process that have to do with Data wrangling:

<div>
<img src="https://github.com/biosustain/data_club/raw/main/figures/data_science_process.png" width="900"/>
</div>

The steps we will go through are:

#### - Data collection

#### - Data cleaning

#### - Data transformation

#### - Data annotation

#### - Data validation


As a project dataset, we will use [Xia et al 2022](https://www.nature.com/articles/s41467-022-30513-2): **Proteome allocations change linearly with the specific growth rate of Saccharomyces cerevisiae under glucose limitation**

<div>
<img src="https://github.com/biosustain/data_club/raw/main/figures/xia_et_al_2022.png" width="900"/>
</div>



And specifically the absolute proteome and transcriptome:

<div>
<img src="https://github.com/biosustain/data_club/raw/main/figures/xia_datasets.png" width="500"/>
</div>


## Collecting Data

[Collecting data notebook](CollectingData.ipynb)

In [None]:
import os
import pandas as pd

In [None]:
article_url = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-022-30513-2/MediaObjects/" 
proteome_file = "41467_2022_30513_MOESM4_ESM.xlsx"
transcriptome_file = "41467_2022_30513_MOESM6_ESM.xlsx"

### Transcriptome data

<div>
<img src="https://github.com/biosustain/data_club/raw/main/figures/xia_trans_dataset.png" width="300"/>
</div>

In [None]:
transcriptome_data = pd.read_excel(os.path.join(article_url, transcriptome_file), header=0)

### Proteome data

<div>
<img src="https://github.com/biosustain/data_club/raw/main/figures/xia_prot_dataset.png" width="300"/>
</div>

In [None]:
proteome_data = pd.read_excel(os.path.join(article_url, proteome_file), header=0)

## Data Cleaning

### Transcriptome

**Print the first few rows of the dataset**

**Print the last few rows of the dataset**

**Have a look at the shape and format of the dataframe**

**Check for missing values**

### Proteome

**Print the first few rows of the dataset**

**Print the last few rows of the dataset**

**Have a look at the shape and format of the dataframe**

**Check for missing values**

**Handling missing values**


- _Missing at Random (MAR)_

MAR missing values mostly result from technical limitations and stochastic fluctuations in an abundance-independent manner

- _Missing Not at Random (MNAR)_

MNAR missing values are more abundance-dependent that can be explained by the measurability of the corresponding peptides

(source: [A comparative study of evaluating missing value imputation methods in label-free proteomics](https://www.nature.com/articles/s41598-021-81279-4))

Some options:

1. Drop rows or columns that have a missing value
```python
# drop column where there is a missing value
df.dropna(axis=1)
```

2. Fill with a constant value
```python
# use 0 to fill the gap
df.fillna(value=0)
```

3. Fill with an aggregated value (e.g., max, min, mean, median)
```python
# use the mean of the column for fill the gap
df['column1'].fillna(df['column1'].mean())
```

4. Replace with the previous (ffill) or next value (bfill)
```python
# use next valid observation to fill gap
df.fillna(method ='bfill')
```

5. Fill the missing values using linear method
```python
# to interpolate the missing values 
df.interpolate(method ='linear', limit_direction ='forward')
```

### Transcriptome

In [None]:
transcriptome_data_complete = transcriptome_data.dropna(inplace=False)  # Remove rows with missing data

In [None]:
transcriptome_data_complete.isnull().sum()

### Proteome

In [None]:
proteome_data.head()

**Filtering**

! relevant --  Column Majority protein IDs

### Complete dataframes

## Data Transformation

## Data annotation


Adding metadata or labels to data to make it easier to understand and, work with it. This is a crucial step in data science applications, as it helps to identify patterns, classify data, make predictions or extend analyses.


Here, we will annotate the Protein identifiers with extra information from [UniProt](https://www.uniprot.org/). UniProt is a comprehensive and freely accessible resource of protein sequence and functional information, it has an [API (Application Programming Interface)](https://en.wikipedia.org/wiki/API) that allows [programmatic access](https://www.uniprot.org/help/programmatic_access) to this information.

Example:

```python

import requests, sys

requestURL = "https://www.ebi.ac.uk/proteins/api/proteins/P19097"

r = requests.get(requestURL, headers={ "Accept" : "application/json"})

if not r.ok:
  r.raise_for_status()
  sys.exit()

responseBody = r.text
print(responseBody)

```

## Data Validation

The final step involves verifying that the data has been transformed correctly and is ready for analysis. This step includes checking that the data is accurate, complete, and consistent.