# **METK Barley SNP-Chip:** Exploring the correlation between barley’s genetic makeup and its protein content
## Importing and editing the necessary datasets

In [12]:
import pandas as pd
snip_data = pd.read_csv("SNIP_DATA.csv")
barley_data = pd.read_csv("BARLEY_DATA.csv")

In [13]:
# Giving the first column in "SNIP_DATA.csv" a header since it was originally unnamed
snip_data.rename(columns={snip_data.columns[0]: 'SNP'}, inplace=True)

# Dropping ID column for "BARLEY_DATA.csv"
barley_data = barley_data.drop(columns='Id')

## Cleaning dataset 'SNIP_DATA.csv' based on the following criteria: 
### 1. Handling 'failed' values by replacing them with NaN

In [14]:
snip_data.replace('failed', pd.NA, inplace=True)

### 2. Removing SNPs with the same allele across all varieties

In [15]:
# Excluding the first column
snp_columns = snip_data.columns[1:]

# Filtering rows where there's more than one unique value in the SNP columns 
# This ignores NaN values so that if the row is ["A", "A", "A", "A", NaN, NaN] then it is still removed although there's 2 different values
snip_data = snip_data[snip_data[snp_columns].apply(lambda row: row.dropna().nunique() > 1, axis=1)]

### 3. Removing barley varieties that are not present in both datasets

In [16]:
# Function to normalize variety names because
# in dataset barley_data some varieties presented in form '5777.7.1.2' are written as 5777712 in snip_data
def normalize_variety_names(variety):
    return str(variety).replace('.', '')

In [17]:
# Normalizing variety names in barley_data
barley_data['Nimi'] = barley_data['Nimi'].apply(normalize_variety_names)

# Extracting variety names from snip_data (columns starting from the second column)
snip_varieties = set(snip_data.columns[1:])

# Extracting variety names from barley_data (row values in the appropriate column)
barley_varieties = set(barley_data['Nimi'])

# Finding common varieties
common_varieties = snip_varieties.intersection(barley_varieties)

# Filtering snip_data to keep only common varieties
snip_data = snip_data[['SNP'] + list(common_varieties)]

# Filtering barley_data to keep only rows with common varieties
barley_data = barley_data[barley_data['Nimi'].isin(common_varieties)]

## Processing the datasets 
### **In preparation for finding correlations between protein content and genetic makeup**

In [18]:
# Creating a copy with only the barley variety and protein columns
protein_data = barley_data[['Nimi', 'Proteiin']].copy()

### Merging datasets on variety name

In [19]:
# Transposing snip_data to have barley varieties as rows not columns
snip_data_transposed = snip_data.set_index('SNP').T.reset_index()
snip_data_transposed.rename(columns={'index': 'Nimi'}, inplace=True)


# Merging protein_data with the transposed snip_data
merged_data = protein_data.merge(snip_data_transposed, on='Nimi', how='inner')


### Seeing what different types of values we have as alleles

In [20]:
# Extracting allele columns
allele_columns = merged_data.columns[2:]

# Flattening all values from allele columns into a single series and dropping NaN
all_alleles = merged_data[allele_columns].stack().dropna()

# Counting the occurrences of each allele value
allele_counts = all_alleles.value_counts()

print(allele_counts)

G    562756
A    540192
C    492895
T    429012
R      2030
Y      1405
K       466
M       416
S        82
W        45
Name: count, dtype: int64


**Brief biological explanation:**

| Nucleotide Symbol | Full Name                       |
|-------------------|---------------------------------|
| A                 | Adenine                         |
| C                 | Cytosine                        |
| G                 | Guanine                         |
| T                 | Thymine                         |
| R                 | Guanine / Adenine (purine)      |
| Y                 | Cytosine / Thymine (pyrimidine) |
| K                 | Guanine / Thymine               |
| M                 | Adenine / Cytosine              |
| S                 | Guanine / Cytosine              |
| W                 | Adenine / Thymine               |


### Performing one-hot encoding

In [21]:
# Empty DataFrame to store one-hot encoded SNP data
encoded_snps = pd.DataFrame(index=merged_data.index)

# Looping through SNP columns and apply one-hot encoding
for column in merged_data.columns[2:]:  # Skip 'Nimi' and 'Proteiin'
    one_hot = pd.get_dummies(merged_data[column], prefix=column, dtype=bool)
    encoded_snps = pd.concat([encoded_snps, one_hot], axis=1)

# Concatenating back with 'Nimi' and 'Proteiin' columns
encoded_data = pd.concat([merged_data[['Nimi', 'Proteiin']], encoded_snps], axis=1)


Each SNP is represented as multiple binary columns, one for each **observed** allele. For instance, if a SNP site had the alleles "A," "C," and "T," it would be encoded as three separate columns (e.g., SNP_A, SNP_C, SNP_T), where a value of True signifies the presence of that allele, and False indicates its absence.

**Let's look at our encoded dataset:**

In [22]:
encoded_data

Unnamed: 0,Nimi,Proteiin,BK_01_G,BK_01_T,BK_03_C,BK_03_T,BK_05_C,BK_05_T,BK_05_Y,BK_08_C,...,TGBA15K-TG0400_G,TGBA15K-TG0402_NC_MA_G,TGBA15K-TG0402_NC_MA_T,TGBA15K-TG0402_NG_MA_G,TGBA15K-TG0402_NG_MA_T,TGBA15K-TG0403_A,TGBA15K-TG0403_C,TGBA15K-TG0409_A,TGBA15K-TG0409_G,TGBA15K-TG0409_R
0,Amidala,12.2,False,False,False,True,True,False,False,False,...,False,False,True,False,True,False,True,False,True,False
1,Amy,12.7,False,True,False,True,True,False,False,False,...,False,False,True,False,True,False,True,False,True,False
2,Anneli,13.0,False,True,False,True,False,True,False,False,...,False,False,True,False,True,False,True,True,False,False
3,Anni,12.4,False,True,False,True,True,False,False,True,...,False,False,True,False,True,False,True,False,True,False
4,Annika,11.2,False,True,False,True,True,False,False,False,...,False,False,True,False,True,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,6006142,12.1,False,True,False,True,True,False,False,True,...,False,False,True,False,True,False,True,False,True,False
167,6006153,11.3,False,True,False,True,True,False,False,False,...,False,False,True,False,True,False,True,False,True,False
168,6011421,11.4,False,True,False,True,True,False,False,True,...,False,False,True,False,True,False,True,False,True,False
169,6012243,11.6,False,True,False,True,True,False,False,False,...,False,False,True,False,True,False,True,False,True,False


## Finding correlations