# SNP Data Integration Plan of Action

## Objective
To compare all available rsid files with SNP data and to combine matching data based on the "SNP Name" and first column "Name" from the rsid files.

## Tools Required
- Python 3.x
- Pandas library for Python
- I use an Anaconda environment for Python development.  install jupyter notebook, pandas, numpy

## Stage 1: Preprocessing

### 1.1. Prepare the Python Environment
- Ensure Python and Pandas are installed.
- Verify that you have read/write access to the files and sufficient memory for processing.

### 1.2. define file paths
- Define the paths to the rsid files and the SNP data files



## Stage 2: Load and process Files

###  2.1. Load rsid Files and validate the data
- Confirm that the rsid files have a "Name" column and an "RsID" column.
- Check that the second column can have a `.` indicating a missing value or rsid values separated by commas.
###  2.2. Load SNP Data File and validate the data
###  2.3. Identify Unmatched SNP Names

## Stage 3: Combine Data and output results
### 3.1. Merge SNP Data with rsid Information




In [36]:
# Stage 1: preprocessing

import os
import pandas as pd
import numpy as np




# print the current working directory.  If it is not in the notebooks directory, reload the notebook
if os.path.basename(os.getcwd()) == 'notebooks':
    print('Current working directory: ', os.getcwd())
    os.chdir(os.path.join(os.getcwd(), '..'))
else:
    # cd to snp_processing/notebooks
    try:
        os.chdir('notebooks')
        os.chdir(os.path.join(os.getcwd(), '..'))
    except:
        print('Current working directory is not notebooks.  Please move notebook to the notebooks directory')
        print('Current working directory: ', os.getcwd())

# set the cwd to the main directory which is the parent directory of the current directory(notebooks)


# Load the rsid files.
rsid_file_1 = 'GSA-24v3-0_A1_b151_rsids.txt'
rsid_file_2 = 'GSA-24v1-0_C2_b150_rsids.txt'
# define the rsid file paths cwd + references folder + rsid file name
rsid_file_path_1 = os.path.join(os.getcwd(), 'references', rsid_file_1)
rsid_file_path_2 = os.path.join(os.getcwd(), 'references', rsid_file_2)


# test snp file is a file with the data removed except for the first column.  Create by running the following command
# awk -F '\t' '{print $1}' filename.txt > test_snp.txt
test_snp_file = 'test_snp.txt'
# data is in the data folder
test_snp_file_path_1 = os.path.join(os.getcwd(), 'data', 'test', test_snp_file)



In [37]:
# Stage 2: Load the files

# Load the rsid files with a \t delimiter into a pandas dataframe
rsid_1 = pd.read_csv(rsid_file_path_1, delimiter='\t')
rsid_2 = pd.read_csv(rsid_file_path_2, delimiter='\t')
rsid_data = [rsid_1, rsid_2]
# read the test snp file.  There is a row that has the value of "[data]" that needs to be removed as well as everthing above it.

# open the file
with open(test_snp_file_path_1) as f:
    # read the lines
    lines = f.readlines()
    # find the index of the line that has the value of "[data]"
    index = lines.index('[Data]\n')

# read the file again but skip the first index and everything above it
test_snp_1 = pd.read_csv(test_snp_file_path_1, delimiter='\t', skiprows=index+1)

for value in rsid_data:
    # print the shape of the dataframe
    print(value.shape)
    # confirm that the first header is Name and the second is RsID
    if value.columns[0] != 'Name' or value.columns[1] != 'RsID':
        print(f'Error: The columns are not named correctly'  )
        # print head of the dataframe
        print(value.head())



# find all values in rsid_1 that are not in rsid_2
rsid_1_not_in_2 = rsid_1[~rsid_1['Name'].isin(rsid_2['Name'])]
# find all values in rsid_2 that are not in rsid_1
rsid_2_not_in_1 = rsid_2[~rsid_2['Name'].isin(rsid_1['Name'])]

# print the shape of the dataframes
print(rsid_1_not_in_2.shape)
print(rsid_2_not_in_1.shape)


# determine if there are any values in the first column of the test snp dataframe that are not in the rsid_1 dataframe
test_snp_1_not_in_rsid_1 = test_snp_1[~test_snp_1[test_snp_1.columns[0]].isin(rsid_1[rsid_1.columns[0]])]


# if there are any values in test_snp_1_not_in_rsid_1, print the shape
if test_snp_1_not_in_rsid_1.shape[0] > 0:
    print("there are values in the test snp file that are not in the rsid file")
    print(test_snp_1_not_in_rsid_1.shape)
    print(test_snp_1_not_in_rsid_1.head())


(654027, 2)
(618540, 2)
(64609, 2)
(29122, 2)


In [39]:
# Stage 3: Combine Data and output results
# combine the rsid_1 and test_snp_1 dataframes on the first column
combined = pd.merge(test_snp_1, rsid_1, left_on=test_snp_1.columns[0], right_on=rsid_1.columns[0])

# the length of the combined dataframe should be the same as the test_snp_1 dataframe
if combined.shape[0] == test_snp_1.shape[0]:
    print('The combined dataframe has the same length as the test_snp_1 dataframe')

# save the combined dataframe to a tsv file in the data/test folder
combined_file = 'test_combined.tsv'
combined_file_path = os.path.join(os.getcwd(), 'data', 'test', combined_file)
combined.to_csv(combined_file_path, sep='\t', index=False)





The combined dataframe has the same length as the test_snp_1 dataframe
