# Analyse ADN MyHeritage

---

**Site d'analyse :**

- MyHeritage

- MyTrueAncestry

rsid = ID SNP de référence (indique le « nom » du premier et du dernier SNP du segment)

Les SNP (pour single nucleotide polymorphism), prononcés Snips, correspondent à des variations mineures du génome au sein d'une population. Un seul nucléotide, le composant de base de l'ADN, est modifié.

Voir les Étude d'association pangénomique

https://towardsdatascience.com/know-thyself-using-data-science-to-explore-your-own-genome-ec726303f16c

https://towardsdatascience.com/machine-learning-in-bioinformatics-genome-geography-d1b1dbbfb4c2

In [3]:
# data visualization
import seaborn as sns
sns.set_style('darkgrid')
sns.color_palette('Spectral')
import matplotlib.pyplot as plt

# data analysis
import numpy as np
import requests
import pandas as pd
import re

# using web browser in python
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

## Data collection

In [26]:
data = pd.read_csv('MyHeritage_raw_dna_data.csv', sep=',', dtype={'RSID':'str', 'CHROMOSOME':'object', 'POSITION':'int', 'RESULT':'str'}, comment='#', low_memory=False)
df = pd.DataFrame(data) # for Collection and EDA
data.head()

Unnamed: 0,RSID,CHROMOSOME,POSITION,RESULT
0,rs547237130,1,72526,AA
1,rs562180473,1,565703,AA
2,rs575203260,1,567693,TT
3,rs3131972,1,752721,AA
4,rs200599638,1,752918,GG


**Description des variables :**

(RESULT is Genotype)

In [51]:
display(df.isna().any(), df.nunique())

RSID          False
CHROMOSOME    False
POSITION      False
RESULT        False
dtype: bool

RSID          609346
CHROMOSOME        24
POSITION      608314
RESULT            14
dtype: int64

**Simplification du dataset :**

In [24]:
# replace X and Y by values (regex using)
chromosome_dict = df['CHROMOSOME'].unique()
chromosome_dict = dict(zip(list(range(chromosome_dict.size)), chromosome_dict))
df['CHROMOSOME'] = df['CHROMOSOME'].apply(lambda x: re.sub(r'X', r'23', x))
df['CHROMOSOME'] = df['CHROMOSOME'].apply(lambda x: re.sub(r'Y', r'24', x))
df['CHROMOSOME'] = df['CHROMOSOME'].apply(lambda x:int(x))
df.head()

Unnamed: 0,RSID,CHROMOSOME,POSITION,RESULT
0,rs547237130,1,72526,AA
1,rs562180473,1,565703,AA
2,rs575203260,1,567693,TT
3,rs3131972,1,752721,AA
4,rs200599638,1,752918,GG


**Acquisition de la base de donnée SNPedia (https://www.snpedia.com/)** 

Article interessant : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245045/

tuto wikitool (https://www.snpedia.com/index.php/Bulk), ne marche pas avec Python3, faire webscrapping avec BeautifulSoup. Voir :

- https://stackoverflow.com/questions/70233801/python-scraping-of-wikipedia-category-page
- https://stackoverflow.com/questions/62398524/how-can-i-get-an-article-from-wiki-with-a-specific-language-using-python

In [None]:
import requests
from bs4 import BeautifulSoup

nb_snp = 111727
snp_list = []
# get query request loop
URL = "http://bots.snpedia.com/api.php"
for i in range(int(nb_snp/500)+1) :
    if i == 0 :
        PARAMS = { "action": "query","cmtitle": "Category:Is a snp", "cmlimit": '500' , "list": "categorymembers","format": "json"}
    else :
        PARAMS = { "action": "query","cmtitle": "Category:Is a snp", 'cmcontinue': lpage, "cmlimit": '500' , "list": "categorymembers","format": "json"}
    # request
    req = requests.get(url=URL, params=PARAMS).json()
    if 'continue' in req.keys(): lpage = req['continue']['cmcontinue']
    # get list
    pages = req['query']['categorymembers']
    snp_list += [p['title'] for p in pages]
#display(snp_list)

**Correspondance entre la liste et le dataframe (réduction) :**

In [149]:
snp_series = pd.Series(snp_list,  name="RSID")
snp_series = snp_series.apply(lambda x: re.sub(r'R', r'r', x))
# new df
snp_df = pd.concat([snp_series,pd.Series(range(len(snp_series)), name="ID_LIST")], axis=1)
snp_df = snp_df.merge(df, how='inner', on=['RSID'])
snp_df.head()
# save list of rsid referenced
snp_df[['RSID','ID_LIST']].to_csv('rsid_referenced.csv', index=False)

## Data Exploration


## Data Processing
