# Construindo um dataset para Phenotype Prediction usando openSNP

## Contexto

Existe um campo de estudo chamado DNA Forensics ou Forensic DNA Phenotyping, que pode se utilizar da predição de fenótipos baseado em coleta de material genético.

A predição de características externamente visíveis pode ajudar a polícia a estreitar os caminhos de um investigação e ajudar a solucionar um crime mais rápido.

Com material genético em mãos é possível buscar mutações dos genes chamadas SNPs (Single-Nucleotide Polymorphisms), algumas destas mutações podem estão associadas a fenótipos, como cor dos olhos, cor da pele, cor do cabelo e etc...

Para auxiliar nesta tarefa de predizer fenótipos baseado no material genético, podem ser aplicados modelos de Aprendizado de Máquina. Para isso, são necessários... DADOS!!! E é sobre a coleta e tratamento destes dados que se trata este trabalho. 

## Introdução

A plataforma [openSNP](https://opensnp.org/) permite que clientes de testes genéticos publiquem seus resultados e compartilhem seus fenótipos.

Usando esse plataforma foi possível baixar um grande arquivo que inclui dados genéticos de 6326 usuários acompanhados da autodeclaração de seus fenótipos.

Um arquivo auxiliar fornece uma tabela com 673 fenótipos para cada usuário.

Dos 6326 usuários da plataforma, 1678 preencheram seus fenótipos para cor do olho.

A seguir, será construído um conjunto de dados que visa permitir a aplicação de algoritmos de aprendizado de máquina para a predição de cor do olho em indivíduos.

In [1]:
# importando bibliotecas
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

## Trabalhando com o arquivo de fenótipos

1. Leitura do arquivo contendo fenótipos de cada usuário.
2. Filtragem do usuários que declararam sua cor do olho.
3. Agrupamento das cores em 3 categorias
   * Azul, Verde, Cinza: Olhos predominantemente claros.
   * Intermediários: Castanho claro, mel, avelã (hazel).
   * Castanho.
4. Gravação do arquivo processado em um conjunto de dados auxiliar.

Obs: Nesta etapa sobraram apenas 1286 usuários, devido ao descarte de rótulos não informativos e de dados duplicados.


In [2]:
phenotypes_df = pd.read_csv("opensnp_alldata/phenotypes_202206080235.csv", sep=';')

In [3]:
phenotypes_df.head(2)

Unnamed: 0,user_id,genotype_filename,date_of_birth,chrom_sex,openhumans_name,Retrognathia (Marfan Syndrome),Eye pigmentation,Vegetarianism/Preference for Meat,Form of foot,Eye color,A+,MSG tastes...,Boldness type,Lynch Syndrome,Early Onset Heart Disease,black,Interstitial cystitis,Jewish Ancestry,Response to Enbrel,Bicuspid aortic valve,Hemochromotosis,ACT science,Hypomagnesemia,ACT math,12,Prolapsed Organ,vi/vim or Emacs,Heart Problems - Long QT Syndrome,chronically sore neck glands,Urticaria,R1b1a2a1a1b,ACT reading,Cramps,Number of Neanderthal variants,Dental decay,Miscarriage/Spontaneous Abortion,Thrombosis,Resistance To Infectious Disease,Enhanced Hippocampal Volume,Significantly increased Risk of Heart Disease,Amount of Body Hair (Male),Morton's Toe,Pressure Sensitivity of the Ear,"Dolichocephaly: Disproportionately Long, Narrow Head & Neanderthal rs12416000(A;G)",Neanderthal,Webbed toes,Thyroid Issues/Cancer,Purposefulness,Right Atrial Enlargement,mtDNA Haplogroup (PhyloTree),Lactose intolerance,"Light-skinned, European Ancestry (rs14256654)",Fainting Upon Seeing Blood/Gore/Violence,Erythromycin allergy,Crohn's Disease +rs2241880,ethnicity,Interests-General,"Increased Risk of GYN, Head & Neck Cancers (mutation on RNASEL Gene ~ Cancer Marker)",Premature Atrial Contractions,Reduced MAO-A Activity,Hidradenitis Supperativa,Scottish Ancestry,Welsh Ancestry,nosebleeds as a child,Hair and eye color Brown,Mitral Insufficiency (Regurgitation),Left Ventricular Hypertrophy,ABH Blood Group (antigens),"black skin,O+,hair Black,Eye color Brown,",Photic Sneeze Reflex (Photoptarmis),ENTP,Tricuspid Insufficiency (Regurgitation),Weight,Large tonsil crypts and tonsil stones,Uterine Fibroids,AB +,Type 2 Diabetes +rs13266634,Kell Blood Group (K/k antigens),MBTI (Myers Briggs) Type,"Throat, Stomach Cancer +rs2274223",Nicotine dependence,Thyroid Disorders +rs966423,Asparagus Metabolite Detection,Beard Color,Sexual Preferences,IQ,In both the large group of people is best for you ?for you ?,Ear - darwin's tubercle,Unable To Metabolize Common Medications (CYP3A5 non-expressor),Alzheimer's Disease,Frequency of colds/flu,Fibromyalgia,Hedonic set point,Otosclerosis,Skin color.,Ability to Tan,Little fingers(pinkies),Response to Humira,Farnsworth Munsell 100 Hue Test,FBN1 Mutation,CMV serostatus,Aquagenic pruritus (episodes),Atherosclerotic Vascular Disease +rs2943634,music ear,eczema,How do you put your glasses down?,Polycystic ovary syndrome (pcos),Impacted Canines,Primary Open Angle Glaucoma,Misophonia,Panic Disorder,Allergy to Egg Whites,Acrophobia,vitiligo,Double vision,Preterm labor,Anemic?,TMJ/TMD,ADHD? Subtype?,Response to Metformin,Autoimmune Disease,Non Alcoholic Fatty Liver,Social Level,Supraventricuar Tachycardia,"dark Blonde, hazal eyes,, 165 cm",Plantar fasciitis,Anorgasmia,maternal or mTDNA haplogroup,Supernumerary Kidney,Fish Preference,Looking at the world around you,No headaches EVER,Exercise Induced Ischemia +rs1024611,Kidd Blood Group,multiple brain aneurysms,adult a.d.d,Alcohol Consumption (per week),Alcoholism,Cigarette/Cigar Smoker,Chronic cough and single CFTR mutation,"Busy Bee, Multi tasker",Curiosity and love of research,shoe size (US MEN),Mother's eye color,Rain,Latino Ancestry,Secretor Status,hair color,impaired NSAID drug metabolism,Scoliosis,"Third molars ""wishdom teeth""",water taste,Anorexia nervosa or ednos,rs429358,immune thrombocytopenia,Spondyloarthropathy,Oxycodone/Oxycontin Effectiveness,parsonage-turner syndrome,Toenail or fingernail fungus,Arthritis,Peanut butter preference,Eye with Blue Halo,Serotonin transporter,Personality Disorder test - top result,Aphantasia,Depression,One Warrior Gene/One Non-Warrior Gene,Cocaine addiction,"Gray eyes,fair skin, dark blonde, o- blood",Craves sugar,Dermatofibroma,Essential Tremor,low appetite,PICA - eating non food stuff,introvert or extrovert,Can you smell cut-grass?,Cervical dysplasia / cancer,Critical thinking,CLL,Carrier,Medium brown skin,Online Alexithymia Questionnaire,"Medium blonde, medium tone skims blood type o+",OCD - Obsessive-Compulsive Disorder,Blood Type: O negative,Pseudocholinesterase deficiency,Behçet's Syndrome / Disease (Behcet's),Father or grandfather died of Prostate Cancer,EBV Epstein Barr Virus history: Infectious mononucleosis history now may be not symptomatic,MENSA Member,33,Rh Factor: A Positive,Tooth sensitivity,Woolnerian Tip (Darwin's Tubercle),Clinodactyly,Handedness,Atheism,(Male) Nipple's size,Ehlers-Danlos,Enjoy using the Internet,Heroin addiction,Raynaud's,Diet,Squizophrenia,Dietary supplements used,Irritable Bowel Syndrome,Wake up preference,5-HTP effectiveness,Tooth shade range,Brazil nut diarrhoea,Caffeine dependence,Gambling,Umami taster?,lichen planus,macular degeneration,Bipolar disorder,Gap between front teeth,Easily irritated or frustrated,MS,Premature Gray,Thyroid nodules,deviated septum,Renal hypoplasia,inverted nipples,Do You Have Lucid Dreams?,Allergy to Hair Dye,Cholesterol,Skin dry,Inflammation,Pulmonary Fibrosis,Political Ideology,Post traumatic Stress Disorder or PTSD?,Vasovagal syncopy or Neurocardiogenic syncopy,Pain Tolerance,"Do you prefer python, matlab, or R?",Atypical Sulfonomide Antibiotic Reaction,Red Hair,Sleep Disorders,Do You Suffer From Agoraphobia?,"hemophilia C (factor XI deficiency is caused by ""rare mutations"" in the F11 gene)",Are You The Advertising Phenotype?,Skin - Fitzpatrick Scale,Score on Big Five at PersonalityLab.org,"Brunette, O+, Caucasian",Resting Heart Rate,Wanting to be immortal,Motion sickness,Cusp of Carabelli,brown hair colour,Diagnosed Vitamin D deficiency,Affinity to Cannabis,Can you bend your fifth finger (pinky) without bending your fourth finger (ring)?,Number of fingers,Black,blue eyes with yellow stripes making them look depending on the light or my mood,Income Range?,Ocular migraines,"brown hair colour, white skin color",Hypospadias,clubfoot,Hazel Eyes,No Allergies,Cat or Dog person,Do hops taste like soap?,SAT Verbal,Persistant Muscle Pain or Fatigue,erectil disfunction,Ability to find a bug in openSNP,dark Blonde,SAT - when taken,Enophthalmos,Have ME/CFS,Axiiety (following cannabis consumption),Negative reaction to fluoroquinolone antibiotics,Hand span,A- (cisA2B3),Sweat eating spicy food,autism,Pancreas,ADHD,Penicillin reaction,SAT Math,paternal or Y haplogroup,How many wisdom teeth did you have/ do you have (if you know).,Gynecomastia (male breast tissue),midtarsal flexibility,Physical Aversion to Certain Foods,Interested in news from real newspaper / news from the Internet,Myers-Briggs Type Indicator,Sensitivity to Mosquito Bites,Response to Methotrexate,Autoimmune disorder,Strabismus,Marfan Syndrome,lips size,Have You Been Diagnosed With Brown's Syndrome (Tendon Sheath Syndrome)?,Degree of Empathy,eosinophilic disease,Rythm Test Result,Colon cancer ONLY FOR (rs3219489 GG)!,Brachydactyly,Disseminated Superficial Actinic Porokeratosis,Synesthesia,Walk in nature on the roads,"Intolerance: gluten, casein, soy",At what altitude you live or have lived most of their lives?,Third Nipple,Cleverness,Anxiety,Diseases,Downslanting Palpebral Fissures Related To Marfan Syndrome,Bone Mineral Density,SLE - Lupus,Dyscalculia,Penis Circumference at Glans,black skin,Asthma,Political Compass,Diagnosed with Sleep Apnea?,Hair colour,Sports interest,MethyleneTetraHydroFolate Reductase (MTHFR),excessive daytime sleepiness,Keratoconus Disease,Reading habits,Diego Blood Group,Cognitive impairment side effect on statin,Allergic to Lexapro (Escitalopram),Gestures when speaking,Reaction to poison ivy,Second MBTI type,Ancestry,Physical,restless leg syndrome,The Dress: Perception of colour,white skin,Cancer Marker RNASEL Gene,Digit ratio,Hindfoot Valgus (Marfan Syndrome),Hair color,Chronotype - Morningness-eveningness questionnaire (MEQ),SAT Writing,Thumb Sign (Marfan Syndrome),Do you have dust-mite allergies,head form,Rheumatoid Arthritis,Migraine frequency,Does cilantro taste like soap to you?,Mole (Nevus) type,Hereditary breast and ovarian cancer,Enjoy watching TV,brown hair colour blue eyes,Number of toes,prognathism,Interest in Spirituality and Mysticism,Nickel Dermatitis,Good / poor eater as child,Hashimoto's,hair on fingers,Dyslexia,Musical Perfect Pitch,Mirror Touch Synesthesia,syndactyly,Eye Color,Gorlin sign,Female with Ring Finger longer than Index Finger,ABO Rh,Aspirin Allergy,Age you started wearing glasses,Keloid,Moles raised,Pheochromocytoma,Fat-pad knee syndrome,Multiple Sclerosis,Preference for Loud Music,(male) penis releases pre-cum when sexually aroused.,apthous in mouth tendency,Hypertriglyceridemia,Broad face,You look like the flame of the fire?,Enjoy riding a motorbike,Dupuytren's disease,Laterality,Seborrhoeic Dermatitis,Insect bites and stings,Cardiac Arterial Disease,Sense of smell,ALS,Female Pattern Baldnes,Migraine,Daily Sleep Duration (hours),sex drive,Phobia,hair on ear,glass eye,Grey hair very late,Easiness to navigate/orientate,Acromegaly,Malar hypoplasia in Marfan Syndrome,Autism,Ambition,Ectopia Lentis,Type II Diabetes,Physician-diagnosed celiac/coeliac disease,eye colour,Lisp,Do you have a parent who was diagnosed with Alzheimer's disease?,ASMR,Birth year,mthfr,Sneezing induced by sexual ideation or orgasm?,Skintype,double jointed thumbs,Cystic Fibrosis Like Disease,Would you invite strangers into your living space?,Retinitis pigmentosa,Vitiligo,Chest Pain on Ritalin,cluster headache,form of the nose,Photophobia,Lipoma,ear proximity to head,Tea consumption,Age you had a heart attack,Sexuality,Sensitivity to smell,Allergy to artificial grape flavoring,Jogger,Melasma,Faktor 5 Leiden (F5),Widow's Peak,Hair Length,Sporting activity participation,libido,Earwax type,blood type,Age learned to read,"Allergy to Strawberries, Tomatoes, Citrus",Handwriting/Fine Motor Skills,Clotting disorder,I dance....,Ear Infections In Childhood,Creutzfeldt–Jakob disease,Aortic Insufficiency (Regurgitation),Metabolic Syndrome [MetS],Dermatographia,Kinsey Scale,Academic degree,"White skin, black hair",hair colour,Allergic/bad reaction to fish oil supplements,Enneagram Personality Type,Short-sightedness (Myopia),Astigmatism,Freckling,Eurogenes,Favorite Color/Colour,Hypermobility,Hair color changed from blonde to brown,Nationality,Allergic rhinitis,Blood type,number of biological children,Нос,Like the taste of Stevia,High platelet count,Adult Second Language Acquistion Aptitude,Amblyopia,Response to Remicade,mouth size,Fight or flight response (see description),Do you grind your teeth,Haemophilia/Hemophilia,congenital talipes equinovarus (CTEV),Physical Characteristic,Abnormal Blood Pressure,brunette,Pectus Excavatum (sunken chest),Index Toe Longer than Big Toe,Tongue roller,Hair Color,opensnp.org,Subjective dream intensity,Prone to Cysts,Am I a duck?,Body shape,Black hair and Green Eyes,Desmoid Tumor,Artistic ability,Coffee consumption,ABO Rh.1,Neanderthal (Interpretome),ring finger longer than index finger,African,Taste of broccoli,Ear wiggling,Artistic Talent,african-northern european,Do you like the taste of hops?,Homophobia - Questionnaire Result,rolled tongue,Sex,Mental Disease,Enjoy driving a car,English ACT,Smell of coffee in urine,First word,"Blood Type A1, Rh positive","Palpitations Unrelated To Food, Alcohol, Drugs or Exercise",Dimple in the chin,Number of wisdom teeth,natural skinny,how do you like your steak?,Pale,Affinity For Animals,Toe Thumbs,Sport interest,Body complexity,"White skin, black straight hair, B+",Height,Foot length,Energy Level,Earlobe: Free or attached,Colour Blindness,Extra Fingers (Polydactyly),Eyebrow Mover,Blood Type,Y-DNA Haplogroup (ISOGG),Can You Spread Your Toes?,Penis Length,Hair Type,Do you like the taste of bananas?,Night Owl/Early Bird,Carpal tunnel,Are you a map lover?,Horseshoe kidney,Cushing's Disease,Sang À+,Force,Lichen Sclerosis(LS)/Balanitis Xerotica Obliterates(BXO),"Dark brown,Rheusus Neg B,Olive",Familial hypercholesterolemia,Extra Teeth,"dark Blonde,fair skin,brown eyes",Severe Acne,Sleep duration,"Optimistic, empathetic, handles stress well",Response to Codeine,Fibromyalgia Response to SRI drugs,Suicidality,Rosacea,Eye Color - Heterochromia,GREEK DNA,Supernumerary nipples,Dyshydrotic Eczema,Gulf War Illness,"chilblains/perniosis, have you ever had them, yes/no","wide feet, yes/no","green eyes, light brown hair, left-handed, Dupuytren's",Diagnosed celiac disease AND photic sneeze reflex,Therapeutically induced hypothyroidism,Gall bladder disease,Cold sores (herpes),can't stand Tarragon,Ectopic heartbeats,Hypertension,Resistance to Norovirus,blood incompatibility,"Brown hair, Hazel, Caucasian.",Missing canine teeth,Han Arm Creases,Asexual,"Brown hair colour, white skin color",generalized anxiety disorder,MPN myeloproliferative neoplasms,Penis Girth,Fuchs' Disease,Double eyelashes,Left-handed,Two tendons at wrists,Hearing loss genetic cookie bite configuration,abnormal brain shape,Episodic Major Depression,Age achieved full height,Biophilia,Reading speed(WPM),Chimera or mosaic?,"Strongly feel that true ""intelligence"" is more than just IQ",Feel that intelligence is genetically more than just IQ,RCCX,immunological disorder: HLA-b27,Deep Blue,Do you read a lot on the why’s and how’s things work.,"Can you wake up at night, drink coffee and a couple hours later go right back to sleep","nail biter, lip biter or skin picker when stressed",Have a military neck,High kidneys,Allergic to a steroid shot,Wake time before leaving for work,Allergic to bees,Birthing pain,Tattoos,"black hair and brown eyes, blood B+, 6,5 tall",Hypermobile Ehlers Danlos Syndrome,repetitive movements or unwanted sounds,Rh Neg Blood Type WITH Unusual Allergies or Early Onset Degenerative Disorders,tourettes syndrome,Smell of rain,Boldness type A+,Age of first grey hair (scalp),Age of first grey hair (facial),"deep sleep, debilitating sleep pressure",Alcoholism.1,Simian Line(s),Why MBBS Ukraine Is The Best Option For MBBS Aspirants?,Which is the cheapest country to study MBBS?,The Cracker Test,Wiggle My Ears,Dry Eye Syndrome / Fishing Eye Syndrome,Allergic to Band-Aid Adhesives / Other Adhesives,Allergy to Macrobid (nitrofurantoin) Antibiotic (Only),Recurrent Urinary Tract Infections (UTI) / Bladder Infections,Spinal Degenerative Bone or Disc Disease Early Onset Genetic-Cause Only,TIA (transient ischemic attack),Highest Education,Look like mother or father,Income,"Темный, очі темні",Blood transfusion,Occipital bun,"Cabelo liso castanho, tipo sanguíneo AB+, pele morena",blood compatibility for transfussion,Hidradenitis suppurativa,Covid19 Symptoms,Cannabis Allergy,MGUS,Gender,Gender Dysphoria,Polycythemia Vera,Mast Cell Activation Syndrome,Ear deformity,VACTERL Association,Grip strength,Nationality.1,Lacrimal Duct Hyoplasia,"brown hair, hazel eyes, caucasian, blood type O, Rh+",Bipolar Disorder (Immediate Family or Personal Diagnosis),"Blonde, blue green eyes, type a+, caucasian fair skin",Blue eyes,Type 1 diabetes,IQ score (2022 research),"IQ score (2022 research, ignore other IQ phenotype as it had typo)",Psoriasis,Psoriatic Arthritis,Alpha-1 Antitrypsin Deficiency,Altitude acclimatisation and sickness,"Black curly hair,brown eyes,olive skin",music preference,autologous blood,"black hair and brown eyes, blood A-, and white skin","brown hair, brown eyes, caucasian, blood type O, Rh+",Congenital Sucrase-Isomaltase Deficiency,Familial Mediterranean Fever,Autism: Systemizing,Musical Ability,Autistic Spectrum Disorder,TCF4 - Schizophrenia,Schizophrenia,Childhood Intelligence,Y-Haplogroup,"white skin, dark blond",Blood type AB Rh+
0,4134,4134.23andme.2800,rather not say,rather not say,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,Impacted molars,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
1,885,885.23andme-exome-vcf.994,1982,XY,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,Right-handed,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,Male,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,6',-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-


In [4]:
# filtrar o dataset para mostrar apenas as colunas que podem ser de interesse
colunas_interesse = ["user_id","genotype_filename", "Eye color"]
eye_color_df = phenotypes_df[colunas_interesse]

In [5]:
eye_color_df["Eye color"].value_counts()

-                                                                                                         4648
Brown                                                                                                      408
Blue                                                                                                       214
Hazel                                                                                                      124
Blue-grey                                                                                                  122
Green                                                                                                      116
Dark brown                                                                                                  88
Blue-green                                                                                                  75
Brown-green                                                                                                 68
H

In [6]:
# filtrar apenas os valores preenchidos
eye_color_df = eye_color_df[eye_color_df["Eye color"] != '-']
len(eye_color_df)

1678

In [7]:
# organizar as entradas em 3 categorias: Brown, Blue/Green/Gray, intermediate
color_map = {
    "Brown":"Brown",
    "Blue":"BGG",
    "Hazel":"Int",
    "Blue-grey":"BGG",
    "Green":"BGG",
    "Dark brown":"Brown",
    "Blue-green":"BGG",
    "Hazel (brown/green)":"Int",
    "Hazel/light brown":"Int",
    "Gray-blue":"BGG",
    "Blue-grey; broken amber collarette":"BGG",
    "hazel":"Int",
    "Dark blue":"BGG",
    "Green-hazel":"Int",
    "Green-brown":"Int",
    "brown":"Brown",
    "Green-gray":"BGG",
    "Blue-green":"BGG",
    "blue-green":"BGG",
    "blue":"BGG",
    "Blue, grey, green, changing":"BGG",
    "Blue grey":"BGG",
    "Blue with yellow parts":"BGG",
    "Blue-grey with central heterochromia":"BGG",
    "Light-mixed green":"BGG",
    "blue":"BGG",
    "blue-grey":"BGG",
    "Blue-green; amber collarette, and gray-blue ringing ":"BGG",
    "Blue with a yellow ring of flecks that make my eyes look green depending on the light or my  mood":"BGG",
    "Brown/black":"Brown",
    "Hazel (light brown, dark green, dark blue)":"Int",
    "Blue-green-grey":"BGG",
    "Green":"BGG",
    "Brown-amber":"Int",
    "blue, grey, green, changing":"BGG",
    "Amber":"Int",
    "Amber - (yellow/ocre  brown)":"Int",
    "Hazel/Light Brown":"Int",
    "Amber - (yellow/ocre  brown)":"Int"
}
    
eye_color_df["color_cat"] = eye_color_df["Eye color"].map(color_map)
eye_color_df = eye_color_df.dropna(axis=0, subset = "color_cat")
eye_color_df["color_cat"].value_counts()

BGG      698
Brown    518
Int      263
Name: color_cat, dtype: int64

In [8]:
len(eye_color_df)

1479

In [9]:
eye_color_df.head(2)

Unnamed: 0,user_id,genotype_filename,Eye color,color_cat
3,2953,2953.ftdna-illumina.1885,Dark brown,Brown
16,4135,4135.ftdna-illumina.2801,Blue-grey,BGG


In [10]:
eye_color_df_snp = eye_color_df.copy()
eye_color_df_snp.reset_index(drop = True, inplace = True)

In [11]:
eye_color_df_snp.head(2)

Unnamed: 0,user_id,genotype_filename,Eye color,color_cat
0,2953,2953.ftdna-illumina.1885,Dark brown,Brown
1,4135,4135.ftdna-illumina.2801,Blue-grey,BGG


In [12]:
eye_color_df_snp['user_id'].duplicated().sum()

193

In [13]:
eye_color_df_snp_drop_duplicates = eye_color_df_snp.drop_duplicates(subset='user_id')

In [14]:
len(eye_color_df_snp_drop_duplicates)

1286

In [15]:
# salvar estes dados
import os
os.makedirs('datasets', exist_ok=True)  
eye_color_df_snp_drop_duplicates.to_csv('datasets/eye_color_df_snp.csv', index=False)  

## Leitura dos SNPs dos arquivos individuais

Nesta etapa será construído um conjunto de dados contendo os snps que queremos analisar e o código do usuário para cada usuário.

1. O arquivo anterior possui uma coluna indicando o nome do arquivo individual com os dados genéticos de cada usuário, os valores desta coluna foram processados para criar um padrão para acessar estes arquivo individuais. (utilizando glob). Resultando em um conjunto de dados auxiliar com uma coluna contendo os nomes orginais de cada arquivo para cada suário.
2. Foi utilizado a biblioteca [snps](https://pypi.org/project/snps/) para parsear os dados genéticos de cada indivíduo em um objeto pandas dataframe.
3. O dataframe de cada indivíduo foi filtrado para conter as snps que queremos salvar e cada snp de interesse foi salva em uma lista para a criação de um dataframe contendo as snps de interesse para cada usuário.

In [16]:
# importando bibliotecas necessárias
from snps import SNPs 
import glob

### Criação de conjunto de dados auxiliar

In [17]:
def get_true_file_name():
    
    user_id_list = [] 
    doc_name_list = []
    indexes = []

    for index, row in eye_color_df_snp_drop_duplicates.iterrows():
        user_id = row['user_id']
        filename = row['genotype_filename']

        split = filename.rsplit('.')
        glob_pattern =  "user{}_file{}".format(split[0], split[2])

        try:
            doc_name = glob.glob("opensnp_alldata/{}*.txt".format(glob_pattern))[0]
            user_id_list.append(user_id)
            doc_name_list.append(doc_name)
            indexes.append(index)
        except:
            pass


    return user_id_list, doc_name_list, indexes

In [18]:
user_id_list, doc_name_list, indexes = get_true_file_name()

In [19]:
def create_dataframe(user_id_list, doc_name_list, indexes):
    return pd.DataFrame({"user_id":user_id_list, "filename":doc_name_list, "index": indexes})

In [20]:
true_filenames = create_dataframe(user_id_list, doc_name_list, indexes)

In [21]:
true_filenames_drop_duplicates = true_filenames.drop_duplicates(subset='user_id')

In [22]:
# salvar estes dados
os.makedirs('datasets', exist_ok=True)  
true_filenames_drop_duplicates.to_csv('datasets/true_filenames.csv', index=False)

In [23]:
true_filenames_drop_duplicates.head(2)

Unnamed: 0,user_id,filename,index
0,2953,opensnp_alldata\user2953_file1885_yearofbirth_...,0
1,4135,opensnp_alldata\user4135_file2801_yearofbirth_...,1


### Lista de SNPs que queremos salvar
Na célula abaixo está uma lista de SNPs com valor preditivo para cor dos olhos segundo artigos científicos da área.

In [24]:
snps_eye_color = [
    'rs12913832',
    'rs1800407',
    'rs12896399',
    'rs16891982',
    'rs1393350',
    'rs12203592',
    'rs1129038',
    'rs116363232',
    'rs1289399']

### Procedimento passando por todos os arquivos para:
* Filtrar os dados individuais
* Gerar um conjunto de dados com as SNPs da lista `snps_eye_color`

In [25]:
snp_dict = {}
for snp in snps_eye_color:
    snp_dict[snp] = []

user_id_list2 = []   
    
for index, row in true_filenames_drop_duplicates.iterrows():
    try:        
        s = SNPs(r'{}'.format(row['filename']))
        df = s.snps
        df = df[['genotype']]
        
        
        
        df_t = df.filter(items = snps_eye_color, axis=0).T #dataframe de um linha com cada snp e o conteúdo no valor

        for snp in snps_eye_color:
            try:
                snp_dict[snp].append(df_t[snp][0])
            except:
                snp_dict[snp].append('missing')
                
        user_id_list2.append(row['user_id'])
                
                
        #print(row['user_id'])
        print(len(df.filter(items = snps_eye_color, axis=0)))
        #print('-'*30)
        
    except:
        print(r"ERRO!!! NO ARQUIVO {}".format(row['filename']))
        
        
        
        

2
2
6


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
0
2
2
ERRO!!! NO ARQUIVO opensnp_alldata\user4088_file2768_yearofbirth_unknown_sex_unknown.23andme.txt
6


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user881_file429_yearofbirth_unknown_sex_XY.ftdna-illumina.txt


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
6
6


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
8


  df = pd.read_csv(


6


  df = pd.read_csv(
no SNPs loaded...


6
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
7


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
1


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
2
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user158_file66_yearofbirth_unknown_sex_unknown.23andme.txt


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
6


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


6
0
6


  df = pd.read_csv(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["chrom"] = df["chrom"].map(


ERRO!!! NO ARQUIVO opensnp_alldata\user1964_file1683_yearofbirth_1990_sex_XY.23andme.txt


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user2199_file1344_yearofbirth_1960_sex_XX.23andme.txt


  df = pd.read_csv(


8
6
6


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user2287_file1400_yearofbirth_unknown_sex_unknown.23andme.txt


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user2322_file1430_yearofbirth_unknown_sex_unknown.23andme.txt
0


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user2398_file1490_yearofbirth_1963_sex_XX.23andme.txt


  df = pd.read_csv(


8
6


  df = pd.read_csv(


6
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
6
0
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
2


  df = pd.read_csv(


8
6
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8
6


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


6


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


no SNPs loaded...


0


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8
2
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
6


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
6
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0
7


  df = pd.read_csv(


8
6
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user3930_file2629_yearofbirth_unknown_sex_unknown.23andme.txt


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
0
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
6
6


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user4676_file3279_yearofbirth_unknown_sex_unknown.23andme.txt


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
0
ERRO!!! NO ARQUIVO opensnp_alldata\user4192_file2845_yearofbirth_unknown_sex_unknown.23andme.txt
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
7


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8
0
ERRO!!! NO ARQUIVO opensnp_alldata\user1111_file3026_yearofbirth_unknown_sex_unknown.ftdna-illumina.txt


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
6
6
6


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
6


  df = pd.read_csv(


8
6


  df = pd.read_csv(


9


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
6
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
5
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
5


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8
2
5
6
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
6
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5
5
5
2
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


no SNPs loaded...


2
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
5
5


  df = pd.read_csv(


8
2
5


  df = pd.read_csv(


8
0


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
5
7
2
8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
6
6
8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5
5
2


  df = pd.read_csv(


8
2
6


  df = pd.read_csv(


8
5
2
2


  df = pd.read_csv(


8


  df = pd.read_csv(


6
2
5


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8
2


  df = pd.read_csv(


9
5
5
5
5
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


9
7
7
5


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


6
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
5
6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8
5
6


  df = pd.read_csv(


9


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user6373_file4863_yearofbirth_unknown_sex_XY.23andme-exome-vcf.txt


  df = pd.read_csv(


8
2


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9
ERRO!!! NO ARQUIVO opensnp_alldata\user6667_file5067_yearofbirth_unknown_sex_unknown.IYG.txt


  df = pd.read_csv(


9


  df = pd.read_csv(


8
2


  df = pd.read_csv(


9
2


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9
5


  df = pd.read_csv(


8
0


  df = pd.read_csv(


9
6


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9
6


  df = pd.read_csv(


9
2
0
2


  df = pd.read_csv(


9
2


  df = pd.read_csv(


9
2
5
5
5
6


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8
ERRO!!! NO ARQUIVO opensnp_alldata\user9079_file7418_yearofbirth_1997_sex_XY.ftdna-illumina.txt


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
5


  df = pd.read_csv(


8
2


  df = pd.read_csv(


8
5


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9
2
5
5


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5
6
5


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9
7
5


  df = pd.read_csv(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["chrom"] = df["chrom"].map(


9
5


  df = pd.read_csv(


9


  df = pd.read_csv(


9
2


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9
5
6


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


8
2
6


  df = pd.read_csv(


9


  df = pd.read_csv(


8
2
7
6
2
5


  df = pd.read_csv(


9


  df = pd.read_csv(


9
2
7
5
5
2


  df = pd.read_csv(


9
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9
5
8


  df = pd.read_csv(


8
2


  df = pd.read_csv(
no SNPs loaded...


8
0


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8
5
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2
2
8
2
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
7
7


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9
5
7


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9
2
2


  df = pd.read_csv(


9
5
2


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9
2


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9
5


  df = pd.read_csv(


8


  df = pd.read_csv(


8
7


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


6


  df = pd.read_csv(


8
2
7


  df = pd.read_csv(


9
7
2


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8
7


  df = pd.read_csv(


9
6


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9
5


  df = pd.read_csv(


9
5


  df = pd.read_csv(


8


no SNPs loaded...


0
7
5


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9
2


  df = pd.read_csv(


9
5


  df = pd.read_csv(


9
6


  df = pd.read_csv(


9


  df = pd.read_csv(


7
8


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9
ERRO!!! NO ARQUIVO opensnp_alldata\user8655_file7006_yearofbirth_1983_sex_XX.ancestry.txt
6


  df = pd.read_csv(


9
2


  df = pd.read_csv(


8
2
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9
8
5


  df = pd.read_csv(


9


  df = pd.read_csv(


8
6


  df = pd.read_csv(


8
2
2
7


  df = pd.read_csv(


9
7
7
7


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8
6
2
5


  df = pd.read_csv(


8
5
8
2


  df = pd.read_csv(


9
7
7
7


no SNPs loaded...


0
7
7
2


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9


  df = pd.read_csv(


8
2


  df = pd.read_csv(


9
7


  df = pd.read_csv(


8


  df = pd.read_csv(


6


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


8
2


  df = pd.read_csv(


9
7
7
0


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8
5


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9
5


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7
7
2


  df = pd.read_csv(


9
8


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9
6


  df = pd.read_csv(


9
ERRO!!! NO ARQUIVO opensnp_alldata\user9548_file7831_yearofbirth_unknown_sex_unknown.ancestry.txt


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


6


  df = pd.read_csv(


8


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9
6


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9
7
7
5


  df = pd.read_csv(


9


  df = pd.read_csv(


9
5


  df = pd.read_csv(


9
7
7


  df = pd.read_csv(


9
7


  df = pd.read_csv(


8
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9
ERRO!!! NO ARQUIVO opensnp_alldata\user9697_file7982_yearofbirth_1967_sex_XY.23andme.txt
6


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


8
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7
7
7


  df = pd.read_csv(


8
8


  df = pd.read_csv(


9
8


  df = pd.read_csv(


9
8


  df = pd.read_csv(


8


  df = pd.read_csv(


8
5


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


no SNPs loaded...


0
6
5
7


  df = pd.read_csv(


9


  df = pd.read_csv(


9


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9
6


  df = pd.read_csv(


9
7


  df = pd.read_csv(


8
7
8


  df = pd.read_csv(


9
7


  df = pd.read_csv(


9
ERRO!!! NO ARQUIVO opensnp_alldata\user10627_file8878_yearofbirth_unknown_sex_unknown.ancestry.txt
5
5
7


  df = pd.read_csv(


9
7
6


  df = pd.read_csv(


9


  df = pd.read_csv(


8


  df = pd.read_csv(


9
6
7
9


  df = pd.read_csv(


9


In [26]:
# Criando o conjunto de dados utilizando o dicionário e as listas
snp_dict_user = snp_dict.copy()
snp_dict_user['user_id'] = user_id_list2
df_final = pd.DataFrame(snp_dict_user)

In [27]:
df_final.head(2)

Unnamed: 0,rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592,rs1129038,rs116363232,rs1289399,user_id
0,missing,CC,missing,missing,missing,missing,TC,missing,missing,2953
1,missing,TC,missing,missing,missing,missing,TT,missing,missing,4135


In [28]:
df_final_drop_duplicates = df_final.drop_duplicates()

In [29]:
len(df_final_drop_duplicates)

1264

In [30]:
# Salvando o dataframe
df_final_drop_duplicates.to_csv('datasets/users_snps.csv', index=False)