# Data Overview

Before we dig too deeply into the data we should understand the data format a little bit better.

At its core, the OpenSNP  has 4 components each . A CSV file, several text files, several image files, and another CSV file corresponding to the image files. We will be ignoring the image data during this exploration.

The CSV file is delimitted by ';' and there are a number of rows, each corresponding to a question that a user answered. Each row corresponds to a different user. User's do not need to answer any question, in fact most of them do not. To add further complication, the answers are primarily free-form and not normalized to any standard.

Each text file corresponds to a single user (and a single user may have more than 1 text file associated with them). The text files contain a list of genetic variants that were identified in the user's genetic data.

The genetic variant information in this dataset was uploaded by users and collected and analyzed through various services such as [23andMe](https://www.23andme.com/), [deCODEme](https://www.decode.com/decode-launches-decodeme/), or [FamilyTreeDNA](https://www.familytreedna.com/) (the source is listed in the `.txt` files and the survey information was collected through [OpenSNP](https://www.opensnp.org/)

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
!ls ../data/*csv

../data/phenotypes_201811100342.csv
../data/picture_phenotypes_201811100342.csv


In [3]:
!ls ../data/*txt | wc -l

4704


In [4]:
!ls ../data/* | wc -l

4708


In [5]:
# Assign the file that we are looking at
data_file = "../data/phenotypes_201811100342.csv"

In [6]:
# We know the column name we are interested
col_name_myers = 'Myers-Briggs Type Indicator'

In [7]:
# Load the CSV into a dataframe
df = pd.read_csv(data_file, delimiter=";")

# Initial Data Exploration
Now that we understand the data and we have loaded it into a dataframe, we will take a look at it.

Let's start by looking at the number os users, the number of questions answered, and then determine how many people answered each question.

## Summary CSV File

In [None]:
# Print out the row, column count
df.shape

In [None]:
# Display the first 5 user submissions
df.head()

In [None]:
# Print out each column by alphabetical order
for column in sorted(df.columns):
    print(column)

In [None]:
# Print a count of each column
(df != '-').sum(axis = 0, skipna = True)

In [None]:
# Let's get the sum, only for the columns we care about
(df != '-').sum(axis = 0, skipna = True)[col_name_myers]


What's going on here! The website said there were 299 entries.

Upon closer inspection, we see that the website has a setion labeled "Already answered on previous submission." That probably acounts for the missing ten, and that would likely indicate thate we have 10 users who submitted twice. We'll look into this later.

For now let's plot some of the data out.

In [None]:
# Add a new column to the dataframe with a count of how many answers were given
count_df = (df != '-').sum(axis = 1, skipna = True)
df.insert(1, column='count', value=count_df)

In [None]:
df

## Individual User Files

Now that we have taken a look at the summary file, let's take a look at a single user file and then see how we can load all relevant user files into a dataframe for analysis.