# Unique Phenotypes

You have been given a file that consists of phenotypes that have been assigned to individual cardiology patients. The data was supposed to have been entered as a table with two columns per patient. The first column was to be a single, primary phenotype for a patient. The second column was to be a comma separated list of secondary phenotypes for that patient. Unfortunately, data entry did not follow this prescription and was inconsistent. Sometimes multiple phenotypes are provided in column one, sometimes a different delimiter than a comma is used.

Your task is to provide an alphabetically sorted list of unique phenotypes encountered in the data. Missing values are denoted by an empty string and should **not** be treated as a phenotype.

As a side note, the researcher responsible for this data would also like to know what delimiters were used in data entry.

I have used Pandas to read the data in and convert it to a list for you to start with.

In [None]:
import pandas as pd

In [None]:
data = pd.read_excel("./create_uniq_phenotypes.xlsx", header=None)
print(data.columns)
data.head(50)

In [None]:
data = [" ".join(d) for d in data.fillna("").values.tolist()]


In [None]:
data

### Create a variable to store an empty list

In [None]:
datalist=[]

#### We want to test what happens when we use a comma to split a string, but the string doesn't have a comma in it.

In [None]:
test1="Brian Chapman"
test2="Chapman, Brian"
test1.split(","), test2.split(",")

#### With or without a comma, we end up with a list

The class decided the best way to add the lists generated by the split to `datalist` was with the list extend method

In [None]:
for d in data:
    datalist.extend (d.split(","))

#### When we split on the comma, we end up with leading and trailing whitespaces

It was decided we could use the string `strip` method to remove the leading and trailing whitespaces.

In a second pass, we decided to convert everything to lowercase

In [None]:
datalist = [d.strip().lower() for d in datalist]

In [None]:
datalist[:5]

In [None]:
len(datalist)

#### To get unique phenotypes, we convert the list to a set

In [None]:
data_set=set(datalist)

In [None]:
len(data_set)

In [None]:
help(data_list_order.sort)

#### We use the `remove` method of a set to get rid of the empty space phenotype

In [None]:
data_set.remove('')

In [None]:
data_set

#### In order to sort, we convert the set back to a list

In [None]:
data_list_order=list(data_set)

In [None]:
data_list_order.sort(reverse=False)

In [None]:
data_list_order

#### The sorted lists demonstrated the expected sort order based on ordinal position of the characters in the ASCII (UNICODE) definitions