# Unique Phenotypes

You have been given a file that consists of phenotypes that have been assigned to individual cardiology patients. The data was supposed to have been entered as a table with two columns per patient. The first column was to be a single, primary phenotype for a patient. The second column was to be a comma separated list of secondary phenotypes for that patient. Unfortunately, data entry did not follow this prescription and was inconsistent. Sometimes multiple phenotypes are provided in column one, sometimes a different delimiter than a comma is used.

Your task is to provide an alphabetically sorted list of unique phenotypes encountered in the data. Missing values are denoted by an empty string and should **not** be treated as a phenotype.

As a side note, the researcher responsible for this data would also like to know what delimiters were used in data entry.

I have used Pandas to read the data in and convert it to a list for you to start with.

In [None]:
import pandas as pd

In [None]:
data = pd.read_excel("./create_uniq_phenotypes.xlsx", header=None)
print(data.columns)
data.head(50)

### "," vs " "

My original code created a string from a list by joining with an empty space (`" "`). This made it look the first term in the 2nd column was part of teh laster term in the 1st column. For example

AS, BAV	Cardiology would appear to be "AS" and "BAV Cardiology" where as it should be "AS", "BAV", and "Cardiology".

By combining with a ", " the three terms are properly separated.

In [None]:
data = [", ".join(d) for d in data.fillna("").values.tolist()]


In [None]:
data 


In [None]:
phenotypes = []
for string in data:
    phenotypes.extend(string.split(","))
    
print(phenotypes)

#### Here is a list comprehension approach

```Python
phenotypes = [p.strip().upper() for p in phenotypes if p]
```

Because we wanted to drop the empty strings from our phenotypes, we modified the list comprehension to include a boolean test. We only keep the `p` (the strings within phenotypes) that evaluate as ```True```.

In [None]:
phenotypes = [p.strip().upper() for p in phenotypes if p]

In [None]:
phenotype_set = set(phenotypes)

In [None]:
phenotype_set

In [None]:
phenotype_set

In [None]:
plist = list(phenotype_set)
plist


In [None]:
help(plist.sort)

In [None]:
plist.sort(key=len, reverse=True)
plist

#### Anonymous functions

```Python
lambda x: x[-1]
```

This declares an **anonymous function** with argument `x` and returns the last element in `x`.

In [None]:
plist.sort(key= lambda x: x[-1], reverse=True)
plist

In [None]:
plist.sort(key= lambda x: len(x), reverse=True)
plist