# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

---

# Day 2 Project - More strings & loops - and very messy data

Data is messy. Biologist data even more so. Here we have some data on bacterial abundance as collected by some well meaning scientists but unfortunately it's a bit of a mess. It is technically in a four column format liks this, howver when you look below it's mixed up:

```
| Collector | Percentage abundance | Dominant Phyla | Date |
```

Delimeters:
- Between collected sample records: ```,```
- Between data fields per sample: ```-```

We want to clean up the data and make some sense out of it. **The objective is to output a count of the number of samples with a high proportion of each phyla.**

1. Look at the text file first so that you know what we are looking at!
2. We will read in the file ```MessyData.txt``` with ```open()``` as one object (it is too mixed-up to read line-by-line), and then split based on the delimiters above. We will learn more about loading files in the IO session.

If you want to challange yourself try to clean the data first before looking in this guide section!
I recommend using ```print()``` functions after each step to check the output is as expected.

---

<details>
<summary>Step-by-step guide</summary>

2. First split the data by commas into a new list of ```records``` with the function ```.split()```
2. Create a new loop to go through your ```records``` list and split each record by ```-``` into the 4 data elements (put the output into a new list too)
3. Create a **2D/nested** loop for your latest list, to remove the whitepace off each element with ```.strip()```. (First go through each record, then through each element. Make sure to keep experiments together!)
4. Create a long list of all the dominant phyla per sample (The third column of the data) - some samples have multiple phyla, so have to be split again first! Careful here, because you want a basic list, not a list of lists.

</details>

---

5. Print out your new clean dataframe!

**Extensions**
1. Calculate the average abundance per collection date (4 options) (use ```if date_column == ....```. We'll look at automatically building lists later)
2. Output a clean list of all named phyla from the data column in a list named ```phyla_count```. There may be more than one phyla in the list per sample. There is a codeblock at the end that will count for each of the list I've given you, and summarise your output for a list of phyla.

In [2]:
# Read file in as one block because too messy to read line by line
with open("/content/Day2-Project-MessyData.txt") as inFile:
  data = inFile.read()

data = data.replace("%", " ")

data = data.strip()

#separate out each entry by the dividing ,

records = data.split(",")

#make new nested list separating each record at -

data_points_separate = []

for data_point in records :
  data_points_separate.append(data_point.split("-"))

#make nested loop so that the parts of the nested list are cleaned (list cant be stripped only string) - need to append twice!

data_points_separate_cleaned = []

for parts in data_points_separate:
    cleaned_parts = []
    for part in parts:
        cleaned_parts.append(part.strip())
    data_points_separate_cleaned.append(cleaned_parts)

#print (data_points_separate_cleaned)
# select out index2 (3rd item) and split at &

dominant_phyla_raw = []

for individual_record in data_points_separate_cleaned :
  dominant_phyla_raw.append(individual_record[2])


dominant_phyla_splitup = []

for bugs in dominant_phyla_raw :
  dominant_phyla_splitup.append(bugs.split("&"))

#flatten the list so no nesting

dominant_phyla_separated = []
for sublist in dominant_phyla_splitup:
    for item in sublist:
        dominant_phyla_separated.append(item)

print ("The dominant phyla observed in the samples were as follows:\n", dominant_phyla_separated)

phyla_count = dominant_phyla_separated



The dominant phyla observed in the samples were as follows:
 ['Chloroflexi', 'Chloroflexi', 'Acidobacteria', 'Chloroflexi', 'Acidobacteria', 'Chloroflexi', 'Chloroflexi', 'Bacillus', 'Actinomycetes', 'Actinomycetes', 'Bacillus', 'Actinomycetes', 'Bacillus', 'Acidobacteria', 'Acidobacteria', 'Actinomycetes', 'Acidobacteria', 'Chloroflexi', 'Chloroflexi', 'Firmicutes', 'Chloroflexi', 'Acidobacteria', 'Firmicutes', 'Acidobacteria', 'Proteobacteria', 'Acidobacteria', 'Proteobacteria', 'Acidobacteria', 'Firmicutes', 'Proteobacteria', 'Acidobacteria', 'Firmicutes', 'Cyanobacteria', 'Cyanobacteria', 'Bacillus', 'Chloroflexi', 'Cyanobacteria', 'Bacillus', 'Chloroflexi', 'Cyanobacteria', 'Bacillus', 'Proteobacteria', 'Proteobacteria', 'Bacillus', 'Proteobacteria', 'Bacillus', 'Acidobacteria', 'Proteobacteria', 'Bacillus', 'Actinomycetes', 'Acidobacteria', 'Cyanobacteria', 'Cyanobacteria', 'Acidobacteria', 'Cyanobacteria', 'Acidobacteria', 'Cyanobacteria', 'Cyanobacteria', 'Actinomycetes', 'Cyan

In [29]:
# Name your final clean list of all phyla "phyla_count", then test it with this code block
phyla = ['Actinomycetes', 'Proteobacteria', 'Cyanobacteria', 'Firmicutes', 'Chloroflexi', 'Acidobacteria', 'Bacillus']

print("Phylum\t\tCount")
for p in phyla:
    print(p, "\t", phyla_count.count(p))


Phylum		Count
Actinomycetes 	 17
Proteobacteria 	 30
Cyanobacteria 	 26
Firmicutes 	 24
Chloroflexi 	 28
Acidobacteria 	 22
Bacillus 	 34


In [35]:
collection_dates =["15/03/22", "04/05/22", "21/06/22", "01/08/22"]

abundance_by_date = {date: [] for date in collection_dates}

for record in data_points_separate_cleaned:
    record_date = record[3] #4th item in nested record
    if record_date in abundance_by_date:   # only looks for dates in the collection_dates list/dictionary
        abundance_by_date[record_date].append(record[1])

abundance_march = abundance_by_date["15/03/22"]
abundance_may = abundance_by_date["04/05/22"]
abundance_june = abundance_by_date["21/06/22"]
abundance_august = abundance_by_date["01/08/22"]

sum_march = sum(float(values) for values in abundance_march)
sum_may = sum(float(values) for values in abundance_may)
sum_june = sum(float(values) for values in abundance_june)
sum_august = sum(float(values) for values in abundance_august)

average_abundance_march = round(sum_march/len(abundance_march),3)
average_abundance_may = round(sum_may/len(abundance_may),3)
average_abundance_june = round(sum_june/len(abundance_june),3)
average_abundance_august = round(sum_august/len(abundance_august),3)

print (average_abundance_march)
print (average_abundance_may)
print (average_abundance_june)
print (average_abundance_august)


13.32
14.322
20.389
12.347


In [38]:
# can i loop it?


collection_dates =["15/03/22", "04/05/22", "21/06/22", "01/08/22", "01/01/22"]

abundance_by_date = {date: [] for date in collection_dates}

for record in data_points_separate_cleaned:
    record_date = record[3] #4th item in nested record
    if record_date in abundance_by_date:   # only looks for dates in the collection_dates dictionary
        abundance_by_date[record_date].append(record[1]) #adds index item to the dictionary if date has been found

averages_by_date = {}

for date, abundances in abundance_by_date.items():
    if abundances: #means not broken by a null entry
        total = sum(float(value) for value in abundances)
        averages_by_date[date] = round(total / len(abundances), 3)
    else:
        averages_by_date[date] = None

print (averages_by_date)

#MY BRAIN HURTS

{'15/03/22': 13.32, '04/05/22': 14.322, '21/06/22': 20.389, '01/08/22': 12.347, '01/01/22': None}
