# Day 2: Python Advancd



## 1. Making a decision

The following decision tree is used in a (pretty bad) hospital to decide if a patient has "more" health risk or "less". 

![](tree1.png)

### 1.1 Classify a patient

* Create a function that takes as input a tuple containing values for attributes (smoker,age,diet), and computes the output of the decision tree. Should return `"less"` or `"more"`.
* Test your function on the tuple `('yes', 31, 'good')`,

In [11]:
def decision(x):
    # >>>>> YOUR CODE HERE
    if x[0] == "yes":
        if x[1] < 29.5:
            condition = "less"
        else:
            if x[2] == "good":
                condition = "less"
            else:
                condition = "more"
    else:
        if x[2] == "good":
            condition = "less"
        else:
            if x[1] > 70:
                condition = "more"
            else:
                condition = "less"
    return condition
    # <<<<< END YOUR CODE

In [12]:
# Test
x = ('yes', 31, 'good')
assert decision(x) == 'less'

### 1.2 Reading a dataset from a .txt file

The file `health-test.txt` contains several fictious records of personal data and habits.

* Read the file automatically using the methods introduced during the lecture.
* Represent the dataset as a list of tuples. Make sure that the tuples have the same format as above, e.g. `('yes', 31, 'good')`.
* Make extra note of the datatype of each element

In [13]:
def gettest():
    # >>>>> YOUR CODE HERE
    with open("health-test.txt","r") as f:
        datas = list()
        for line in f:
            datas.append([x for x in str.split(line[:-1],",")])
    
    datan = list()
    for j in datas:
        datan.append((j[0],int(j[1]),j[2]))
    
    return datan # get the testdata
    # <<<<< END YOUR CODE

In [14]:
print(gettest()) #test your implementation

[('yes', 21, 'poor'), ('no', 50, 'good'), ('no', 23, 'good'), ('no', 45, 'poor'), ('yes', 51, 'good'), ('no', 60, 'good'), ('no', 15, 'poor'), ('no', 18, 'good'), ('yes', 24, 'poor'), ('no', 55, 'good'), ('no', 37, 'good'), ('yes', 99, 'poor'), ('yes', 5, 'good'), ('no', 44, 'poor'), ('yes', 16, 'good'), ('no', 18, 'good'), ('no', 25, 'poor'), ('yes', 59, 'good'), ('no', 24, 'good'), ('yes', 45, 'good'), ('yes', 51, 'good'), ('no', 65, 'good'), ('yes', 15, 'poor'), ('no', 16, 'good'), ('yes', 24, 'poor'), ('no', 65, 'good'), ('no', 37, 'good'), ('no', 99, 'good'), ('no', 5, 'poor'), ('yes', 84, 'good'), ('no', 16, 'good'), ('no', 48, 'poor')]


### 1.3 Applying the decision tree to the dataset

* Apply the decision tree to all points in the dataset, and return the ratio of them that are classified as "more".

In [15]:
def evaluate_testset():
    # >>>>> YOUR CODE HERE
    tot = 0
    morenum = 0
    for data in gettest():
        tot += 1
        if decision(data) == "more":
            morenum += 1
    return morenum/tot #return the ratio of them that are classified as "more"
    # <<<<< END YOUR CODE

In [16]:
print(evaluate_testset())

0.03125


### 1.4 Build a DecisionMaker

Write a class `DecisionMaker`, which can decide on a patient's health and is also capable to evaluate multiple patients.
- write a class `DecisionMaker`, which sets the values `b1` and `b2` and `D` in the constructor. Default value for D is `None`
- for this class, write the function `setBoundaries(b1,b2)` which sets the boundaries for the class again (setter)
- write a setter for `D`
- write a function `decision(x)` which makes a decision for the data point `x`. Check if `x` has the right format, and if D has been initialized.  Use the boundaries of the class
- write a function `evaluate()` which gives the ratio of points of `D` classiefied as `more`.

In [17]:
class DecisionMaker:
    # >>>>> YOUR CODE HERE
    def __init__(self,b1,b2,D = None):
        self.b1 = b1
        self.b2 = b2
        self.D = D
        
    def setBoundaries(self, b1, b2):
        self.b1 = b1
        self.b2 = b2
        
    def setData(self,D):
        self.D = D
    
    def checker(self,x):
        if (x[0] == "yes" or x[0] == "no") and (type(x[1]) == int) and \
        (x[2] == "poor" or x[2] == "good"):
            return 0
        else:
            return -1
    
    def decision(self,x):
        if self.checker(x) != 0:
            print("The data format is wrong!")
            return -1
        else:
            if not self.D:
                print("D is not initialized")
                return -1
            else:
                # use the boundaries
                if x[0] == "yes":
                    if x[1] < self.b1:
                        condition = "less"
                    else:
                        if x[2] == "good":
                            condition = "less"
                        else:
                            condition = "more"
                else:
                    if x[2] == "good":
                        condition = "less"
                    else:
                        if x[1] > self.b2:
                            condition = "more"
                        else:
                            condition = "less"
                return condition
        
    def evaluate(self):
        morenum = 0
        tot = 0
        for x in self.D:
            if self.decision(x) == -1:
                print("Wrong data format!")
                return -1
            else:
                tot += 1
                if self.decision(x) == "more":
                    morenum += 1
        
        return morenum/tot
            
    # <<<<< END YOUR CODE
    

In [18]:
dm = DecisionMaker(10,30)
dm.setData(gettest())

dm.evaluate()



0.25

## 2 Evaluate student performance

In this exercise, we will read a dataset about the performance of students.  We will then use python to make some statements about the students.

### 2.1 Read the data from a .csv file

Read the data of the StudentsPerformance.csv file, and store the value in a dictionary.
- The keys of the dictionary should be the header of the csv file (first line of the .csv file!
- The value for each key should be a list, containing the value for each student
- All lists need to have the same length and should contain the order
- The values which are numerical should be converted to `int`

In [19]:
def parse_csv(filename):
    # >>>>> YOUR CODE HERE
    kvs = list()
    with open(filename,"r") as f:
        strings = f.readlines()
        
        everylines = list()
        for s in strings:
            everylines.append(s.replace("\n",""))
        #print(everylines[0])
        #print(everylines[1])
        #everyline 是每一列没有换行符的字符串
        
        '''
        #现在先处理第一行everyline[0]
        keys2 = list(everylines[0].split(","))
        keys = list()
        for key in keys2:
            keys.append(key.strip('\"'))
        print(tuple(keys))
        #好我们现在吧第一行处理完了
        #接下来用相同方式处理剩下的
        '''
        
        datas = list()
        for everyline in everylines:
            values2 = list(everyline.split(","))
            values = list()
            for value in values2:
                values.append(value.strip('\"'))
            datas.append(values)
        #ok我们处理完了这个data,存着每一行的每个单词的list
        
        #创建一堆空list
        lt = []
        for i in range(8):
            lt.append([])
        
        keys = datas[0]
        for j in range(len(datas)-1):
            for i in range(8):
                if i < 5:
                    lt[i].append(datas[j+1][i])
                else:
                    lt[i].append(int(datas[j+1][i]))
        mydict = dict(zip(keys,lt))
        #print(dic["math score"])
        #所以数据就被储存在mydict这个字典里了
        
    # <<<<< END YOUR CODE
    return mydict
mydict = parse_csv("StudentsPerformance.csv")

### 2.2 Extend dataset and write .csv files

Now, the dictionary dataset should be appended with a student ID for each student and new test scores from the Python Programming Test. 
- the ID should be a random 4-digit number. The ID of a student needs to be unique! (no repetition). Also, the id should be stored as a string.
- the Python Programming score are in the file _PythonScores.csv_. The order of the students are the same as in the _StudentsPerformance.csv_.
- When you have appended the dictionary, write the extended dataset into a new .csv file with the name: _NewStudentPerformance.csv_. This file should also contain the headers for the new values we have added.


In [20]:
import random
def add_ids(mydict):
    # >>>>> YOUR CODE HERE
    idnum = list()
    for i in range(10000):
        if i<10:
            idnum.append("000"+str(i))
        elif i<100:
            idnum.append("00"+str(i))
        elif i<1000:
            idnum.append("0"+str(i))
        else:
            idnum.append(str(i))
    myid = list()
    for i in range(1000):
        temp = random.choice(idnum)
        myid.append(temp)
        idnum.remove(temp) 
    mydict.update({"ID": myid})
    return
    # <<<<< END YOUR CODE
add_ids(mydict)

In [21]:
# you can also use set to make no repetition


In [22]:
def add_python_scores(mydict):
    # >>>>> YOUR CODE HERE
    with open("PythonScores.csv","r") as f:
        strings = f.readlines()
        
    everylines = list()
    for s in strings:
        everylines.append(s.replace("\n",""))
    key = everylines[0]
    value = list()
    for i in range(len(everylines)-1):
        value.append(int(everylines[i+1]))
    #print(key)
    #print(value)
    
    # and we also need to add {key: value} into mydict????
    mydict.update({key:value})
    return
add_python_scores(mydict)
    # <<<<< END YOUR CODE

In [23]:
for key in list(mydict.keys()):
    print(len(mydict[key]))

1000
1000
1000
1000
1000
1000
1000
1000
1000
1001


In [24]:
def write_new_csv(mydict):
    # >>>>> YOUR CODE HERE
    with open("NewStudentPerformance.csv","w") as f:
        for key in list(mydict.keys()):
            f.write(key+",")
        f.write("\n")
        for i in range(len(mydict["gender"])):
            for key in list(mydict.keys()):
                f.write(str(mydict[key][i])+",")
            f.write("\n")
    return
write_new_csv(mydict)
                        
    # <<<<< END YOUR CODE

### 2.3 Get some statistics of the dataset

Finally we want to get a better understanding of the dataset. Therefore answer the following questions?
- How many males and females are in the dataset? What is the ratio?
- For each test, find the mean, min and max value and standard deviation (you can use built-ins)
- For each parental education, get the percentage of students.
- Who is the best student for each test?
- Who has the best average of all tests?

For each question, print out a message.
You can write a function for each question.

In [33]:
# >>>>> YOUR CODE HERE
def get_gender(mydict):
    gender = mydict["gender"]
    m = 0
    f = 0
    for g in gender:
        if g == "male":
            m += 1
        if g == "female":
            f += 1
    print("There are " + str(m) + " in the dataset.")
    print("There are " + str(f) + " in the dataset.")
    print("The ratio is " + str(m/f) + ".")

def get_count(mydict):
    #math
    print("For the math test:")
    print("The max score is " + str(max(mydict["math score"])))
    print("The min score is " + str(min(mydict["math score"])))
    mathavg = sum(mydict["math score"])/len(mydict["math score"])
    print("The average score is " + str(mathavg))
    mathas = [(s - mathavg)**2 for s in mydict["math score"]]
    print("The standard diviation is " + str((sum(mathas)/len(mydict["math score"]))**0.5))
    print("\n")
    
    #reading
    print("For the reading test:")
    print("The max score is " + str(max(mydict["reading score"])))
    print("The min score is " + str(min(mydict["reading score"])))
    readingavg = sum(mydict["reading score"])/len(mydict["reading score"])
    print("The average score is " + str(readingavg))
    readingas = [(s - readingavg)**2 for s in mydict["reading score"]]
    print("The standard diviation is " + str((sum(readingas)/len(mydict["reading score"]))**0.5))
    print("\n")
    
    #writing
    print("For the writing test:")
    print("The max score is " + str(max(mydict["writing score"])))
    print("The min score is " + str(min(mydict["writing score"])))
    writingavg = sum(mydict["writing score"])/len(mydict["writing score"])
    print("The average score is " + str(writingavg))
    writingas = [(s - writingavg)**2 for s in mydict["writing score"]]
    print("The standard diviation is " + str((sum(writingas)/len(mydict["writing score"]))**0.5))
    print("\n")
    
    #python
    print("For the python test:")
    print("The max score is " + str(max(mydict["python score"])))
    print("The min score is " + str(min(mydict["python score"])))
    pythonavg = sum(mydict["python score"])/len(mydict["python score"])
    print("The average score is " + str(pythonavg))
    pythonas = [(s - pythonavg)**2 for s in mydict["python score"]]
    print("The standard diviation is " + str((sum(pythonas)/len(mydict["python score"]))**0.5))
    print("\n")
    
    return

def get_parental_level_of_education(mydict):
    # 怎么添加字典的项目?????????????
    # enumerate怎么用????????
    level = {}
    for lv in mydict["parental level of education"]:
        if lv in level.keys():
            level[lv] += 1
        else:
            level.update({lv: 1})
    
    tot = len(mydict["parental level of education"])
    for lv in level.keys():
        print(str(level[lv]/tot*100)+"% of the student's parental level of education is "+lv)
    return

def get_index(lst=None, item=''):
    return [index for (index,value) in enumerate(lst) if value == item]

def get_best(mydict):
    #math
    print("For the math test:")
    mathmax = max(mydict["math score"])
    mathname = get_index(mydict["math score"],mathmax)
    
    print("The best student for math test has the index of " + " and ".join([str(s) for s in mathname]))
    
    #reading
    print("For the reading test:")
    readingmax = max(mydict["reading score"])
    readingname = get_index(mydict["reading score"],readingmax)
    
    print("The best student for reading test has the index of " + " and ".join([str(s) for s in readingname]))
    
    #writing
    print("For the writing test:")
    writingmax = max(mydict["writing score"])
    writingname = get_index(mydict["writing score"],writingmax)
    
    print("The best student for writing test has the index of " + " and ".join([str(s) for s in writingname]))
    
    #python
    print("For the python test:")
    pythonmax = max(mydict["python score"])
    pythonname = get_index(mydict["python score"],pythonmax)
    
    print("The best student for python test has the index of " + " and ".join([str(s) for s in pythonname]))
    
    return

def get_best_average(mydict):
    #avg
    print("For the average score:")
    avglist = list()
    for i in range(1000):
        avglist.append((mydict["math score"][i] + mydict["writing score"][i] + mydict["reading score"][i] + mydict["python score"][i]) / 4)
    avgmax = max(avglist)
    avgname = get_index(avglist,avgmax)
    
    print("The student with best average score has the index of " + " and ".join([str(s) for s in avgname]))
    return
# <<<<< END YOUR CODE
#get_gender(mydict)
#get_count(mydict)
#get_parental_level_of_education(mydict)
#get_best(mydict)
#get_best_average(mydict)

## 3 More List Comprehension!!

- Find all of the numbers from 1-1000 that are divisible by 7
- Find all of the numbers from 1-1000 that have a 3 in them
- Count the number of spaces in a string
- Remove all of the vowels in a string
- Find all of the words in a string that are less than 4 letters

In [2]:
[i for i in range(1001) if i % 7 == 0 and i != 0]

[7,
 14,
 21,
 28,
 35,
 42,
 49,
 56,
 63,
 70,
 77,
 84,
 91,
 98,
 105,
 112,
 119,
 126,
 133,
 140,
 147,
 154,
 161,
 168,
 175,
 182,
 189,
 196,
 203,
 210,
 217,
 224,
 231,
 238,
 245,
 252,
 259,
 266,
 273,
 280,
 287,
 294,
 301,
 308,
 315,
 322,
 329,
 336,
 343,
 350,
 357,
 364,
 371,
 378,
 385,
 392,
 399,
 406,
 413,
 420,
 427,
 434,
 441,
 448,
 455,
 462,
 469,
 476,
 483,
 490,
 497,
 504,
 511,
 518,
 525,
 532,
 539,
 546,
 553,
 560,
 567,
 574,
 581,
 588,
 595,
 602,
 609,
 616,
 623,
 630,
 637,
 644,
 651,
 658,
 665,
 672,
 679,
 686,
 693,
 700,
 707,
 714,
 721,
 728,
 735,
 742,
 749,
 756,
 763,
 770,
 777,
 784,
 791,
 798,
 805,
 812,
 819,
 826,
 833,
 840,
 847,
 854,
 861,
 868,
 875,
 882,
 889,
 896,
 903,
 910,
 917,
 924,
 931,
 938,
 945,
 952,
 959,
 966,
 973,
 980,
 987,
 994]

In [3]:
[int(i)    for i in [str(r) for r in range(1001)]   if "3" in i]

[3,
 13,
 23,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 43,
 53,
 63,
 73,
 83,
 93,
 103,
 113,
 123,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 143,
 153,
 163,
 173,
 183,
 193,
 203,
 213,
 223,
 230,
 231,
 232,
 233,
 234,
 235,
 236,
 237,
 238,
 239,
 243,
 253,
 263,
 273,
 283,
 293,
 300,
 301,
 302,
 303,
 304,
 305,
 306,
 307,
 308,
 309,
 310,
 311,
 312,
 313,
 314,
 315,
 316,
 317,
 318,
 319,
 320,
 321,
 322,
 323,
 324,
 325,
 326,
 327,
 328,
 329,
 330,
 331,
 332,
 333,
 334,
 335,
 336,
 337,
 338,
 339,
 340,
 341,
 342,
 343,
 344,
 345,
 346,
 347,
 348,
 349,
 350,
 351,
 352,
 353,
 354,
 355,
 356,
 357,
 358,
 359,
 360,
 361,
 362,
 363,
 364,
 365,
 366,
 367,
 368,
 369,
 370,
 371,
 372,
 373,
 374,
 375,
 376,
 377,
 378,
 379,
 380,
 381,
 382,
 383,
 384,
 385,
 386,
 387,
 388,
 389,
 390,
 391,
 392,
 393,
 394,
 395,
 396,
 397,
 398,
 399,
 403,
 413,
 423,
 430,
 431,
 432,
 433,
 434,
 435,
 436,
 437,
 438,
 439,


In [5]:
string = input("Input a string: ")
len([s for s in string if s == " "])

Input a string:     


5

In [8]:
string = input("Input a string: ")
"".join([s for s in string if s not in ["a","e","i","o","u"] + ])

Input a string: heuiadhauihiwdhaiu


'hdhhwdh'

In [35]:
string = input("Input a string: ")
[word for word in string.split(" ") if 0 < len(word) < 4]
# maybe it has to deal with the comma? but I dont want to do it

Input a string: sjsj sjsj sjaj   adiwj akj


['akj']

### Challenge:

- Use a dictionary comprehension to count the length of each word in a sentence.
- Use a nested list comprehension to find all of the numbers from 1-1000 that are divisible by any single digit besides 1 (2-9)
- For all the numbers 1-1000, use a nested list/dictionary comprehension to find the highest single digit any of the numbers is divisible by

In [None]:
# dict comprehension is using a function dict.item()!
{key: len(word) for key}