#### <h1> <center> ENSF 519.01 Programming Fundamentals for Data Engineers </center></h1>
<h2> <center> Assignment 5: Numpy and Pandas (100 marks)</center></h2>

<center>
<div class="alert alert-block alert-info">
Updated Due: Sunday 28 Oct Midnight. To be submitted on D2L.
</div></center>


Edit this file and write your solutions to the problems in sections specified with `# Your solution goes here`. Test your code and when you were done, download this notebook as an `.ipynb` file and submit it to D2L. To get this file, in Jupyter notebook you can go to File -> Download as -> Notebook(.ipynb)

# Working with Numbers

In this assignment, as a data scientist you are going to learn how to efficiently work with numbers and perform the operations faster using `numpy` and `pandas` libraries. 

### Before You start: 
##### Make Sure :
* Numpy and pandas are already installed with the Anaconda stack. If you installed jupyter directly, make sure that you have them installed:<br> 
    `import numpy
    numpy.__version__`

* Download data.csv and put it somewhere in your repository to be accessible in you code (for ease of access put it in the same directory of you notebook file)

<div class="alert alert-block alert-info">
<b>Tip:</b> Use `numpy ?` to read numpy documentation before you start working with it.
</div>


### Description:
Nowadays, there are many healthcare provider companies that invest in technology based startups to assist the patients and doctors in diagnosis and curing process. You as a data scientist, are assigned to do a study on the data collected from 293 patients with heart disease, and extract some meaningfull information and report it to one of the healthcare provider companies. You can download the dataset (data.csv) from d2l and if you want to know more about it, you can use the below Kaggle link:<br>
https://www.kaggle.com/imnikhilanand/heart-attack-prediction



|Feature|Description|
|-------|----------------------------|
|age|age in years|
|gender|(1 = male; 0 = female)|
|cp|chest pain type|
|trestbps|resting blood pressure (in mm Hg on admission to the hospital)|
|chol|serum cholestoral in mg/dl|
|fbs|(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)|
|restecg|resting electrocardiographic results|
|thalach|maximum heart rate achieved|
|exang|exercise induced angina (1 = yes; 0 = no)|
|oldpeak|ST depression induced by exercise relative to rest|
|slope|the slope of the peak exercise ST segment|
|ca|number of major vessels (0-3) colored by flourosopy|
|thal|3 = normal; 6 = fixed defect; 7 = reversable defect|
|num|diagnosis of heart disease (angiographic disease status)|

# Section A. Numpy (60pts)
### Part I. Table (40pts)
As a first step, to work properly with data, design a class that is responsible for general processes that you'll need.
* Keep header names in a python list, and store the the data rows in a numpy.ndarray as data attributes (why not class attribute ?)
* Implement a `readCSV` method which is responsible for reading header and data from a csv file.
* Implement a `printHead` method that print the header and first 10 rows of the table.
* Implement a `sort` method that sort the table based on the specified `column`. (`default='age'`)
* Implement a `deleteRow` method that gets a `row index` and deletes that row from the table. (`default=last row`) 
* Implement a `deleteCol` method that gets a `column index` and deletes that column from the table. (`default=last column`)
* Implement a `getColumn` method that gets a `column` name and after removing `nan` values from it, returns it as a numpy array. 
* Implement a `select` method that gets a `column` name and a `value`, and after removing the `nan` values, searchs for the records with `column=value` and returns that sub-table. 
* Implement a `rangeSelect` method that gets a `column` name and a `begin` and `end` (that define a range), and after removing the `nan` values and sorting the table based on `column`, searchs for the records with `begin<column<end` and returns that sub-table.
* Implement a `percentageSelect` method that gets a `column` name, a `Perc`(percentage) and `index`, and after removing the `nan` values and sorting the table based on `column` from that column: 
    * if index ==0 : returns the __first__ `Perc*column.size` sub-table.  
    * if index ==-1 : returns the __last__ `Perc*column.size` sub-table. 

<div class="alert alert-block alert-danger">
<b>Note:</b> <br>
Do not change the signature of methods and complete the same class structure as below<br>
`select`, `rangeSelect` and `percentageSelect` should return a new `Table` as their output without changing the current Table.<br>
To remove `nan` values and sort, you must reuse the `getColumn` and `sort` methods that you have implemented before.
</div>

In [1]:
import numpy as np
import csv

class Table:
    
    def __init__(self,header=[],data=np.array([])):
        self.header=header
        self.data = data
        pass
            
    def readCSV(self,filename:str)->None:
        self.data = np.genfromtxt(filename, dtype=float, names=True, delimiter=',')
        self.header= self.data.dtype.names

    def printHead(self)->None:
        print(self.header)
        print(self.data[:10])
        pass

    def sort(self,column:str='age')->None:
        self.data =np.sort(self.data,order=column)
        pass

    def deleteRow(self,index:int=-1)->None:
        self.data = np.delete (self.data,(index),axis=(0))
        pass

    def deleteCol(self,column:int=-1)->None:
        self.header = self.header[:column]+self.header[column+1:]
        self.data = self.data[list(self.header)]
        pass

    def getColumn(self,column:str='age')->np.array:  
        selected_column = self.data[column]
        selected_column= selected_column[~(np.isnan(selected_column))]
        return selected_column
        pass 

    def select(self,column:str,value:float):
        self.sort(column)
        selected_column = self.getColumn(column)
        data = []
        for name,values in enumerate(selected_column):
            if values==value:
                data.append( np.array([name]))
        return Table(self.header,np.take(self.data, data))
        pass
    
    def rangeSelect(self,column:str,begin:int,end:int):
        self.sort(column)
        selected_column = self.getColumn(column)
        data = []
        for name,values in enumerate(selected_column):
            if values>begin and values<end:
                data.append( np.array([name]))
        return Table(self.header,np.take(self.data, data))
        pass
    
    def percentageSelect(self,column:str,Perc:float,index:int):
        self.sort(column)
        selected_column = self.getColumn(column)
        selected_column1= self.data[~np.isnan(self.data[column])]
        selected_row= (selected_column).shape[0]
        percentage= int(Perc*selected_row)
        if index == 0:
            return Table(self.header,selected_column1[:percentage])
        if index ==-1:
            return Table(self.header,selected_column1[-percentage:])
        pass
    



In [2]:
# some test cases

"""  
make sure to add some more test cases to ensure the
correctness of all your methods before going to partII
"""

Synopsis=Table()
Synopsis.readCSV("data.csv")
Synopsis.printHead()
Synopsis.sort(column='trestbps')
Synopsis.printHead()
Synopsis.deleteRow(0)
Synopsis.printHead()
Synopsis.deleteCol(0)
Synopsis.printHead()
Synopsis.getColumn('cp')
Synopsis.select('slope',value =2).printHead()
Synopsis.percentageSelect('thal',.20,-1).printHead()

('age', 'gender', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num')
[(28., 1., 2., 130., 132., 0., 2., 185., 0., 0., nan, nan, nan, 0.)
 (29., 1., 2., 120., 243., 0., 0., 160., 0., 0., nan, nan, nan, 0.)
 (29., 1., 2., 140.,  nan, 0., 0., 170., 0., 0., nan, nan, nan, 0.)
 (30., 0., 1., 170., 237., 0., 1., 170., 0., 0., nan, nan,  6., 0.)
 (31., 0., 2., 100., 219., 0., 1., 150., 0., 0., nan, nan, nan, 0.)
 (32., 0., 2., 105., 198., 0., 0., 165., 0., 0., nan, nan, nan, 0.)
 (32., 1., 2., 110., 225., 0., 0., 184., 0., 0., nan, nan, nan, 0.)
 (32., 1., 2., 125., 254., 0., 0., 155., 0., 0., nan, nan, nan, 0.)
 (33., 1., 3., 120., 298., 0., 0., 185., 0., 0., nan, nan, nan, 0.)
 (34., 0., 2., 130., 161., 0., 0., 190., 0., 0., nan, nan, nan, 0.)]
('age', 'gender', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num')
[(38., 1., 4.,  92., 117., 0., 0., 134., 1., 2.5,  2., nan, nan, 1.)
 (34

### Part II. Extract Meaning (20pts)

Now that you have a proper infrastructure to handle the data, the company want you to analyze the data and answer the following questions:
* Report1: what is the average age of patients ? 
* Report2: report the average `chol` level of people in intervals of 10 years old ([20,30], [30,40],[40,50],[50,60])
* Report3: report the average `trestbps` in people with `chol` of highest level(the highest 30%) and lowest level(the lowest 30%).
* Report4: report percentage of men and women with positive diagnosis of heart disease(`num=1`). 


In [3]:
# Your Code goes here - complete the code bellow

report=Table()
report.readCSV("data.csv")

# report 1
average_age = report.getColumn('age').mean()

# total=0
# value = 0
# for items in selected_column:
#     total= total + items
#     value+=1
# average_age = (total/value) 
print("Report1\n \tthe average age of patients are: ",average_age)


# report2
print("\nReport2")
#print(selected_columnage.where(selected_columnage<30))

# counter = 0
# x = 0
# y=0
# for i in selected_columnage:
#     if i == 20:
#         x = counter
#     if i == 30:
#         y = counter
#     if i == 40:
#         z = counter
#     if i == 50:
#         a = counter
#     if i == 60:
#         b = counter
#     else:
#         counter +=1
# result= 0
# for u in selected_columnChol[int(x):int(y)]:
#     result+=u
#     value= result/(y-x)
# print(value)
# for u in selected_columnChol[int(y):int(z)]:
#     result+=u
#     value= result/(z-y)
# print(value)
# for u in selected_columnChol[int(z):int(a)]:
#     result+=u
#     value= result/(a-z)
# print(value)
# for u in selected_columnChol[int(a):int(b)]:
#     result+=u
#     value= result/(b-a)
# print(value)
 


for i in range(20,60,10):
    ages = report.rangeSelect(column= 'age',begin = i,end=i+10)
    average_chol = np.average(ages.getColumn(column='chol'))
    print("\tfor people with age in range of", i,i+10,"average CHOL is: ",average_chol)

# report3

print("\nReport3")
top_chol = report.percentageSelect(column = 'chol',Perc= 0.30, index=0)
average_tres_top_chol=np.average(top_chol.getColumn(column= 'trestbps'))
print("\tfor patients with lowest 30% of chol, average trestbps is: ",average_tres_top_chol)
top_chol = report.percentageSelect(column = 'chol',Perc= 0.30, index=-1)
average_tres_lower_chol=np.average(top_chol.getColumn(column= 'trestbps'))
print("\tfor patients with highest 30% of chol, average trestbps is: ",average_tres_lower_chol) 

# report4
print("\nReport4")
#total_gender=report.getColumn(column='gender').size
#male_gender=report.select(column='gender',value=1).getColumn(column='gender').size
#female_gender= report.select(column='gender',value=0).getColumn(column='gender').size
male_gender_sick=report.select(column='gender',value=1).select(column='num',value =1).getColumn(column='gender').size
female_gender_sick= report.select(column='gender',value=0).select(column='num',value =1).getColumn(column='gender').size
total_sick_people=male_gender_sick+female_gender_sick
male_percent=(male_gender_sick/total_sick_people)*100
female_percent=(female_gender_sick/total_sick_people)*100
print("\t{0:.2f}% of patients diagnosed with heart decease are men and ".format(  male_percent ),"{0:.2f}% of them are women".format(  female_percent))


Report1
 	the average age of patients are:  47.767918088737204

Report2
	for people with age in range of 20 30 average CHOL is:  187.5
	for people with age in range of 30 40 average CHOL is:  239.6888888888889
	for people with age in range of 40 50 average CHOL is:  245.6451612903226
	for people with age in range of 50 60 average CHOL is:  258.6666666666667

Report3
	for patients with lowest 30% of chol, average trestbps is:  132.97530864197532
	for patients with highest 30% of chol, average trestbps is:  135.6375

Report4
	88.57% of patients diagnosed with heart decease are men and  11.43% of them are women


## Section B. Pandas (40pts)

Now generate the same reports(sectionA.partII) again, but this time use Pandas dataframe instead of `Table` class. 

In [4]:
# Your Code goes here - complete the code bellow

import pandas as pd

df = pd.read_csv('data.csv', index_col=False, header=0,na_values="?")



# report 1
average_age=df.age.mean()
# total=0
# value = 0
# for items in df.age:
#     total= total + items
#     value+=1
# average_age = (total/value) 
print("Report1\n \tthe average age of patients are: ",average_age)

#report2
#print(df.age.where(df.age>50))
print("\nReport2")
for i in range(20,60,10):
    average_chol =df[(df['age'] >i) & (df['age']<i+10)]['chol'].mean()
    print("\tfor people with age in range of", i,i+10,"average CHOL is: ",average_chol )

# report3
print("\nReport3")
df_sorted=df.sort_values(by= ['chol'])
average_tres_top_chol=df_sorted[:int(df_sorted.shape[0]*.3)]['trestbps'].mean()
print("\tfor patients with lowest 30% of chol, average trestbps is: ",average_tres_top_chol)
average_tres_lower_chol=df_sorted[int(-df_sorted.shape[0]*.3):]['trestbps'].mean()
print("\tfor patients with highest 30% of chol, average trestbps is: ",average_tres_lower_chol) 


# report4
print("\nReport4")
total_gender= df.gender.size
#male_gender=(df[df['gender']==1]['gender']).size
#female_gender = (df[df['gender']==0]['gender']).size
male_gender_sick=df[(df['gender']==1)&(df['num']==1)]['gender'].size
female_gender_sick=df[(df['gender']==0)&(df['num']==1)]['gender'].size
total_sick_people=male_gender_sick+female_gender_sick
male_percent=(male_gender_sick/total_sick_people)*100
female_percent=(female_gender_sick/total_sick_people)*100

print("\t{0:.2f}% of patients diagnosed with heart decease are men and ".format( male_percent ),"{0:.2f}% of them are women".format( female_percent ))


Report1
 	the average age of patients are:  47.767918088737204

Report2
	for people with age in range of 20 30 average CHOL is:  187.5
	for people with age in range of 30 40 average CHOL is:  239.6888888888889
	for people with age in range of 40 50 average CHOL is:  245.6451612903226
	for people with age in range of 50 60 average CHOL is:  258.6666666666667

Report3
	for patients with lowest 30% of chol, average trestbps is:  133.22988505747125
	for patients with highest 30% of chol, average trestbps is:  136.09302325581396

Report4
	88.57% of patients diagnosed with heart decease are men and  11.43% of them are women
