# Coding with Pandas, Matplotlib and Sklearn

Aim of this notebook is to learn about the coding system, hierarchy, clustering in the medical field with using different python libraries.

These reference documentation may help:
1. <a href="https://pandas.pydata.org/docs/">Pandas</a> and <a href="https://numpy.org/doc/1.24/">NumPy</a>
    1. [Pandas 10 minute introduction](https://pandas.pydata.org/docs/user_guide/10min.html) 
    2. [NumPy for absolute beginners](https://numpy.org/doc/stable/user/absolute_beginners.html), read until the section on [Creating Matrices](https://numpy.org/doc/stable/user/absolute_beginners.html#creating-matrices)
2. <a href="https://matplotlib.org/stable/index.html">Matplotlib</a> 
    1. [Pyplot Tutorial](https://matplotlib.org/stable/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py)
3. Coding System: <a href="https://icd.who.int/browse10/2019/en#/">ICD-10</a> 
4. Clustering: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">K-means Sklearn</a> || <a href="https://en.wikipedia.org/wiki/K-means_clustering">K-means Wikipedia</a>
    1. [Kaggle Clustering with K-means](https://www.kaggle.com/code/ryanholbrook/clustering-with-k-means)

***Run the initial setup script below for initial setup***

In [None]:
# Initial Setup Script
import warnings
warnings.filterwarnings("ignore")
from lab_2 import *
import pandas as pd
from IPython.display import display
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans

print("Setup complete.")

#### ***Warm Up***

***Warm Up: 1***

Create a DataFrame called `icd_codes` that consists of columns `ICD Code` and `Name`. Include two rows containing `C000` and `D104` as ICD codes, those ICD codes have `External upper lip` and `Tonsil` as their names respectively. The DataFrame should look like: 

|       | ICD Code | Name               |
| ----- | -------- | ------------------ |
| **0** | C000     | External upper lip |
| **1** | D104     | Tonsil             |



In [None]:
# TODO: Your code goes here.
icd_codes = _____

# Check your answer
w1_check(icd_codes)
icd_codes

In [None]:
#view one of the possible answers
w1_solution()

***Warm Up: 2***

Create a DataFrame called `icd_codes` that consists of columns `ICD Code` and `Count`. Include two rows containing `C000` and `D104` as ICD codes, those ICD codes have `4` and `6` as count value respectively. The DataFrame should look like:

|         | ICD Code | Count |
| ------- | -------- | ----- |
| **0**   | C000     | 4     |
| **1**   | D104     | 6     | 

We will use this DataFrame to create a barplot.

In [None]:
# TODO: Your code goes here.
icd_codes = _____

# Check your answer
icd_codes.plot.bar(x="ICD Code", y="Count")
plt.show()
w2_check()

In [None]:
# View one of the possible answers
w2_solution()

***Warm Up: 3***

Find the ICD-10 code for the `Tuberculosis of lung, confirmed by culture only`. You can look up the ICD-10 code [here](https://icd.who.int/browse10/2019/en#/). Submit your answer below in capital letters excluding the dot(.) between ICD code (Example:`B653`):

In [None]:
icd_code = input("Enter your answer here: ")

# Check your answer
w3_check(icd_code)

In [None]:
# View one of the possible answers
w3_solution()

***Warm up: 4 - Just for Looking: Real-World Dataset Overview***

Here we can see the rows and columns how the first 10 rows of our dataset looks like. By calling a custom made `datasetOverview()`. This is a command made specifically for this notebook. Run it and understand/analyze the output and data

In [None]:
# Get an overview of the dataset
datasetOverview()

<hr>
<hr>
<hr>
<hr>

## Tasks
***For the tasks use the DataFrame `DATA_4`***

<hr>
<hr>

In [None]:
# Searching ICD /Diagnosis codes and Disease
https://icd.who.int/browse10/2010/en
https://finnkode.ehelse.no/#icd10/0/0/0/-1

##### Task 1

Find the correct ICD code for the following condition?

You can look up the ICD-10 code [here](https://icd.who.int/browse10/2019/en#/). Submit your answer below in capital letters excluding the dot(.) between ICD code (Example:B653):

In [None]:
#Run it and provid the ICD code as input
print("For: Iron deficiency anaemia")
icd_code = input("Enter corresponding ICD code: ")
t1_check(icd_code)

In [None]:
#view answer
t1_solution()

<hr>

##### Task 2

Find the correct ICD code for the following condition?

You can look up the ICD-10 code [here](https://icd.who.int/browse10/2019/en#/). Submit your answer below in capital letters excluding the dot(.) between ICD code (Example:B653):

In [None]:
#Run it and provid the ICD code as input
print("For: Cutaneous abscess, furuncle and carbuncle")
icd_code = input("Enter corresponding ICD code: ")
t2_check(icd_code)

In [None]:
#view answer
t2_solution()

<hr>

##### Task 3

Find the ICD-10 code for the `Dementia in Alzheimer disease` which is a mental and behavioural disorder. The ICD-10 code we want must encompass all the following:
 - Dementia in Alzheimer disease with early onset
 - Dementia in Alzheimer disease with late onset
 - Dementia in Alzheimer disease, atypical or mixed type
 - Dementia in Alzheimer disease, unspecified

You can look up the ICD-10 code [here](https://icd.who.int/browse10/2019/en#/). Submit your answer below, and exclude the dot(.) in between the ICD code:

In [None]:
#Provide the ICD-10 code as input
icd_code = input("Enter your answer here: ")

# Check your answer
t3_check(icd_code)

In [None]:
#view solution
t3_solution()

<hr>

In [None]:
# Searching or Filter DataFrames/Patients based on ICD /Diagnosis codes.

# Find all patient diagnonesed after age_diagnosis_year 100 in DATA_4
"""DATA_8 = DATA_4[DATA_4["age_diagnosis_year"] > 100]

DATA_8.loc[:,["gender","id"]]"""

# Find all patient diagnonesed with age_diagnosis_year greater then 60 and age_dead_year greater then 100 in DATA_4
"""DATA_4[(DATA_4["age_diagnosis_year"] > 95) & (DATA_4["age_dead_year"] > 100)]"""




In [None]:
# Example 
data = {'Names': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']}
df = pd.DataFrame(data)

# Select rows where Names contain 'a' with startswith()
selected_df = df[df['Names'].str.startswith('E')]

print(selected_df)



# Find all patient diagnonesed with I10 in DATA_4 , diagnosis_code column
# text-based filtering on string columns in a DataFrame
"""DATA_4[DATA_4["diagnosis_code"].str.startswith("I10")]"""

#DATA_4[DATA_4["diagnosis_code"].str.contains("I10")]

##### Task 4

Display the DataFrame that contains the `Dementia in Alzheimer disease` ICD-10 code or `diagnosis_code` that encompasses all the following:
 - Dementia in Alzheimer disease with early onset
 - Dementia in Alzheimer disease with late onset
 - Dementia in Alzheimer disease, atypical or mixed type
 - Dementia in Alzheimer disease, unspecified

Tips: 
1. Fill the correct ICD code 
2. Use `startswith()` or `contains()` along with ICD-10 code or `diagnosis_code` to select appropriate DataFrames.

In [None]:
# TODO: Your code goes here.
icd_code = "F00"
diagnosis_codes_XXX_df = DATA_4[DATA_4["diagnosis_code"].str.startswith(icd_code)]
#_____

# Check your answer
t4_check(diagnosis_codes_XXX_df)
diagnosis_codes_XXX_df

In [None]:
#view one of the possible answers
t4_solution()

<hr>

In [None]:
le = LabelEncoder()
DATA_4["labelencoded"] = le.fit_transform(DATA_4["gender"])
DATA_4[["labelencoded"]]

##### Task 5: Apply label encoding using `LabelEncoder()` on the `gender`and `diagnosis_code` column of the `DATA_4`DataFrame.

Follow these instruction sequentially
1. Create a new copy of dataframe as `DATA_4_Gender_DiagnosisCode`consisting only the `gender` and `diagnosis_code` column from the `DATA_4`.
2. Display the first 20 rows from the `DATA_4_Gender_DiagnosisCode`.
3. Create an instance named `le` of `LabelEncoder()`. The `LabelEncoder()` is already imported `from sklearn.preprocessing import LabelEncoder` .
4. Apply a for loop to iterate through the `DATA_4_Gender_DiagnosisCode` columns.
5. Perform fit transformation using label encoder instance on dataframe `DATA_4_Gender_DiagnosisCode`. (* This is already done in snippet below)
6. Display the `DATA_4_Gender_DiagnosisCode` after performing label encoding.
7. Analyze the difference between the original and label encoded dataframe.

In [None]:
# TODO: Your code goes here.

'''1. Create a new copy of dataframe as `DATA_4_Gender_DiagnosisCode`consisting only the `gender` and `diagnosis_code` column from the `DATA_4`.'''
DATA_4_Gender_DiagnosisCode = _____.copy()

'''2. Display the first 20 rows of the `DATA_4_Gender_DiagnosisCode'''
print("Original DataFrame")
display(DATA_4_Gender_DiagnosisCode.head(20))

'''3. Create an instance named `le` for the `LabelEncoder()'''
le = _____()

'''4. Apply a for loop to iterate through the `DATA_4_Gender_DiagnosisCode` columns'''
for col in _____:
    
    '''5. Perform fit transformation using label encoder instance on dataframe `DATA_4_Gender_DiagnosisCode`.'''
    DATA_4_Gender_DiagnosisCode[col] = le.fit_transform(DATA_4_Gender_DiagnosisCode[col])
    
'''6. Display the `DATA_4_Gender_DiagnosisCode` after performing label encoding.'''
print("DataFrame After Label Encoding")
display(DATA_4_Gender_DiagnosisCode.head(20))

# Checks your answer
t5_check(DATA_4_Gender_DiagnosisCode.head(20))

In [None]:
#view the answers
t5_solution()

<hr>

##### Task 6: Apply one hot encoding on the `gender`and `diagnosis_code` column of the `DATA_4`DataFrame using `get_dummies()`.

Follow these instruction sequentially
1. Create a new copy of dataframe as `DATA_4_Gender_DiagnosisCode`consisting only the `gender` and `diagnosis_code` column from the `DATA_4`.
2. Display the first 20 rows of the `DATA_4_Gender_DiagnosisCode`.
3. Create a dataframe `onehotencoded_DATA_4` that stores the new dataframe generated after applying one hot encoding through pandas `get_dummies()` 
    1. Syntax to apply one hot encoding through pandas get_dummies: `pd.get_dummies(data,columns=[column_name],dtype=int)`
6. Display the new `onehotencoded_DATA_4` after performing one hot encoding.
7. Analyze the difference between the original and one hot encoded dataframe.

In [None]:
# TODO: Your code goes here.
'''1. Create a new copy of dataframe as `DATA_4_Gender_DiagnosisCode`consisting only the `gender` and `diagnosis_code` column from the `DATA_4`.'''
_____ = DATA_4[['gender','diagnosis_code']].copy()

'''2. Display the first 20 rows of the `DATA_4_Gender_DiagnosisCode'''
display(DATA_4_Gender_DiagnosisCode.head(20))

'''3. Create a dataframe onehotencoded_DATA_4 that stores the new dataframe generated after applying one hot encoding through pandas get_dummies()
        3A. Syntax to apply one hot encoding through pandas get_dummies: pd.get_dummies(data,columns=[column_name],dtype=int)'''
onehotencoded_DATA_4 = pd._____(DATA_4_Gender_DiagnosisCode,columns = ['gender','diagnosis_code'],dtype=int)

'''4. Display the `DATA_4_Gender_DiagnosisCode` after performing label encoding.'''
display(onehotencoded_DATA_4.head(20))

# Checks your answer
t6_check(onehotencoded_DATA_4.head(20))

In [None]:
#view the answers
t6_solution()

<hr>

In [None]:
point1 = np.array((3, 2, 3))
point2 = np.array((2, 1, 1))

sum_sq = np.sum(np.square(point1 - point2))
print(sum_sq)

##### Task 7

Compute the euclidean distance between two columns of `DATA_4` DataFrame column `age_dead_year` and `age_diagnosis_year` for `diagnosis_code` `F102`, `F011`. Using numpy `np.sqrt()`, `np.sum()` and pandas

Tips: You can select dataframe either using `isin()` or OR `(|)`.

Read more about [Euclidian distance](https://machinelearningmastery.com/distance-measures-for-machine-learning/#:~:text=Euclidean%20distance%20calculates%20the%20distance,floating%20point%20or%20integer%20values.). It computes distance between two points in euclidian space.

In [None]:
# TODO: Your code goes here.
# Select data based on diagnosis_code F102, F011
selected_df = DATA_4[DATA_4["diagnosis_code"].isin(["F102", "F011"])]

# Compute values in age_dead_year and age_diagnosis_year column
x = selected_df["age_dead_year"].values
y = selected_df["age_diagnosis_year"].values

# Compute euclidian distance between x and y
distance = np.sqrt(np.sum((x - y) ** 2))

# Print euclidian distance
print(distance)

# Check your answer
t7_check(distance)

In [None]:
#view one of the possible answers
t7_solution()

<hr>

##### Task 8

Select the `diagnosis_text` column from the `DATA_4` DataFrame and make a copy of it containing only top 5 rows.Then convert each text/strings in respective row to the corresponding vector representation using `CountVectorizer` from `sklearn`. Display the original and transformed `diagnosis_text` column dataframe. 

Follow these instruction sequenctially
1. Copy the `diagnosis_text` column from the `DATA_4` DataFrame and save the copy as `TASKXX_DATA_4`
2. Display the copied `TASKXX_DATA_4` DataFrame
3. Create an instance `count_vectorizor` of `CountVectorizer()` for vectorization of string/text.
4. Fit and transfrom the `diagnosis_text` column from `TASKXX_DATA_4` DataFrame with `count_vectorizor`.
5. Convert the new transformed data `numeric_diagnosis_text` and save to `numeric_diagnosis_text_df`
6. Display the `numeric_diagnosis_text_df` DataFrame.

Info: Many data analysis and machine learning algorithms require numerical input. And numerical/vector representation is needed to apply them effectively.
There are many ways for it like A. `Bag of Words` B. `Word Embedding` C. `Document Embedding` D. `TF-IDF`

In [None]:
# TODO: Your code goes here.

# Create a copy of the DataFrame
TASKXX_DATA_4 = DATA_4[['diagnosis_text']].iloc[:5]._____()

# Display all rows of the original DataFrame
display(TASKXX_DATA_4)

# Initialize the CountVectorizer
count_vectorizor = _____()

# Fit and transform the text column
numeric_diagnosis_text = count_vectorizor._____(TASKXX_DATA_4['diagnosis_text'])

# Convert the transformed data into a DataFrame
numeric_diagnosis_text_df = pd.DataFrame(numeric_diagnosis_text.toarray(), columns=count_vectorizor.get_feature_names_out())

# Display all rows of the transformed DataFrame
display(numeric_diagnosis_text_df)

# Check your answer
t8_check(numeric_diagnosis_text_df)

In [None]:
#view answer
t8_solution()

<hr>

##### Task 9

Compute the minimum, maximum value, mean, median of the `age_dead_year` column from the `DATA_4`. 

Tips: Use `min()`,`max()`,`mean()`,`median()`

In [None]:
# TODO: Your code goes here.
min_val = DATA_4['age_dead_year']._____()
max_val = DATA_4['age_dead_year']._____()
mean_val = DATA_4['age_dead_year']._____()
median_val = DATA_4['age_dead_year']._____()

print(f'Minimum:{min_val}, Maximum:{max_val}, Mean:{mean_val}, Median:{median_val}')

# Check your answer
answer = min_val,max_val,mean_val,median_val
t9_check(answer)

In [None]:
#view one of the possible answers
t9_solution()

<hr>

##### Task 10

Display all the data with the specific `diagnosis_code` of "J449" (e.g., return a DataFrame where the `diagnosis_code` is "J449").

In [None]:
# TODO: Your code goes here.
diagnosis_codes_starting_with_J449_df = DATA_4[DATA_4["_____"].str.startswith("_____")]

# Check your answer
t10_check(diagnosis_codes_starting_with_J449_df)
diagnosis_codes_starting_with_J449_df

In [None]:
#view one of the possible answers
t10_solution()

<hr>

In [None]:

data = {'X': [1, 2, 3, 4, 5],
        'Y': [10, 15, 13, 20, 18]}
df = pd.DataFrame(data)

x_data = df['X']
y_data = df['Y']

plt.plot(x_data,y_data)


# Add labels and a title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot Example')

# Show a legend if needed
plt.legend()

# Display the plot
plt.show()

##### Task 11

Create a new `DATA_4_ICD` DataFrame by filtering the `DATA_4` to only include rows with ICD codes or `diagnosis_code` that start with "C11" which is `Malignant neoplasm of nasopharynx`. Then, plot `age_diagnosis_year`,	`age_first_contact_year`,	`age_last_contact_year`,	`age_dead_year` in a line plot using matplotlib. And analyze it.

Follow these instruction sequentially:
1. Select the DataFrame with `diagnosis_code`: `C11` and save it as `DATA_4_ICD` using `startswith()`
2. Create a new DataFrame named `DATA_4_ICD_C11` consisting column ` 'id', 'age_diagnosis_year', 'age_first_contact_year', 'age_last_contact_year', 'age_dead_year'`
3. Set the `id` column, which is patient id as the index using `set_index()`
4. Create a `line` plot with `grid` and ´marker´as `o` using `plot()`
5. Set the plot labels with `Patient id` as `xlabel`, `Year` as `ylabel` and title as `Line plot of patient with diagnose, first, last, dead year`
6. Show the plot
7. Analyse the plot result yourself

In [None]:
# TODO: Your code goes here.

# Select appropriate DataFrame with C11 diagnose code and the useful columns
DATA_4_ICD = DATA_4[DATA_4['diagnosis_code'].str.startswith('C11')]
DATA_4_ICD_C11 = _____[['id', 'age_diagnosis_year', 'age_first_contact_year', 'age_last_contact_year', 'age_dead_year']]

# Set the 'Patient id' column as the index
DATA_4_ICD_C11.set_index('id', inplace=True)

# Create a line plot with a grid
ax = DATA_4_ICD_C11._____(kind='_____', marker='o')

# Add a grid
ax._____(True)

# Set plot labels and title
plt.xlabel('Patient id')
plt.ylabel('Year')
plt.title('Line plot of patient with diagnose, first, last, dead year')

# Show the plot
plt.show()

In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t11_check()

In [None]:
# View one of the possible answers
t11_solution()

<hr>

In [None]:
data = {'X': [1, 2, 3, 4, 5],
        'Y': [10, 15, 13, 20, 18]}
df = pd.DataFrame(data)

x_data = df['X']
y_data = df['Y']

# Create a scatter plot
plt.scatter(x_data, y_data, marker='o')

# Add labels and a title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')

# Show a legend if needed
plt.legend()

# Display the plot
plt.show()

##### Task 12

Using the `DATA_4_ICD` DataFrame, create a scatter plot using `scatter()` for diagnosis_code `F001` which is `Dementia in Alzheimer disease with late onset`. 
The scatter plot should have `age_last_contact_year` as the x-axis and  `age_dead_year` as the y-axis.

In [None]:
# TODO: Your code goes here.
DATA_4_ICD = DATA_4[DATA_4["diagnosis_code"].str.startswith("F001")]

# Create a scatter plot for `age_last_contact_year` vs `age_dead_year` using matplotlib
DATA_4_ICD.plot.scatter(x="_____", y="_____")
plt._____(DATA_4_ICD["age_last_contact_year"], DATA_4_ICD["age_dead_year"])

# Assign the x and y labels
plt.xlabel("age_last_contact_year")
plt.ylabel("age_dead_year")

# Display the scatter plot
plt.show()

In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t12_check()

In [None]:
# View one of the possible answers
t12_solution()

<hr>

In [None]:

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5]

# Create the histogram
plt.hist(data)


plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Histogram')

# Show the plot
plt.show()


##### Task 13

Using the `DATA_4`, create a histogram plot using matplotlib to show total number of patients with some of world most common diseases, the ICD codes or `diagnosis_code` listed below, the title of histogram should be `Histogram showing the number of patients with some common diseases`: 
1. Diagnose code: `I2`, Disease: `Ischaemic heart diseases and Pulmonary heart disease and diseases of pulmonary circulation`
2. Diagnose code: `C`, Disease: `Malignant neoplasms`
3. Diagnose code: `J4`, Disease: `Chronic lower respiratory diseases` 
4. Diagnose code: `E66`, Disease: `Obesity` 
5. Diagnose code: `F00`, Disease: `Dementia in Alzheimer disease` 
6. Diagnose code: `E1`, Disease: `Diabetes mellitus and Other disorders of glucose regulation and pancreatic internal secretion` 

In [None]:
# TODO: Your code goes here.

# Select appropriate DataFrame based on the diagnose_code
DATA_4_ICD_I2 = DATA_4[DATA_4['diagnosis_code'].str.startswith('I2')]
DATA_4_ICD_C = DATA_4[DATA_4['diagnosis_code'].str.startswith('C')]
DATA_4_ICD_J4 = DATA_4[DATA_4['diagnosis_code'].str.startswith('J4')]
DATA_4_ICD_E66 = DATA_4[DATA_4['diagnosis_code'].str.startswith('E66')]
DATA_4_ICD_F00 = DATA_4[DATA_4['diagnosis_code'].str.startswith('F00')]
DATA_4_ICD_E1 = DATA_4[DATA_4['diagnosis_code'].str.startswith('E1')]

#Compute the lengh of DataFrame of each specific diagnose_code 
number_of_patient = [len(DATA_4_ICD_I2), len(DATA_4_ICD_C), len(DATA_4_ICD_J4), len(DATA_4_ICD_E66), len(DATA_4_ICD_F00), len(DATA_4_ICD_E1)]
diagnose_code_labels = ['I2', 'C', 'J4', 'E66', 'F00', 'E1']

# Create a histogram with count of patient and the diagnosis code
plt._____(diagnose_code_labels,number_of_patient, color='blue', edgecolor='black')
plt._____('Histogram showing the number of patients with some common diseases ')
plt.xlabel('Diagnosis Code')
plt.ylabel('Count')
plt.grid(axis='y')

# Show the histogram
plt.show()

In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t13_check()

In [None]:
# View one of the possible answers
t13_solution()

<hr>

##### Task 14

Select rows from the `DATA_4` DataFrame that have ICD codes or `diagnosis_code` starting with "C41", "I42", "J42", and "Z42". 
- Then using matplotlib, plot selected dataframes with:
- the`diagnosis_code` in y-axis and `age_diagnosis_year` in x-axis in a scatter plot. 
- Assign `age_diagnosis_year` as x-label and `diagnosis_code` as ylabel.
- The title of plot should be `age_diagnosis_year vs diagnosis_code`. 
- And finally display the plot  

Tips: Create a tuple with all `diagnosis_code` and use `startswith()` 

In [None]:
# TODO: Your code goes here.
"""
Create a tuple `icd_codes` consisting of the ICD codes starting with C41,I42,J42,Z42.
"""
icd_codes = ("C41", "I42", "J42", "Z42")

# Select rows that start with those ICD codes 
df_selected = DATA_4[DATA_4["diagnosis_code"].str.startswith(______)]

# Create a scatter plot of age_diagnosis_year vs. diagnosis code
plt.______(_____)

# Assign x and y labels and a title below respectively
___._____
___._____
___._____

# Show the plot
___._____


In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t14_check()

In [None]:
# View one of the possible answers
t14_solution()

<hr>

##### Task 15

Select rows from the `DATA_4` DataFrame that have ICD codes starting with "A415", "C348", and "J440". Then, plot `diagnosis_code` with respect to `age_diagnosis_year` using matplotlib. 

And analyze the result.

In [None]:
# TODO: Your code goes here.
"""
1. Create a tuple variable `icd_codes` consisting of the ICD codes starting with "A415", "C348", and "J440".
"""

icd_codes = ("A415", "C348", "J440")

# Select rows that start with those ICD codes 
df_selected =  DATA_4[DATA_4["diagnosis_code"].str.startswith(icd_codes)]

# Create a scatter plot of age_diagnosis_year vs. diagnosis code
plt._____(_____, _____)

# Assign labels and a title
plt.xlabel("age_diagnosis_year")
plt.ylabel("diagnosis_code")
plt.title("age_diagnosis_year vs. diagnosis_code")


# Show the plot
plt.show()

In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t15_check()

In [None]:
#view one of the possible answers
t15_solution()

<hr>

In [None]:
categories = ['Category A', 'Category B', 'Category C', 'Category D']
values = [10, 25, 15, 30]

# Create the bar chart
plt.bar(categories, values)

# Customize the plot (add labels, title, etc.)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart ')

# Show the plot
plt.show()

##### Task 16

Select rows from the `DATA_4` DataFrame with ICD codes or `diagnosis_code` starting with `A415` which is `sepsis due to gram-negative bacteria`  and `J440` which is `Chronic obstructive pulmonary disease with (acute) lower respiratory infection` from the `diagnosis_code` column and count their values. Plot their count values with respect to diagnosis year using the column `age_diagnosis_year`, in a grouped `barchart` with labels using matplotlib. 

And analyze the result.

In [None]:
# TODO: Your code goes here.
# Select rows with diagnosis codes starting with 'A415' and 'J440'
df_selected =  DATA_4[DATA_4["diagnosis_code"].str.startswith(("A415", "J440"))]

# Group the counts by age diagnosis year and diagnosis year
grouped_count_df = df_selected.groupby(["_____", "diagnosis_code"]).size().unstack()
grouped_count_df.plot(kind="_____")
# Plot the grouped barchart and assign labels
plt.xlabel("Diagnosis Year")
plt.ylabel("Count")
#plot legend titled "ICD Code"
plt.legend(title="ICD Code")
    
# Show the plot
plt.show()

In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t16_check()

In [None]:
#view one of the possible answers
t16_solution()

<hr>

In [None]:
#Plot pie
sizes = [15, 30, 45, 10]  
labels = ['Category A', 'Category B', 'Category C', 'Category D']  
colors = ['blue', 'green', 'red', 'purple'] 

# Create the pie chart
plt.pie(sizes, labels=labels)

plt.axis('equal')  
plt.title('Pie Chart')

#plt.savefig('my_plot.jpg')

# Show the plot
plt.show()

##### Task 17

Plot the selected rows with `diagnosis_code` "A415", "J440", and "C348" in a form of piechart with their respective propotion in percentage. 

Tips: Select the appropriate dataframes and then count the dataframe using `value_counts()`. Plot the pie chart using `pie()` with respective proportion in percentage. And display the plot with title `Pie chart of selected ICD codes with their count values`

And analyze the result.

In [None]:
# TODO: Your code goes here.
"""
1. Create a tuple variable `icd_codes` consisting of the ICD codes starting with A415, J440, C348.
"""
icd_codes = ("A415", "J440", "C348")

# Select rows with diagnosis codes starting with A415, J440, C348
df_selected = DATA_4[DATA_4["diagnosis_code"].str.startswith(icd_codes)]

# Get the count of dataframe selected based on the diagnosis_code
age_counts = df_selected["diagnosis_code"]._____()

# Create a pie chart based on the countvalue of each diagnosis_code
plt._____(_____, labels=age_counts.index, autopct="%1.1f%%")
plt.title("Pie chart of selected ICD codes with their count values")
plt.show()


In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t17_check()

In [None]:
#view one of the possible answers
t17_solution()

<hr>

##### Task 18

Select rows with diagnosis codes "A415", "J440", "C348". Then, plot it in a form of barchart `bar()` with a grid `grid()`, to view which of the following disease more common and which is least and their distribution.

And analyze the result.

In [None]:
# TODO: Your code goes here.
"""
1. Create a tuple variable `icd_codes` consisting of the ICD codes starting with A415, J440, C348.
"""

icd_codes = ("A415", "J440", "C348")

# Select rows with diagnosis codes starting with A415, J440, C348
df_selected = DATA_4[DATA_4["diagnosis_code"].str.startswith(icd_codes)]

# Get the count of each ICD code value
icd_counts = df_selected["diagnosis_code"].value_counts()

# Create a bar plot of ICD code index and their respective count values
plt._____(icd_counts.index, _____)
plt.xlabel("Diagnosis Code")
plt.ylabel("Count")
plt.title("Bar plot of selected ICD codes(A415, J440, C348)")
plt.grid(True)

# Show the plot
plt.show()

In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t18_check()

In [None]:
#view one of the possible answers
t18_solution()

<hr>

##### Task 19

Create two scatter subplots horizontally placed to each other containing:
1. `age_first_contact_year` vs `age_diagnosis_year`, and 
2. `age_last_contact_year` vs `age_dead_year data`.

The plot should have a size of `(12, 5)` for the diagnosis codes "A415" and "C348". Follow the instruction in the comments for the tips.

In [None]:
# TODO: Your code goes here.
# Create figure with two horizontally placed subplots of size (12, 5)
fig, (ax1, ax2) = plt.subplots(1, 2, _______)

"""
1.  Create a tuple variable `icd_codes` consisting of the ICD codes starting with A415, C348.
"""
icd_codes = ("A415", "C348")

# Select rows with diagnosis codes starting with "A415" and "C348"
df_selected = DATA_4[DATA_4["diagnosis_code"].str.startswith(icd_codes)]

# Plot age_first_contact_year vs age_diagnosis_year on the first subplot
ax1._______(df_selected["age_first_contact_year"], df_selected["age_diagnosis_year"])
# Assign "Age at first contact" as x labels for subplot1
ax1.set_xlabel("Age at first contact")
# Assign "Age diagnosis year" as y labels for subplot1
ax1.set_ylabel("Age diagnosis year")
# Assign title for subplot1
ax1._________("Relationship between age at first and last contact")


# Plot age_last_contact_year vs age_dead_year on the second subplot
ax2._______(df_selected["age_last_contact_year"], df_selected["age_dead_year"])
# Assign "age_last_contact_year" as x labels for subplot2
_____._________("age_last_contact_year")
# Assign "age_dead_year" as y labels for subplot2
_____._________("age_dead_year")
# Assign title for subplot2
_____.___________("Relationship between age_last_contact_year and age_dead_year")


# Show the plot
plt.show()


In [None]:
# Check your answer (Visualize and compare the plot to determine if the your result is correct or not)
t19_check()

In [None]:
#view one of the possible answers
t19_solution()

<hr>

##### Task 20

Create a 3D scatter plot for columns `gender`, `age_diagnosis_year`, and `age_dead_year` using matplotib mpl_toolkits.mplot3d

Tips: Create a plot figure named `fig`, add subplot using `add_subplot()` and projection as `3d` and display it after setting the x,y,z labels and titles respectively in the plot.  

In [None]:
# TODO: Your code goes here.
"""Replace female,male as the numeric value 0 and 1 respectively in 
DATA_4 and save it in the variable CLUSTER_DATA_4"""
CLUSTER_DATA_4 = DATA_4.replace(to_replace=["female","male"],value=[0,1])

# Plot a figure
fig = _____
# Add subplot and projection as 3d
ax = fig.add_subplot(111, projection="_____")

#plot the data into a 3d scatter plot 
ax.scatter(CLUSTER_DATA_4["gender"], CLUSTER_DATA_4["age_diagnosis_year"], CLUSTER_DATA_4["age_dead_year"])

# Set plot properties (e.g., the labels and title)
ax.set_xlabel("Gender")
ax.set_ylabel("Age at diagnosis")
ax.set_zlabel("Age at death")
ax.set_title("3D plot of gender, age at diagnosis, and age at death")

# Show the 3d plot
plt.show()


In [None]:
# Check your answer (Visualize and compare the plot to determine if your result is correct or not)
t20_check()

In [None]:
#view one of the possible answers
t20_solution()

<hr>

**Just fo viewing: Elbow method to find the appropriate number of clusters.** 

In [None]:
#This is just for viewing only: Elbow method for finding the appropriate number of clusters
df_selected = DATA_4[DATA_4["diagnosis_code"].str.startswith(tuple("H"))]
kmeans_range = range(1,10)

#sum of square error
sse = []
for k in kmeans_range:
    kmeans = KMeans(n_clusters =k)
    kmeans.fit(df_selected[["age_diagnosis_year","age_dead_year"]])
    sse.append(kmeans.inertia_)

# Assign legend and labels
plt.title("Elbow Plot for No of Clusters")
plt.xlabel("K")
plt.ylabel("Sum of square error")

# Display the plot 
plt.plot(kmeans_range,sse)

##### Task 21

Create a `df_selected` DataFrame by selecting all ICD codes starting with "H" from the `DATA_4` DataFrame `diagnosis_code` column. 
Use sklearn K-means `KMeans()` clustering on the `df_selected` DataFrame , by applying `fit_predict` to make 3 clusters by selecting the columns `age_diagnosis_year` and `age_dead_year`. Cluster 1, Cluster 2, and Cluster 3 will be colored green, red, and blue respectively. In this task, 3 clusters seem meaningful because of the elbow plot above. 

Additionally read the comment in code to perform the task

In [None]:
# TODO: Your code goes here.

# Select dataframes with diagnosis codes starting with H from DATA_4 and save it in df_selected
df_selected = _____

# Initialize the k-means model having 3 clusters
kmeans = _____(n_clusters =_____)

# Fit the model selecting column age_diagnosis_year and age_dead_year and predict the clusters using fit_predict()
y_predicted = kmeans._____(df_selected[["age_diagnosis_year","age_dead_year"]])
df_selected["cluster"] = y_predicted

#Assign cluster to respective data 
cluster0 = _____[_____["cluster"]==0]
cluster1 = _____[_____["cluster"]==1]
cluster2 = _____[_____["_____"]==2]

# Visualize the clusters
plt.scatter(_____["age_diagnosis_year"],cluster0["_____"],color="green", label = "Cluster 1")
plt.scatter(_____["age_diagnosis_year"],cluster1["_____"],color="red", label = "Cluster 2")
plt.scatter(_____["age_diagnosis_year"],cluster2["_____"],color="blue", label = "Cluster 3")

#Provide the x and y labels and legend
plt.xlabel("age_diagnosis_year")
plt._____("age_dead_year")
_____.legend()

# Show or save plot
plt._____()

In [None]:
# Check your answer (Visualize and compare the plot to determine if the your result is correct or not)
t21_check()

In [None]:
#view one of the possible answers
t21_solution()