#### Data Transformation

In this step, I prepared the collected and cleaned data for analysis.  
The goal was to make the dataset consistent, well-structured, and suitable for visualization.

#### âœ… Tasks Performed:
- Converted all **column names** (like `age`, `gender`, `region`, `year`, and `disease`) into consistent formats.
- Created a new column **Age_Group** to categorize people as:
  - Young (0â€“18)
  - Mid (19â€“40)
  - Older (41â€“60)
  - Senior (61+)
- Standardized **gender** values (converted `M/F` or lowercase text to `Male/Female`).
- Cleaned and formatted **region** and **disease** names for uniformity.
- Ensured **year** column was in integer format.
- Removed unnecessary columns like `name` for focused analysis.
- Saved the final transformed dataset as `multi_disease_transformed.csv`.

#### ðŸ§  Outcome:
A clean and structured dataset ready for detailed analysis on disease distribution by  
**age group, gender, region, and year.**


In [2]:
# import required library

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
# load combined csv file for data tranformation

multi = pd.read_csv('multi_disease_cleaned.csv')
print(multi.head())
print(multi.tail())

                Name  Age  Gender  Year    Region  Disease
0      John Lawrence   71    Male  2021        UK  Cancer 
1    Robert Matthews   34    Male  2021     China  Cancer 
2  Chandran Malhotra   80    Male  2023  Pakistan  Cancer 
3       Joshua Smart   40    Male  2015        UK  Cancer 
4       Denise Lopez   43  Female  2017    Brazil  Cancer 
                  Name  Age  Gender  Year     Region Disease
55387  Jennifer Garcia   43    Male  2023     Canada  Asthma
55388     Michael Frye   18    Male  2016  Australia  Asthma
55389       Mark Smith   54  Female  2021      China  Asthma
55390   Michael Miller   46    Male  2021      China  Asthma
55391     Teresa Russo   26    Male  2022      China  Asthma


In [4]:
# create age groups columns

bins = [0,21,40,60,100]
labels = ['Young','Mid','Old','Senior']
multi['Age Group'] = pd.cut(multi['Age'],bins=bins, labels = labels,right=False)
print(multi.head())
print(multi.tail())
multi['Age Group'].value_counts()

                Name  Age  Gender  Year    Region  Disease Age Group
0      John Lawrence   71    Male  2021        UK  Cancer     Senior
1    Robert Matthews   34    Male  2021     China  Cancer        Mid
2  Chandran Malhotra   80    Male  2023  Pakistan  Cancer     Senior
3       Joshua Smart   40    Male  2015        UK  Cancer        Old
4       Denise Lopez   43  Female  2017    Brazil  Cancer        Old
                  Name  Age  Gender  Year     Region Disease Age Group
55387  Jennifer Garcia   43    Male  2023     Canada  Asthma       Old
55388     Michael Frye   18    Male  2016  Australia  Asthma     Young
55389       Mark Smith   54  Female  2021      China  Asthma       Old
55390   Michael Miller   46    Male  2021      China  Asthma       Old
55391     Teresa Russo   26    Male  2022      China  Asthma       Mid


Age Group
Senior    23249
Old       15735
Mid       15079
Young      1329
Name: count, dtype: int64

In [5]:
# check gender values

multi['Gender'].value_counts()

Gender
Male      27996
Female    27396
Name: count, dtype: int64

In [6]:
# Standarize disease name and region name

# check diseasea and region name are non standarize or not.

print(multi['Disease'].value_counts())
print(multi['Region'].value_counts())

Disease
Cancer           50000
Heart Disease     3000
Asthma            2392
Name: count, dtype: int64
Region
Australia    5668
UK           5627
USA          5613
India        5565
Russia       5555
Brazil       5553
Germany      5540
Pakistan     5461
China        5428
Canada       5382
Name: count, dtype: int64


In [7]:
# convert year into integer

multi['Year'].value_counts()

Year
2016    5722
2019    5663
2020    5645
2015    5626
2017    5606
2021    5589
2018    5566
2023    5542
2022    5439
2024    4994
Name: count, dtype: int64

In [8]:
# remove unnecessary column
multi = multi.drop(columns = ['Name'])
print(multi.head())

   Age  Gender  Year    Region  Disease Age Group
0   71    Male  2021        UK  Cancer     Senior
1   34    Male  2021     China  Cancer        Mid
2   80    Male  2023  Pakistan  Cancer     Senior
3   40    Male  2015        UK  Cancer        Old
4   43  Female  2017    Brazil  Cancer        Old


In [9]:
# check any missing value in the dataset

multi.isnull().sum()

Age          0
Gender       0
Year         0
Region       0
Disease      0
Age Group    0
dtype: int64

In [10]:
# save transformed data 

multi.to_csv('multi_disease_transformed.csv', index=False)