## 1-Data Overview

I started by examining all the datasets to understand their structure and contents by creating a function that reads a CSV file, checks for columns matching certain keywords,and provides an overview of the dataset's structure.


In [1]:
import pandas as pd

def info_data(file):
    try:
        
        if file.endswith('.csv'):
            df = pd.read_csv(file)
        else:
            print("File format not supported.")
            return
        
        df_columns = df.columns.str.lower() 
        keywords = ['age', 'sex', 'stage', 'grade','gender','type','name'] 
        matching_columns = [col for col in df_columns if any(keyword in col for keyword in keywords)]
        
        
        if matching_columns:
            print("\n--- Overview for", file, "---")
            print("Found the following matching columns:", matching_columns)
            print("\nAll columns in the dataset:")
            print(df.columns)
        else:
            print("\n--- Overview for", file, "---")
            print("No matching columns found.")
            print("\nAll columns in the dataset:")
            print(df.columns)
        
        
        print("\nFirst 5 rows of the dataset:")
        print(df.head())
        
        # Data type
        print("\nDataset info:")
        df.info()
        
        # Summary statistics for numerical columns
        print("\nStatistical summary:")
        print(df.describe())
        
    except Exception as e:
        
        print("Error:", e)
    

    return df



The `see_value` function is used to print the unique values within a specific column of a DataFrame. This step helps in assessing whether there are patterns or categories in the data that can be extracted or standardized.

In [2]:
def see_value(df,nom_colonne):
    value= df[nom_colonne].unique()
    print(value)

## 2- Approach and Methodology

To extract critical information from the datasets, I implemented a combination of string operations and regular expressions. This approach effectively handles data inconsistencies and varying formats. Since key information (such as age, sex, stage, and grade) was not consistently located in the same column or presented in a uniform format, I developed dedicated functions for extracting each type of data. These functions were tailored for age, sex, stage, grade, cancer type, sample type, and tumor subtype, enabling a flexible and adaptable extraction process.

By creating distinct functions for each type of information, I ensured targeted processing that accommodated the unique structure of each dataset. This method facilitated consistent and standardized outputs, regardless of data representation differences.


In [3]:
import re

def standardize_age(df, input_column_name):
    df['Age'] = df[input_column_name].apply(lambda text: re.search(r'(\d+(\.\d+)?)', str(text), re.IGNORECASE).group(1) if re.search(r'(\d+(\.\d+)?)', str(text), re.IGNORECASE) else 'NA')
    return df


def standardize_sex(df, input_column_name):
    df['Sex'] = df[input_column_name].apply(
        lambda text: 'Female' if re.search(r'\bfemale\b', str(text), re.IGNORECASE) 
        else 'Male' if re.search(r'\bmale\b', str(text), re.IGNORECASE) 
        else 'NA'
    )
    return df


def standardize_cancer_type(df, input_column_name):
    df['Cancer Type'] = df[input_column_name].apply(lambda text: 'Ovarian Cancer' if re.search(r'ovarian', str(text), re.IGNORECASE) else 'Breast Cancer' if re.search(r'breast', str(text), re.IGNORECASE) else 'Endometrial cancer' if re.search(r'endometrial', str(text), re.IGNORECASE) else 'NA') 
    return df


def standardize_sample_type(df, input_column_name):
    df['Sample Type'] = df[input_column_name].apply(lambda text: 'normal' if re.search(r'normal', str(text), re.IGNORECASE) else 'primary tumour' if re.search(r'(primary|tumour)', str(text), re.IGNORECASE) else 'NA')
    return df

def standardize_tumour_subtype(df, input_column_name):
    subtypes = ['serous', 'endometrioid', 'mucinous', 'clear cell', 'ductal', 'lobular', 'apocrine']
    df['Tumour Subtype'] = df[input_column_name].apply(lambda text: next((subtype for subtype in subtypes if re.search(subtype, str(text), re.IGNORECASE)), 'NA'))
    return df


def standardize_stage(df, input_column_name):
    df['Stage'] = df[input_column_name].apply(lambda text: re.search(r'(I|II|III|IV|1|2|3|4)', str(text)).group(1) if re.search(r'(I|II|III|IV|1|2|3|4)', str(text)) else 'NA')
    return df

def standardize_stage_romain(df, input_column_name):
    df['Stage'] = df[input_column_name].apply(lambda text: re.search(r'(?:(I{1,4}|II?I?|III?|IV)|(\d{1,2})(?:[A-Z])?)|(high|low)', str(text), re.IGNORECASE).group(1) if re.search(r'(?:(I{1,4}|II?I?|III?|IV)|(\d{1,2})(?:[A-Z])?)|(high|low)', str(text), re.IGNORECASE) else 'NA')
    return df



def standardize_stage_mot(df, input_column_name):
    df['Stage'] = df[input_column_name].apply(
        lambda text: re.search(r'(high|low|benign|borderline|advanced|early|invasive|microinvasion)', str(text), re.IGNORECASE).group(1) 
        if re.search(r'(high|low|benign|borderline|advanced|early|invasive|microinvasion)', str(text), re.IGNORECASE) 
        else 'NA'
    )
    return df


def standardize_grade(df, input_column_name):
    df['Grade'] = df[input_column_name].apply(lambda text: re.search(r'(high-grade|low-grade|1|2|3|4)', str(text), re.IGNORECASE).group(1) if re.search(r'(high-grade|low-grade|1|2|3|4)', str(text), re.IGNORECASE) else 'NA')
    return df



## 3-Extraction of the informations from each dataset

The first Dataset : GSE6008_metadata

In [4]:
df1=info_data('GSE6008_metadata.csv')


--- Overview for GSE6008_metadata.csv ---
Found the following matching columns: ['characteristics_ch1.0.tumor_type', 'characteristics_ch1.1.stage', 'characteristics_ch1.12.cel_file_name', 'characteristics_ch1.2.grade', 'contact_name', 'source_name_ch1', 'type']

All columns in the dataset:
Index(['channel_count', 'characteristics_ch1.0.Tumor_Type',
       'characteristics_ch1.1.stage', 'characteristics_ch1.10.TP53_mutation',
       'characteristics_ch1.11.P53_immunohistochemistry',
       'characteristics_ch1.12.CEL_file_name', 'characteristics_ch1.2.grade',
       'characteristics_ch1.3.B-canenin_nuclear_accumulation',
       'characteristics_ch1.4.CTNNB1_mutation',
       'characteristics_ch1.5.APC_mutation',
       'characteristics_ch1.6.PTEN_mutation',
       'characteristics_ch1.7.PTEN_immunohistochemistry',
       'characteristics_ch1.8.KRAS_mutation',
       'characteristics_ch1.9.PIK3CA_mutation', 'contact_address',
       'contact_city', 'contact_country', 'contact_email', 'c

In [5]:
df1.head(2)

Unnamed: 0,channel_count,characteristics_ch1.0.Tumor_Type,characteristics_ch1.1.stage,characteristics_ch1.10.TP53_mutation,characteristics_ch1.11.P53_immunohistochemistry,characteristics_ch1.12.CEL_file_name,characteristics_ch1.2.grade,characteristics_ch1.3.B-canenin_nuclear_accumulation,characteristics_ch1.4.CTNNB1_mutation,characteristics_ch1.5.APC_mutation,...,relation,scan_protocol,series_id,source_name_ch1,status,submission_date,supplementary_file,taxid_ch1,title,type
0,1,Clear_Cell,3C,,,CHTN-OC-004.CEL,3,,,,...,,standard Affymetrix procedures,GSE6008,Ovarian_Tumor,Public on Apr 09 2007,Oct 10 2006,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM139n...,9606,Ovarian_Tumor_ClearCell_CHTN-OC-004,RNA
1,1,Clear_Cell,4,,,CHTN-OC-012.CEL,3,,,,...,,standard Affymetrix procedures,GSE6008,Ovarian_Tumor,Public on Apr 09 2007,Oct 10 2006,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM139n...,9606,Ovarian_Tumor_ClearCell_CHTN-OC-012,RNA


In [6]:
see_value(df1,'description')

['ovarian tumor: clear cell' 'ovarian tumor: endometrioid'
 'ovarian tumor: mucinous' 'ovarian tumor: serous' 'Normal ovary']


Upon examining the dataset, we can identify several key pieces of information to extract. The "Sample type" can be derived from the "description" column, while the "Cancer type" is found in the "source_name_ch1". The "Tumor subtype" can be extracted from the "characteristics_ch1.0.Tumor_Type" column, the "Stage" from "characteristics_ch1.1.stage", and the "Grade" from "characteristics_ch1.2.grade".


In [7]:
df1=standardize_cancer_type(df1,'source_name_ch1')
df1=standardize_tumour_subtype(df1,'description')
df1=standardize_stage(df1,'characteristics_ch1.1.stage')
df1=standardize_sample_type(df1,'description')
df1=standardize_grade(df1,'characteristics_ch1.2.grade')

In [8]:
df1=df1[['Cancer Type','Tumour Subtype','Stage','Sample Type','Grade']]

In [9]:
df1

Unnamed: 0,Cancer Type,Tumour Subtype,Stage,Sample Type,Grade
0,Ovarian Cancer,clear cell,3,,3
1,Ovarian Cancer,clear cell,4,,3
2,Ovarian Cancer,clear cell,1,,3
3,Ovarian Cancer,clear cell,1,,
4,Ovarian Cancer,clear cell,2,,
...,...,...,...,...,...
98,Ovarian Cancer,serous,3,,3
99,,,,normal,
100,,,,normal,
101,,,,normal,


In [10]:
see_value(df1,'Cancer Type')

['Ovarian Cancer' 'NA']


Second dataset GSE6822_metadata

In [11]:
df2=info_data('GSE6822_metadata.csv')


--- Overview for GSE6822_metadata.csv ---
Found the following matching columns: ['contact_name', 'source_name_ch1', 'type']

All columns in the dataset:
Index(['channel_count', 'characteristics_ch1.0.Tissue', 'contact_address',
       'contact_city', 'contact_country', 'contact_department',
       'contact_email', 'contact_fax', 'contact_institute', 'contact_name',
       'contact_phone', 'contact_state', 'contact_zip/postal_code',
       'data_processing', 'data_row_count', 'description',
       'extract_protocol_ch1', 'geo_accession', 'hyb_protocol', 'label_ch1',
       'label_protocol_ch1', 'last_update_date', 'molecule_ch1',
       'organism_ch1', 'platform_id', 'scan_protocol', 'series_id',
       'source_name_ch1', 'status', 'submission_date', 'supplementary_file',
       'taxid_ch1', 'title', 'type'],
      dtype='object')

First 5 rows of the dataset:
   channel_count                       characteristics_ch1.0.Tissue  \
0              1  epithelial ovarian tumor, Tumor type: 

In [12]:
see_value(df2,'source_name_ch1')
see_value(df2,'description')
see_value(df2,'characteristics_ch1.0.Tissue')
see_value(df2,'description')

['epithelian ovarian tumor']
['low malignant potential' 'invasive']
['epithelial ovarian tumor, Tumor type: serous, Tumor stage: benign'
 'epithelial ovarian tumor, Tumor type: serous, Tumor stage: borderline'
 'epithelial ovarian tumor, Tumor type: serous, Tumor stage: invasive grade 3'
 'epithelial ovarian tumor, Tumor type: serous, Tumor stage: invasive grade 3, potentially from extra-ovarian origin'
 'epithelial ovarian tumor, Tumor type: serous, Tumor stage: invasive grade 2'
 'epithelial ovarian tumor, Tumor type: serous, Tumor stage: invasive grade 3, post-chemotherapy sample'
 'epithelial ovarian tumor, Tumor type: undetermined, Tumor stage: invasive, small foci of grade 3 tumor'
 'epithelial ovarian tumor, Tumor type: endometrioid, Tumor stage: invasive grade 3'
 'epithelial ovarian tumor, Tumor type: clear cell, Tumor stage: invasive grade 2'
 'epithelial ovarian tumor, Tumor type: clear cell, Tumor stage: invasive grade 3'
 'epithelial ovarian tumor, Tumor type: mixed, Tumor

According to what we have, we can see that the most important column is 'characteristics_ch1.0.Tissue' as it contains the type, subtype, stage, and grade

In [13]:
df2=standardize_cancer_type(df2,'characteristics_ch1.0.Tissue')
df2=standardize_grade(df2,'characteristics_ch1.0.Tissue')
df2=standardize_stage_mot(df2,'characteristics_ch1.0.Tissue')
df2=standardize_tumour_subtype(df2,'characteristics_ch1.0.Tissue')


In [14]:
df2=df2[['Cancer Type','Grade','Stage','Tumour Subtype']]
#df2_for_model=df2[['Cancer Type','Grade','Stage','Tumour Subtype','description']]

In [15]:
df2

Unnamed: 0,Cancer Type,Grade,Stage,Tumour Subtype
0,Ovarian Cancer,,benign,serous
1,Ovarian Cancer,,borderline,serous
2,Ovarian Cancer,,borderline,serous
3,Ovarian Cancer,,borderline,serous
4,Ovarian Cancer,3,invasive,serous
...,...,...,...,...
69,Ovarian Cancer,3,invasive,serous
70,Ovarian Cancer,3,invasive,endometrioid
71,Ovarian Cancer,3,invasive,clear cell
72,Ovarian Cancer,2,invasive,serous


The third dataset GSE12418_metadata

In [16]:
df3=info_data('GSE12418_metadata.csv')


--- Overview for GSE12418_metadata.csv ---
Found the following matching columns: ['type', 'source_name_ch1', 'source_name_ch2', 'contact_name']

All columns in the dataset:
Index(['title', 'geo_accession', 'status', 'submission_date',
       'last_update_date', 'type', 'channel_count', 'source_name_ch1',
       'organism_ch1', 'characteristics_ch1', 'characteristics_ch1.1',
       'characteristics_ch1.2', 'characteristics_ch1.3',
       'characteristics_ch1.4', 'characteristics_ch1.5', 'molecule_ch1',
       'extract_protocol_ch1', 'label_ch1', 'label_protocol_ch1', 'taxid_ch1',
       'source_name_ch2', 'organism_ch2', 'characteristics_ch2',
       'molecule_ch2', 'extract_protocol_ch2', 'label_ch2',
       'label_protocol_ch2', 'taxid_ch2', 'hyb_protocol', 'scan_protocol',
       'scan_protocol.1', 'description', 'data_processing', 'platform_id',
       'contact_name', 'contact_email', 'contact_phone', 'contact_fax',
       'contact_department', 'contact_institute', 'contact_address

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 46 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   title                    54 non-null     object
 1   geo_accession            54 non-null     object
 2   status                   54 non-null     object
 3   submission_date          54 non-null     object
 4   last_update_date         54 non-null     object
 5   type                     54 non-null     object
 6   channel_count            54 non-null     int64 
 7   source_name_ch1          54 non-null     object
 8   organism_ch1             54 non-null     object
 9   characteristics_ch1      54 non-null     object
 10  characteristics_ch1.1    54 non-null     object
 11  characteristics_ch1.2    54 non-null     object
 12  characteristics_ch1.3    54 non-null     object
 13  characteristics_ch1.4    54 non-null     object
 14  characteristics_ch1.5    54 non-null     obj

In [17]:
see_value(df3,'source_name_ch1') #tumour subtype

['serous ovarian adenocarcinoma']


In [18]:
df3=standardize_tumour_subtype(df3,'source_name_ch1')
df3=standardize_sample_type(df3,'source_name_ch1')
df3=standardize_cancer_type(df3,'source_name_ch1')
df3=standardize_stage_romain(df3,'characteristics_ch1.2')
df3=standardize_age(df3,'characteristics_ch1.4')

In [19]:
df3=df3[['Cancer Type','Age','Stage','Tumour Subtype','Sample Type']]

In [20]:
df3.head()

Unnamed: 0,Cancer Type,Age,Stage,Tumour Subtype,Sample Type
0,Ovarian Cancer,43,III,serous,
1,Ovarian Cancer,43,III,serous,
2,Ovarian Cancer,58,III,serous,
3,Ovarian Cancer,72,III,serous,
4,Ovarian Cancer,60,III,serous,


the fourth dataset GSE12470_metadata

In [21]:
df4=info_data('GSE12470_metadata.csv')


--- Overview for GSE12470_metadata.csv ---
Found the following matching columns: ['characteristics_ch1.0.gender', 'characteristics_ch1.2.stage', 'contact_name', 'source_name_ch1', 'type']

All columns in the dataset:
Index(['channel_count', 'characteristics_ch1.0.Gender',
       'characteristics_ch1.1.Tissue', 'characteristics_ch1.2.Stage',
       'contact_address', 'contact_city', 'contact_country',
       'contact_department', 'contact_email', 'contact_institute',
       'contact_name', 'contact_zip/postal_code', 'data_processing',
       'data_row_count', 'description', 'extract_protocol_ch1',
       'geo_accession', 'hyb_protocol', 'label_ch1', 'label_protocol_ch1',
       'last_update_date', 'molecule_ch1', 'organism_ch1', 'platform_id',
       'scan_protocol', 'series_id', 'source_name_ch1', 'status',
       'submission_date', 'supplementary_file', 'taxid_ch1', 'title', 'type'],
      dtype='object')

First 5 rows of the dataset:
   channel_count characteristics_ch1.0.Gender cha

In [22]:
see_value(df4,'source_name_ch1') #Cancer type and sample type
see_value(df4,'characteristics_ch1.1.Tissue') # Tumour Subtype
see_value(df4,'characteristics_ch1.2.Stage') #Stage 

['normal peritoneum' 'ovarian cancer' 'ovarian cacner']
['normal peritoneum' 'ovarian cancer' 'serous ovarian cancer']
[nan 'advanced stage' 'early stage']


In [23]:
df4=standardize_tumour_subtype(df4,'characteristics_ch1.1.Tissue')
df4=standardize_sample_type(df4,'source_name_ch1')
df4=standardize_cancer_type(df4,'source_name_ch1')
df4=standardize_stage_mot(df4,'characteristics_ch1.2.Stage')
df4=standardize_sex(df4,'characteristics_ch1.0.Gender')

In [24]:
df4=df4[['Stage','Cancer Type','Sex','Tumour Subtype','Sample Type']]


In [25]:
see_value(df4,'Stage')

['NA' 'advanced' 'early']


In [26]:
df4

Unnamed: 0,Stage,Cancer Type,Sex,Tumour Subtype,Sample Type
0,,,Female,,normal
1,,,Female,,normal
2,,,Female,,normal
3,,,Female,,normal
4,,,Female,,normal
5,,,Female,,normal
6,,,Female,,normal
7,,,Female,,normal
8,,,Female,,normal
9,,,Female,,normal


The 5th GSE26712_metadata

In [27]:
df5=info_data('GSE26712_metadata.csv') 


--- Overview for GSE26712_metadata.csv ---
Found the following matching columns: ['unnamed: 0', 'contact_name', 'source_name_ch1', 'type']

All columns in the dataset:
Index(['Unnamed: 0', 'channel_count', 'characteristics_ch1.0.tissue',
       'characteristics_ch1.1.surgery outcome', 'characteristics_ch1.2.status',
       'characteristics_ch1.3.survival years', 'contact_address',
       'contact_city', 'contact_country', 'contact_department',
       'contact_email', 'contact_institute', 'contact_laboratory',
       'contact_name', 'contact_phone', 'contact_state',
       'contact_zip/postal_code', 'data_processing', 'data_row_count',
       'description', 'extract_protocol_ch1', 'geo_accession', 'hyb_protocol',
       'label_ch1', 'label_protocol_ch1', 'last_update_date', 'molecule_ch1',
       'organism_ch1', 'platform_id', 'scan_protocol', 'series_id',
       'source_name_ch1', 'status', 'submission_date', 'supplementary_file',
       'taxid_ch1', 'title', 'treatment_protocol_ch1',

In [28]:
df5=standardize_cancer_type(df5,'characteristics_ch1.0.tissue')
df5=standardize_stage_mot(df5,'characteristics_ch1.0.tissue') 
df5=standardize_grade(df5,'characteristics_ch1.0.tissue')
df5=standardize_sample_type(df5,'characteristics_ch1.0.tissue')  
df5=standardize_tumour_subtype(df5,'characteristics_ch1.0.tissue')  

In [29]:
df5

Unnamed: 0.1,Unnamed: 0,channel_count,characteristics_ch1.0.tissue,characteristics_ch1.1.surgery outcome,characteristics_ch1.2.status,characteristics_ch1.3.survival years,contact_address,contact_city,contact_country,contact_department,...,supplementary_file,taxid_ch1,title,treatment_protocol_ch1,type,Cancer Type,Stage,Grade,Sample Type,Tumour Subtype
0,GSM657519,1,Normal ovarian surface epithelium,,,,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Normal HOSE2237,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,,,normal,
1,GSM657520,1,Normal ovarian surface epithelium,,,,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Normal HOSE2008,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,,,normal,
2,GSM657521,1,Normal ovarian surface epithelium,,,,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Normal HOSE2061,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,,,normal,
3,GSM657522,1,Normal ovarian surface epithelium,,,,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Normal HOSE2064,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,,,normal,
4,GSM657523,1,Normal ovarian surface epithelium,,,,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Normal HOSE2085,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,,,normal,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,GSM657709,1,Late-stage high-grade ovarian cancer,Optimal,NED (no evidence of disease),4.74,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Ovarian Cancer SO87,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,high,high-grade,,
191,GSM657710,1,Late-stage high-grade ovarian cancer,Suboptimal,NED (no evidence of disease),4.76,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Ovarian Cancer SO88,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,high,high-grade,,
192,GSM657711,1,Late-stage high-grade ovarian cancer,Suboptimal,AWD (alive with disease),4.66,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Ovarian Cancer SO89,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,high,high-grade,,
193,GSM657712,1,Late-stage high-grade ovarian cancer,Optimal,NED (no evidence of disease),4.63,70 Blossom Street,Boston,USA,Medicine,...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM657n...,9606,Ovarian Cancer SO91,Tumor specimens were obtained from 185 previou...,RNA,Ovarian Cancer,high,high-grade,,


In [30]:
df5=df5[['Cancer Type','Stage','Grade','Sample Type','Tumour Subtype']]


In [31]:
df5.head()

Unnamed: 0,Cancer Type,Stage,Grade,Sample Type,Tumour Subtype
0,Ovarian Cancer,,,normal,
1,Ovarian Cancer,,,normal,
2,Ovarian Cancer,,,normal,
3,Ovarian Cancer,,,normal,
4,Ovarian Cancer,,,normal,


6 GSE42568_metadata

In [32]:
df6=info_data('GSE42568_metadata.csv')


--- Overview for GSE42568_metadata.csv ---
Found the following matching columns: ['unnamed: 0.1', 'unnamed: 0', 'type', 'source_name_ch1', 'characteristics_ch1.1.age', 'characteristics_ch1.4.grade', 'contact_name']

All columns in the dataset:
Index(['Unnamed: 0.1', 'Unnamed: 0', 'title', 'geo_accession', 'status',
       'submission_date', 'last_update_date', 'type', 'channel_count',
       'source_name_ch1', 'organism_ch1', 'taxid_ch1',
       'characteristics_ch1.0.tissue', 'characteristics_ch1.1.age',
       'characteristics_ch1.2.er_status', 'characteristics_ch1.3.size',
       'characteristics_ch1.4.grade',
       'characteristics_ch1.5.lymph node status',
       'characteristics_ch1.6.relapse free survival time_days',
       'characteristics_ch1.7.relapse free survival event',
       'characteristics_ch1.8.overall survival time_days',
       'characteristics_ch1.9.overall survival event',
       'treatment_protocol_ch1', 'molecule_ch1', 'extract_protocol_ch1',
       'label_ch1

In [33]:
df6=standardize_age(df6,'characteristics_ch1.1.age')
df6=standardize_grade(df6,'characteristics_ch1.4.grade')
df6=standardize_cancer_type(df6,'characteristics_ch1.0.tissue')
df6=standardize_sample_type(df6,'characteristics_ch1.0.tissue')
df6=standardize_tumour_subtype(df6,'characteristics_ch1.0.tissue')

In [34]:
df6=df6[['Age','Grade','Cancer Type','Sample Type','Tumour Subtype']]
df6.head()

Unnamed: 0,Age,Grade,Cancer Type,Sample Type,Tumour Subtype
0,,,Breast Cancer,normal,
1,,,Breast Cancer,normal,
2,,,Breast Cancer,normal,
3,,,Breast Cancer,normal,
4,,,Breast Cancer,normal,


go to 7eme GSE44408_metadata

In [35]:
df7=info_data('GSE44408_metadata.csv')


--- Overview for GSE44408_metadata.csv ---
Found the following matching columns: ['unnamed: 0.1', 'unnamed: 0', 'type', 'source_name_ch1', 'contact_name']

All columns in the dataset:
Index(['Unnamed: 0.1', 'Unnamed: 0', 'title', 'geo_accession', 'status',
       'submission_date', 'last_update_date', 'type', 'channel_count',
       'source_name_ch1', 'organism_ch1', 'taxid_ch1',
       'characteristics_ch1.0.tissue', 'molecule_ch1', 'extract_protocol_ch1',
       'label_ch1', 'label_protocol_ch1', 'hyb_protocol', 'scan_protocol',
       'description', 'data_processing', 'platform_id', 'contact_name',
       'contact_email', 'contact_department', 'contact_institute',
       'contact_address', 'contact_city', 'contact_zip/postal_code',
       'contact_country', 'supplementary_file', 'series_id', 'data_row_count'],
      dtype='object')

First 5 rows of the dataset:
   Unnamed: 0.1  Unnamed: 0  \
0             0  GSM1084655   
1             1  GSM1084656   
2             2  GSM1084657  

In [36]:
df7=standardize_sample_type(df7,'title')
df7=standardize_cancer_type(df7,'title')
df7=standardize_tumour_subtype(df7,'title')

In [37]:
df7=df7[['Cancer Type', 'Sample Type','Tumour Subtype']]

In [38]:
df7

Unnamed: 0,Cancer Type,Sample Type,Tumour Subtype
0,,,
1,,,
2,,,
3,Breast Cancer,,ductal
4,Breast Cancer,,ductal
5,Breast Cancer,,ductal
6,Breast Cancer,,ductal
7,Breast Cancer,,ductal
8,Breast Cancer,,ductal
9,Breast Cancer,,ductal


go to GSE78958_metadata

In [39]:
df8=info_data('GSE78958_metadata.csv')


--- Overview for GSE78958_metadata.csv ---
Found the following matching columns: ['unnamed: 0.1', 'unnamed: 0', 'type', 'source_name_ch1', 'characteristics_ch1.2.tumor grade', 'characteristics_ch1.3.tumor subtype (via breastprs)', 'characteristics_ch1.4.tumor stage', 'contact_name']

All columns in the dataset:
Index(['Unnamed: 0.1', 'Unnamed: 0', 'title', 'geo_accession', 'status',
       'submission_date', 'last_update_date', 'type', 'channel_count',
       'source_name_ch1', 'organism_ch1', 'taxid_ch1',
       'characteristics_ch1.0.patient ethnicity', 'characteristics_ch1.1.bmi',
       'characteristics_ch1.2.tumor grade',
       'characteristics_ch1.3.tumor subtype (via breastprs)',
       'characteristics_ch1.4.tumor stage', 'molecule_ch1',
       'extract_protocol_ch1', 'label_ch1', 'label_protocol_ch1',
       'hyb_protocol', 'scan_protocol', 'description', 'data_processing',
       'platform_id', 'contact_name', 'contact_email', 'contact_department',
       'contact_institute

In [40]:
df8.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,title,geo_accession,status,submission_date,last_update_date,type,channel_count,source_name_ch1,...,contact_department,contact_institute,contact_address,contact_city,contact_state,contact_zip/postal_code,contact_country,supplementary_file,series_id,data_row_count
0,357,GSM2082442,primary breast tumor_366,GSM2082442,Public on Mar 08 2016,Mar 07 2016,Mar 08 2016,RNA,1,Primary breast tumor_BMI <25,...,Clinical Breast Care Project,Windber Research Institute,620 Seventh Street,Windber,PA,15963,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2082...,GSE78958,22277
1,2,GSM2082087,primary breast tumor_3,GSM2082087,Public on Mar 08 2016,Mar 07 2016,Mar 08 2016,RNA,1,Primary breast tumor_BMI Unk,...,Clinical Breast Care Project,Windber Research Institute,620 Seventh Street,Windber,PA,15963,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2082...,GSE78958,22277
2,3,GSM2082088,primary breast tumor_4,GSM2082088,Public on Mar 08 2016,Mar 07 2016,Mar 08 2016,RNA,1,Primary breast tumor_BMI 25-29.99,...,Clinical Breast Care Project,Windber Research Institute,620 Seventh Street,Windber,PA,15963,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2082...,GSE78958,22277
3,4,GSM2082089,primary breast tumor_5,GSM2082089,Public on Mar 08 2016,Mar 07 2016,Mar 08 2016,RNA,1,Primary breast tumor_BMI Unk,...,Clinical Breast Care Project,Windber Research Institute,620 Seventh Street,Windber,PA,15963,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2082...,GSE78958,22277
4,7,GSM2082092,primary breast tumor_8,GSM2082092,Public on Mar 08 2016,Mar 07 2016,Mar 08 2016,RNA,1,Primary breast tumor_BMI <25,...,Clinical Breast Care Project,Windber Research Institute,620 Seventh Street,Windber,PA,15963,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2082...,GSE78958,22277


In [41]:
df8=standardize_sample_type(df8,'title')
df8=standardize_cancer_type(df8,'title')
df8=standardize_grade(df8,'characteristics_ch1.2.tumor grade')
df8=standardize_stage_romain(df8,'characteristics_ch1.4.tumor stage')

In [42]:
df8=df8[['Sample Type',	'Cancer Type', 'Stage','characteristics_ch1.3.tumor subtype (via breastprs)']]

In [43]:
see_value(df8,'characteristics_ch1.3.tumor subtype (via breastprs)')

['Luminal B' 'Luminal A' 'HER2 enriched' 'Basal Like' 'Normal-like']


In [44]:
subtype_mapping = {
    'Luminal B': 'ductal',
    'Luminal A': 'lobular',
    'HER2 enriched': 'ductal',
    'Basal Like': 'apocrine',
    'Normal-like': 'clear cell'
}

# Apply the mapping to the 'Tumour Subtype' column
df8['Tumour Subtype'] = df8['characteristics_ch1.3.tumor subtype (via breastprs)'].map(subtype_mapping)
df8=df8.drop(columns=['characteristics_ch1.3.tumor subtype (via breastprs)'])

In [45]:
df8.head()

Unnamed: 0,Sample Type,Cancer Type,Stage,Tumour Subtype
0,primary tumour,Breast Cancer,I,ductal
1,primary tumour,Breast Cancer,I,lobular
2,primary tumour,Breast Cancer,I,lobular
3,primary tumour,Breast Cancer,I,lobular
4,primary tumour,Breast Cancer,I,lobular


### 3-Merge

After reviewing all the datasets, we will merge them into a single DataFrame.

In [46]:
merged_df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8], ignore_index=True, sort=False)
#merged_df = pd.concat([df1, df2_for_model, df3, df4, df5, df6, df7, df8], ignore_index=True, sort=False)

In [47]:
merged_df['Patient ID'] = [f"{i:03}" for i in range(1, len(merged_df) + 1)]


In [48]:
#iN THE RIGHT ORDER 
desired_order = ['Patient ID', 'Age', 'Sex', 'Stage', 'Grade', 'Cancer Type', 'Sample Type', 'Tumour Subtype']
#desired_order = ['Patient ID', 'Age', 'Sex', 'Stage', 'Grade', 'Cancer Type', 'Sample Type', 'Tumour Subtype','description']
merged_df = merged_df[desired_order]


In [49]:
merged_df

Unnamed: 0,Patient ID,Age,Sex,Stage,Grade,Cancer Type,Sample Type,Tumour Subtype
0,001,,,3,3,Ovarian Cancer,,clear cell
1,002,,,4,3,Ovarian Cancer,,clear cell
2,003,,,1,3,Ovarian Cancer,,clear cell
3,004,,,1,,Ovarian Cancer,,clear cell
4,005,,,2,,Ovarian Cancer,,clear cell
...,...,...,...,...,...,...,...,...
1063,1064,,,I,,Breast Cancer,primary tumour,apocrine
1064,1065,,,I,,Breast Cancer,primary tumour,ductal
1065,1066,,,I,,Breast Cancer,primary tumour,ductal
1066,1067,,,I,,Breast Cancer,primary tumour,lobular


 ## 4-Standarization

### Methodology for Column Standardization:

Objective : 
The aim of using functions for column standardization is to transform raw data into a consistent format suitable for data analysis and machine learning. Each function applies specific transformations tailored to each column, ensuring uniformity and proper handling of missing or inconsistent values.

### Key Points of the Method

### Use of Functions
- **Modularity**: Each function is independent, allowing easy use or modification without impacting others.
- **Reusability**: Functions can be applied to multiple datasets, ensuring consistent data processing.
- **Reliability**: Functions systematically standardize data and handle errors, such as unknown or missing values.

### Benefits
By using dedicated functions for data standardization, we ensure data consistency, improve data quality, and make datasets ready for analysis and modeling.

In [50]:
see_value(merged_df,'Age') 						
see_value(merged_df,'Sex')
print("Stage : ")
see_value(merged_df,'Stage')
print("Grade : ")
see_value(merged_df,'Grade')
see_value(merged_df,'Cancer Type')
see_value(merged_df,'Sample Type')
see_value(merged_df,'Tumour Subtype')

[nan '43' '58' '72' '60' '50' '62' '49' '77' '52' '36' '44' '84' '70' '65'
 '54' '35' '40' '41' '56' '76' '73' '59' '67' '69' '64' '74' '75' '63'
 '55' '47' '68' '51' '82' 'NA' '74.03' '67.26' '53.35' '43.72' '51.47'
 '42.49' '52.01' '43.4' '45.86' '78.6' '56.89' '54.36' '49.03' '51.46'
 '63.88' '43.89' '76.49' '68.57' '82.14' '73.05' '77.3' '65.07' '47.59'
 '57.59' '38.6' '49.59' '80.3' '58.52' '46.08' '50.62' '42.95' '73.38'
 '53.75' '72.74' '54.6' '53.28' '54.69' '48.01' '60.12' '52.82' '62.81'
 '43.48' '72.49' '54.34' '73.68' '78.8' '74.34' '74.02' '75.15' '67.83'
 '50.21' '69.58' '31.06' '71.06' '60.72' '65.08' '67.24' '56.88' '69.42'
 '65.68' '61.42' '46.97' '52.26' '56.1' '59.56' '50.94' '75.81' '59.0'
 '48.27' '38.9' '46.59' '48.84' '47.39' '67.7' '46.31' '71.64' '66.66'
 '68.29' '67.75' '47.13' '44.52' '55.89' '55.57' '51.23' '54.31' '68.76'
 '56.04' '60.55' '89.93' '73.79' '61.41' '57.23' '75.75' '41.17' '53.07'
 '55.95' '60.39' '46.09' '47.82' '48.43' '70.44' '62.51']
[nan '

In [51]:
def standardize_age(df, column):
    df[column] = pd.to_numeric(df[column], errors='coerce') 
    df[column] = df[column].fillna('NA')  
    df[column] = df[column].apply(lambda x: int(x) if x != 'NA' else 'NA')
    return df

merged_df = standardize_age(merged_df, 'Age')




In [52]:
def standardize_sex(df, column):
    df[column] = df[column].str.strip().str.capitalize()  
    df[column] = df[column].where(df[column].isin(['Male', 'Female']), 'NA')  
    return df

merged_df = standardize_sex(merged_df, 'Sex')


In [53]:
def standardize_cancer_type(df, column):
    valid_types = ['Ovarian Cancer', 'Breast Cancer']
    df[column] = df[column].str.capitalize().where(df[column].isin(valid_types), 'NA')
    
    return df

merged_df = standardize_cancer_type(merged_df, 'Cancer Type')



In [54]:
def standardize_sample_type(df, column):
    valid_types = ['normal', 'primary tumour']
    df[column] = df[column].str.lower().where(df[column].isin(valid_types), 'NA')
    
    return df

merged_df = standardize_sample_type(merged_df, 'Sample Type')


In [55]:
def standardize_tumour_subtype(df, column):
    valid_subtypes = ['clear cell', 'endometrioid', 'mucinous', 'serous', 'ductal', 'lobular', 'apocrine']
    df[column] = df[column].str.lower().where(df[column].isin(valid_subtypes), 'NA')
    
    return df

merged_df = standardize_tumour_subtype(merged_df, 'Tumour Subtype')


In [56]:
def standardize_stage_grade(df, column):
    standard_map = {
        '1': 'I', '2': 'II', '3': 'III', '4': 'IV',
        'I': 'I', 'II': 'II', 'III': 'III', 'IV': 'IV',
        'early': 'I', 'advanced': 'IV', 'high': 'IV',
        'benign': 'NA', 'borderline': 'NA', 'microinvasion': 'NA',
        'NA': 'NA'
    }
    df[column] = df[column].replace(standard_map)
    df[column] = df[column].where(df[column].isin(['I', 'II', 'III', 'IV', 'NA']), 'NA')  
    return df

merged_df = standardize_stage_grade(merged_df, 'Stage')
merged_df = standardize_stage_grade(merged_df, 'Grade')


In [57]:
see_value(merged_df,'Age') 						
see_value(merged_df,'Sex')
print("Stage : ")
see_value(merged_df,'Stage')
print("Grade : ")
see_value(merged_df,'Grade')
see_value(merged_df,'Cancer Type')
see_value(merged_df,'Sample Type')
see_value(merged_df,'Tumour Subtype')

['NA' 43 58 72 60 50 62 49 77 52 36 44 84 70 65 54 35 40 41 56 76 73 59 67
 69 64 74 75 63 55 47 68 51 82 53 42 45 78 57 38 80 46 48 31 71 61 66 89]
['NA' 'Female']
Stage : 
['III' 'IV' 'I' 'II' 'NA']
Grade : 
['III' 'NA' 'I' 'II']
['Ovarian cancer' 'NA' 'Breast cancer']
['NA' 'normal' 'primary tumour']
['clear cell' 'endometrioid' 'mucinous' 'serous' 'NA' 'ductal' 'lobular'
 'apocrine']


In [58]:
merged_df.head(10)

Unnamed: 0,Patient ID,Age,Sex,Stage,Grade,Cancer Type,Sample Type,Tumour Subtype
0,1,,,III,III,Ovarian cancer,,clear cell
1,2,,,IV,III,Ovarian cancer,,clear cell
2,3,,,I,III,Ovarian cancer,,clear cell
3,4,,,I,,Ovarian cancer,,clear cell
4,5,,,II,,Ovarian cancer,,clear cell
5,6,,,I,,Ovarian cancer,,clear cell
6,7,,,I,,Ovarian cancer,,clear cell
7,8,,,I,,Ovarian cancer,,clear cell
8,9,,,IV,III,Ovarian cancer,,endometrioid
9,10,,,I,I,Ovarian cancer,,endometrioid


## Conclusion

Through the implementation of these targeted functions for data extraction and standardization, the datasets have been transformed into a more coherent and uniform format. This ensures consistency across different datasets and significantly improves data quality, making it suitable for downstream analysis and modeling. The structured approach enhances the reliability of results, reduces potential errors arising from inconsistent data, and provides a solid foundation for applying machine learning and data science techniques.

In [59]:
#Saving the merged_df as CSV

merged_df.to_csv('Final_Dataset.csv', index=False)
