# Introduction

**Introduction**

Machine learning (ML) is a rapidly evolving field that has the potential to revolutionize many industries, including healthcare. ML algorithms can be used to learn from large datasets of data and identify patterns that would be difficult or impossible for humans to find on their own. This makes ML ideal for a variety of tasks in healthcare, such as developing new diagnostic tools and predicting patient outcomes.

One area where ML is having a major impact is in the field of RNA sequencing (RNA-seq). RNA-seq is a powerful technology that allows scientists to measure the expression of all the genes in a cell at once. This information can be used to identify biomarkers for disease, understand the molecular mechanisms of cancer, and develop new personalized treatments.

ML algorithms can be used to analyze RNA-seq data to identify biomarkers for cancer and other diseases. ML can also be used to develop diagnostic tools that are more accurate and sensitive than traditional methods. For example, ML algorithms have been used to develop diagnostic tests for breast cancer, kidney cancer, and lung cancer that are more accurate than traditional methods such as mammography and biopsy.

**Problem statement**

Breast invasive carcinoma, kidney clear cell carcinoma, and lung adenocarcinoma are three of the most common types of cancer. Early detection and treatment of these cancers is essential for improving patient outcomes. However, traditional diagnostic methods for these cancers are not always accurate or sensitive.

Machine learning assisted RNA sequencing based diagnostics have the potential to overcome the limitations of traditional diagnostic methods. ML algorithms can be used to analyze RNA-seq data to identify biomarkers for these cancers and develop diagnostic tools that are more accurate and sensitive.

**Benefits of ML-assisted RNA-seq based diagnostics**

Machine learning assisted RNA sequencing based diagnostics offer a number of benefits over traditional diagnostic methods, including:

  * Improved accuracy and sensitivity: ML algorithms can be trained on large datasets of RNA-seq data to identify biomarkers for cancer with high accuracy and sensitivity. This means that machine learning assisted RNA sequencing based diagnostics can detect cancer earlier and more accurately than traditional methods.
  * Reduced invasiveness: RNA-seq can be performed on liquid biopsies, such as blood or urine. This means that ML-assisted RNA-seq based diagnostics can be used to diagnose cancer without the need for invasive procedures such as biopsies.
  * Personalization: Machine learning assisted RNA sequencing based diagnostics can be used to develop personalized diagnostic tools that are tailored to the individual patient's tumor. This can help to improve the accuracy of diagnosis and guide treatment decisions.


# Importations

In [None]:
# importation

# preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

# dimentionality reduction
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree

# evaluations
from sklearn.metrics import classification_report
from sklearn.metrics import multilabel_confusion_matrix

# others
from sklearn.tree import export_text
import plotly.graph_objects as go

# Data Preprocessings

In [None]:
#labels importation and preprocessing
train_label = pd.read_csv("/data/train_label.tsv", sep='\t')
train_labels = train_label.X_primary_disease
train_labels.shape

(1845,)

In [None]:
#features importation and preprocessing
train_data = pd.read_csv("/data/train_data.tsv", sep='\t')
train_data = train_data.T
train_data = train_data.tail(-1)
print("Before processing: ",train_data.shape)
train_data_1 = train_data.dropna()
print("After_1 processing: ",train_data_1.shape)

Before processing:  (1845, 16340)
After_1 processing:  (1845, 16340)
