# A Data Science Framework: 

Architecture for the bladder cancer analytics and prediction via machine-learning

#Define the Problem: Traditional way to diagnosis or prognosis cancer relys on biopsy, cancer is discovered usually after it reached a certain size or entern the mid/late stage. At the same time, biopsy also relys on the pathologist's experience. Different pathologist might have different opinion against the cancer samples. Developing a non-invasive, and object measurement of patients' sample and making un-bised judgement on the biopsy is the goal in the cancer clinical communities. This study measured over 2500 miRNA genes from the peripheral blood samples of bladder cancer patients of different stages, plus samples from normal persons as well as samples from other cancers as control.

We hope this study can identify biomarker signatures in the bladder cancer that might be used for diagnosis/prognosis/cancer staging. Through machine learning and predictive analytics, we aim to develop a method that can be used to predict bladder cancer in the early stage.
●Hypothesis or project topic
➢Cancer is a disease of molecular dis-regulations, Is there a unique molecular signature for bladder cancer samples? 
➢If there is a signature, can we build a data product that can identify bladder cancer for a given blood sample?
➢Can non-invasive, early detection of bladder cancer achievable?, with high accuracy and high sensitivity?


#Gather the Data: 
➢Dataset is downloaded from GEO/NCBI/NIH; 
➢This data set contains profiles of 972 samples, which consist of 392 bladder cancer, 100 non-cancer control, and 480 other types of cancer patients. Each sample has about 2600 variables to describe it.
➢Bladder cancer, non-cancer control, other cancers are the class labels.
➢miRNA measurement and gender/age/locations are the major features.


#Prepare Data for Consumption: 

This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.

#Perform Exploratory Analysis: Anybody who has ever worked with data knows, garbage-in, garbage-out (GIGO). Therefore, it is important to deploy descriptive and graphical statistics to look for potential problems, patterns, classifications, correlations and comparisons in the dataset. In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.

#Model Data: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It's important to remember, algorithms are tools and not magical wands or silver bullets. You must still be the master craft (wo)man that knows how-to select the right tool for the job. An analogy would be asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible. The same is true in data modelling. The wrong model can lead to poor performance at best and the wrong conclusion (that’s used as actionable intelligence) at worst.

#Validate and Implement Data Model: After you've trained your model based on a subset of your data, it's time to test your model. This helps ensure you haven't overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.

#Optimize and Strategize: This is the "bionic man" step, where you iterate back through the process to make it better...stronger...faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. Once you're able to package your ideas, this becomes your “currency exchange" rate.
    
    

# Workflow goals

#The data science solutions workflow solves for seven major goals.
#Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.
#Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.
#Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.
#Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.
#Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.
#Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.
#Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals.

In [None]:
# Imports - you'll need some of these later, but it's traditional to put them all at the beginning.

import os
import csv
import json

#from collections import Counter
from operator import itemgetter
from requests import get


def download(download_url, output_file):
    """
    Downloads a URL and writes it to the specified path. The "path" 
    is like the mailing address for the file - it tells the function 
    where on your computer to send it!
    
    Also note the use of "with" to automatically close files - this 
    is a good standard practice to follow.
    """
    with open(output_file,'wb') as f:
        response = get(download_url)
        f.write(response.content)
        
        
## Write the Python to execute the function and download the file here:
url = "http://ftp.ncbi.nlm.nih.gov/geo/series/FSE/113nnn/GSE113486/matrix/GSE113486_series_matrix.txt.gz"
path = "C:/BigData/DSCert/input/GSE113486_series_matrix.txt.gz"
download(url, path)

os.system("gunzip path")

#bladder_cancer_file = "C:\Users\Liu_PC\Documents\Georgetown/GSE113486_series_matrix.txt"
bladder_cancer_file = "C:/BigData/DSCert/input/GSE113486_series_matrix.txt"

#gunzip the downloaded file. This will need be done only once. For analytical purpose, the unzipped file will 
#usually opened many times during debugging and texting
 

In [15]:
#Main program

#clean the memory
#in ipython
%reset -f 

#in python
import gc
gc.collect()

# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
#show plots in the Jupyter Notebook
%matplotlib inline
#configure visualization defaults
sns.set(style='white', context='notebook', palette='deep')
sns.set_style('white')


# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier


#Acquire data
#The Python Pandas packages helps us work with our datasets. We start by acquiring the "master" data generated by computer
#we will split the data into training and testing datasets later.
#
#downloaded master file for this project: 

bladder_cancer_file = "C:/Users/Liu_PC/Documents/Georgetown/GSE113486_series_matrix.txt"

df = pd.read_csv(bladder_cancer_file, delimiter="\t", skiprows = 48, header = None) 

#df = pd.read_csv(bladder_cancer_file, delimiter="\t", skiprows = 73)
df.shape



(2592, 973)

In [16]:
#this does not work, unfortunately
df3 =pd.concat([df.iloc[[25],:], df.iloc[[2],:], df.drop([2, 25], axis=0)], axis=0)

del df

df3.head(30)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,963,964,965,966,967,968,969,970,971,972
25,ID_REF,GSM3106847,GSM3106848,GSM3106849,GSM3106850,GSM3106851,GSM3106852,GSM3106853,GSM3106854,GSM3106855,...,GSM3107818,GSM3107819,GSM3107820,GSM3107821,GSM3107822,GSM3107823,GSM3107824,GSM3107825,GSM3107826,GSM3107827
2,!Sample_characteristics_ch1,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,...,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma
0,!Sample_characteristics_ch1,Sex: Male,Sex: Female,Sex: Male,Sex: Male,Sex: Female,Sex: Male,Sex: Male,Sex: Female,Sex: Male,...,Sex: Female,Sex: Female,Sex: Male,Sex: Female,Sex: Male,Sex: Male,Sex: Male,Sex: Female,Sex: Male,Sex: Male
1,!Sample_characteristics_ch1,age: 59,age: 77,age: 50,age: 76,age: 81,age: 54,age: 74,age: 76,age: 58,...,age: 32,age: 31,age: 46,age: 26,age: 65,age: 56,age: 30,age: 28,age: 60,age: 77
3,!Sample_characteristics_ch1,pathological tstage: <pT2,pathological tstage: >=pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,...,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain
4,!Sample_characteristics_ch1,pathological grade: low,pathological grade: high,pathological grade: high,pathological grade: high,pathological grade: high,pathological grade: low,pathological grade: low,pathological grade: high,pathological grade: low,...,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain
5,!Sample_molecule_ch1,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,...,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA,total RNA
6,!Sample_extract_protocol_ch1,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...,Total RNA was extracted each from 300uL serum ...
7,!Sample_label_ch1,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,...,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5,Cy5
8,!Sample_label_protocol_ch1,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...,miRNA was labeled using 3D-Gene® miRNA Labelin...


In [17]:
"""
new_header = df3.iloc[0] #grab the 7th row for the header

df3 = df3[1:]

df3.columns = new_header #set the header row as the df header
"""
# Rename the dataframe's column values with the header variable


df3 = df3.drop([5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24], axis=0) 

df3.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,963,964,965,966,967,968,969,970,971,972
25,ID_REF,GSM3106847,GSM3106848,GSM3106849,GSM3106850,GSM3106851,GSM3106852,GSM3106853,GSM3106854,GSM3106855,...,GSM3107818,GSM3107819,GSM3107820,GSM3107821,GSM3107822,GSM3107823,GSM3107824,GSM3107825,GSM3107826,GSM3107827
2,!Sample_characteristics_ch1,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,disease status: Bladder Cancer,...,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma,disease status: Sarcoma
0,!Sample_characteristics_ch1,Sex: Male,Sex: Female,Sex: Male,Sex: Male,Sex: Female,Sex: Male,Sex: Male,Sex: Female,Sex: Male,...,Sex: Female,Sex: Female,Sex: Male,Sex: Female,Sex: Male,Sex: Male,Sex: Male,Sex: Female,Sex: Male,Sex: Male
1,!Sample_characteristics_ch1,age: 59,age: 77,age: 50,age: 76,age: 81,age: 54,age: 74,age: 76,age: 58,...,age: 32,age: 31,age: 46,age: 26,age: 65,age: 56,age: 30,age: 28,age: 60,age: 77
3,!Sample_characteristics_ch1,pathological tstage: <pT2,pathological tstage: >=pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,pathological tstage: <pT2,...,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain,pathological tstage: uncertain
4,!Sample_characteristics_ch1,pathological grade: low,pathological grade: high,pathological grade: high,pathological grade: high,pathological grade: high,pathological grade: low,pathological grade: low,pathological grade: high,pathological grade: low,...,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain,pathological grade: uncertain
19,!Sample_contact_city,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,...,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa,Kanagawa
26,MIMAT0000062,-1.061,0.765,2.949,3.033,4.832,1.729,6.330,6.835,0.195,...,2.504,-0.258,4.517,3.239,5.270,3.708,0.132,5.440,0.071,4.798
27,MIMAT0000063,-1.061,0.765,3.451,6.224,5.349,4.569,6.452,1.320,4.343,...,-1.241,-0.258,4.911,3.096,5.926,0.047,5.036,6.437,4.577,4.875
28,MIMAT0000064,2.303,4.920,0.420,3.496,5.571,3.458,3.505,4.226,1.969,...,2.750,-0.258,0.842,2.058,0.319,0.047,2.702,6.580,5.210,5.118


In [18]:
#df3.reset_index(drop=True, inplace=True)

df = df3.T

del df3
#df = df[1:]
df.head(10)
#df.drop.iloc[25]

Unnamed: 0,25,2,0,1,3,4,19,26,27,28,...,2582,2583,2584,2585,2586,2587,2588,2589,2590,2591
0,ID_REF,!Sample_characteristics_ch1,!Sample_characteristics_ch1,!Sample_characteristics_ch1,!Sample_characteristics_ch1,!Sample_characteristics_ch1,!Sample_contact_city,MIMAT0000062,MIMAT0000063,MIMAT0000064,...,MIMAT0032026,MIMAT0032029,MIMAT0032110,"MIMAT0032114, MIMAT0032115",MIMAT0032116,MIMAT0033692,MIMAT0035542,MIMAT0035703,MIMAT0035704,!series_matrix_table_end
1,GSM3106847,disease status: Bladder Cancer,Sex: Male,age: 59,pathological tstage: <pT2,pathological grade: low,Kanagawa,-1.061,-1.061,2.303,...,-1.061,7.743,-1.061,-1.061,6.507,3.906,-1.061,-1.061,-1.061,
2,GSM3106848,disease status: Bladder Cancer,Sex: Female,age: 77,pathological tstage: >=pT2,pathological grade: high,Kanagawa,0.765,0.765,4.920,...,0.765,8.038,0.765,0.765,5.946,0.765,0.765,0.765,0.765,
3,GSM3106849,disease status: Bladder Cancer,Sex: Male,age: 50,pathological tstage: <pT2,pathological grade: high,Kanagawa,2.949,3.451,0.420,...,-1.492,7.596,-1.492,-1.492,6.058,4.482,2.917,-1.492,-1.492,
4,GSM3106850,disease status: Bladder Cancer,Sex: Male,age: 76,pathological tstage: <pT2,pathological grade: high,Kanagawa,3.033,6.224,3.496,...,0.867,7.526,0.867,0.867,6.315,2.759,5.028,0.867,4.042,
5,GSM3106851,disease status: Bladder Cancer,Sex: Female,age: 81,pathological tstage: <pT2,pathological grade: high,Kanagawa,4.832,5.349,5.571,...,1.237,7.388,5.724,1.237,6.788,6.695,5.987,1.237,5.985,
6,GSM3106852,disease status: Bladder Cancer,Sex: Male,age: 54,pathological tstage: <pT2,pathological grade: low,Kanagawa,1.729,4.569,3.458,...,-0.486,9.491,-0.486,-0.486,6.825,4.682,-0.486,-0.486,-0.486,
7,GSM3106853,disease status: Bladder Cancer,Sex: Male,age: 74,pathological tstage: <pT2,pathological grade: low,Kanagawa,6.330,6.452,3.505,...,1.51,8.169,1.51,1.51,5.517,1.51,4.643,1.51,1.51,
8,GSM3106854,disease status: Bladder Cancer,Sex: Female,age: 76,pathological tstage: <pT2,pathological grade: high,Kanagawa,6.835,1.320,4.226,...,3.98,8.404,5.749,1.32,6.913,4.67,1.32,1.32,1.32,
9,GSM3106855,disease status: Bladder Cancer,Sex: Male,age: 58,pathological tstage: <pT2,pathological grade: low,Kanagawa,0.195,4.343,1.969,...,0.195,7.219,0.195,0.195,6.598,6.062,3.48,2.795,0.195,


In [19]:
#need to reset the column names so they become unique and indexable
df.loc[0, 0] = 'Gender'
df.loc[0, 1] = 'Age'
df.loc[0, 2] = 'Class'
df.loc[0, 3] = 'Path_Stage'
df.loc[0, 4] = 'Path_Grade'
df.loc[0, 19] = 'Sample_City'

df.loc[0, 25] = 'Sample_ID'

#df.loc['ID_REF', 6] = 'ID_REF_old'

#now the dataframe is indexable
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header 

#df = df.loc[:, ~df.columns.duplicated()]
df.head(10)


Unnamed: 0,Sample_ID,Class,Gender,Age,Path_Stage,Path_Grade,Sample_City,MIMAT0000062,MIMAT0000063,MIMAT0000064,...,MIMAT0032026,MIMAT0032029,MIMAT0032110,"MIMAT0032114, MIMAT0032115",MIMAT0032116,MIMAT0033692,MIMAT0035542,MIMAT0035703,MIMAT0035704,!series_matrix_table_end
1,GSM3106847,disease status: Bladder Cancer,Sex: Male,age: 59,pathological tstage: <pT2,pathological grade: low,Kanagawa,-1.061,-1.061,2.303,...,-1.061,7.743,-1.061,-1.061,6.507,3.906,-1.061,-1.061,-1.061,
2,GSM3106848,disease status: Bladder Cancer,Sex: Female,age: 77,pathological tstage: >=pT2,pathological grade: high,Kanagawa,0.765,0.765,4.92,...,0.765,8.038,0.765,0.765,5.946,0.765,0.765,0.765,0.765,
3,GSM3106849,disease status: Bladder Cancer,Sex: Male,age: 50,pathological tstage: <pT2,pathological grade: high,Kanagawa,2.949,3.451,0.42,...,-1.492,7.596,-1.492,-1.492,6.058,4.482,2.917,-1.492,-1.492,
4,GSM3106850,disease status: Bladder Cancer,Sex: Male,age: 76,pathological tstage: <pT2,pathological grade: high,Kanagawa,3.033,6.224,3.496,...,0.867,7.526,0.867,0.867,6.315,2.759,5.028,0.867,4.042,
5,GSM3106851,disease status: Bladder Cancer,Sex: Female,age: 81,pathological tstage: <pT2,pathological grade: high,Kanagawa,4.832,5.349,5.571,...,1.237,7.388,5.724,1.237,6.788,6.695,5.987,1.237,5.985,
6,GSM3106852,disease status: Bladder Cancer,Sex: Male,age: 54,pathological tstage: <pT2,pathological grade: low,Kanagawa,1.729,4.569,3.458,...,-0.486,9.491,-0.486,-0.486,6.825,4.682,-0.486,-0.486,-0.486,
7,GSM3106853,disease status: Bladder Cancer,Sex: Male,age: 74,pathological tstage: <pT2,pathological grade: low,Kanagawa,6.33,6.452,3.505,...,1.51,8.169,1.51,1.51,5.517,1.51,4.643,1.51,1.51,
8,GSM3106854,disease status: Bladder Cancer,Sex: Female,age: 76,pathological tstage: <pT2,pathological grade: high,Kanagawa,6.835,1.32,4.226,...,3.98,8.404,5.749,1.32,6.913,4.67,1.32,1.32,1.32,
9,GSM3106855,disease status: Bladder Cancer,Sex: Male,age: 58,pathological tstage: <pT2,pathological grade: low,Kanagawa,0.195,4.343,1.969,...,0.195,7.219,0.195,0.195,6.598,6.062,3.48,2.795,0.195,
10,GSM3106856,disease status: Bladder Cancer,Sex: Female,age: 65,pathological tstage: >=pT2,pathological grade: high,Kanagawa,2.491,0.882,0.801,...,0.801,6.498,0.801,0.801,6.438,0.801,0.801,2.912,3.758,


In [20]:
#list(df)

#Now we are ready to clean the data by string replacement and string manipulation
#df.columns = df.columns.str.strip().str.replace('!Sample_', '').str.replace('Sex: ', '').str.replace('disease status: ', '')

#df.idh = df.idh.astype(str).apply(locale.atof)

#df['Gender'] = df['Gender'].apply(lambda x: x.replace('Sex:', ''))
df['Gender'] = df['Gender'].str.replace('Sex: ', '')
df['Age'] = df['Age'].str.replace('age: ', '')
df['Class'] = df['Class'].str.replace('disease status: ', '')
df['Path_Stage'] = df['Path_Stage'].str.replace('pathological tstage: ', '')
df['Path_Grade'] = df['Path_Grade'].str.replace('pathological grade: ', '')


#df.drop(['ID_REF'], axis = 0)

#df.columns.values()
df.head(10)

Unnamed: 0,Sample_ID,Class,Gender,Age,Path_Stage,Path_Grade,Sample_City,MIMAT0000062,MIMAT0000063,MIMAT0000064,...,MIMAT0032026,MIMAT0032029,MIMAT0032110,"MIMAT0032114, MIMAT0032115",MIMAT0032116,MIMAT0033692,MIMAT0035542,MIMAT0035703,MIMAT0035704,!series_matrix_table_end
1,GSM3106847,Bladder Cancer,Male,59,<pT2,low,Kanagawa,-1.061,-1.061,2.303,...,-1.061,7.743,-1.061,-1.061,6.507,3.906,-1.061,-1.061,-1.061,
2,GSM3106848,Bladder Cancer,Female,77,>=pT2,high,Kanagawa,0.765,0.765,4.92,...,0.765,8.038,0.765,0.765,5.946,0.765,0.765,0.765,0.765,
3,GSM3106849,Bladder Cancer,Male,50,<pT2,high,Kanagawa,2.949,3.451,0.42,...,-1.492,7.596,-1.492,-1.492,6.058,4.482,2.917,-1.492,-1.492,
4,GSM3106850,Bladder Cancer,Male,76,<pT2,high,Kanagawa,3.033,6.224,3.496,...,0.867,7.526,0.867,0.867,6.315,2.759,5.028,0.867,4.042,
5,GSM3106851,Bladder Cancer,Female,81,<pT2,high,Kanagawa,4.832,5.349,5.571,...,1.237,7.388,5.724,1.237,6.788,6.695,5.987,1.237,5.985,
6,GSM3106852,Bladder Cancer,Male,54,<pT2,low,Kanagawa,1.729,4.569,3.458,...,-0.486,9.491,-0.486,-0.486,6.825,4.682,-0.486,-0.486,-0.486,
7,GSM3106853,Bladder Cancer,Male,74,<pT2,low,Kanagawa,6.33,6.452,3.505,...,1.51,8.169,1.51,1.51,5.517,1.51,4.643,1.51,1.51,
8,GSM3106854,Bladder Cancer,Female,76,<pT2,high,Kanagawa,6.835,1.32,4.226,...,3.98,8.404,5.749,1.32,6.913,4.67,1.32,1.32,1.32,
9,GSM3106855,Bladder Cancer,Male,58,<pT2,low,Kanagawa,0.195,4.343,1.969,...,0.195,7.219,0.195,0.195,6.598,6.062,3.48,2.795,0.195,
10,GSM3106856,Bladder Cancer,Female,65,>=pT2,high,Kanagawa,2.491,0.882,0.801,...,0.801,6.498,0.801,0.801,6.438,0.801,0.801,2.912,3.758,


In [21]:
df['Gender'] = df['Gender'].map( {'Female': 1, 'Male': 0} ).astype(int)

df['Class'] = df['Class'].map(lambda x: 1 if x == "Bladder Cancer" else 0).astype(int)


df.tail(30)

Unnamed: 0,Sample_ID,Class,Gender,Age,Path_Stage,Path_Grade,Sample_City,MIMAT0000062,MIMAT0000063,MIMAT0000064,...,MIMAT0032026,MIMAT0032029,MIMAT0032110,"MIMAT0032114, MIMAT0032115",MIMAT0032116,MIMAT0033692,MIMAT0035542,MIMAT0035703,MIMAT0035704,!series_matrix_table_end
943,GSM3107798,0,0,30,uncertain,uncertain,Kanagawa,-1.189,-1.189,-1.189,...,-1.189,6.782,-1.189,-1.189,6.918,4.754,0.952,-1.189,-1.189,
944,GSM3107799,0,1,68,uncertain,uncertain,Kanagawa,-0.028,-0.028,4.598,...,-0.028,7.355,-0.028,-0.028,5.203,-0.028,-0.028,4.23,-0.028,
945,GSM3107800,0,1,61,uncertain,uncertain,Kanagawa,3.306,1.028,1.028,...,1.028,6.339,1.028,1.028,5.332,1.028,5.046,1.028,4.413,
946,GSM3107801,0,0,44,uncertain,uncertain,Kanagawa,-0.635,-0.782,3.004,...,2.127,6.497,4.05,-0.782,5.716,4.913,-0.782,-0.782,1.671,
947,GSM3107802,0,1,41,uncertain,uncertain,Kanagawa,-1.174,2.167,-1.174,...,3.008,7.312,-1.174,-1.174,5.899,5.171,3.738,-1.174,-1.174,
948,GSM3107803,0,0,78,uncertain,uncertain,Kanagawa,6.624,6.485,3.172,...,0.193,6.615,0.193,0.193,4.976,5.76,4.682,0.193,3.882,
949,GSM3107804,0,0,69,uncertain,uncertain,Kanagawa,2.215,0.244,6.263,...,-1.383,7.038,0.361,-1.383,5.99,4.403,-1.383,-1.383,-1.383,
950,GSM3107805,0,0,58,uncertain,uncertain,Kanagawa,-0.006,-0.006,-0.006,...,-0.006,5.156,-0.006,-0.006,6.148,3.774,-0.006,-0.006,1.989,
951,GSM3107806,0,0,86,uncertain,uncertain,Kanagawa,3.194,3.232,1.849,...,0.044,7.634,0.044,0.044,6.276,5.68,3.427,0.044,4.119,
952,GSM3107807,0,0,27,uncertain,uncertain,Kanagawa,6.122,5.325,4.726,...,0.722,6.938,2.358,0.722,6.173,6.218,4.526,0.722,5.219,


## 
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
#show plots in the Jupyter Notebook
%matplotlib inline
#configure visualization defaults
sns.set(style='white', context='notebook', palette='deep')
sns.set_style('white')


# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier


#Acquire data
#The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing 
#datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.
#downloaded master file for this project:
bladder_cancer_file = "C:/Users/Liu_PC/Documents/Georgetown/GSE113486_series_matrix.txt"

df =  pd.read_csv(bladder_cancer_file, delimiter="\t", skiprows = 39)
df.drop([1,2,3,4,5,6,7,8],axis=0)
df = df.T    
df.shape    

#Analyze by describing data
#Pandas also helps describe the datasets answering following questions early in our project.
#Which features are available in the dataset?
#Noting the feature names for directly manipulating or analyzing these. These feature names are described on the Kaggle data page here.


    

In [None]:
df.describe()