# The midterm report and 5-minute presentation should include the following.

- Preliminary visualizations of data.
- Detailed description of data processing done so far.
- Detailed description of data modeling methods used so far.
- Preliminary results. (e.g. we fit a linear model to the data and we achieve promising results, or we did some clustering and we notice a clear pattern in the data)

We expect to see preliminary code in your project repo at this point.

Your report should be submitted as README.md in your project GitHub repo.

The 5-minute presentation should be a recording uploaded to YouTube. Please add the video link to the beginning of your report.

# Visualizations of Data

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/aml_ohsu_2022_clinical_data.tsv', sep='\t') # load the data (tsv file so need to specify sep)


## Data 

In [None]:
# 1. Check basic info of the data
df.info() # check the basic info of the data
df.shape # check the shape of the data

# 2. Check the first few rows of the data
df.head() # check the first few rows of the data
df.tail() # check the last few rows of the data

# 3. Check the columns of the data
df.describe() # only numeric columns
df.describe(include='all') # include all columns

# 4. Check the missing values
df.isnull().sum() # check the missing values
df.isnull().sum()/len(df) * 100 # check the percentage of missing values

# 5. Check the unique values
df.duplicated().sum() # check the number of duplicated rows
df.duplicated().sum()/len(df) * 100 # check the percentage of duplicated rows

# 6. Check the unique values in each column
df.columns # check the columns of the data
df.describe(include='all')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Histogram
df['column_name'].hist()

# Boxplot
sns.boxplot(data=df, x='column_name')

# Correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

# Scatterplot
sns.scatterplot(data=df, x='column1', y='column2')

# Distribution plot
sns.distplot(df['column_name'])
