# Tempus AI Coding Assessment

## Bioinformatics Scientist, Pipeline Development - Katelyn Bechler

### 1. Design a Scalable Twitter Clone in the Cloud

1. Define MVP core functionality and features that must be supported - what are the core components that should exist in order for a "Twitter" to function in the cloud?
    - Posts: write posts, add tags (mentions), add hashtags, like posts, unlike posts, repost, comment on posts
    - Community: follow, un-follow, view followers list, view those you follow list, private messaging, search for users
    - Technical: support for multi-media posts, flagging unsafe posts, ability to get live data/feeds, algorithmic newsfeed, notifications, search posts, username/password login

2. Required cloud resources: I would select AWS as a cloud resource to build and deploy this MVP, but I acknowledge that Microsoft Azure or GCP could also be used. With AWS, I would use AWS EC2 instances to support the backend since then we can scale the number of EC2 instances based on traffic and number of users. For storage, I would use Amazon Redshift (PostgreSQL or another relational database) to store user data.

3. Scaling strategy: To help with scaling, I would add configurations that leverage an elastic load balancer to distribute user requests as evenly as possible across EC2 instances and leverage an application load balancer to then route the users to different EC2 instances. For storage, I might suggest to use Amazon S3 flat files to store large media like images and videos that are associated with posts.

4. Additional features and considerations: Additional features that might be added are ideas to help personalize a users' feed, provide following recommendations/post recommendations, highlight trending hashtagas. These features, as well as most personalized features would require using novel machine learning or AI methods - these would inherently require significant additional compute power since most of this work would be deep learning. Other considerations are for added security, especially as the world gets more technology savvy and cyber attacks are getting more realistic, it is crucial to ensure there is proper and robust security of users and their data. Lastly, as the app is developed, it may be useful to consider integrations with other data platforms to create a more wholistic user experience. These integrations could rqeuire a lot of scrunity from both a compute and security perspective.

5. Key questions to keep in mind while scaling and developing:
    - How does the system handle peak user times? I.e. elections, breaking news, etc.
    - What information needs to be real-time vs. information that can have time delays?
    - How does the system handle monitoring, logging, failures, handling errors?
    - What does user privacy look like?
    - Are there user permissions, authentications, or authorization for certain data?
    - How to handle authentications or authorizations?
    - How does scaling impact costs?
    - How does scaling impact speed?
    - How do we monitor user behavior?
    - How do we maintain a safe and secure platform?
    - How do we ensure content is appriopriate and safe?
    - How can be enourage integration with other data platforms?
    - How do we handle user history?

### 2. Write a Dockerized pipeline in Nextflow

Requirements: Take a stock data set, slice it by a relevant variable, fit a simple regression model on each slice, and then combine these models in a single table.

The below is an EDA to explore the dataset and task in python, prior to integrating with Nextflow. See 'Problem2' on git.

In [14]:
# Load required libraries.
import pandas as pd
from palmerpenguins import load_penguins
from sklearn.linear_model import LinearRegression
import joblib

#### 1. Load and slice dataset

Palmer Penguins dataset: https://pypi.org/project/palmerpenguins/

In [2]:
# Load and slice dataset by a relevant variable.
penguins = load_penguins()

In [3]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [4]:
penguins.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [5]:
# Slice dataset by species.
penguins.species.value_counts()

species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

In [8]:
# Create a function to run regression model on each slice of data.
def fit_regression(sliced_data):
    # Segment to features and target and drop NAs for linear regression.
    sliced_data = sliced_data[['bill_length_mm', 'flipper_length_mm']]
    sliced_data_na_rm = sliced_data.dropna()

    # Define features and target.
    X = sliced_data_na_rm[['bill_length_mm']]
    # Drop rows with missing values.
    y = sliced_data_na_rm['flipper_length_mm']

    # Fit a regression model.
    lr_model = LinearRegression()
    lr_model.fit(X, y)

    # Save the model as a pickle file.
    joblib.dump(lr_model, f"model_{species}.pkl")

In [9]:
# Slice the dataset by spceis and save each sliced dataset.
for species in penguins.species.unique():
    sliced_data = penguins[penguins['species'] == species] 
    sliced_data.to_csv(f"sliced_{species}.csv", index = False)
    fit_regression(sliced_data)

In [89]:
# Load sliced data.
#sliced_data = pd.read_csv("${slice}")

In [10]:
# Combine all models.
models_combined = []

In [11]:
# Load each model and summarize its coefficients.
for model_name in os.listdir("."):
    if model_name.startswith("model_") and model_name.endswith(".pkl"):
        model = joblib.load(model_name)
        coefficients = model.coef_[0]
        intercept = model.intercept_
        species = model_name.replace("model_", "").replace(".pkl", "")
        models_combined.append({"species": species, "intercept": intercept, "coefficient": coefficients})

In [12]:
# Convert the combined models into a DataFrame and save it.
models_combined_df = pd.DataFrame(models_combined)
models_combined_df.to_csv("models_combined.csv", index = False)

In [13]:
models_combined_df.head()

Unnamed: 0,species,intercept,coefficient
0,Chinstrap,146.63584,1.007246
1,Adelie,158.924442,0.799899
2,Gentoo,151.096041,1.391246


### 3. Describe a Python package that you use regularly.

Though this might be a big one to tackle, I'm going to talk about the *pandas* package - there is a lot to uncover here, but I want to highlight the main features, limitations, and areas for growth for this package.

**Import classes/methods/functions**
- Classes: DataFrame, Series, Index
- Methods: .apply(), .map(), .dropna(), .sort_values(), .query()
- Functions: working with csvs (read/write), bringing data together (merge, concat), simple analysis (pivot_table, groupby)

Pandas is also extremely useful because it integrates well with other packages (Numpy, Matplotlib, Scikit-learn, Dask, Pytorch, TensorFlow, BeautifulSoup). Pandas is built on top of Numpy and also integrates very well with Matplotlib and Seaborn, which are plotting packages. At a high level, pandas is exceptional at manipulating datasets in clean, interpretable ways and carrying this forward in future tasks.

Some limitations of the pandas packages are that it is not as efficient for large datasets because it mostly processes in-memory and on a single-node. This could be improved by looking for ways to process tasks across CPUS using frameworks like Apache Spark or to perform operations in parallel using methods like apply.

Another limitation that has come up a lot for me personally is how pandas handles NA data. Pandas syntax can be fairly cumbersome/might not even exist (another limitation of pandas in general) to address all missing value issues, especially when we want to leverage downstream methods. For example, LinearRegression from sklearn can not take NA values and so it would be awesome if we could incorporate specialized functions that can create cleaner datasets in pandas resolving missing values through various methods, depending on user needs. If I were to propose working on this effort, I would create an advanced function for handling missing values based on custom parameters. A user could select to address NAs by column mean imputation, column mode imputation, column mediation imputation, KNN imputation, logistic regression imputation, removing all rows with any NA values, or encode missing values as a feature.


### 4. Part A: SQL/GBQ Challenge 1: Analyzing Genomic Data for Disease Association

See 'Problem4' on git.

### 4. Part B: SQL/GBQ Challenge 2: Analyzing Genomic Data for Gene Expression

See 'Problem4' on git.