## Detect negative controls

In [9]:
import pandas as pd

In [10]:
profiles = pd.read_parquet("../1.load/output/raw_filtered_profiles.parquet")
profiles.head()

Unnamed: 0,Metadata_JCP2022,Metadata_broad_sample,Metadata_Name,Metadata_Vector,Metadata_Transcript,Metadata_Symbol,Metadata_NCBI_Gene_ID,Metadata_Taxon_ID,Metadata_Gene_Description,Metadata_Prot_Match,...,Nuclei_Texture_Variance_RNA_10_03_256,Nuclei_Texture_Variance_RNA_3_00_256,Nuclei_Texture_Variance_RNA_3_01_256,Nuclei_Texture_Variance_RNA_3_02_256,Nuclei_Texture_Variance_RNA_3_03_256,Nuclei_Texture_Variance_RNA_5_00_256,Nuclei_Texture_Variance_RNA_5_01_256,Nuclei_Texture_Variance_RNA_5_02_256,Nuclei_Texture_Variance_RNA_5_03_256,Metadata_Batch
0,JCP2022_900002,ccsbBroad304_00001,ORF008415.1_TRC304.1,pLX_304,NM_001160173.3,NAT1,9,9606,N-acetyltransferase 1,100.0,...,82.875999,76.996002,77.473999,76.582001,77.233002,78.186996,80.055,77.632004,79.955002,2021_06_21_Batch7
1,JCP2022_900011,ccsbBroad304_00013,ORF009063.1_TRC304.1,pLX_304,NM_001612.6,ACRV1,56,9606,acrosomal vesicle protein 1,100.0,...,93.607002,88.196999,89.211998,88.081001,89.154999,89.897003,92.719002,89.843002,92.597,2021_06_21_Batch7
2,JCP2022_900033,ccsbBroad304_00037,ORF015627.1_TRC304.1,pLX_304,NM_001136.5,AGER,177,9606,advanced glycosylation end-product specific re...,100.0,...,133.380005,126.150002,127.25,125.769997,127.25,128.429993,131.880005,127.940002,131.960007,2021_06_21_Batch7
3,JCP2022_900063,ccsbBroad304_00069,ORF005433.1_TRC304.1,pLX_304,NM_001153.5,ANXA4,307,9606,annexin A4,100.0,...,84.871002,80.910004,81.814003,80.850998,81.926003,82.567001,85.179001,82.646004,85.292999,2021_06_21_Batch7
4,JCP2022_900084,ccsbBroad304_00091,ORF014376.1_TRC304.1,pLX_304,NM_001651.4,AQP5,362,9606,aquaporin 5,100.0,...,91.669998,87.241997,87.132004,86.538002,87.476997,88.224998,90.223,87.663002,90.227997,2021_06_21_Batch7


In [11]:
profiles.shape

(79560, 4780)

In [12]:
# Create a new column Metadata_SymbolX which is equal to Metadata_Symbol, by only if the values are in the list `selected_negcons`, otherwise it is set to "other"

selected_negcons = ["BFP", "HcRed", "LUCIFERASE"]
profiles["Metadata_SymbolX"] = profiles.Metadata_Symbol
profiles.loc[~profiles.Metadata_Symbol.isin(selected_negcons), "Metadata_SymbolX"] = "other"

# Now report counts of Metadata_SymbolX

profiles.Metadata_SymbolX.value_counts()

other         76800
BFP             920
HcRed           920
LUCIFERASE      920
Name: Metadata_SymbolX, dtype: int64

In [13]:
# Keep only `Metadata_SymbolX` and columns that start with `Cells_` or `Nuclei_` or `Cytoplasm_` or `Image_`

prefixes = ["Cells_", "Nuclei_", "Cytoplasm_", "Image_"]
profiles = profiles[
    ["Metadata_SymbolX"]
    + [col for col in profiles.columns if any(col.startswith(prefix) for prefix in prefixes)]
]



I have a dataframe a column `Metadata_SymbolX` and several feature columns. 

`Metadata_SymbolX` contains the class label

I want to create a classifier using the features to predict the class label.

Follow machine learning best practice and come up with a classifier that is robust to overfitting.

Then report the performance of the classifier on the test set.



In [14]:
# drop rows where Metadata_SymbolX == "other"

# profiles = profiles[profiles.Metadata_SymbolX != "other"]

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Splitting the dataset into features and target
features = profiles.drop('Metadata_SymbolX', axis=1)
target = profiles['Metadata_SymbolX']

# Splitting the dataset into training and testing sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=42, stratify=target
)

# Creating the classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Training the classifier
clf.fit(features_train, target_train)

# Predicting the classes for the test set
target_pred = clf.predict(features_test)

# Reporting the performance
print(f"Accuracy: {accuracy_score(target_test, target_pred)}")
print(classification_report(target_test, target_pred))



Accuracy: 0.5778985507246377
              precision    recall  f1-score   support

         BFP       0.59      0.66      0.62       184
       HcRed       0.54      0.57      0.55       184
  LUCIFERASE       0.62      0.51      0.56       184

    accuracy                           0.58       552
   macro avg       0.58      0.58      0.58       552
weighted avg       0.58      0.58      0.58       552

