**Structured Data Assignment**

**Problem Statement**

Problem 1 - The development of drugs is critical in providing therapeutic options
for patients suffering from chronic and terminal illnesses. “Target Drug”, in particular,
is designed to enhance the patient's health and well-being without causing
dependence on other medications that could potentially lead to severe and
life-threatening side effects. These drugs are specifically tailored to treat a particular
disease or condition, offering a more focused and effective approach to treatment,
while minimising the risk of harmful reactions.
The objective in this assignment is to develop a predictive model which will predict
whether a patient will be eligible*** for “Target Drug” or not in next 30 days. Knowing
if the patient is eligible or not will help physician treating the patient make informed
decision on the which treatments to give.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [7]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
# Step 1: Data Preprocessing
train_data = pd.read_parquet('/content/drive/MyDrive/train.parquet')
test_data = pd.read_parquet('/content/drive/MyDrive/test.parquet')

In [9]:
print(train_data.columns)


Index(['Patient-Uid', 'Date', 'Incident'], dtype='object')


In [10]:
print(train_data.shape)


(3220868, 3)


In [27]:


# Identify patients who have taken the "Target Drug" at least once
positive_set = train_data[train_data['Incident'] == 'TARGET DRUG']['Patient-Uid'].unique()

# Create a negative set by randomly selecting patients who have not taken the "Target Drug"
negative_set = train_data[~train_data['Patient-Uid'].isin(positive_set)]['Patient-Uid'].sample(n=len(positive_set), random_state=42)

# Combine positive and negative sets
dataset = pd.concat([train_data[train_data['Patient-Uid'].isin(positive_set)], train_data[train_data['Patient-Uid'].isin(negative_set)]])

# Step 2: Feature Engineering
# Example: Creating frequency-based features
patient_incident_count = dataset.groupby('Patient-Uid')['Incident'].count().reset_index()
patient_incident_count.columns = ['Patient-Uid', 'Incident_Count']

# Merge frequency-based features with the dataset
dataset = pd.merge(dataset, patient_incident_count, on='Patient-Uid', how='left')

# Step 3: Model Development
# Split the dataset into training and validation sets
train_size = int(0.8 * len(dataset))
train_set = dataset[:train_size]
val_set = dataset[train_size:]

#  machine learning model and train it
#  using Logistic Regression:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train = train_set[['Incident_Count']]
y_train = train_set['Incident'].apply(lambda x: 1 if x == 'TARGET DRUG' else 0)

X_val = val_set[['Incident_Count']]
y_val = val_set['Incident'].apply(lambda x: 1 if x == 'TARGET DRUG' else 0)

model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model on the validation set
val_predictions = model.predict(X_val)
print(classification_report(y_val, val_predictions))

# Step 4: Generate Predictions for the Test Set
# Compute frequency-based features for the test set
test_patient_incident_count = test_data.groupby('Patient-Uid')['Incident'].count().reset_index()
test_patient_incident_count.columns = ['Patient-Uid', 'Incident_Count']

# Merge frequency-based features with the test dataset
test_dataset = pd.merge(test_data, test_patient_incident_count, on='Patient-Uid', how='left')

# Generate predictions using the trained model
X_test = test_dataset[['Incident_Count']]
test_predictions = model.predict(X_test)

# Step 5: Create the Final Submission File
submission_df = pd.DataFrame({'Patient-Uid': test_dataset['Patient-Uid'], 'label': test_predictions})

# Replace 'fill 1 or 0' with the actual predicted labels (1 or 0)
submission_df['label'] = submission_df['label'].apply(lambda x: '1' if x == 1 else '1')

submission_df.to_csv('Final_submission.csv', index=False)

# Step 6: Evaluation
# Calculate the F1-score for your model's predictions
# compare the predictions with ground truth labels if available to compute the F1-score


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    450980

    accuracy                           1.00    450980
   macro avg       1.00      1.00      1.00    450980
weighted avg       1.00      1.00      1.00    450980



In [12]:
import pandas as pd

submission_df = pd.read_csv('Final_submission.csv')
print(submission_df.head())  # Display the first few rows of the DataFrame


                            Patient-Uid  label
0  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
1  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
2  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
3  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
4  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1


In [19]:
submission_df = submission_df.groupby('Patient-Uid')['label'].max().reset_index()


In [25]:
print(submission_df)

                                  Patient-Uid label
0        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f    no
1        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f    no
2        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f    no
3        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f    no
4        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f    no
...                                       ...   ...
1065519  a10272c9-1c7c-11ec-b3ce-16262ee38c7f    no
1065520  a10272c9-1c7c-11ec-b3ce-16262ee38c7f    no
1065521  a10272c9-1c7c-11ec-b3ce-16262ee38c7f    no
1065522  a10272c9-1c7c-11ec-b3ce-16262ee38c7f    no
1065523  a10272c9-1c7c-11ec-b3ce-16262ee38c7f    no

[1065524 rows x 2 columns]


In [14]:
import pandas as pd

submission_df = pd.read_csv('Final_submission.csv')
print(submission_df.head())  # Display the first few rows of the DataFrame

                            Patient-Uid  label
0  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
1  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
2  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
3  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
4  a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f      1
