## <font color="lime">BA 305 - Team Exercise<font>

---

### Teaching Team Pre-Processing Data Before Distributing the Exercise

##### - We are splitting the original train dataset into two parts

- 80% for students to train and test their models
- 20% for us to evaluate their models
- Create a sample submission file based on the 20% split


---

#### Following is how the codes work

Step 1: Load the Dataset
First, load the dataset from the uploaded train.csv file.

Step 2: Split the Dataset
Split the dataset into an 80% portion for students and a 20% portion for evaluation. We'll use train_test_split from sklearn.model_selection for this purpose.

Step 3: Save the 80% Split
Save the 80% portion as a CSV file for students. This file will be used by students to train and validate their models.

Step 4: Prepare the Evaluation Split
For the 20% evaluation split, prepare two files: one with the features (excluding the labels) for students to make predictions on, and another with the actual labels, which you'll keep private for evaluation.

Step 5: Create a Sample Submission File
Create a sample submission file to guide students on how the submission should be formatted. This file will mimic the structure of the 20% features file but include mock predictions.

---

#### Output Files:
- Student Training Data:
    - student_training_data.csv - This file contains the 80% split of the dataset for students to train and test their models.

- Evaluation Features: 
    - evaluation_features.csv - This file has the features from the 20% split for students to make predictions on.

- Sample Submission:
    - sample_submission.csv - A template showing how students should format their predictions for submission.

---

In [23]:
# TODO
# Change the Following Variables

# Where is the original data?
original_data_path = './original-data/train.csv'

# Where should the 80% student data be saved?
student_data_path = './student-data/globaltrains'

# Where should the evaluation data label be saved?
evaluation_data_path = './evaluation-data/globaltrains'

# What is the variable name for the label?
data_label = 'y'


In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [25]:
# Load the dataset
df = pd.read_csv(original_data_path)

# Reindex the dataset
df.reset_index(inplace=True, drop=True)

# Split the dataset into training (80%) and evaluation (20%) sets, maintaining the index
train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)

# Save the 80% split for students without the index
train_df.reset_index(drop=True).to_csv(student_data_path + '/team_exercise_data.csv', index=False)

# Prepare the evaluation split: features for students and labels for evaluation
eval_features = eval_df.drop(columns=[data_label])
eval_labels = eval_df[[data_label]]

# Save the evaluation features for students, including the index
eval_features.to_csv(student_data_path + '/test.csv', index=True)

# Save the actual labels privately for your evaluation, including the index
eval_labels.to_csv(evaluation_data_path + '/answer.csv', index=True)

# Create a sample submission file, using the index from the evaluation features
sample_submission = eval_features.copy()
sample_submission[data_label] = 0  # Mock predictions, replace 0 with the model's predictions
sample_submission[[data_label]].to_csv(student_data_path + '/sample_submission.csv', index=True)


---

#### Check the data

In [26]:
student_data = pd.read_csv(student_data_path + '/team_exercise_data.csv')
student_data

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,2011,88.96,m,v,as,c,d,ag,k,x,...,0,0,1,0,0,0,0,0,0,0
1,3690,89.90,n,s,as,d,d,ae,g,s,...,1,0,0,0,0,0,0,0,0,0
2,7597,92.59,f,c,m,c,d,v,i,e,...,0,0,1,0,0,0,0,0,0,0
3,322,108.84,j,aa,g,d,d,i,i,e,...,0,1,0,0,0,0,0,0,0,0
4,3103,111.15,ay,i,as,c,d,ad,l,k,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3362,6879,109.42,j,i,as,c,d,r,l,r,...,0,0,1,0,0,0,0,0,0,0
3363,898,78.25,az,y,e,c,d,d,j,j,...,0,0,0,0,0,0,1,0,0,0
3364,6214,92.18,y,w,ae,c,d,q,i,c,...,1,0,0,0,0,0,0,0,0,0
3365,7558,91.92,y,r,ak,f,d,v,i,b,...,0,0,0,0,0,0,0,0,0,0


In [27]:
student_test = pd.read_csv(student_data_path + '/test.csv', index_col=0)
student_test

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
1073,2140,al,o,ai,f,d,ag,j,l,0,...,0,0,0,0,0,0,0,0,0,0
144,310,f,l,ae,f,d,i,i,w,0,...,0,0,0,0,0,0,0,0,0,0
2380,4779,j,aa,ay,c,d,n,l,o,1,...,1,0,0,0,0,0,0,0,0,0
184,385,az,y,b,c,d,i,j,l,0,...,0,0,0,0,0,0,1,0,0,0
2587,5180,ak,v,ak,d,d,m,i,r,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
657,1280,x,b,m,c,d,c,d,n,0,...,0,0,1,0,0,0,0,0,0,0
3975,7972,t,b,m,c,d,w,j,w,0,...,0,0,1,0,0,0,0,0,0,0
907,1810,z,aa,ay,c,d,ag,h,s,1,...,1,0,0,0,0,0,0,0,0,0
3597,7206,f,v,ae,c,d,r,g,m,0,...,0,0,1,0,0,0,0,0,0,0


In [28]:
sample = pd.read_csv(student_data_path + '/sample_submission.csv', index_col=0)
sample

Unnamed: 0,y
1073,0
144,0
2380,0
184,0
2587,0
...,...
657,0
3975,0
907,0
3597,0


In [29]:
answer = pd.read_csv(evaluation_data_path + '/answer.csv', index_col=0)
answer

Unnamed: 0,y
1073,97.94
144,96.41
2380,105.83
184,79.09
2587,108.69
...,...
657,113.68
3975,88.85
907,89.60
3597,89.23
