## Assignment

The task is to build and train a classifier given a labeled dataset and then use it to infer the labels of a given unlabeled evaluation dataset. 

You will find the training and evaluation data on canvas.

Here's the training data: TrainOnMe-2.csv 

Here's the evaluation data: EvaluateOnMe-2.csv 

Here's the ground truth: EvaluationGT-2.csv

You can use whatever python libraries you like! The steps below are suggestions, but feel free to try any other techniques we discussed in class.

You can submit the predicted labels by uploading them in csv format, which will then be compared to the ground truth.


In [4]:
# Import packages 
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# For feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# For min-max scaling
from sklearn.preprocessing import MinMaxScaler

# For encoding
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Some models you can try
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier


## Load the training and evaluation datasets

In [6]:
# Read datasets
df = pd.read_csv('TrainOnMe-2.csv') 
eval_df = pd.read_csv('EvaluateOnMe-2.csv')

# Split your training dataset into features and labels - Audrey Updated
X = df.drop(columns=['y'])  # Features (all columns except Labels which is 'y' column)
y = df['y']  # Labels


## Data pre-processing

#### Remove NA values and noise - edited

In [21]:
# Make a copy of the training set - edited
train_copy = df.copy()
feature_copy = X.copy()

# Removing the rows with y values that are not one of the four names
valid_y_outputs = ['Bob', 'Jorg', 'Shoogee', 'Atsuto']
invalid_rows = df[~df['y'].isin(valid_y_outputs)]
print("Rows with invalid outputs:")
print(invalid_rows)
# Store row numbers of invalid rows in a list
removed_row_numbers = invalid_rows.iloc[:, 0].tolist()
print("Row numbers removed due to invalid outputs:", removed_row_numbers)
# Remove the invalid rows from the DataFrame
df_cleaned = df[df['y'].isin(valid_y_outputs)]

# Removing the rows with missing values
missing_rows = df[df.isnull().any(axis=1)]
dropped_indices = missing_rows.index.tolist()
df_cleaned = df.dropna()
# Combine the list of missing value row numbers and invalid y outputs rows
combined_removed_rows = list(set(removed_row_numbers + dropped_indices))

Rows with invalid outputs:
                                    Unnamed: 0               y       x1  \
134               https://youtu.be/74TDvzZcHPw             NaN      NaN   
307                                        306             NaN  1.12582   
408                                        407             NaN  0.95125   
581  Det är då som det stora vemodet rullar in             NaN      NaN   
582            Och från havet blåser en isande   gråkall vind.      NaN   
633                                        630             NaN  0.96981   
739                                        736             NaN  0.61820   
803          Den här är felt. Darfor erase it.             NaN      NaN   

          x2       x3       x4        x5        x6                  x7  \
134      NaN      NaN      NaN       NaN       NaN                 NaN   
307  0.37246 -3.91978 -8.69740   9.94560   0.38929     Jerry Fernström   
408 -0.14534  2.69299 -9.22679  10.24097  -0.15571  Erik Sven Williams   
5

#### Check the dtypes of all features - edited 

In [None]:
# Check the dtypes of all features - edited 
print(df.dtypes)
df.info()

#### Convert text columns to category - edited 

In [None]:
# Convert the x7 column
x7_cat = df[['x7']]
x7_cat_encoder = OneHotEncoder(sparse=False, drop='first')
x7_encoded = encoder.fit_transform(x7_cat)
x7_encoded_df = pd.DataFrame(x7_encoded, columns=encoder.get_feature_names_out(['x7'])) # Convert the result to a DataFrame
x7_encoded_df.index = df.index  # Reset the index of the new DataFrame to match the original DataFrame
df = df.drop('x7', axis=1)  # Drop the original 'x7' column from the DataFrame
df = pd.concat([df, x7_encoded_df], axis=1)  # Concatenate the new one-hot encoded columns with the original DataFrame

# Convert the x12 column
df['x12'] = df['x12'].astype(int) # Convert True/False column to 1's and 0's

# Change categories to encoded labels using LabelEncoder()

## Dealing with outliers

In [2]:
# Try to remove outliers from training data to improve performance
# There are different ways to do this but one way could be to use stats.zscore


## Scaling the features

In [3]:
# Scale your features
# You can try both standardscaler and minmaxscaler and see which works better



## Feature selection

In [None]:
# You could try to apply SelectKBest class to extract the most useful features (this is optional but MIGHT improve accuracy)
# Remove whichever features that are not useful



## Split your data to train and test set

In [581]:
# Train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state = 0)


## Fit the model

* You can try models other than the models listed below
* You can try different hyperparameters
* Evaluate your model using cross-validation

In [None]:
# Try linear SVM classifier
linear = SVC(kernel='linear', C=0.5).fit(X_train, y_train)
print(linear.score(X_test,y_test))

# Evaluate using cross-validation
scores = cross_val_score(linear,X_test,y_test,cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

In [None]:
#Try decision tree classifier
decision_tree = DecisionTreeClassifier(criterion = "gini").fit(X_train, y_train)
print(decision_tree.score(X_test,y_test))

# Evaluate using cross-validation
scores = cross_val_score(decision_tree,X_test,y_test,cv=10)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

In [None]:
#Try random forest classifier
random_forest = RandomForestClassifier().fit(X_train, y_train)
print(random_forest.score(X_test,y_test))

scores = cross_val_score(random_forest,X_test,y_test,cv=10)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

In [None]:
# Use your best model to predict the labels for the evaluation set

y_pred = best_model.predict(X_eval)

print(y_pred)


In [599]:
# Save your predictions to a csv and upload it to canvas

pd.DataFrame(y_pred).to_csv("file.txt",index = False,header=False)