## Assignment

The task is to build and train a classifier given a labeled dataset and then use it to infer the labels of a given unlabeled evaluation dataset. 

You will find the training and evaluation data on canvas.

Here's the training data: TrainOnMe-2.csv 

Here's the evaluation data: EvaluateOnMe-2.csv 

Here's the ground truth: EvaluationGT-2.csv

You can use whatever python libraries you like! The steps below are suggestions, but feel free to try any other techniques we discussed in class.

You can submit the predicted labels by uploading them in csv format, which will then be compared to the ground truth.


In [70]:
# Import packages 
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# For feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# For min-max scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# For encoding
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Some models you can try
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier


## Load the training and evaluation datasets

In [71]:
# Read datasets
df = pd.read_csv('TrainOnMe-2.csv') 
eval_df = pd.read_csv('EvaluateOnMe-2.csv')

# Split your training dataset into features and labels - Audrey Updated
X = df.drop(columns=['y'])  # Features (all columns except Labels which is 'y' column)
y = df['y']  # Labels


## Data pre-processing

#### Remove NA values and noise - edited

In [72]:
# Make a copy of the training set - edited
train_copy = df.copy()
feature_copy = X.copy()

# 1: Removing the rows with y values that are not one of the four names
# Force remove row 134
df = df.drop(134) 
valid_y_outputs = ['Bob', 'Jorg', 'Shoogee', 'Atsuto']
invalid_rows = df[~df['y'].isin(valid_y_outputs)]
print("Rows with invalid outputs:")
print(invalid_rows)

# Store row indices of invalid rows in a list
removed_row_numbers = invalid_rows.index.tolist()
print("Row indices removed due to invalid outputs:", removed_row_numbers)

# Remove the invalid rows from the DataFrame
df_y = df[df['y'].isin(valid_y_outputs)]


# 2: Remove rows that do not have a valid row number in the first column
first_column_df = df_y.columns[0]

# Convert to numeric, forcing errors to NaN, and then drop rows with NaN
df_y.loc[:, first_column_df] = pd.to_numeric(df_y[first_column_df], errors='coerce')

# Identify rows with non-numeric values in the first column
non_numeric_rows = df_y[df_y[first_column_df].isna()]
non_numeric_indices = non_numeric_rows.index.tolist()

# Remove rows with non-numeric values
df_0 = df_y.dropna(subset=[first_column_df])


# 3: Removing the rows with missing values
missing_rows = df_0[df_0.isnull().any(axis=1)]
dropped_indices = missing_rows.index.tolist()
df_cleaned = df_0.dropna()

# Combine the list of removed indices
combined_removed_rows = list(set(removed_row_numbers + dropped_indices + non_numeric_indices))
print("Combined row indices removed:", combined_removed_rows)


Rows with invalid outputs:
                                    Unnamed: 0               y       x1  \
307                                        306             NaN  1.12582   
408                                        407             NaN  0.95125   
581  Det är då som det stora vemodet rullar in             NaN      NaN   
582            Och från havet blåser en isande   gråkall vind.      NaN   
633                                        630             NaN  0.96981   
739                                        736             NaN  0.61820   
803          Den här är felt. Darfor erase it.             NaN      NaN   

          x2       x3       x4        x5        x6                  x7  \
307  0.37246 -3.91978 -8.69740   9.94560   0.38929     Jerry Fernström   
408 -0.14534  2.69299 -9.22679  10.24097  -0.15571  Erik Sven Williams   
581      NaN      NaN      NaN       NaN       NaN                 NaN   
582      NaN      NaN      NaN       NaN       NaN                 NaN   
63

#### Check the dtypes of all features - edited 

In [73]:
# Check the dtypes of all features - edited 
print(df.dtypes)
df.info()

Unnamed: 0     object
y              object
x1            float64
x2            float64
x3            float64
x4            float64
x5            float64
x6             object
x7             object
x8            float64
x9            float64
x10           float64
x11           float64
x12            object
x13           float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
Index: 1003 entries, 0 to 1003
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1003 non-null   object 
 1   y           997 non-null    object 
 2   x1          1000 non-null   float64
 3   x2          1000 non-null   float64
 4   x3          1000 non-null   float64
 5   x4          1000 non-null   float64
 6   x5          1000 non-null   float64
 7   x6          1000 non-null   object 
 8   x7          1000 non-null   object 
 9   x8          1000 non-null   float64
 10  x9          1000 non-null   float64
 11  x10         1000

#### Convert text columns to category - edited 

In [74]:
# Check for missing values in the 'x7' column
print("Missing values in 'x7':", df_cleaned['x7'].isna().sum())

# Drop rows with missing values in 'x7'
df_cleaned = df_cleaned.dropna(subset=['x7'])

# Convert the x7 column using OneHotEncoder
x7_cat = df_cleaned[['x7']]
x7_cat_encoder = OneHotEncoder(sparse_output=False, drop='first')  
x7_encoded = x7_cat_encoder.fit_transform(x7_cat)
x7_encoded_df = pd.DataFrame(x7_encoded, columns=x7_cat_encoder.get_feature_names_out(['x7'])) 

# Ensure the index lengths match before assigning
if len(x7_encoded_df) == len(df_cleaned):
    x7_encoded_df.index = df_cleaned.index  # Reset the index to match the original DataFrame
else:
    print("Warning: Mismatch in the number of rows after encoding 'x7'")

# Drop the original 'x7' column and concatenate the encoded columns
df_cleaned = df_cleaned.drop('x7', axis=1)
df_cleaned = pd.concat([df_cleaned, x7_encoded_df], axis=1)

# Convert the x12 column
df_cleaned['x12'] = df_cleaned['x12'].astype(int)  # Convert True/False column to 1's and 0's

# Attempt to convert x6 to numeric, coercing errors to NaN
df_cleaned['x6'] = pd.to_numeric(df_cleaned['x6'], errors='coerce')
# Check if there are any NaN values after the conversion
non_numeric_x6 = df_cleaned[df_cleaned['x6'].isna()]
print("Rows with non-numeric values in 'x6':")
print(non_numeric_x6)
# Fill in non-numeric x6
df_cleaned['x6'].fillna(df_cleaned['x6'].median(), inplace=True)

# Convert the first column to numeric, coercing errors to NaN
df_cleaned[df_cleaned.columns[0]] = pd.to_numeric(df_cleaned[df_cleaned.columns[0]], errors='coerce')
# Compute the median of the column, ignoring NaN values
median_value = df_cleaned[df_cleaned.columns[0]].median()
# Fill NaN values with the median value
df_cleaned[df_cleaned.columns[0]].fillna(median_value, inplace=True)

# Check the types and information of the DataFrame
print(df_cleaned.dtypes)
df_cleaned.info()

# Check for non-numeric columns
non_numeric_cols = df_cleaned.select_dtypes(exclude=[np.number]).columns
print(f"Non-numeric columns: {non_numeric_cols}")


Missing values in 'x7': 0
Rows with non-numeric values in 'x6':
    Unnamed: 0     y       x1       x2       x3       x4        x5  x6  \
265        264  Jorg  0.88362  0.12864 -4.91248 -9.41917  10.04934 NaN   

          x8      x9      x10      x11  x12       x13  x7_Erik Sven Williams  \
265  0.60795  2.3159 -5.25392 -1.40652    0  41.72493                    0.0   

     x7_Jerry Fernström  x7_Jerry Williams  x7_Jerry från Solna  
265                 1.0                0.0                  0.0  
Unnamed: 0                 int64
y                         object
x1                       float64
x2                       float64
x3                       float64
x4                       float64
x5                       float64
x6                       float64
x8                       float64
x9                       float64
x10                      float64
x11                      float64
x12                        int64
x13                      float64
x7_Erik Sven Williams    float64

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['x6'].fillna(df_cleaned['x6'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned[df_cleaned.columns[0]].fillna(median_value, inplace=True)


### Convert Eval columns to categories 

In [75]:
# Check the dtypes of all features - edited 
print(eval_df.dtypes)
eval_df.info()

# Convert the eval x7 column using OneHotEncoder
eval_x7_cat = eval_df[['x7']]
eval_x7_cat_encoder = OneHotEncoder(sparse_output=False, drop='first')  
eval_x7_encoded = eval_x7_cat_encoder.fit_transform(eval_x7_cat)
eval_x7_encoded_df = pd.DataFrame(eval_x7_encoded, columns=eval_x7_cat_encoder.get_feature_names_out(['x7'])) 

# Drop the original 'x7' column and concatenate the encoded columns
eval_df = eval_df.drop('x7', axis=1)
eval_df = pd.concat([eval_df, eval_x7_encoded_df], axis=1)

# Convert the x12 column
eval_df['x12'] = eval_df['x12'].astype(int)  # Convert True/False column to 1's and 0's


Unnamed: 0      int64
x1            float64
x2            float64
x3            float64
x4            float64
x5            float64
x6            float64
x7             object
x8            float64
x9            float64
x10           float64
x11           float64
x12              bool
x13           float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  10000 non-null  int64  
 1   x1          10000 non-null  float64
 2   x2          10000 non-null  float64
 3   x3          10000 non-null  float64
 4   x4          10000 non-null  float64
 5   x5          10000 non-null  float64
 6   x6          10000 non-null  float64
 7   x7          10000 non-null  object 
 8   x8          10000 non-null  float64
 9   x9          10000 non-null  float64
 10  x10         10000 non-null  float64
 11  x11         10000 non-null  floa

#### Check types have changed

In [76]:
# Check types of eval have changed
print(eval_df.columns)
print(eval_df.dtypes)
eval_df.info()

Index(['Unnamed: 0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x8', 'x9', 'x10',
       'x11', 'x12', 'x13', 'x7_Erik Sven Williams', 'x7_Jerry Fernström',
       'x7_Jerry Williams', 'x7_Jerry från Solna'],
      dtype='object')
Unnamed: 0                 int64
x1                       float64
x2                       float64
x3                       float64
x4                       float64
x5                       float64
x6                       float64
x8                       float64
x9                       float64
x10                      float64
x11                      float64
x12                        int64
x13                      float64
x7_Erik Sven Williams    float64
x7_Jerry Fernström       float64
x7_Jerry Williams        float64
x7_Jerry från Solna      float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 ------------

## Dealing with outliers

In [77]:
# Ensure you only use numeric columns for Z-score calculation
numeric_df = df_cleaned.select_dtypes(include=[np.number])

# Calculate Z-scores for each column in df_cleaned, which ensures index alignment
z_scores = np.abs(stats.zscore(numeric_df))

# Define a threshold for Z-scores
z_threshold = 3

# Identify rows that have any Z-score above the threshold
outliers = (z_scores > z_threshold).any(axis=1)

# Ensure the index of 'outliers' matches df_cleaned
outliers = outliers.reindex(df_cleaned.index, fill_value=False)

# Print the rows with outliers
outlier_rows = df_cleaned[outliers]
print("Rows with outliers:")
print(outlier_rows)

# Remove outliers from df_cleaned
df_cleaned = df_cleaned[~outliers]

# Print the cleaned DataFrame
print("\nDataFrame after removing outliers:")
print(df_cleaned)

# There are different ways to do this but one way could be to use stats.zscore

Rows with outliers:
     Unnamed: 0        y       x1       x2        x3            x4        x5  \
27           27     Jorg  2.10221  1.56680  -4.65063 -9.084660e+00  10.62068   
40           40      Bob  1.34002  1.83351  -3.68605 -8.389370e+00  10.48468   
55           55     Jorg  1.99826  0.05477  -0.98095 -9.344060e+00   9.92647   
64           64      Bob  3.00284  1.45413 -11.15675 -9.381620e+00   9.92228   
82           82  Shoogee -0.10483  1.87380   0.76205 -9.500880e+00   9.94529   
146         145     Jorg  1.06445  0.70250  -0.68295 -9.645790e+00  10.08945   
220         219     Jorg  0.86135 -0.10248   0.47975 -9.133260e+00  10.45166   
250         249  Shoogee -0.71322 -2.00360  -0.68410 -9.951670e+00  10.04280   
269         268     Jorg  4.00674 -0.14577  -5.37766 -9.081050e+00  10.44582   
281         280     Jorg  1.10084 -0.61411   0.50381 -9.234500e+00  10.24072   
346         345      Bob  1.92895  0.20643  -3.61880 -8.928800e+00  10.10618   
402         401     

## Scaling the features

In [78]:
# Scale your features
# You can try both standardscaler and minmaxscaler and see which works better - edited
# ONLY USE NUMERICAL
data_num = df_cleaned.select_dtypes(include=['number']) 
std_scaler = StandardScaler()
data_num_std_scaled = std_scaler.fit_transform(data_num)

# Test if Standard Scaled worked
scaled_df = pd.DataFrame(data_num_std_scaled, columns=data_num.columns, index=df_cleaned.index)

# Check mean and standard deviation
means = scaled_df.mean()
std_devs = scaled_df.std()

print("Means of scaled features:")
print(means)
print("\nStandard deviations of scaled features:")
print(std_devs)

# Check for non-numeric columns
non_numeric_cols = df_cleaned.select_dtypes(exclude=[np.number]).columns
print(f"Non-numeric columns: {non_numeric_cols}")

Means of scaled features:
Unnamed: 0               0.000000e+00
x1                       3.337622e-17
x2                       1.483388e-17
x3                       2.039658e-17
x4                      -2.117536e-15
x5                       4.031106e-15
x6                      -3.708469e-17
x8                       2.225082e-17
x9                       1.668811e-17
x10                     -5.562704e-17
x11                     -7.416939e-17
x12                     -3.893893e-17
x13                     -9.271173e-18
x7_Erik Sven Williams   -7.046092e-17
x7_Jerry Fernström       5.701772e-17
x7_Jerry Williams        0.000000e+00
x7_Jerry från Solna     -2.595929e-17
dtype: float64

Standard deviations of scaled features:
Unnamed: 0               1.000522
x1                       1.000522
x2                       1.000522
x3                       1.000522
x4                       1.000522
x5                       1.000522
x6                       1.000522
x8                       1.000522


## Feature selection

In [79]:
# You could try to apply SelectKBest class to extract the most useful features (this is optional but MIGHT improve accuracy)
# Remove whichever features that are not useful




## Split your data to train and test set

In [80]:
# Check for non-numeric columns
non_numeric_cols = df_cleaned.select_dtypes(exclude=[np.number]).columns
print(f"Non-numeric columns: {non_numeric_cols}")

X = df_cleaned.drop(columns=['y'])  # Drop the target column to get features
y= df_cleaned['y']  # Extract the target variable

# Train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state = 0)

# Check for non-numeric columns in X_train
non_numeric_columns = X_train.select_dtypes(exclude=[np.number]).columns
print("Non-numeric columns in X_train:", non_numeric_columns)

# Check for non-numeric values in each column
for col in X_train.columns:
    non_numeric = X_train[X_train[col].apply(lambda x: not isinstance(x, (int, float)))]
    if not non_numeric.empty:
        print(f"Non-numeric values in column {col}:")
        print(non_numeric)



Non-numeric columns: Index(['y'], dtype='object')
Non-numeric columns in X_train: Index([], dtype='object')


## Fit the model

* You can try models other than the models listed below
* You can try different hyperparameters
* Evaluate your model using cross-validation

In [81]:
# Try linear SVM classifier
from sklearn.svm import LinearSVC
linear = LinearSVC(C=0.5).fit(X_train, y_train)
print(linear.score(X_test,y_test))

# Evaluate using cross-validation
scores = cross_val_score(linear,X_test,y_test,cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
print('test')

0.4479166666666667
0.46 accuracy with a standard deviation of 0.14
test


In [82]:
#Try decision tree classifier
decision_tree = DecisionTreeClassifier(criterion = "gini").fit(X_train, y_train)
print(decision_tree.score(X_test,y_test))

# Evaluate using cross-validation
scores = cross_val_score(decision_tree,X_test,y_test,cv=10)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.5
0.45 accuracy with a standard deviation of 0.15


In [83]:
#Try random forest classifier
random_forest = RandomForestClassifier().fit(X_train, y_train)
print(random_forest.score(X_test,y_test))

scores = cross_val_score(random_forest,X_test,y_test,cv=10)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.6041666666666666
0.53 accuracy with a standard deviation of 0.14


In [84]:
# Use your best model to predict the labels for the evaluation set
best_model = random_forest
y_pred = best_model.predict(eval_df)

print(y_pred)


['Jorg' 'Bob' 'Jorg' ... 'Jorg' 'Bob' 'Jorg']


In [85]:
# Save your predictions to a csv and upload it to canvas

pd.DataFrame(y_pred).to_csv("file.txt",index = False,header=False)

## Calculate percentage

In [87]:
import pandas as pd

# Step 1: Read values from CSV file without column names
csv_file = 'EvaluationGT-2.csv'
df = pd.read_csv(csv_file, header=None)

# Grab the first column as a list
csv_values = df.iloc[:, 0].tolist()

# Step 2: Read values from the text file (assuming values are line-separated)
text_file = 'file.txt'
with open(text_file, 'r') as file:
    text_values = [line.strip() for line in file.readlines()]  # Removing any extra spaces or newline chars

# Step 3: Ensure both lists are the same length
if len(csv_values) != len(text_values):
    print("The CSV and text file do not have the same number of values.")
else:
    # Step 4: Compare the values element-wise, considering the order
    matches = [1 for csv_value, text_value in zip(csv_values, text_values) if csv_value == text_value]
    
    # Step 5: Calculate the percentage of matching values
    num_matches = len(matches)
    total_values = len(csv_values)
    percentage_match = (num_matches / total_values) * 100

    # Output the results
    print(f"Total values in CSV: {total_values}")
    print(f"Number of matching values (order-sensitive): {num_matches}")
    print(f"Percentage of matching values: {percentage_match:.2f}%")


Total values in CSV: 10000
Number of matching values (order-sensitive): 6101
Percentage of matching values: 61.01%
