# Importing and Analysis

In [52]:
import pandas as pd

# Read the dataset using pandas
data = pd.read_pickle("/content/drive/MyDrive/Colab Notebooks/ass2.pickle")
df = pd.DataFrame([data])

# Show the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Check the shape of the dataset
print("\nShape of the dataset:")
print(df.shape)

# Explore different columns and their data types
print("\nColumns and their data types:")
print(df.dtypes)

# Check for numeric columns stored as objects
numeric_columns_as_objects = df.select_dtypes(include=['object']).apply(pd.to_numeric, errors='coerce').notnull().all()

# Convert columns to numeric if applicable
for col in numeric_columns_as_objects[numeric_columns_as_objects].index:
    df[col] = pd.to_numeric(df[col])

# Check the updated data types
print("\nUpdated Columns and their data types:")
print(df.dtypes)

# Summary statistics
print("\nSummary statistics:")
print(df.describe())

# Check for any missing values
print("\nMissing values:")
print(df.isnull().sum())

# Plot histograms for numeric features
numeric_features = df.select_dtypes(include=['int64', 'float64']).columns

if not numeric_features.empty:
    df[numeric_features].hist(figsize=(12, 10))
    plt.tight_layout()
    plt.show()
else:
    print("\nNo numeric columns available for histogram plotting.")


First few rows of the dataset:
                                               train  \
0         f0  f1  f2  f3  f4  f5  f6  f7  f8  f9 ...   

                                                 dev  \
0         f0  f1  f2  f3  f4  f5  f6  f7  f8  f9 ...   

                                                test  
0         f0  f1  f2  f3  f4  f5  f6  f7  f8  f9 ...  

Shape of the dataset:
(1, 3)

Columns and their data types:
train    object
dev      object
test     object
dtype: object

Updated Columns and their data types:
train    object
dev      object
test     object
dtype: object

Summary statistics:
                                                    train  \
count                                                   1   
unique                                                  1   
top            f0  f1  f2  f3  f4  f5  f6  f7  f8  f9 ...   
freq                                                    1   

                                                      dev  \
count                

# Data Pre Processing
We converted the target variable to an integer to ensure that it is correctly interpreted as categorical by the classification models. Most machine learning algorithms expect the target variable to be encoded as integers representing different classes. By converting it to an integer type, we ensure compatibility with classification algorithms.

In [54]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pickle

# Extract the train, dev, and test sets
train_data = data['train']
dev_data = data['dev']
test_data = data['test']

# Assuming the data is stored as a list of dictionaries or similar structures
train_df = pd.DataFrame(train_data)
dev_df = pd.DataFrame(dev_data)
test_df = pd.DataFrame(test_data)

# Print the first few rows to inspect the data
print(train_df.head())
print(dev_df.head())
print(test_df.head())

# Ensure the target column is categorical
print("\nData types before encoding:")
print(train_df.dtypes)
print(dev_df.dtypes)
print(test_df.dtypes)

# Convert the target column to categorical if it's not
if train_df['target'].dtype == 'object' or train_df['target'].dtype == 'float64' or train_df['target'].dtype == 'int64':
    le = LabelEncoder()
    train_df['target'] = le.fit_transform(train_df['target'])
    dev_df['target'] = le.transform(dev_df['target'])
    test_df['target'] = le.transform(test_df['target'])

# Verify the target column is now categorical
print("\nData types after encoding:")
print(train_df.dtypes)
print(dev_df.dtypes)
print(test_df.dtypes)

# Handle missing values and convert appropriate columns to numeric if needed
# Example: Fill missing values
train_df.fillna(method='ffill', inplace=True)
dev_df.fillna(method='ffill', inplace=True)
test_df.fillna(method='ffill', inplace=True)

# # Standardize the numeric features if needed
# numeric_features = train_df.select_dtypes(include=['int64', 'float64']).columns
# scaler = StandardScaler()

# train_df[numeric_features] = scaler.fit_transform(train_df[numeric_features])
# dev_df[numeric_features] = scaler.transform(dev_df[numeric_features])
# test_df[numeric_features] = scaler.transform(test_df[numeric_features])

# Assume the target column is named 'target'
X_train = train_df.drop('target', axis=1)
y_train = train_df['target']
X_dev = dev_df.drop('target', axis=1)
y_dev = dev_df['target']
X_test = test_df.drop('target', axis=1)
y_test = test_df['target']

       f0  f1  f2  f3  f4  f5  f6  f7  f8  f9  ...  f33  f34  f35  f36  f37  \
51905   1   0   0   0   0   0   2   1   2   2  ...    0    0    0    2    0   
52612   0   0   0   0   0   0   2   1   0   0  ...    0    0    0    2    0   
61699   2   1   2   1   1   0   2   2   0   0  ...    0    0    0    1    0   
6291    0   0   0   0   0   0   0   0   0   0  ...    0    0    0    2    0   
17484   0   0   0   0   0   0   1   1   2   0  ...    0    0    0    2    1   

       f38  f39  f40  f41  target  
51905    0    0    0    0       2  
52612    0    0    0    0       2  
61699    0    0    0    0       2  
6291     0    0    0    0       2  
17484    2    0    0    0       2  

[5 rows x 43 columns]
       f0  f1  f2  f3  f4  f5  f6  f7  f8  f9  ...  f33  f34  f35  f36  f37  \
104     1   2   0   0   0   0   0   0   0   0  ...    0    0    0    0    0   
3548    2   0   0   0   0   0   2   0   0   0  ...    0    0    0    0    0   
11672   1   0   0   0   0   0   2   2   0   0  ..

# The different Classifiers

### **Logistic Regression**
Logistic regression was chosen as one of the initial models due to its simplicity and interpretability. However, it might not perform well if the relationship between the features and the target variable is not linear. In cases where the decision boundary is more complex, logistic regression may not capture it effectively, leading to suboptimal performance.

Additionally, the UndefinedMetricWarning indicates that some classes have no predicted samples in the logistic regression model, resulting in precision and F-score being set to 0 for those classes. This can occur if the model is not well-tuned or if the data is imbalanced. It's essential to address these issues by tuning hyperparameters or using techniques to handle class imbalance to improve the model's performance.

In [55]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict on the dev set
log_reg_pred = log_reg.predict(X_dev)

# Evaluate the model
print("Logistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_dev, log_reg_pred)}")


# Optionally, evaluate on the test set
log_reg_test_pred = log_reg.predict(X_test)
print("\nLogistic Regression Test Performance:")
print(f"Accuracy: {accuracy_score(y_test, log_reg_test_pred)}")


Logistic Regression Performance:
Accuracy: 0.6596358792184724

Logistic Regression Test Performance:
Accuracy: 0.6595618709295441


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### **Desicion Tree**
A decision tree was chosen as it can capture non-linear relationships between features and the target variable. Decision trees are also easy to interpret and visualize, making them useful for gaining insights into the data. However, decision trees tend to overfit the training data, which can lead to poor generalization performance on unseen data. But it was a way to start.

In [37]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the model
dec_tree = DecisionTreeClassifier()
dec_tree.fit(X_train, y_train)

# Predict on the dev set
dec_tree_pred = dec_tree.predict(X_dev)

# Evaluate the model
print("Decision Tree Performance:")
print(f"Accuracy: {accuracy_score(y_dev, dec_tree_pred)}")

# Optionally, evaluate on the test set
dec_tree_test_pred = dec_tree.predict(X_test)
print("\nDecision Tree Test Performance:")
print(f"Accuracy: {accuracy_score(y_test, dec_tree_test_pred)}")


Decision Tree Performance:
Accuracy: 0.7217288336293665

Decision Tree Test Performance:
Accuracy: 0.7223208999407934


### **Random Forest**
Random forests were chosen as an extension of decision trees to address the overfitting issue. Random forests use an ensemble of decision trees, where each tree is trained on a random subset of the data and features. By averaging the predictions of multiple trees, random forests reduce overfitting and improve generalization performance as we saw in class. Random forests also provide feature importance scores, which can be useful for feature selection and interpretation.

In [36]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the model
rand_forest = RandomForestClassifier()
rand_forest.fit(X_train, y_train)

# Predict on the dev set
rand_forest_pred = rand_forest.predict(X_dev)

# Evaluate the model
print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_dev, rand_forest_pred)}")


# Optionally, evaluate on the test set
rand_forest_test_pred = rand_forest.predict(X_test)
print("\nRandom Forest Test Performance:")
print(f"Accuracy: {accuracy_score(y_test, rand_forest_test_pred)}")



Random Forest Performance:
Accuracy: 0.8088365896980462

Random Forest Test Performance:
Accuracy: 0.8040260509177027


# Random Forest with Class Weighting
Class imbalance can lead to biased models. By assigning higher weights to minority class samples during training, Random Forest with class weights effectively balances the class distribution, improving the model's ability to learn from minority class instances

In [59]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the model with class weights
rand_forest = RandomForestClassifier(class_weight='balanced')
rand_forest.fit(X_train, y_train)

# Predict on the dev set
rand_forest_pred = rand_forest.predict(X_dev)

# Evaluate the model
print("Random Forest Performance with Class Weighting:")
print(f"Accuracy: {accuracy_score(y_dev, rand_forest_pred)}")

# Optionally, evaluate on the test set
rand_forest_test_pred = rand_forest.predict(X_test)
print("\nRandom Forest Test Performance with Class Weighting:")
print(f"Accuracy: {accuracy_score(y_test, rand_forest_test_pred)}")



Random Forest Performance with Class Weighting:
Accuracy: 0.8100207223208999

Random Forest Test Performance with Class Weighting:
Accuracy: 0.8035079928952042


# Ada Boost
AdaBoost was tried because:
1. Works Well with Weak Learners: AdaBoost is effective with weak learners, such as decision trees with limited depth. By sequentially training weak learners and adjusting their weights based on their performance, AdaBoost creates a strong classifier.
2. Boosting Mechanism: AdaBoost focuses on difficult-to-classify instances, improving the model's performance over iterations. However, its performance depends on the quality of weak learners and the dataset characteristics.

In [57]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize the AdaBoost classifier with a decision tree base estimator
adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50, random_state=42)

# Train the model
adaboost.fit(X_train, y_train)

# Predict on the test set
y_pred = adaboost.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy}')




Test Accuracy: 0.7295737122557726


# Metric Used for Evaluation on the Dev Set
We used accuracy as the primary metric to evaluate the models on the dev set. Accuracy measures the proportion of correctly classified instances out of the total instances in the dataset. While accuracy provides a good overall assessment of model performance, it may not be suitable for imbalanced datasets. However, since we employed techniques like class weights and ensemble methods to address class imbalance, accuracy serves as a reasonable metric in this context.

# Hyperparameter Search and its Impact on Model Accuracy
In the provided code snippets, we did not explicitly perform hyperparameter search or tuning. However, the chosen models, such as Random Forest with class weights and AdaBoost, have hyperparameters that can significantly impact model performance.

For Random Forest, key hyperparameters include the number of trees (n_estimators), the maximum depth of each tree (max_depth), and the minimum number of samples required to split an internal node (min_samples_split). Adjusting these hyperparameters can affect the model's ability to capture complex patterns in the data, prevent overfitting, and improve generalization performance.

Similarly, AdaBoost has hyperparameters like the number of weak learners (n_estimators) and the learning rate (learning_rate). These parameters control the boosting process and influence the model's capacity to adapt to the training data.

A systematic hyperparameter search, such as grid search or random search, could be conducted to find the optimal combination of hyperparameters that maximizes model performance on the dev set. By evaluating the models with different hyperparameter configurations, we can identify the settings that yield the highest accuracy or other desired metrics.

# Model Selection
Based on the evaluation results and considering the use of class weights to handle class imbalance, we ultimately chose Random Forest with class weights. Despite AdaBoost being tried, Random Forest with class weights performed slightly better, making it the preferred choice for our classification task.