In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#to ignore warnings
import warnings
warnings.filterwarnings('ignore')


from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder

 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn import svm



from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pickle


In [None]:
### `read_data` Function

The `read_data` function is utilized to read a CSV file and transform it into a machine-readable format. It accomplishes this by performing the following tasks:

1. **Reading Data from CSV:**
   - Reads data from a CSV file specified by the `file` parameter using the Pandas library and stores it in a DataFrame named `data`.

2. **Printing the First 5 Rows:**
   - Prints the first 5 rows of the DataFrame using `data.head()` to provide a quick preview of the data.

3. **Printing Data Information:**
   - Prints information about the DataFrame using `data.info()`. This information includes data types of columns, the number of non-null values, and memory usage.

4. **Printing Summary Statistics:**
   - Prints summary statistics of the DataFrame using `data.describe()`. This includes count, mean, standard deviation, minimum, and maximum values for numerical columns.

5. **Counting Missing Values:**
   - Prints the count of missing values in each column using `data.isnull().sum()`.

6. **Counting Unique Classes:**
   - Prints the number of unique classes in the 'Order' column to provide an idea of the number of unique categories or labels in that specific column.

7. **Label Encoding:**
   - Performs label encoding on specific columns ('Year', 'Major', 'University', and 'Order') using scikit-learn's LabelEncoder. This step converts categorical data into numerical values, making it suitable for machine learning models that require numerical input.

8. **Data Separation:**
   - Separates the data into feature variables (X) and the target variable (y). It assumes that the last column in the DataFrame is the target variable. The X variable includes all columns except the last one, and the y variable contains the last column.

9. **Data Splitting:**
   - Splits the data into training and testing sets using scikit-learn's `train_test_split` function. It assigns 80% of the data to the training set (`X_train` and `y_train`) and 20% to the testing set (`X_test` and `y_test`).

The `read_data` function is an essential data preparation step, transforming raw CSV data into a format suitable for machine learning, performing exploratory data analysis, and preparing the data for subsequent modeling and analysis.


In [6]:
def read_data(file):
    
    data = pd.read_csv(file)
    print(data.head())
    print(data.info())
    print(data.describe())
    print(data.isnull().sum())
    print("Number of Classes  :",len(data['Order'].unique()))
    
    le= LabelEncoder()
    data['Year'] = le.fit_transform(data['Year'])
    data['Major'] = le.fit_transform(data['Major'])
    data['University'] = le.fit_transform(data['University'])
    data['Order'] = le.fit_transform(data['Order'])
    
    X = data.iloc[:, 0:-1]
    y = data.iloc[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, y,train_size = 0.8, test_size = 0.2, random_state = 42)
    return X_train, X_test, y_train, y_test                                               
    

### `training_dtree` Function

The `training_dtree` function is designed to train a Decision Tree classifier model, evaluate its performance, and save the trained model to a file. Below is a breakdown of what this function does:

1. **Training a Decision Tree Classifier:**
   - Trains a Decision Tree classifier model with a maximum depth of 10 using scikit-learn's `DecisionTreeClassifier`. The training is performed with the provided training data (`X_train` for features and `y_train` for target labels).

2. **Making Predictions:**
   - Uses the trained Decision Tree model to make predictions on the test data (`X_test`) and stores the predicted labels in the `pred` variable.

3. **Calculating Accuracy:**
   - Calculates the accuracy of the model's predictions using scikit-learn's `accuracy_score` function, which measures how well the model performs on the test data.

4. **Calculating F1 Score:**
   - Calculates the F1 score using scikit-learn's `f1_score` function. The F1 score is a measure of model accuracy that takes both precision and recall into account. It is calculated with a weighted average ('weighted') option.

5. **Saving the Model:**
   - Persists the trained Decision Tree model by saving it to a file named 'finalized_model_dtree.sav' using the `pickle.dump` function. This allows for future use of the model.

6. **Printing Accuracy and F1 Score:**
   - Prints the accuracy and F1 score as percentages, rounded to two decimal places using the `print` function.

7. **Returning Results:**
   - Returns three values: accuracy, F1 score, and the filename under which the trained model is saved.

The purpose of this function is to facilitate the training and evaluation of a Decision Tree classifier model as part of a machine learning workflow.


In [7]:
def training_dtree(X_train,X_test,y_train,y_test):
    
    dtree_model = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train) 
    pred = dtree_model.predict(X_test)
    accuracy = accuracy_score(y_test,pred)
    f1 = f1_score(y_test, pred, average='weighted')
    
    filename = 'finalized_model_dtree.sav'
    pickle.dump(dtree_model, open(filename, 'wb'))
    
    print('Accuracy : ', "%.2f" % (accuracy*100))
    print('F1 : ', "%.2f" % (f1*100))
    return accuracy,f1,filename

### `training_svm` Function

The `training_svm` function is designed to train a Support Vector Machine (SVM) classifier with an RBF (Radial Basis Function) kernel and evaluate its performance. Below is a breakdown of what this function does:

1. **Training an SVM Classifier with RBF Kernel:**
   - Trains an SVM classifier with an RBF kernel using scikit-learn's `SVC` (Support Vector Classification) class. The kernel is set to 'rbf', and other hyperparameters such as 'gamma' and 'C' are specified. The training is performed with the provided training data (`X_train` for features and `y_train` for target labels).

2. **Making Predictions:**
   - Uses the trained SVM model with the RBF kernel to make predictions on the test data (`X_test`) using `rbf.predict(X_test)`. The predicted labels are stored in the `pred` variable.

3. **Calculating Accuracy:**
   - Calculates the accuracy of the model's predictions using scikit-learn's `accuracy_score` function, which measures how well the model performs on the test data.

4. **Calculating F1 Score:**
   - Calculates the F1 score using scikit-learn's `f1_score` function. The F1 score is a measure of model accuracy that takes both precision and recall into account. It is calculated with a weighted average ('weighted') option.

5. **Saving the Model:**
   - Persists the trained SVM model with the RBF kernel by saving it to a file named 'finalized_model_svm.sav' using the `pickle.dump` function. This allows for future use of the model.

6. **Printing Accuracy and F1 Score:**
   - Prints the accuracy and F1 score as percentages, rounded to two decimal places using the `print` function.

7. **Returning Results:**
   - Returns three values: accuracy, F1 score, and the filename under which the trained model is saved.

The `training_svm` function is a crucial part of a machine learning workflow, where it trains an SVM classifier and evaluates its performance while also providing a means to save the model for future use.


In [8]:
def training_svm(X_train,X_test,y_train,y_test):
    
    rbf = svm.SVC(kernel='rbf', gamma=0.5, C=0.1).fit(X_train, y_train)
    pred = rbf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    f1 = f1_score(y_test, pred, average='weighted')
    
    filename = 'finalized_model_svm.sav'
    pickle.dump(rbf, open(filename, 'wb'))
    
    print('Accuracy : ', "%.2f" % (accuracy*100))
    print('F1 : ', "%.2f" % (f1*100))
    return accuracy,f1,filename


### `training_knn` Function

The `training_knn` function is designed for training a k-Nearest Neighbors (k-NN) classifier and evaluating its performance. Below is a breakdown of what this function does:

1. **Training a k-NN Classifier:**
   - It trains a k-Nearest Neighbors (k-NN) classifier using scikit-learn's `KNeighborsClassifier`. The number of neighbors (k) is set to 10, and the training is performed with the provided training data (`X_train` for features and `y_train` for target labels).

2. **Making Predictions:**
   - The trained k-NN model is used to make predictions on the test data (`X_test`) using `knn.predict(X_test)`. The predicted labels are stored in the `pred` variable.

3. **Calculating Accuracy:**
   - The function calculates the accuracy of the model's predictions using scikit-learn's `accuracy_score` function, which measures how well the model performs on the test data.

4. **Calculating F1 Score:**
   - It calculates the F1 score using scikit-learn's `f1_score` function. The F1 score is a measure of model accuracy that takes both precision and recall into account. It is calculated with a weighted average ('weighted') option.

5. **Saving the Model:**
   - The trained k-NN model is saved to a file named 'finalized_model_knn.sav' using the `pickle.dump` function. This allows for persisting the model for future use.

6. **Printing Accuracy and F1 Score:**
   - The function prints the accuracy and F1 score as percentages, rounded to two decimal places using the `print` function.

7. **Returning Results:**
   - The function returns three values: accuracy, F1 score, and the filename under which the trained model is saved.

The `training_knn` function is a crucial component of a machine learning workflow, where it trains a k-Nearest Neighbors classifier, assesses its performance, and provides a way to save the model for future use.


In [9]:
def training_knn(X_train,X_test,y_train,y_test):
    
    knn = KNeighborsClassifier(n_neighbors = 10).fit(X_train, y_train)
    pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test,pred)
    f1 = f1_score(y_test, pred, average='weighted')
     
    filename = 'finalized_model_knn.sav'
    pickle.dump(knn, open(filename, 'wb'))
    
    print('Accuracy : ', "%.2f" % (accuracy*100))
    print('F1 : ', "%.2f" % (f1*100))
    return accuracy,f1,filename


In [10]:
X_train,X_test,y_train,y_test = read_data("XTern 2024 Artificial Intelegence Data Set - Xtern_TrainData.csv")


     Year                    Major                University  Time  \
0  Year 2                  Physics  Indiana State University    12   
1  Year 3                Chemistry     Ball State University    14   
2  Year 3                Chemistry         Butler University    12   
3  Year 2                  Biology  Indiana State University    11   
4  Year 3  Business Administration         Butler University    12   

                                               Order  
0                               Fried Catfish Basket  
1                                    Sugar Cream Pie  
2                                 Indiana Pork Chili  
3                               Fried Catfish Basket  
4  Indiana Corn on the Cob (brushed with garlic b...  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Year        5000 non-null   object
 1   Major       5000 n

```python
# Train and Evaluate Three Different Machine Learning Models

# Train a Decision Tree model and store the results
acc_dtree, f1_dtree, dtree_mod = training_dtree(X_train, X_test, y_train, y_test)

# Train an SVM model and store the results
acc_svm, f1_svm, svm_mod = training_svm(X_train, X_test, y_train, y_test)

# Train a k-Nearest Neighbors (k-NN) model and store the results
acc_knn, f1_knn, knn_mod = training_knn(X_train, X_test, y_train, y_test)


In [11]:
acc_dtree,f1_dtree,dtree_mod = training_dtree(X_train,X_test,y_train,y_test)
acc_svm,f1_svm,svm_mod = training_svm(X_train,X_test,y_train,y_test)
acc_knn,f1_knn,knn_mod = training_knn(X_train,X_test,y_train,y_test)

Accuracy :  64.10
F1 :  63.77
Accuracy :  58.90
F1 :  58.58
Accuracy :  61.70
F1 :  61.13


```python
# Load the Trained Decision Tree Model
dtree_model = pickle.load(open(dtree_mod, 'rb'))

# Evaluate the Decision Tree Model on the Test Data
accuracy = dtree_model.score(X_test, y_test)


In [12]:
dtree_model = pickle.load(open(dtree_mod, 'rb'))
accuracy = dtree_model.score(X_test, y_test)
print("Accuracy for Decision Tree : ",accuracy*100)

Accuracy for Decision Tree :  64.1


```python
# Load the Trained SVM Model
svm_model = pickle.load(open(svm_mod, 'rb'))

# Evaluate the Decision Tree Model on the Test Data
accuracy = dtree_model.score(X_test, y_test)



In [13]:
svm_model = pickle.load(open(svm_mod, 'rb'))
accuracy = svm_model.score(X_test, y_test)
print("Accuracy for Support Vector Machines : ",accuracy*100)

Accuracy for Support Vector Machines :  58.9


```python
# Load the Trained KNN Model
knn_model = pickle.load(open(knn_mod, 'rb'))

# Evaluate the Decision Tree Model on the Test Data
accuracy = dtree_model.score(X_test, y_test)



In [14]:
knn_model = pickle.load(open(knn_mod, 'rb'))
accuracy = knn_model.score(X_test, y_test)
print("Accuracy for K-Nearest-Neighbour : ",accuracy*100)

Accuracy for K-Nearest-Neighbour :  61.7
