# Homework 7 - Classification
In this assignment, we will be applying some basic classification methods to the soccer database (found on canvas). We will first need to import all the libraries required for this guide.

## Instructions
In this assignment, you will be performing the specified classification methods in Python.

---

### Step 1: Load Data

- Load the following attributes from `Player_Attributes`:
  - `gk_reflexes`
  - `gk_kicking`
  - `gk_handling`

These values will be used for classification.

---

### Step 2: Classification (Part 1)

- Use `gk_reflexes` and `gk_kicking`.
- Choose one of the attributes as the **target attribute**.
- Generate **five classes** in the target property by reducing the range of values in the target data.
- Split the data into **training** and **testing** sets.
- Apply the following methods and print the resulting `accuracy_score` from `sklearn.metrics`:
  - Logistic Regression
  - Support Vector Machine (SVM)
  - Decision Tree
  - K-Nearest Neighbors (KNN)

---

### Step 3: Classification (Part 2)

- Repeat **Step 2**, this time using:
  - `gk_kicking` and `gk_handling`

- Again, print the corresponding `accuracy_score` for each classification method.

---

### Step 4: Analysis (Comment in Python file)

Answer the following question as a **comment** in your Python file:

> Since this assignment (Classification) and the previous assignment (Regression) are with the same data, can you compare and conclude which technique is yielding best results?

---

### Dataset Overview
The dataset covers information about soccer players in sqlite format. This file is located in the `Datasets` directory of this repository. The file is called `fifa_soccer_dataset.sqlite.gz`. **This is the same file from the previous homework (assignment 4).**

If you haven't decompressed the file, you may need to follow the instructions below to decompress it.

**IMPORTANT** The database is compressed and needs to be decompressed before use. You can do this by running the following command in your terminal on Linux or MacOS:

```bash
gunzip Datasets/fifa_soccer_dataset.sqlite.gz
```

If you are using Windows, you can use the following command in your powershell:

```powershell
$sourceFile = "$PWD\Datasets\fifa_soccer_dataset.sqlite.gz"
$destinationFile = "$PWD\Datasets\fifa_soccer_dataset.sqlite"

$inputStream = [System.IO.File]::OpenRead($sourceFile)
$outputStream = [System.IO.File]::Create($destinationFile)
$gzipStream = New-Object System.IO.Compression.GzipStream($inputStream, [System.IO.Compression.CompressionMode]::Decompress)
$gzipStream.CopyTo($outputStream)

$gzipStream.Close()
$outputStream.Close()
$inputStream.Close()
```

Alternatively, you can extract the file using the GUI of your operating system.


### Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
     - Search for `Jupyter: Export to HTML`.
     - Save the HTML file to your computer and submit it via Canvas.

---


In [2]:
import pandas as pd
import sqlite3
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

To start this assignment, we first need to connect to the sqlite database, do so below.

In [3]:
# Input Code Here
dataset_path = "fifa_soccer_dataset.sqlite" # Fix your path accordingly

# Connect to the SQLite database
conn = sqlite3.connect(dataset_path)

# Query to fetch the required attributes: gk_reflexes, gk_kicking, and gk_handling
query = "SELECT gk_reflexes, gk_kicking, gk_handling FROM Player_Attributes"

Now connected, let's grab required attributes for the scenario from the `Player_Attributes`(Using gk_reflexes and gk_kicking) table.

In [4]:
player_attr_df = pd.read_sql_query(query, conn)
player_attr_df.head()

Unnamed: 0,gk_reflexes,gk_kicking,gk_handling
0,8.0,10.0,11.0
1,8.0,10.0,11.0
2,8.0,10.0,11.0
3,7.0,9.0,10.0
4,7.0,9.0,10.0


Droping the rows with are having missing values

In [5]:
# Drop rows with missing values in the selected columns
player_attr_df = player_attr_df.dropna(subset=['gk_reflexes', 'gk_kicking', 'gk_handling'])

For this classifying, we'll be using the ` gk_reflexes` and `gk_kicking`. Pull these values into `x` and `y`.

In [7]:
x = player_attr_df[['gk_kicking']].values # Your Code Here
y = player_attr_df[['gk_reflexes']].values # Your Code Here

the target variable should be reduced to just 5 classes.

In [8]:
y = pd.cut(player_attr_df['gk_reflexes'],
           bins=5,
           labels=['Class 1', 'Class 2', 'Class 3', 'Class 4', 'Class 5']).values

Let's split the data set into test and training sets using the `train_test_split()` function. We'll want to transform our `x` variable, which can be done by calling the `transform()` function.

In [11]:
# (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.3,
                                                    random_state=0)

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

To preform a logistic regression, we'll use the `LogisticRegression()` function. This may take a couple moments to run.

In [12]:
lr = LogisticRegression(C=1000.0, random_state=0, max_iter=1000)
lr.fit(X_train_std, y_train.ravel())
y_pred_lr = lr.predict(X_test_std)

print("Accuracy :", accuracy_score(y_test, y_pred_lr))

Accuracy : 0.85348452032106


Great! Let's try applying SVM instead. Try using `SVC()` below, then use the same prediction and output methods as the above cell.

In [13]:
svm = SVC(kernel='linear', C=1.0, random_state=0, cache_size=7000)
svm.fit(X_train_std, y_train.ravel())
y_pred_svm = svm.predict(X_test_std)

print(f'Accuracy: {accuracy_score(y_test, y_pred_svm)}')

Accuracy: 0.8662431974955863


Let's try using a KNeightbors Classifier. We can call the `KNeighborsClassifier()` function, and supply 2 parameters: `n_neighbors=5` and `matric='euclidean`. Once you run this method, display the accuracy of your model as you did in the above cells.

In [14]:
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_std, y_train.ravel())
y_pred_knn = knn.predict(X_test_std)

print("Accuracy:", accuracy_score(y_test, y_pred_knn))

Accuracy: 0.8586535136414102


let's repeat the above steps agian with gk_kicking and gk_handling.

In [15]:
# Feature: use gk_handling
X_part2 = player_attr_df[['gk_handling']].values

# Target: discretize gk_kicking into 5 classes
y_part2 = pd.cut(player_attr_df['gk_kicking'],
                 bins=5,
                 labels=['Class 1', 'Class 2', 'Class 3', 'Class 4', 'Class 5']).values

print(pd.value_counts(y_part2))

Class 1    142652
Class 4     21334
Class 3     14026
Class 2      3118
Class 5      2012
Name: count, dtype: int64


  print(pd.value_counts(y_part2))


In [16]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_part2, y_part2,
                                                        test_size=0.3,
                                                        random_state=0)

sc2 = StandardScaler()
X_train2_std = sc2.fit_transform(X_train2)
X_test2_std = sc2.transform(X_test2)

In [17]:
lr2 = LogisticRegression(C=1000.0, random_state=0, max_iter=1000)
lr2.fit(X_train2_std, y_train2.ravel())
y_pred_lr2 = lr2.predict(X_test2_std)

print("Accuracy:", accuracy_score(y_test2, y_pred_lr2))

Accuracy: 0.8844984802431611


In [18]:
svm2 = SVC(kernel='linear', C=1.0, random_state=0, cache_size=7000)
svm2.fit(X_train2_std, y_train2.ravel())
y_pred_svm2 = svm2.predict(X_test2_std)

print("Accuracy:", accuracy_score(y_test2, y_pred_svm2))

Accuracy: 0.8857361265311323


In [19]:
knn2 = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn2.fit(X_train2_std, y_train2.ravel())
y_pred_knn2 = knn2.predict(X_test2_std)

print("Accuracy: ", accuracy_score(y_test2, y_pred_knn2))

Accuracy:  0.8857725278925431


Lastly, in the cell below, answer the question:
Since this assignment (Classification) and the previous assignment (Regression) are with the same data, can you compare and conclude which technique is yielding best results?

In [None]:
"""
The regression results (R² ≈ 0.93) show that nearly 93% of the variance in gk_handling is explained by gk_reflexes,
indicating a strong continuous relationship. This model preserves the detail in the data and predicts with high accuracy.

In contrast, when the attributes are discretized into five classes for classification, significant granularity is lost,
and performance drops because the models must predict across broader, arbitrary class boundaries. In this scenario,
regression is clearly the more effective technique.
"""