In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
df = pd.read_csv("file-name.csv",encoding="latin-1")

In [3]:
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


Drop the Categorical columns that are not important whil training the dataset

In [4]:
df = df.drop(columns=["PLAYER","CS"])

We can see that the two columns are dropped from the dataframe

In [5]:
df.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
0,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,0.366,1
1,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,0.331,1
2,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,0.345,1
3,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,0.31,1
4,21,2792,10430,1736,3430,640,252,101,0,963,327,722,0.329,1


We have created a set of Independent variable stored in X and Dependent variable stored in y. We hav eused index numbers where we know the last column is the target variable

In [6]:
X = df.iloc[:,0:13]

In [7]:
y = df.iloc[:,13]

Now we will split the X and y into training and test dataset.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 1,test_size = 0.2)

We have used min max scaler in  this case.
Min-Max Scaling (Normalisation) and Standardisation are both techniques for feature scaling in data preprocessing, but they serve slightly different purposes and have distinct implementations.

## Min Max Scaler which is also known as Normalisation:

# Definition: 
Rescales the data to fit within a specific range, typically [0, 1].

# Formula: 
Scaled Value = (X - X_min)/(X_max - X_min)

If you want to fit the data in specific range [a,b] then
Scaled Value = a +  (X - X_min)/(X_max - X_min) * (b-a)

# Purpose: 
Suitable for models sensitive to magnitude, but not to variance, such as neural networks and distance-based algorithms (e.g., KNN, K-means).

# When to Use:
1. Algorithms Sensitive to Magnitudes
Normalization is ideal for algorithms where the magnitude or absolute scale of the data directly impacts the results. For example:

Neural Networks: Normalization ensures that inputs fall within a uniform range, helping gradient-based optimization algorithms converge faster.

K-Nearest Neighbors (KNN) and K-Means Clustering: These algorithms use distance metrics (e.g., Euclidean distance), so features with larger magnitudes can dominate unless normalized. SVM too

Principal Component Analysis (PCA): While PCA often benefits from standardization, normalization is used if features must be scaled within a bounded range.

2. When Data Needs to Fit a Specific Range
Some applications or algorithms require data to lie within a bounded range. For example:

Image processing: Pixel values are often normalized to [0,1] 0r [-1,1]
Certain activation functions (e.g., sigmoid, tanh) in neural networks perform better with normalized inputs.

3. When Data Has a Large Range
Normalization can help mitigate the impact of features with extremely large ranges on the model's performance.

4. When Data Has a Large Number of Zero Values
Normalization can help reduce the impact of zero values on the model's performance.

5. When Data Has a Large Number of Missing Values
Normalization can help reduce the impact of missing values on the model's performance.

6. When Features Have Different Units
Normalization ensures that features measured in different units (e.g., kilograms, meters, dollars) are scaled to a comparable range, preventing one feature from disproportionately influencing the results.

7. When Dealing with Sparse or Binary Data
For datasets with binary features or sparse representations, normalization can help maintain the sparsity and ensure values stay within a meaningful range.

8. When Using Algorithms with Predefined Ranges
Some algorithms, such as specific regularization techniques or those relying on bounded optimizations, perform better with normalized data.

9. When Feature Values Are Naturally Bounded
Normalization is suitable when data inherently lies within a bounded range e.g Percentage or probability.

In [9]:
scaler = MinMaxScaler(feature_range=(0,1))

In [10]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [11]:
knn = KNeighborsClassifier(n_neighbors=8)

In [12]:
knn.fit(X_train,y_train)

In [13]:
y_pred = knn.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0])

In [14]:
knn.score(X_test,y_test)

0.8817204301075269

In [15]:
cm = confusion_matrix(y_test,y_pred)
cm

array([[66,  3],
       [ 8, 16]])

Now we find out the classification report which consists of precision, recall and other 

In [16]:
cr = classification_report(y_test,y_pred)

In [17]:
print(cr)

              precision    recall  f1-score   support

           0       0.89      0.96      0.92        69
           1       0.84      0.67      0.74        24

    accuracy                           0.88        93
   macro avg       0.87      0.81      0.83        93
weighted avg       0.88      0.88      0.88        93

