### Q1. Data Holdout
### Load the dataset auto-mpg.csv and perform the following tasks:
#### 1. Print the shape of the dataset.
#### 2. Split the dataset into training (80%) and testing (20%) subsets.
#### 3. Use a random state for reproducibility.
#### 4. Print the training and test sets and their length.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("auto-mpg.csv")
print(df.shape)

X = df.drop("mpg", axis=1)
y = df["mpg"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(X_train)
print(X_test)
print(len(X_train), len(X_test))

(398, 9)
     cylinders  displacement  horsepower  weight  acceleration  model_year  \
3            8         304.0       150.0  3433.0          12.0          70   
18           4          97.0        88.0  2130.0          14.5          70   
376          4          91.0        68.0  2025.0          18.2          82   
248          4          91.0        60.0  1800.0          16.4          78   
177          4         115.0        95.0  2694.0          15.0          75   
..         ...           ...         ...     ...           ...         ...   
71           3          70.0        97.0  2330.0          13.5          72   
106          8         350.0       180.0  4499.0          12.5          73   
270          4         134.0        95.0  2515.0          14.8          78   
348          4          89.0        62.0  2050.0          17.3          81   
102          4          97.0        46.0  1950.0          21.0          73   

     origin                   car_name  
3         1  

### Q2. Implementing 5-Fold Cross-Validation
### Using the dataset btissue.csv, do the following:
#### 1. Create a 5-fold cross-validation.
#### 2. For each fold (starting with 0), print:
##### - fold number
##### - indices of the training set
##### - indices of the test set
#### 3. Print the training set and test set and their length for the 3rd iteration.

In [None]:
import pandas as pd
from sklearn.model_selection import KFold

df = pd.read_csv("btissue.csv")

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(kf.split(df),1):
    print("Fold:", fold)
    print("Train indices:", train_idx)
    print("Test indices:", test_idx)
    print()

    if fold == 2:
        print("Train set:\n", df.iloc[train_idx])
        print("Test set:\n", df.iloc[test_idx])
        print("Train length:", len(train_idx))
        print("Test length:", len(test_idx))


Fold: 0
Train indices: [  1   2   3   5   6   7   8   9  13  14  15  16  17  19  20  21  22  23
  24  25  26  27  28  29  31  32  34  35  36  37  38  39  40  41  43  45
  46  48  49  50  51  52  53  54  56  57  58  59  60  61  62  63  66  68
  69  70  71  72  73  74  75  76  77  82  83  84  85  86  87  88  89  90
  91  92  94  95  96  98  99 101 102 103 104 105]
Test indices: [  0   4  10  11  12  18  30  33  42  44  47  55  64  65  67  78  79  80
  81  93  97 100]

Fold: 1
Train indices: [  0   1   2   3   4   6   7   8  10  11  12  13  14  17  18  19  20  21
  23  24  25  27  29  30  32  33  34  37  38  41  42  43  44  46  47  48
  49  50  51  52  54  55  56  57  58  59  60  61  63  64  65  66  67  68
  69  70  71  72  74  75  78  79  80  81  82  83  84  85  86  87  89  90
  91  92  93  94  96  97  98  99 100 101 102 103 105]
Test indices: [  5   9  15  16  22  26  28  31  35  36  39  40  45  53  62  73  76  77
  88  95 104]

Fold: 2
Train indices: [  0   1   2   4   5   9  10  11  1

### Q3. Bootstrap Sampling
### With the dataset btissue.csv, perform the following:
#### 1. Extract the predictors.
#### 2. Generate a bootstrap sample of size 100.
#### 3. Display the first 10 rows of the resampled data.

In [5]:
import pandas as pd
from sklearn.utils import resample

df = pd.read_csv("btissue.csv")
X = df.iloc[:,0:9]

boot = resample(X, n_samples=100, replace=True, random_state=42)

print(boot.head(10))

              I0     PA500       HFS           DA           Area        A/DA  \
102  2600.000000  0.200538  0.208043  1063.441427  174480.476200  164.071543   
51    274.993396  0.147131  0.137532    66.457943    1217.415651   18.318588   
92   1800.000000  0.091979  0.205251   362.863321   15021.553890   41.397278   
14    485.668806  0.230209  0.134041   253.893699    8135.968359   32.044783   
71   1385.664721  0.092328  0.089361   202.480044    8785.028733   43.387134   
60    197.000000  0.132645  0.074002    33.460653     409.647141   12.242652   
20    500.000000  0.192684  0.194779   144.688578    3055.012963   21.114403   
102  2600.000000  0.200538  0.208043  1063.441427  174480.476200  164.071543   
82   1647.939811  0.080983  0.086568   576.770376   11852.485060   20.549747   
86   2100.000000  0.121649  0.377689   450.551667   35671.606290   79.173176   

         Max IP          DR            P  
102  418.687286  977.552367  2664.583623  
51    40.849678   52.421008   327

### Q4. Random Forest Classification and Evaluation
### Using the dataset btissue.csv, carry out the steps below:
#### 1. Separate predictors and target.
#### 2. Perform a holdout split (80% training, 20% testing).
#### 3. Train a Random Forest Classifier.
#### 4. Evaluate the model using:
##### - accuracy_score
##### - confusion_matrix

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

df = pd.read_csv("btissue.csv")

# 1 (using iloc)
X = df.iloc[:, 0:9]      # first 9 columns = predictors
y = df.iloc[:, 9]        # 10th column = target

# 2
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# 4
pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


Accuracy: 0.7727272727272727
Confusion Matrix:
 [[3 0 0 0 0 0]
 [0 6 0 0 0 0]
 [0 0 4 0 0 0]
 [0 0 0 0 0 2]
 [0 0 0 0 2 2]
 [0 0 0 0 1 2]]


### Q5. Regression and Evaluation
### Load the dataset auto_mpg.csv and perform the following tasks:
### 1. Remove all rows where value of any column is ‘NaN’.
### 2. Separate predictors and target.
### 3. Perform a holdout split (80% training, 20% testing).
### 4. Train a Linear Regression Model.
### 5. Evaluate the model using:
#### - mean_squared_error
#### - r2_score

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1
df = pd.read_csv("auto-mpg.csv")
df = df.dropna()

# 2  (mpg = first col, car_name = last col)
X = df.iloc[:, 1:8]   # cylinders ... origin
y = df.iloc[:, 0]     # mpg

# 3
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4
lr = LinearRegression()
lr.fit(X_train, y_train)

# 5
pred = lr.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
print("R2:", r2_score(y_test, pred))

MSE: 10.710864418838373
R2: 0.790150038676035


### Q6. Clustering and Quality Measures
### Using the dataset Dataset_spine.csv:
#### 1. Exclude the class (last) column to form predictor variables.
#### 2. Apply K-Means clustering with n_clusters=3 and random_state=123.
#### 3. Compute and print:
##### - v-measure score
##### - Silhouette score

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import v_measure_score, silhouette_score

df = pd.read_csv("Dataset_spine.csv")

# 1. predictors (without class column)
df_woc = df.iloc[:, 0:12]
df_class = df.iloc[:, 12]

# pick 3 features
f1 = df_woc["Col1"].values
f2 = df_woc["Col5"].values
f3 = df_woc["Col9"].values

X = np.array(list(zip(f1, f2, f3)))

# 2. K-Means
kmeans = KMeans(n_clusters=3, random_state=123)
cluster_labels = kmeans.fit_predict(X)

# 3. scores
v = v_measure_score(df_class, cluster_labels)
print("V-measure:", v)

sil = silhouette_score(X, cluster_labels)
print("Silhouette:", sil)

V-measure: 0.11473187808331904
Silhouette: 0.351912933895437


  super()._check_params_vs_input(X, default_n_init=10)
