## **BEGINNER'S TUTORIAL ON SCIKIT-LEARN**

In this tutorial, we will explore some common tasks that can be accomplished using scikit-learn, a popular machine learning package in Python. Scikit-learn is known for its simplicity and efficiency in handling various machine learning algorithms. We will cover the following topics:

1. Loading a dataset
2. Splitting the dataset into training, validation, and test sets
3. Training different classification and regression models
4. Finding missing values in the dataset
5. Evaluating model performance using various metrics

By the end of this tutorial, you will have a good understanding of how to use scikit-learn to build and evaluate machine learning models. Let's get started!


### Package Installation & Importation

In [None]:
#execute this cell to install the required packages (if not done already)
! pip install scikit-learn numpy pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Install Kaggle API
You need to have the Kaggle API installed. You can install it using pip:

In [None]:
import os
! pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Set Up Kaggle API Credentials
1. Go to Kaggle's website and sign up.
2. Go to "My Account" (click on your profile picture in the top right corner and then on "My Account").
3. Go to "Settings"
4. Scroll down to the "API" section and click on "Create New API Token". This will download a file called kaggle.json.
5. Place the kaggle.json file in the .kaggle directory in your home directory. You can do this with the following commands:
```sh
    mkdir -p ~/.kaggle
    mv /path/to/kaggle.json ~/.kaggle/
    chmod 600 ~/.kaggle/kaggle.json
```

#### Joining Relevant Competition on Kaggle
1. Make sure you have logged into kaggle with the same account the API key has been generated for.
2. Go to https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques and click on "Join Competition".
3. This way, the regression dataset (House Price Prediction) will download seamlessly.

# Classification

#### Download the Dataset
Now you can use the Kaggle API to download the dataset:

In [None]:
filename = 'breast-cancer-wisconsin-data.zip'
os.makedirs("classification", exist_ok=True)
for root, dirs, file in os.walk("./", topdown=True):
    if filename in file:
        break
else:
    !kaggle datasets download -d uciml/breast-cancer-wisconsin-data
    !mv breast-cancer-wisconsin-data.zip 'classification/'

Dataset URL: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
License(s): CC-BY-NC-SA-4.0
Downloading breast-cancer-wisconsin-data.zip to /Users/shyam/Desktop/Engineering/7th_Semester/MI_TA/MI Hands-On/sklearn
  0%|                                               | 0.00/48.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 48.6k/48.6k [00:00<00:00, 3.08MB/s]


In [None]:
for root, dirs, file in os.walk("./", topdown=True):
    if 'data.csv' in file:
        break
else:
    !unzip classification/breast-cancer-wisconsin-data.zip -d classification/

Archive:  classification/breast-cancer-wisconsin-data.zip
  inflating: classification/data.csv  


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

## Classification

#### 1. Loading a dataset
The breast cancer dataset, provided by scikit-learn, is a widely used dataset in the field of machine learning and data science. This dataset contains measurements of various features of cell nuclei present in breast cancer biopsies. It is commonly used for binary classification tasks to distinguish between malignant (cancerous) and benign (non-cancerous) tumors.

In [None]:
data = "classification/data.csv"
df = pd.read_csv(data)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
# Display basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

#### 2. Splitting the dataset into training, validation, and test sets


In [None]:
# Display summary statistics
print("Display summary statistics: \n",df.describe())

Display summary statistics: 
                  id  radius_mean  texture_mean  perimeter_mean    area_mean  \
count  5.690000e+02   569.000000    569.000000      569.000000   569.000000   
mean   3.037183e+07    14.127292     19.289649       91.969033   654.889104   
std    1.250206e+08     3.524049      4.301036       24.298981   351.914129   
min    8.670000e+03     6.981000      9.710000       43.790000   143.500000   
25%    8.692180e+05    11.700000     16.170000       75.170000   420.300000   
50%    9.060240e+05    13.370000     18.840000       86.240000   551.100000   
75%    8.813129e+06    15.780000     21.800000      104.100000   782.700000   
max    9.113205e+08    28.110000     39.280000      188.500000  2501.000000   

       smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0

In [None]:
# Check for missing values
print("missing values:\n", df.isnull().sum())

missing values:
 id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_wors

### Data Preprocessing and Splitting

In [None]:
#Drop columns which

df = df.drop(['Unnamed: 32', 'id'], axis = 1)

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Convert 'diagnosis' column to binary
y = y.replace({'M': 1, 'B': 0})

print("target: \n",y.head(10))

target: 
 0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: diagnosis, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training, validation, and test sets (70%, 10%, 20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.66, random_state=42)

# Check the shapes of the splits
X_train.shape, X_val.shape, X_test.shape

((398, 30), (58, 30), (113, 30))

#### 3. Training different classification models
In this section, we will demonstrate how to initialize and train different classification models using scikit-learn. While we won't go into the detailed workings of these models, it's important to know that there are multiple algorithms available for classification tasks.


In [None]:
# Initialize and train a Logistic Regression Model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

In [None]:
# Initialize and train DecisionTree Model
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

In [None]:
# Initialize and train SVM Model
svm_clf = SVC(random_state=42)
svm_clf.fit(X_train, y_train)

If you wish to look at the predictions of each model separately, try executing `model_name.predict(X_val)`.

These predictions are then compared to `y_val` for better insigths at how the model is performing.

#### 4. Visualizing the metrics for each model

In [None]:
# Summary of performance metrics
metrics = {
    'Model': ['Logistic Regression', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_score(y_val, log_reg.predict(X_val)),
                 accuracy_score(y_val, tree_clf.predict(X_val)),
                 accuracy_score(y_val, svm_clf.predict(X_val))],
    'Precision': [precision_score(y_val, log_reg.predict(X_val)),
                  precision_score(y_val, tree_clf.predict(X_val)),
                  precision_score(y_val, svm_clf.predict(X_val))],
    'Recall': [recall_score(y_val, log_reg.predict(X_val)),
               recall_score(y_val, tree_clf.predict(X_val)),
               recall_score(y_val, svm_clf.predict(X_val))],
    'F1-Score': [f1_score(y_val, log_reg.predict(X_val)),
                 f1_score(y_val, tree_clf.predict(X_val)),
                 f1_score(y_val, svm_clf.predict(X_val))]
}

metrics_df = pd.DataFrame(metrics)
metrics_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.965517,1.0,0.916667,0.956522
1,Decision Tree,0.931034,0.884615,0.958333,0.92
2,SVM,0.913793,1.0,0.791667,0.883721


Based on the performance metrics, it appears that the **Logistic Regression** model is the best fit for this data. It achieved the highest accuracy, precision , recall, and F1-score are also superior compared to the other models.

# Regression

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

filename = 'house-prices-advanced-regression-techniques.zip'
os.makedirs('regression', mode=0o777, exist_ok=True)
for root, dirs, file in os.walk("./", topdown=True):
    if filename in file:
        break
else:
    !kaggle competitions download -c house-prices-advanced-regression-techniques
    !mv house-prices-advanced-regression-techniques.zip regression/

Downloading house-prices-advanced-regression-techniques.zip to /Users/shyam/Desktop/Engineering/7th_Semester/MI_TA/MI Hands-On/sklearn
100%|█████████████████████████████████████████| 199k/199k [00:00<00:00, 540kB/s]
100%|█████████████████████████████████████████| 199k/199k [00:00<00:00, 538kB/s]


In [None]:
if len(os.listdir("regression/")) == 1:
    ! unzip 'regression/house-prices-advanced-regression-techniques.zip' -d 'regression/'

Archive:  regression/house-prices-advanced-regression-techniques.zip
  inflating: regression/data_description.txt  
  inflating: regression/sample_submission.csv  
  inflating: regression/test.csv     
  inflating: regression/train.csv    


In [None]:
df = pd.read_csv("regression/train.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
# Define the target variable and features
target = 'SalePrice'
features = df.drop(columns=[target])
features.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal


In [None]:
# Drop rows with missing target values
df = df.dropna(subset=[target])
df.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [None]:
# drop columns with all NaN's
df = df.dropna(axis=1)
df.isnull().sum()

Id               0
MSSubClass       0
MSZoning         0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 62, dtype: int64

In [None]:
X = df.drop(columns=[target])
y = df[target]

# Identify numerical columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocess the data: scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[numerical_features])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred[:10]

array([157107.77013597, 309240.43532632, 114604.16700413, 180173.91671344,
       300062.74480074,  47476.7036485 , 229838.63937551, 148189.24029287,
        44005.4854943 , 152154.37105333])

In [None]:
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [None]:
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

Mean Squared Error (MSE): 1394027020.51
R-squared (R²): 0.82


In [None]:
if r2 > 0.8:
    print("The model explains a high proportion of the variance in house prices, suggesting a strong fit.")
elif r2 > 0.5:
    print("The model explains a moderate proportion of the variance in house prices, indicating a reasonable fit.")
else:
    print("The model explains a low proportion of the variance in house prices, indicating that it may not fit the data well.")

The model explains a high proportion of the variance in house prices, suggesting a strong fit.


## That's the end of this notebook, hope you had a fun learning experience!