## **BEGINNER'S TUTORIAL ON SCIKIT-LEARN**

In this tutorial, we will explore some common tasks that can be accomplished using scikit-learn, a popular machine learning package in Python. Scikit-learn is known for its simplicity and efficiency in handling various machine learning algorithms. We will cover the following topics:

1. Loading a dataset
2. Splitting the dataset into training, validation, and test sets
3. Training different classification and regression models
4. Finding missing values in the dataset
5. Evaluating model performance using various metrics

By the end of this tutorial, you will have a good understanding of how to use scikit-learn to build and evaluate machine learning models. Let's get started!


### Package Installation & Importation

In [3]:
#execute this cell to install the required packages (if not done already)
! pip install scikit-learn numpy pandas



#### Install Kaggle API
You need to have the Kaggle API installed. You can install it using pip:

In [5]:
import os
! pip install kaggle



#### Set Up Kaggle API Credentials
1. Go to Kaggle's website and sign up.
2. Go to "My Account" (click on your profile picture in the top right corner and then on "My Account").
3. Go to "Settings"
4. Scroll down to the "API" section and click on "Create New API Token". This will download a file called kaggle.json.
5. Place the kaggle.json file in the .kaggle directory in your home directory. You can do this with the following commands:
```sh
    mkdir -p ~/.kaggle
    mv /path/to/kaggle.json ~/.kaggle/
    chmod 600 ~/.kaggle/kaggle.json
```

#### Joining Relevant Competition on Kaggle
1. Make sure you have logged into kaggle with the same account the API key has been generated for.
2. Go to https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques and click on "Join Competition".
3. This way, the regression dataset (House Price Prediction) will download seamlessly.

# Classification

#### Download the Dataset
Now you can use the Kaggle API to download the dataset:

In [6]:
import os
filename = 'breast-cancer-wisconsin-data.zip'
os.makedirs("classification", exist_ok=True)
for root, dirs, file in os.walk("./", topdown=True):
    if filename in file:
        break
else:
    !kaggle datasets download -d uciml/breast-cancer-wisconsin-data
    !mv breast-cancer-wisconsin-data.zip 'classification/'

In [7]:
for root, dirs, file in os.walk("./", topdown=True):
    if 'data.csv' in file:
        break
else:
    !unzip classification/breast-cancer-wisconsin-data.zip -d classification/

In [8]:
pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/home/shreya/.local/share/pipx/venvs/notebook/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

## Classification

#### 1. Loading a dataset
The breast cancer dataset, provided by scikit-learn, is a widely used dataset in the field of machine learning and data science. This dataset contains measurements of various features of cell nuclei present in breast cancer biopsies. It is commonly used for binary classification tasks to distinguish between malignant (cancerous) and benign (non-cancerous) tumors.

In [None]:
data = "classification/data.csv"
df = pd.read_csv(data)

# Display the first few rows of the dataset
df.head()

In [None]:
# Display basic information about the dataset
df.info()

#### 2. Splitting the dataset into training, validation, and test sets


In [None]:
# Display summary statistics
print("Display summary statistics: \n",df.describe())

In [None]:
# Check for missing values
print("missing values:\n", df.isnull().sum())

### Data Preprocessing and Splitting

In [None]:
#Drop columns which

df = df.drop(['Unnamed: 32', 'id'], axis = 1)

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Convert 'diagnosis' column to binary
y = y.replace({'M': 1, 'B': 0})

print("target: \n",y.head(10))

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training, validation, and test sets (70%, 10%, 20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.66, random_state=42)

# Check the shapes of the splits
X_train.shape, X_val.shape, X_test.shape

#### 3. Training different classification models
In this section, we will demonstrate how to initialize and train different classification models using scikit-learn. While we won't go into the detailed workings of these models, it's important to know that there are multiple algorithms available for classification tasks.


In [None]:
# Initialize and train a Logistic Regression Model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

In [None]:
# Initialize and train DecisionTree Model
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

In [None]:
# Initialize and train SVM Model
svm_clf = SVC(random_state=42)
svm_clf.fit(X_train, y_train)

If you wish to look at the predictions of each model separately, try executing `model_name.predict(X_val)`.

These predictions are then compared to `y_val` for better insigths at how the model is performing.

#### 4. Visualizing the metrics for each model

In [None]:
# Summary of performance metrics
metrics = {
    'Model': ['Logistic Regression', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_score(y_val, log_reg.predict(X_val)),
                 accuracy_score(y_val, tree_clf.predict(X_val)),
                 accuracy_score(y_val, svm_clf.predict(X_val))],
    'Precision': [precision_score(y_val, log_reg.predict(X_val)),
                  precision_score(y_val, tree_clf.predict(X_val)),
                  precision_score(y_val, svm_clf.predict(X_val))],
    'Recall': [recall_score(y_val, log_reg.predict(X_val)),
               recall_score(y_val, tree_clf.predict(X_val)),
               recall_score(y_val, svm_clf.predict(X_val))],
    'F1-Score': [f1_score(y_val, log_reg.predict(X_val)),
                 f1_score(y_val, tree_clf.predict(X_val)),
                 f1_score(y_val, svm_clf.predict(X_val))]
}

metrics_df = pd.DataFrame(metrics)
metrics_df

Based on the performance metrics, it appears that the **Logistic Regression** model is the best fit for this data. It achieved the highest accuracy, precision , recall, and F1-score are also superior compared to the other models.

# Regression

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

filename = 'house-prices-advanced-regression-techniques.zip'
os.makedirs('regression', mode=0o777, exist_ok=True)
for root, dirs, file in os.walk("./", topdown=True):
    if filename in file:
        break
else:
    !kaggle competitions download -c house-prices-advanced-regression-techniques
    !mv house-prices-advanced-regression-techniques.zip regression/

In [None]:
if len(os.listdir("regression/")) == 1:
    ! unzip 'regression/house-prices-advanced-regression-techniques.zip' -d 'regression/'

In [None]:
df = pd.read_csv("regression/train.csv")
df.head()

In [None]:
# Define the target variable and features
target = 'SalePrice'
features = df.drop(columns=[target])
features.head()

In [None]:
# Drop rows with missing target values
df = df.dropna(subset=[target])
df.isnull().sum()

In [None]:
# drop columns with all NaN's
df = df.dropna(axis=1)
df.isnull().sum()

In [None]:
X = df.drop(columns=[target])
y = df[target]

# Identify numerical columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocess the data: scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[numerical_features])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred[:10]

In [None]:
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [None]:
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

In [None]:
if r2 > 0.8:
    print("The model explains a high proportion of the variance in house prices, suggesting a strong fit.")
elif r2 > 0.5:
    print("The model explains a moderate proportion of the variance in house prices, indicating a reasonable fit.")
else:
    print("The model explains a low proportion of the variance in house prices, indicating that it may not fit the data well.")

## That's the end of this notebook, hope you had a fun learning experience!