# Python Tutorial: Scikit-Learn (sklearn)

Scikit-learn (sklearn) is a popular machine learning library in Python that provides various tools for building and applying machine learning models. 

## Steps

1. Installation.
2. Load libaries.
3. Load the dataset.
4. Get to know the data.
5. Visualize the data.
6. Building and training a Linear Regression Model.
7. Evaluating the Model and making predictions.
    

## Scikit-learn

https://scikit-learn.org/stable/


<details>
<summary><b>Overview</b></summary>

Sklearn offers a wide range of models, which can be broadly categorized into the following types based on the learning task:

1. **Supervised Learning Models:**
   In supervised learning, the algorithm learns from labeled data, meaning the input data is accompanied by corresponding output labels. Supervised learning models in sklearn can be further divided into two subcategories:

   - **Classification Models:** These models are used for predicting categorical labels. Examples include Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (kNN), etc.
   
   - **Regression Models:** Regression models are used for predicting continuous values. Examples include Linear Regression, Ridge Regression, Lasso Regression, Support Vector Regression (SVR), etc.

2. **Unsupervised Learning Models:**
   In unsupervised learning, the algorithm learns patterns and structures from unlabeled data. Unsupervised learning models in sklearn include:

   - **Clustering Models:** These models are used for grouping similar data points into clusters based on some similarity measure. Examples include K-Means, Hierarchical Clustering, DBSCAN, etc.
   
   - **Dimensionality Reduction Models:** These models are used for reducing the number of features in the data while preserving important information. Examples include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.

3. **Semi-Supervised Learning Models:**
   Semi-supervised learning combines both labeled and unlabeled data to improve learning accuracy. Sklearn provides some semi-supervised learning algorithms, including LabelPropagation and LabelSpreading.

4. **Model Selection and Evaluation:**
   Sklearn also provides tools for model selection and evaluation, including:

   - **Cross-Validation:** Techniques like k-fold cross-validation, which splits the dataset into k subsets and trains the model k times, each time using a different subset as the test set.
   
   - **Model Evaluation Metrics:** Sklearn offers various metrics to evaluate model performance, such as accuracy, precision, recall, F1-score for classification, and mean squared error, R-squared for regression.
   
   - **Hyperparameter Tuning:** GridSearchCV and RandomizedSearchCV are used for finding the best hyperparameters for a model by exhaustively searching through a specified parameter grid or randomly sampling from a parameter distribution.

5. **Ensemble Methods:**
   Ensemble methods combine multiple individual models to improve performance. Sklearn provides ensemble methods like Random Forest, Gradient Boosting, AdaBoost, etc.

These are some of the major model types and functionalities offered by sklearn. Each model type has its strengths and weaknesses, and choosing the right model depends on the specific problem at hand and the characteristics of the dataset.
                                                                                                 </details>                                                                               

## 1. Installation.
  
You can install scikit-learn using pip:


In [None]:
pip install scikit-learn pandas numpy seaborn matplotlib


## 2. Load libaries.

Load libaries and turn off warnings

- Pandas : Data structures and operations for manipulating numerical tables and time series.
- Sklearn : Various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN.
- Pickle : Serializing and de-serializing a Python object structure. 
- Seaborn : High-level interface for drawing attractive and informative statistical graphics.
- Matplotlib : Object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.


In [None]:
# Load libraries
import pandas as pd
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression


# Ignore warnings
# https://docs.python.org/3/library/warnings.html
import warnings

warnings.filterwarnings('ignore')


## 3. Load the dataset.

The dataset has columns such as MPG, Cylinders, Engine Disp, Horsepower, Weight, Accelerate, Year, and Origin.


In [None]:
# Load the dataset from csv
df =  pd.read_csv('mpg.csv')


## 4. Get to know the data.
 
Get an understanding of the data.


In [None]:
# Display sample rows from the dataset
df.sample(5)


In [None]:
# Total number of rows and columns
df.shape


In [None]:
# Index dtype and columns, non-null values and memory usage
df.info


In [None]:
# Description of the data in the DataFrame
df.describe()


Data type check helps to understand what type of variables our dataset contains.


In [None]:
category_cols = ['category']
category_lst = list(df.select_dtypes(include=category_cols).columns)
print("Total number of categorical columns are ", len(category_lst))
print("There names are as follows: ", category_lst)


In [None]:
int64_cols = ['int64']
int64_lst = list(df.select_dtypes(include=int64_cols).columns)
print("Total number of numerical columns are ", len(int64_lst))
print("There names are as follows: ", int64_lst)


In [None]:
float64_cols = ['float64']
float64_lst = list(df.select_dtypes(include=float64_cols).columns)
print("Total number of float64 columns are ", len(float64_lst))
print("There name are as follow: ", float64_lst)


## 5. Visualize the data.


In [None]:
# Relationship between Horsepower and mileage
df.plot.scatter(x="Horsepower", y="MPG")


In [None]:
# Relationship between Horsepower and weight
df.plot.scatter(x="Horsepower", y="Weight")


In [None]:
# Show the variation in the data distribution
num = [f for f in df.columns if df.dtypes[f] != 'object']
nd = pd.melt(df, value_vars = num)
n1 = sns.FacetGrid (nd, col='variable', col_wrap=4, sharex=False, sharey = False)
n1 = n1.map(sns.distplot, 'value')
n1


In [None]:
# Correlation plot
sns.set(rc = {'figure.figsize':(25,20)})
corr = df.corr(numeric_only=True).abs()
sns.heatmap(corr, annot=True) 
plt.show()


In [None]:
# Find the outlier in a dataset/column
plt.figure(figsize=(8, 6))
sns.boxplot(data=df)
plt.title('Outliers')
plt.show()
    

## 6. Building and training a Linear Regression Model.

Linear Regression is a simple and widely used supervised learning algorithm for predicting continuous values. It establishes a linear relationship between the independent variables (features) and the dependent variable (target). In sklearn, linear regression is implemented in the `LinearRegression` class.

First, identify the target and feature columns for building a Linear Regression model. In this case, we’ll predict car mileage (MPG) based on Horsepower and Weight.


In [None]:
# Identify the target column
target = df["MPG"]

# Identify the features
features = df[["Horsepower", "Weight"]]


In [None]:
# Create a Linear Regression model
lr = LinearRegression()

# Train the model
lr.fit(features, target)


## 7. Evaluating the Model and making predictions.


In [None]:
# Evaluate the model
model_score = lr.score(features, target)
print("Model Score:", model_score)


In [None]:
# Make predictions for a car with Horsepower = 100 and Weight = 2000
predictions = lr.predict([[100, 2000]])
print("Predicted MPG:", predictions[0])


## Exercise 1

Loading the Diamond Dataset.


In [None]:
# Solution


## Exercise 2

Identify the target column as price and the features as carat and depth.


In [None]:
# Solution


## Exercise 3

Build and train a new Linear Regression model for diamond price prediction.


In [None]:
# Solution


## Exercise 4

Print the score of the model to assess its performance.


In [None]:
# Solution


## Exercise 5

Predict the price of a diamond with carat = 0.3 and depth = 60.


In [None]:
# Solution


## Summary

Scikit-learn is a powerful library for machine learning in Python, offering a wide range of algorithms and tools for various tasks. By following this tutorial and practicing the exercises, you'll gain a good understanding of how to use scikit-learn effectively for building and evaluating machine learning models.


<details>
<summary><b>Instructor Notes</b></summary>

https://levelup.gitconnected.com/predictions-using-sklearn-regression-for-car-mileage-and-diamond-price-558e6c1daefa

The dataset ws originally loaded into a csv.

```python
# URL for the car mileage dataset
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv"

# Load the data into a Pandas DataFrame
df = pd.read_csv(URL)

df.to_csv('mpg.csv', encoding='utf-8', index=False)
```

# URL for the diamonds dataset
URL2 = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv"


</details>