<a href="https://colab.research.google.com/github/dhakehruturaj/Topics-in-Generative-AI/blob/main/lab_session_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Lab Session

## 1. Introduction to Machine Learning

**Definition:**
Machine learning is a subset of artificial intelligence that involves training algorithms to make predictions or decisions based on data. It allows systems to learn from experience and improve over time.

**Types of Machine Learning:**

- **Supervised Learning:** Training a model on labeled data (e.g., classification, regression).
- **Unsupervised Learning:** Finding patterns in unlabeled data (e.g., clustering, dimensionality reduction).
- **Reinforcement Learning:** Training an agent to make decisions through rewards and punishments.

**Applications:**

- Image recognition
- Recommendation systems
- Natural language processing


# Numpy

Before that lets explore numpy

---

NumPy is a powerful numerical computing library in Python. It provides support for arrays, matrices, and a wide range of mathematical functions to operate on these data structures. Key features of NumPy include:

- **N-dimensional Array Object**: The primary data structure in NumPy is the `ndarray`, an N-dimensional array, which allows for efficient storage and manipulation of large datasets.
- **Broadcasting**: Enables arithmetic operations on arrays of different shapes, making code more concise and faster.
- **Vectorization**: Eliminates the need for explicit loops, optimizing performance by applying operations element-wise on arrays.
- **Comprehensive Math Functions**: Includes a vast collection of mathematical functions for operations like linear algebra, statistics, and Fourier transforms.
- **Interoperability**: Integrates well with other libraries like SciPy, Pandas, and Matplotlib, forming the backbone of the scientific Python ecosystem.

NumPy is essential for data science, machine learning, and scientific computing due to its efficiency and ease of use.


In [None]:
import pandas as pd
import numpy as np

# Number of rows
num_rows = 30

# Generate the data
data = {
    'v1': np.arange(1, num_rows + 1),  # Increasing numbers from 1 to 30
    'v2': np.floor(np.random.rand(num_rows) * 100),  # Random values between 0 and 99
    'v3': np.floor(np.random.rand(num_rows) * 100),  # Random values between 0 and 99
    'v4': np.floor(np.random.rand(num_rows) * 100)   # Random values between 0 and 99
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('myData.csv', index=False)



In [None]:
df.head()

Unnamed: 0,v1,v2,v3,v4
0,1,9.0,74.0,9.0
1,2,13.0,32.0,42.0
2,3,29.0,40.0,69.0
3,4,32.0,78.0,68.0
4,5,79.0,1.0,40.0


In [None]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('./myData.csv')

# Rename the columns
df.columns = ['v1', 'v2', 'v3', 'v4']

# Print the DataFrame
print(df)


    v1    v2    v3    v4
0    1   9.0  74.0   9.0
1    2  13.0  32.0  42.0
2    3  29.0  40.0  69.0
3    4  32.0  78.0  68.0
4    5  79.0   1.0  40.0
5    6  52.0  89.0  95.0
6    7  76.0  31.0  66.0
7    8  78.0  56.0  16.0
8    9   2.0  47.0  25.0
9   10  15.0  39.0  74.0
10  11  48.0  83.0  55.0
11  12   8.0  18.0   4.0
12  13  33.0  70.0  32.0
13  14  49.0  16.0  34.0
14  15  99.0   0.0  44.0
15  16  26.0  73.0  62.0
16  17  77.0  33.0  20.0
17  18  72.0  43.0  74.0
18  19  20.0  14.0  28.0
19  20  91.0  28.0  50.0
20  21  64.0  46.0  58.0
21  22  83.0  11.0  97.0
22  23  76.0  53.0  66.0
23  24  84.0  90.0  98.0
24  25  64.0   3.0  65.0
25  26  94.0  60.0  50.0
26  27  77.0  20.0  84.0
27  28  82.0  19.0  43.0
28  29  30.0  21.0  91.0
29  30  65.0  63.0  79.0


### Dropping Specific Rows in a DataFrame

To drop specific rows from a DataFrame using `pandas`, you can use the `drop` method.


In [None]:
import pandas as pd

# Drop rows 20 and 21
df.drop([20, 21], axis=0, inplace=True)

# Print the updated DataFrame
print(df)


    v1    v2    v3    v4
0    1   9.0  74.0   9.0
1    2  13.0  32.0  42.0
2    3  29.0  40.0  69.0
3    4  32.0  78.0  68.0
4    5  79.0   1.0  40.0
5    6  52.0  89.0  95.0
6    7  76.0  31.0  66.0
7    8  78.0  56.0  16.0
8    9   2.0  47.0  25.0
9   10  15.0  39.0  74.0
10  11  48.0  83.0  55.0
11  12   8.0  18.0   4.0
12  13  33.0  70.0  32.0
13  14  49.0  16.0  34.0
14  15  99.0   0.0  44.0
15  16  26.0  73.0  62.0
16  17  77.0  33.0  20.0
17  18  72.0  43.0  74.0
18  19  20.0  14.0  28.0
19  20  91.0  28.0  50.0
22  23  76.0  53.0  66.0
23  24  84.0  90.0  98.0
24  25  64.0   3.0  65.0
25  26  94.0  60.0  50.0
26  27  77.0  20.0  84.0
27  28  82.0  19.0  43.0
28  29  30.0  21.0  91.0
29  30  65.0  63.0  79.0


### Principal Component Analysis (PCA) to Determine Important Components

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a large set of variables into a smaller one that still contains most of the information in the large set. Key aspects of PCA include:

- **Variance Maximization**: PCA identifies the directions (principal components) in which the data varies the most and projects the data along these directions.
- **Orthogonal Components**: The principal components are orthogonal (uncorrelated), ensuring that each new dimension represents distinct information.
- **Feature Reduction**: By selecting the top principal components, PCA reduces the number of features, simplifying the dataset while retaining its essential patterns.
- **Eigenvectors and Eigenvalues**: PCA uses eigenvectors and eigenvalues of the data's covariance matrix to determine the principal components.
- **Applications**: Widely used in fields like image processing, genomics, finance, and any domain where reducing data dimensionality can lead to more efficient and effective analysis.

PCA is a fundamental tool in exploratory data analysis and preprocessing, helping to uncover hidden patterns and simplify complex datasets.

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of data while retaining as much variance as possible. Here's how to determine the number of important components in a dataset using PCA with `sklearn`:

1. **Import Required Libraries**
   ```python
   import pandas as pd
   from sklearn.decomposition import PCA



In [None]:
import pandas as pd
from sklearn.decomposition import PCA

# Loop through different numbers of components
for i in range(1, 5):
    pca = PCA(n_components=i)  # Initialize PCA with i components
    pca.fit(df)                # Fit PCA on the data
    explained_variance_ratio = sum(pca.explained_variance_ratio_)  # Sum of explained variance ratios
    print(f'Number of components: {i}, Explained variance ratio: {explained_variance_ratio}')


Number of components: 1, Explained variance ratio: 0.4331213083920601
Number of components: 2, Explained variance ratio: 0.8023122462637541
Number of components: 3, Explained variance ratio: 0.9780314500798681
Number of components: 4, Explained variance ratio: 0.9999999999999998


Lets see on different scenario......

In [None]:
import numpy as np
import pandas as pd

# Set the number of data points
numPoints = 15

# Generate random data for columns
v1 = [np.random.randint(low=1, high=80) for i in range(numPoints)]    # Random integers between 1 and 80
v2 = [2 * v1[i] for i in range(numPoints)]                    # Double the values of v1
v3 = [np.random.randint(low=1, high=80) for i in range(numPoints)]    # Random integers between 1 and 80
v4 = np.random.permutation(v1)                                        # Random permutation of v1
v5 = [np.random.randint(low=0, high=2) for i in range(numPoints)]    # Random integers between 0 and 1

# Combine the columns into a list of tuples
aData = list(zip(v1, v2, v3, v4, v5))

# Create a DataFrame from the data
df = pd.DataFrame(data=aData, columns=['v1', 'v2', 'v3', 'v4', 'v5'])

# Save the DataFrame to a CSV file
df.to_csv('aData.csv', index=False, header=False)

# Read the data back from the CSV file
Location = './aData.csv'
df = pd.read_csv(Location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Print the DataFrame
print(df)


    v1   v2  v3  v4  v5
0   25   50   9  49   1
1   72  144  64  43   1
2   26   52  18  45   1
3   43   86  11  74   0
4   72  144  48  20   0
5   37   74  57  26   1
6   53  106  47  64   1
7   37   74  72  46   0
8   74  148  39  25   1
9   46   92  59  37   0
10  64  128  27  37   0
11  30   60  66  72   0
12  20   40  25  72   0
13  45   90  48  53   0
14  49   98  73  30   1


### Variance Ratio

Variance ratio, in the context of Principal Component Analysis (PCA), refers to the proportion of the dataset's total variance that is explained by each principal component. It is a crucial metric for understanding the significance of each principal component. Key points about variance ratio include:

- **Explained Variance Ratio**: This metric indicates how much of the total variance is captured by each principal component. It helps in assessing the importance of each component.
- **Cumulative Variance Ratio**: Summing the explained variance ratios of the principal components provides the cumulative variance ratio, which shows the total variance explained by a subset of the principal components.
- **Selection of Principal Components**: By examining the explained variance ratios, one can decide the number of principal components to retain. Typically, components that contribute significantly to the cumulative variance (e.g., 95% or 99%) are kept.
- **Dimensionality Reduction**: Variance ratio is used to reduce the dataset's dimensionality while preserving as much information as possible, simplifying analysis and visualization.

Understanding the variance ratio helps in making informed decisions about retaining the most informative components and discarding the less significant ones in PCA.


In [None]:
import pandas as pd
from sklearn.decomposition import PCA

# Read the CSV file into a DataFrame
df = pd.read_csv('./aData.csv', names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Perform PCA for different numbers of components
for i in range(1, 6):
    pca = PCA(n_components=i)  # Initialize PCA with i components
    pca.fit(df)                # Fit PCA on the data
    explained_variance_ratio = sum(pca.explained_variance_ratio_)  # Sum of explained variance ratios
    print(f'Number of components: {i}, Explained variance ratio: {explained_variance_ratio}')


Number of components: 1, Explained variance ratio: 0.735544248929193
Number of components: 2, Explained variance ratio: 0.9176376786557637
Number of components: 3, Explained variance ratio: 0.9999019587685559
Number of components: 4, Explained variance ratio: 0.9999999999999999
Number of components: 5, Explained variance ratio: 0.9999999999999999


How the data looks like with three components:
When we only consider three components

In [None]:
import pandas as pd
from sklearn.decomposition import PCA

# Read the CSV file into a DataFrame
df = pd.read_csv('./aData.csv', names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Initialize PCA with 3 components
pca = PCA(n_components=3)

# Fit PCA on the data and transform the data
df2 = pca.fit_transform(df)

# Print the transformed DataFrame
print(df2)


[[ 52.13991901  23.01242374  15.68603496]
 [-59.02105693  -5.84785066 -14.58404552]
 [ 47.22397126  14.30875487  17.12796845]
 [ 20.68615665  34.28335406 -17.97714226]
 [-62.29453688   6.13672347  10.51429324]
 [ 11.41231545 -20.12781125  21.44537722]
 [ -9.97823804   3.52346343 -21.06156454]
 [ 14.05515611 -31.59404067  -0.6230932 ]
 [-63.4287783   16.57391175   6.55269938]
 [ -4.91564125 -15.67362233   5.82950333]
 [-36.77514533  24.57694507   2.85245436]
 [ 37.08793664 -25.70851553 -20.41646868]
 [ 65.95515987   8.38358426  -6.8005431 ]
 [  3.67504309  -3.28438025  -6.60761641]
 [-15.82226135 -28.56293996   8.06214277]]


Explore Numpy Visualization for PCA: [Link](https://projector.tensorflow.org/)

### Classification

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Generate data
numPoints = 80
np.random.seed(500)

v1 = [np.random.randint(low=1, high=80) for i in range(numPoints)]
v2 = [2 * v1[i] for i in range(numPoints)]
v3 = [np.random.randint(low=1, high=80) for i in range(numPoints)]
v4 = np.random.permutation(v1)
v5 = [np.random.randint(low=0, high=2) for i in range(numPoints)]
aData = list(zip(v1, v2, v3, v4, v5))

# Create DataFrame
df = pd.DataFrame(data=aData, columns=['v1', 'v2', 'v3', 'v4', 'v5'])

# Save DataFrame to CSV
df.to_csv('aData.csv', index=False, header=False)

## 2. Basic Concepts

**Datasets:**
A dataset consists of features (input variables) and labels (output variables). For supervised learning, data is divided into training and testing sets.

**Training and Testing:**

- **Training Data:** Used to train the model.
- **Testing Data:** Used to evaluate the model's performance.

**Model Evaluation:**

- **Accuracy:** Proportion of correctly predicted instances.
- **Precision:** Proportion of true positives among all positive predictions.
- **Recall:** Proportion of true positives among all actual positives.
- **F1 Score:** Harmonic mean of precision and recall.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
location = './aData.csv'
df = pd.read_csv(location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Separate features and target variable
X = df.drop(columns=['v5'])
y = df['v5']

# Split the data into training and test sets
X_tr, X_tst, y_tr, y_tst = train_test_split(X, y, test_size=0.1, random_state=42)


In [None]:
# Print the split data to verify
print("Training features:")
print(X_tr.head())

print("\nTesting features:")
print(X_tst.head())

print("\nTraining labels:")
print(y_tr.head())

print("\nTesting labels:")
print(y_tst.head())

Training features:
    v1   v2  v3  v4
4   62  124  13  24
12  35   70  14  56
49  54  108   9  16
33  16   32  66  78
67  33   66   4  20

Testing features:
    v1   v2  v3  v4
30  30   60  53  38
0   56  112  36   3
22  14   28  33  20
31  29   58  68   5
18  48   96  50  43

Training labels:
4     1
12    1
49    0
33    1
67    1
Name: v5, dtype: int64

Testing labels:
30    1
0     1
22    1
31    1
18    0
Name: v5, dtype: int64


Apply Decision Tree Classifier

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read the CSV file into a DataFrame
Location = './aData.csv'
df = pd.read_csv(Location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Extract target labels and remove them from the feature set
y = df['v5']
del df['v5']

# Split data into training and testing sets
X_tr, X_tst, y_tr, y_tst = train_test_split(df, y, test_size=0.1, random_state=500)

# Initialize and train Decision Tree Classifier
clf = DecisionTreeClassifier()
clf.fit(X_tr, y_tr)

# Predict on the test set
predicted = clf.predict(X_tst)

# Calculate and print accuracy
accuracy = metrics.accuracy_score(y_tst, predicted)
print(f"Decision Tree Classifier Accuracy: {accuracy}")


Decision Tree Classifier Accuracy: 0.75


Apply KNN Classifier

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read the CSV file into a DataFrame
Location = './aData.csv'
df = pd.read_csv(Location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Extract target labels and remove them from the feature set
y = df['v5']
del df['v5']

# Split data into training and testing sets
X_tr, X_tst, y_tr, y_tst = train_test_split(df, y, test_size=0.1, random_state=500)

# Initialize and train K-Nearest Neighbors Classifier
clf = KNeighborsClassifier()
clf.fit(X_tr, y_tr)

# Predict on the test set
predicted = clf.predict(X_tst)

# Calculate and print accuracy
accuracy = metrics.accuracy_score(y_tst, predicted)
print(f"K-Nearest Neighbors Classifier Accuracy: {accuracy}")


K-Nearest Neighbors Classifier Accuracy: 0.5


Apply Naive Bayes Classifier

In [None]:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read the CSV file into a DataFrame
Location = './aData.csv'
df = pd.read_csv(Location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Extract target labels and remove them from the feature set
y = df['v5']
del df['v5']

# Split data into training and testing sets
X_tr, X_tst, y_tr, y_tst = train_test_split(df, y, test_size=0.1, random_state=500)

# Initialize and train Naive Bayes Classifier
clf = BernoulliNB()
clf.fit(X_tr, y_tr)

# Predict on the test set
predicted = clf.predict(X_tst)

# Calculate and print accuracy
accuracy = metrics.accuracy_score(y_tst, predicted)
print(f"Naive Bayes Classifier Accuracy: {accuracy}")


Naive Bayes Classifier Accuracy: 0.625


Applying SVM Classifier

In [None]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read the CSV file into a DataFrame
Location = './aData.csv'
df = pd.read_csv(Location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Extract target labels and remove them from the feature set
y = df['v5']
del df['v5']

# Split data into training and testing sets
X_tr, X_tst, y_tr, y_tst = train_test_split(df, y, test_size=0.1, random_state=500)

# Initialize and train Support Vector Machine Classifier
clf = SVC()
clf.fit(X_tr, y_tr)

# Predict on the test set
predicted = clf.predict(X_tst)

# Calculate and print accuracy
accuracy = metrics.accuracy_score(y_tst, predicted)
print(f"SVM Classifier Accuracy: {accuracy}")


SVM Classifier Accuracy: 0.5


Applying for Random Forest Classifier

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read the CSV file into a DataFrame
Location = './aData.csv'
df = pd.read_csv(Location, names=['v1', 'v2', 'v3', 'v4', 'v5'])

# Extract target labels and remove them from the feature set
y = df['v5']
del df['v5']

# Split data into training and testing sets
X_tr, X_tst, y_tr, y_tst = train_test_split(df, y, test_size=0.1, random_state=500)

# Initialize and train Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=500)
clf.fit(X_tr, y_tr)

# Predict on the test set
predicted = clf.predict(X_tst)

# Calculate and print accuracy
accuracy = metrics.accuracy_score(y_tst, predicted)
print(f"Random Forest Classifier Accuracy: {accuracy}")


Random Forest Classifier Accuracy: 0.5


In [None]:
# Calculate and print the confusion matrix
conf_matrix = metrics.confusion_matrix(y_tst, predicted)
print("Confusion Matrix:")
print(conf_matrix)

# Optionally, you can also print classification report for detailed metrics
print("\nClassification Report:")
print(metrics.classification_report(y_tst, predicted))

Confusion Matrix:
[[1 2]
 [2 3]]

Classification Report:
              precision    recall  f1-score   support

           0       0.33      0.33      0.33         3
           1       0.60      0.60      0.60         5

    accuracy                           0.50         8
   macro avg       0.47      0.47      0.47         8
weighted avg       0.50      0.50      0.50         8



Lets have some practical implications

---
## Exercise
---

### Stock Price Prediction Model

dataset: https://www.kaggle.com/datasets/svaningelgem/nyse-100-daily-stock-prices?resource=download


In [None]:
from google.colab import files
upload = files.upload()

Saving kaggle.json to kaggle.json


In [None]:
# Kaggle
!mkdir -p `/.kaggle && mv kaggle.json ~/.kaggle/`
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d svaningelgem/nyse-100-daily-stock-prices

/bin/bash: line 1: /.kaggle: No such file or directory
mkdir: missing operand
Try 'mkdir --help' for more information.
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/svaningelgem/nyse-100-daily-stock-prices
License(s): copyright-authors
Downloading nyse-100-daily-stock-prices.zip to /content
 47% 5.00M/10.5M [00:00<00:00, 34.5MB/s]
100% 10.5M/10.5M [00:00<00:00, 54.4MB/s]


In [None]:
!unzip nyse-100-daily-stock-prices.zip

Archive:  nyse-100-daily-stock-prices.zip
  inflating: ABBV.csv                
  inflating: ABT.csv                 
  inflating: ACN.csv                 
  inflating: AMT.csv                 
  inflating: ANET.csv                
  inflating: APH.csv                 
  inflating: AXP.csv                 
  inflating: BA.csv                  
  inflating: BABA.csv                
  inflating: BAC.csv                 
  inflating: BHP.csv                 
  inflating: BLK.csv                 
  inflating: BP.csv                  
  inflating: BRK.B.csv               
  inflating: BSX.csv                 
  inflating: BUD.csv                 
  inflating: BX.csv                  
  inflating: C.csv                   
  inflating: CAT.csv                 
  inflating: CB.csv                  
  inflating: CI.csv                  
  inflating: COP.csv                 
  inflating: CRM.csv                 
  inflating: CVX.csv                 
  inflating: DE.csv                  
  inflat

In [None]:
import pandas as pd
dataframe = pd.read_csv('IBM.csv')
dataframe.head()

Unnamed: 0,ticker,date,open,high,low,close
0,IBM,1962-01-02,5.0461,5.0461,4.98716,4.98716
1,IBM,1962-01-03,4.98716,5.03292,4.98716,5.03292
2,IBM,1962-01-04,5.03292,5.03292,4.98052,4.98052
3,IBM,1962-01-05,4.97389,4.97389,4.87511,4.88166
4,IBM,1962-01-08,4.88166,4.88166,4.75059,4.78972


In [None]:
dataframe.shape
ibm_df = pd.read_csv('IBM.csv', parse_dates=['date'])
ibm_df.head()

Unnamed: 0,ticker,date,open,high,low,close
0,IBM,1962-01-02,5.0461,5.0461,4.98716,4.98716
1,IBM,1962-01-03,4.98716,5.03292,4.98716,5.03292
2,IBM,1962-01-04,5.03292,5.03292,4.98052,4.98052
3,IBM,1962-01-05,4.97389,4.97389,4.87511,4.88166
4,IBM,1962-01-08,4.88166,4.88166,4.75059,4.78972


In [None]:
ibm_df = ibm_df.set_index('date')
ibm_df.head()

Unnamed: 0_level_0,ticker,open,high,low,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1962-01-02,IBM,5.0461,5.0461,4.98716,4.98716
1962-01-03,IBM,4.98716,5.03292,4.98716,5.03292
1962-01-04,IBM,5.03292,5.03292,4.98052,4.98052
1962-01-05,IBM,4.97389,4.97389,4.87511,4.88166
1962-01-08,IBM,4.88166,4.88166,4.75059,4.78972


### Categorizing Emails Spam or not Spam

Dataset: https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset