#   LAB 04 - Python version

Luca Catalano, Daniele Rege Cambrin and Eleonora Poeta 

### Disclaimer

The purpose of creating this material is to enhance the knowledge of students who are interested in learning how to solve problems presented in laboratory classes using Python. This decision stems from the observation that some students have opted to utilize Python for tackling exam projects in recent years.

To solve these exercises using Python, you need to install Python (version 3.9.6 or later) and some libraries using pip or conda.

Here's a list of the libraries needed for this case:

- `os`: Provides operating system dependent functionality, commonly used for file operations such as reading and writing files, interacting with the filesystem, etc.
- `pandas`: A data manipulation and analysis library that offers data structures and functions to efficiently work with structured data.
- `numpy`: A numerical computing library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- `matplotlib.pyplot`: A plotting library for creating visualizations like charts, graphs, histograms, etc.
- `sklearn`: Machine learning algorithms and tools.
- `sklearn_extra`: Additional machine learning algorithms and extensions.
- `nltk`: The Natural Language Toolkit, a library for natural language processing tasks such as tokenization, stemming, part-of-speech tagging, and more.
- `xlrd`: A Python library used for reading data and formatting information from Excel files (.xls and .xlsx formats). It provides functionality to extract data from Excel worksheets, including cells, rows, columns, and formatting details.

You can download Python from [here](https://www.python.org/downloads/) and follow the installation instructions for your operating system.

For installing libraries using [pip](https://pip.pypa.io/en/stable/) or [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html), you can use the following commands:

- Using pip:
  ```
  pip install pandas numpy matplotlib nltk scikit-learn xlrd scikit-learn-extra
  ```

- Using conda:
  ```
  conda install pandas numpy matplotlib nltk scikit-learn xlrd scikit-learn-extra
  ```

Make sure to run these commands in your terminal or command prompt after installing Python. You can also execute them in a cell of a Jupyter Notebook file (`.ipynb`) by starting the command with '!'.

#   Exercise 1

Import some libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import TruncatedSVD
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.cluster import DBSCAN


from sklearn.metrics import silhouette_score

## Read file excel

### Read the file excel named "UserSmall.xls"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

## How to handle Missing values?

### Find if there are missing values in our dataset. 

Usually in a real dataset the missing values are stored with a nan value. In this case we have ? as missing values representation.

So first of all we can replace each '?' symbol in a nan value [using .replace()].

Count the number of missing values for each column [using .isnumm() and .sum() functions]

### Replace the missing values

As you have seen in class there are different methodologies for filling the nan values. Here we will use the average for the numerical data and the most frequent string for non-numerical columns [use .fillna() and mean() function]

In [None]:
# Replace NaN values with the average value for numerical columns
    # Get the average value for the column and replace NaN values with it

# Replace NaN values with the most frequent value for non-numerical columns
    # Get the most frequent string value
    # Get the most frequent value for the column and replace NaN values with it

In [None]:
# Check (printing the variable dataset) that there are no NaN values left


##  Outlier detection

### Plot using pyplot library the dataset feature

You can plot a scatter/bubble plot to identify some outliers

In [None]:
# Fix the 'Age' attribute on the y-axis


# Plot scatter/bubble plot with an attribute on the x-axis. Ypu caan choose what ever attribute you want


As evident, the 'Age' attribute in our dataset contains errors, such as improbable values like 150 for age or an age less than 0. To ensure the integrity of our data, we need to perform cleaning by filtering out such rows from the dataset.

In [None]:
# Get the condition for the age values (between 0 and 105 years old) [use a boolean condition saved in mask variable]

# Apply the condition to the dataset and store the result in the dataset variable (overwrite the previous dataset)

## Select attributes

### Remove 'Response attribute' (that is in the last column) from the variable dataset

The `.iloc` function in Pandas is used for integer-location based indexing. It allows you to select rows and columns from a DataFrame by their integer position, rather than by label. This function provides a way to select data by position, similar to indexing in NumPy arrays.

### Syntax

```python
DataFrame.iloc[row_indexer, column_indexer]
```

- `row_indexer`: Specifies the rows to select. It can be:
  - An integer, e.g., `2`.
  - A list or array of integers, e.g., `[1, 3, 5]`.
  - A slice object with integers, e.g., `1:4`.
  - A boolean array.

- `column_indexer`: Specifies the columns to select. It follows the same rules as `row_indexer`.

### Example Usage

```python
import pandas as pd

# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4],
        'B': [5, 6, 7, 8],
        'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)

# Selecting specific rows and columns using iloc
selected_data = df.iloc[1:3, 0:2]
print(selected_data)
```

### Output

```
   A  B
1  2  6
2  3  7
```

### Notes

- `.iloc` is exclusive of the end index when using slices, similar to Python indexing conventions.
- If you want to select specific rows and columns by label instead of position, you should use the `.loc` function.

In [None]:
# Remove the last column from the dataset (which is the Resposnse attribute) [you can use index -1 in Python] 

In [None]:
# print dataset and check that the last column has been removed

##  Normalization

### Normalize age attribute

In [None]:
# Initialize MinMaxScaler

# Normalize the 'age' attribute [using the .fit_transform() function]


Age should be in range [0-1]

In [None]:
# print dataset function to check the hypothesis

##   Kmedoids Clustering

### Apply Kmedois clustering

Agglomerative clustering is a strategy of hierarchical clustering. Hierarchical clustering (also known as Connectivity based clustering) is a method of cluster analysis which seeks to build a hierarchy of clusters. Hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. 

In [None]:
# Make a copy of the dataset to avoid modifying the original data


# Instantiate a LabelEncoder object to encode categorical variables

# Iterate through columns of the dataset that have object data type
# and encode them using the LabelEncoder


# Instantiate a KMedoids clustering model with 3 clusters


# Fit the KMedoids clustering model to the encoded dataset
# and obtain cluster labels for each data point


### Evaluate the cluster algorithm

Silhouette score is a metric used to evaluate the quality of clustering in unsupervised learning. It quantifies how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1.

In [None]:
# use the silhouette_score function to evaluate the clustering model


##  SVD

### Apply SVD algorithm after performing KMedoids cluster algorithm

These codes use dimensionality reduction technique, SVD, to reduce the dimensionality of the dataset to 3 dimensions, and then visualize the data points in a 3D scatter plot. Each data point is colored according to its assigned cluster label obtained from a clustering algorithm (`cluster_labels.labels_`).

In [None]:
# Instantiate a TruncatedSVD object with 2 components


# Fit the TruncatedSVD model to the dataset and transform it
# using the cluster labels obtained from the KMedoids clustering


# Create a new figure for plotting

# Scatter plot of the transformed data with colors representing cluster labels

# Set plot title and axis labels

# Show the plot


##  LDA

### Apply LDA algorithm after performing KMedoids cluster algorithm and compare the visualization done before with SVD

In [None]:
# Instantiate an LDA object with 2 components

# Fit the LDA model to the dataset and transform it
# using the cluster labels obtained from the KMedoids clustering


# Create a new figure for plotting


# Scatter plot of the transformed data with colors representing cluster labels

# Set plot title and axis labels

# Show the plot


##  DBSCAN

### Apply DBSCAN algorithm to the dataset after coverting it into one hot encoding form [use .get_dummies()]

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm used in machine learning and data mining. It groups together points that are closely packed together based on their density in a high-dimensional space. Unlike other clustering algorithms, DBSCAN doesn't require the number of clusters to be specified in advance. Instead, it defines clusters as continuous regions of high density separated by regions of low density.

The key parameters of DBSCAN are:

- Epsilon (ε): A distance threshold that determines the neighborhood of a point.
- MinPts: The minimum number of points required to form a dense region (cluster).

DBSCAN works by iteratively exploring the neighborhood of each point. A point is classified as a core point if it has at least MinPts points within its ε-neighborhood. Core points are then used to expand clusters by adding neighboring points to the same cluster. Points that are not core points themselves but are within the ε-neighborhood of a core point are classified as border points and are included in the cluster. Points that are not core points and don't have enough neighboring points are considered noise and are not assigned to any cluster.

DBSCAN is particularly useful for clustering data with irregular shapes and handling noise effectively. It's robust to outliers and doesn't require specifying the number of clusters beforehand, making it suitable for various applications, including spatial data analysis, anomaly detection, and image segmentation. However, choosing appropriate values for ε and MinPts can be challenging and may significantly affect the clustering results.

In [None]:
# Make a copy of the dataset to avoid modifying the original data


# Perform one-hot encoding on the categorical variables in the dataset


# Initialize DBSCAN with specified parameters (epsilon=1.0, min_samples=3)


# Fit DBSCAN to the one-hot encoded dataset and obtain cluster labels


# Analyze the clusters
# Calculate the number of clusters and the number of noise points



### Evaluate the cluster algorithm

Silhouette score is a metric used to evaluate the quality of clustering in unsupervised learning. It quantifies how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1.

In [None]:
silhouette_score(dataset_copy, cluster_labels)

#   Exercise 2

Import some libraries

In [None]:
import os
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

##  Data preprocessing

### Preprocess text files stored in a specific folder, tokenize them, remove stopwords, perform stemming, and then convert them into a TF-IDF (Term Frequency-Inverse Document Frequency) matrix using scikit-learn's TfidfVectorizer. 

In [None]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Define the folder path containing the text files

# List to hold preprocessed text from each file

# List to hold file names

# stemming
stemmer = SnowballStemmer("english")
# file with the stopwords
file_italian_stopwords = open("TODO")
italian_stopwords = set(file_italian_stopwords.read().splitlines())

# Loop through files in the folder
for file_name in os.listdir('''TO COMPILE'''):
    # open file

        # Read the text from the file

        # Tokenization

        # Transform to lowercase

        # Remove stopwords

        # Stemming

        # Join the tokens back to form preprocessed text

        # Append the preprocessed text to the list

        # Append the file name to the list


# Initialize the TfidfVectorizer

# Fit and transform the preprocessed text data


# Convert the TF-IDF matrix to a DataFrame for better visualization



##  KMeans

### Apply KMeans algorithm to the new dataset

K-means is a popular clustering algorithm used in unsupervised machine learning for partitioning a dataset into a predetermined number of clusters. It aims to group similar data points together while maximizing the distance between different clusters. The algorithm iteratively assigns each data point to the nearest cluster centroid and recalculates the centroids based on the mean of the points in each cluster. This process continues until the centroids no longer change significantly, indicating convergence. K-means is sensitive to the initial placement of centroids, and different initializations can lead to different clustering results. Therefore, multiple runs with random initializations are often performed to mitigate this issue. While K-means is computationally efficient and easy to implement, it assumes spherical clusters and struggles with non-linear or irregularly shaped clusters. Additionally, it may not perform well with datasets of varying densities or clusters of unequal sizes. Despite these limitations, K-means remains widely used for clustering tasks in various domains due to its simplicity and scalability.

In [None]:
# K-Means clustering

# Number of clusters

# Maximum number of iterations

# Obtain the centroids of the clusters

# Calculate the cosine similarity between the documents and the cluster centroids

# Assign the documents to clusters based on maximum cosine similarity

# Print the results
