#   LAB 01 - Python version

Luca Catalano, Daniele Rege Cambrin and Eleonora Poeta 

### Disclaimer

The purpose of creating this material is to enhance the knowledge of students who are interested in learning how to solve problems presented in laboratory classes using Python. This decision stems from the observation that some students have opted to utilize Python for tackling exam projects in recent years.

To solve these exercises using Python, you need to install Python (version 3.9.6 or later) and some libraries using pip or conda.

Here's a list of the libraries needed for this case:

- `os`: Provides operating system dependent functionality, commonly used for file operations such as reading and writing files, interacting with the filesystem, etc.
- `pandas`: A data manipulation and analysis library that offers data structures and functions to efficiently work with structured data.
- `numpy`: A numerical computing library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- `matplotlib.pyplot`: A plotting library for creating visualizations like charts, graphs, histograms, etc.
- `sklearn`: Machine learning algorithms and tools.
- `sklearn_extra`: Additional machine learning algorithms and extensions.
- `nltk`: The Natural Language Toolkit, a library for natural language processing tasks such as tokenization, stemming, part-of-speech tagging, and more.
- `xlrd`: A Python library used for reading data and formatting information from Excel files (.xls and .xlsx formats). It provides functionality to extract data from Excel worksheets, including cells, rows, columns, and formatting details.

You can download Python from [here](https://www.python.org/downloads/) and follow the installation instructions for your operating system.

For installing libraries using [pip](https://pip.pypa.io/en/stable/) or [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html), you can use the following commands:

- Using pip:
  ```
  pip install pandas numpy matplotlib nltk scikit-learn xlrd scikit-learn-extra
  ```

- Using conda:
  ```
  conda install pandas numpy matplotlib nltk scikit-learn xlrd scikit-learn-extra
  ```

Make sure to run these commands in your terminal or command prompt after installing Python. You can also execute them in a cell of a Jupyter Notebook file (`.ipynb`) by starting the command with '!'.

#   Exercise 1

### Import some libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Read file excel

### Read the file excel named "UserSmall.xls"

To read the Excel file using a function integrated into the pandas library, you can use the `pd.read_excel()` function. Rewrite the instruction with the argument as the path of the file to be read

In a Jupyter Notebook cell, you can print a subset of the representation by simply calling the name of the variable containing the DataFrame. 

## How to handle Missing values?

### Find if there are missing values in our dataset. 

Usually in a real dataset the missing values are stored with a nan value. In this case we have ? as missing values representation.

So first of all we can replace each '?' symbol in a nan value [using .replace()].

Count the number of missing values for each column [using .isnumm() and .sum() functions]

### Replace the missing values

As you have seen in class there are different methodologies for filling the nan values. Here we will use the average for the numerical data and the most frequent string for non-numerical columns [use .fillna() and mean() function]

In [None]:
# Replace NaN values with the average value for numerical columns
    # Get the average value for the column and replace NaN values with it

# Replace NaN values with the most frequent value for non-numerical columns
    # Get the most frequent string value
    # Get the most frequent value for the column and replace NaN values with it

In [None]:
# Check (printing the variable dataset) that there are no NaN values left


##  Outlier detection

### Plot using pyplot library the dataset feature

You can plot a scatter/bubble plot to identify some outliers

In [None]:
# Fix the 'Age' attribute on the y-axis


# Plot scatter/bubble plot with an attribute on the x-axis. Ypu caan choose what ever attribute you want


As evident, the 'Age' attribute in our dataset contains errors, such as improbable values like 150 for age or an age less than 0. To ensure the integrity of our data, we need to perform cleaning by filtering out such rows from the dataset.

In [None]:
# Get the condition for the age values (between 0 and 105 years old) [use a boolean condition saved in mask variable]

# Apply the condition to the dataset and store the result in the dataset variable (overwrite the previous dataset)

##  Discretize some data

Data discretization is a preprocessing technique used to transform continuous data into discrete intervals or categories. This process involves dividing the continuous range of values into a finite number of intervals, or bins. Discretization is commonly used in data analysis and machine learning tasks to simplify data representation, reduce noise, and improve the performance of algorithms.

In [None]:
# Define bin edges
# Define your own bin edges as needed

# Define bin labels

# Discretize 'Age' attribute using cut() function

#   Exercise 2

### Import some libraries

In [None]:
import os
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

##  Data preprocessing

### Preprocess text files stored in a specific folder, tokenize them, remove stopwords, perform stemming, and then convert them into a TF-IDF (Term Frequency-Inverse Document Frequency) matrix using scikit-learn's TfidfVectorizer. 

### This code could be divided in different sections:

#### Download NLTK resources
- The code begins by downloading necessary resources from NLTK (Natural Language Toolkit) library, specifically the 'punkt' tokenizer and 'stopwords' corpus.

#### Define Folder Path and Initialize Variables
- Next, the code defines the folder path containing the text files to be processed.
- It initializes two lists: 'preprocessed_texts' to hold preprocessed text from each file, and 'file_names' to store the names of the files.

#### Initialize Snowball Stemmer and Define Italian Stopwords
- Italian stopwords are defined to remove common and uninformative words from the text. Stopwords are either loaded from NLTK's stopwords corpus or from a custom file.

#### Loop Through Files in the Folder
- The code iterates through each file in the specified folder.
- It reads the content of each file, tokenizes the text into individual words, converts them to lowercase, removes stopwords, and performs stemming on the remaining tokens.
- The preprocessed text is then joined back together and appended to the 'preprocessed_texts' list. Additionally, the file name is added to the 'file_names' list.

#### TF-IDF Vectorization
- The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer is initialized to convert the preprocessed text data into a matrix of TF-IDF features.
- The preprocessed text data is fitted and transformed using the TF-IDF vectorizer, resulting in a TF-IDF matrix.
- The TF-IDF matrix is converted into a DataFrame for better visualization, with columns representing unique features extracted from the text and rows corresponding to the files processed.


In [None]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Define the folder path containing the text files

# List to hold preprocessed text from each file

# List to hold file names

# stemming
stemmer = SnowballStemmer("english")
# file with the stopwords
file_italian_stopwords = open("TODO")
italian_stopwords = set(file_italian_stopwords.read().splitlines())

# Loop through files in the folder
for file_name in os.listdir('''TO COMPILE'''):
    # open file

        # Read the text from the file

        # Tokenization

        # Transform to lowercase

        # Remove stopwords

        # Stemming

        # Join the tokens back to form preprocessed text

        # Append the preprocessed text to the list

        # Append the file name to the list


# Initialize the TfidfVectorizer

# Fit and transform the preprocessed text data


# Convert the TF-IDF matrix to a DataFrame for better visualization

