### Week 2 (W37) – 11.09 – 17.09 - Understand your data and your modelling goal
- an established communication channel and appropriate strategy for code sharing.
- data correctly imported into appropriate matrices completely: observations as rows, variables (predictors) as columns.
- identification of challenges of the data: for example: time series not synchronized, missing values in data, extra variables, variables with unknown physical meanings, etc.
- a visualization and comment on the dataset: variable distribution, number of observations, type of measurements (time series or not time series)
- identification of pretreatment steps, and a plan on how to do data pretreatment

### -Importing necessary libraries

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

### - Load the dataset and preprocess

In [None]:
# Load the dataset and preprocess
df = pd.read_csv("../dataset/MiningProcess_Flotation_Plant_Database.csv")


### - Displaying first few rows/observations of the dataset

In [None]:
# Displaying first few rows/observations of the dataset
df.head()

### - Number of Rows and columns

In [None]:
# Rows and columns
df.shape

### - Print all the column names

In [None]:
# Print all the column names
df.columns.to_list()

### - Convert the date column to CORRECT date-times
- This will help for further processing of data
- Need to convert/process 'date' column to make/form a time series

In [None]:
# Checking the number of entries for each hour
count_df = df.groupby(['date']).count()
wrong_num_sample_ids = count_df.index[count_df["Starch Flow"] != 180]
# wrong_num_sample_ids

# Convert the date column to datetime
df['date'] = pd.to_datetime(df['date'])
# The real start of recording is from 00:03:00
DATA_COLLECT_START = df["date"][0].replace(minute=3)

# Fix the missing entry on 10 May 2017
row_to_repeat = df.loc[df["date"] == wrong_num_sample_ids[1]].iloc[0]
df.loc[-1] = row_to_repeat
df = df.sort_index()

# get gaps in time
t = pd.Timedelta('1hour')
mask = df['date'].diff().gt(t)
starts = df.loc[mask.shift(-1, fill_value=False), 'date'].add(t).astype(str)
stops = df.loc[mask, 'date'].sub(t).astype(str)
out = list(zip(starts, stops))

# Get list of start-end periods fo data collection
starts_ends = [str(DATA_COLLECT_START),
       str(datetime.datetime.strptime(out[0][0], '%Y-%m-%d %H:%M:%S')
           - datetime.timedelta(hours=1)),
       str(datetime.datetime.strptime(out[0][1], '%Y-%m-%d %H:%M:%S')
           - datetime.timedelta(seconds=20)),
       str(df["date"][df["date"].idxmax()] + datetime.timedelta(hours=1))]

# Get correct datetimes for dataset based on sampling rate
dates = pd.date_range(start=starts_ends[0], end=starts_ends[1], freq='20S')
dates = dates.union(pd.date_range(start=starts_ends[2], end=starts_ends[3], freq='20S'))
dates

### - Replace date column with correct date-times and set variable as index

In [None]:
df["date"] = dates
df.set_index('date', inplace=True)

# Check convertion
df.head()

### - Convert columns/variables to numeric

In [None]:
# Convert columns/variables to numeric
# Replace ',' by '.'
for col in df.columns:
    df[col] = df[col].str.replace(',', '.').astype(float)

In [None]:
# Check convertion
df.head()

### - Checking datatypes of all columns

In [None]:
# Checking datatypes of all columns
df.dtypes

### - Missing/null value checking

In [None]:
# Missing/null value checking
# All variables converted to numeric
df.info()

### - Cross-checking if there is any null values

In [None]:
# Cross-checking if there is any null values
df.isnull().sum()

### - Checking additional statistics- (e.g. count, mean, std, 25%, 50%,75%, min, max)

In [None]:
# Checking additional statistics- (e.g. count, mean, std, 25%, 50%,75%, min, max)
df.describe()

#### - Visualization (Box-Plot of all variables)

In [None]:
# Select numeric variables
numeric_variables = df.select_dtypes(include='number')

fig, ax = plt.subplots(figsize=(15, 8))

# Create a box plot for each numeric variable
box_plot = ax.boxplot(numeric_variables.values, vert=False, patch_artist=True)

for box in box_plot['boxes']:
    box.set(facecolor='lightblue')
for whisker in box_plot['whiskers']:
    whisker.set(color='black', linestyle='-', linewidth=1.2)
for median in box_plot['medians']:
    median.set(color='red', linewidth=1.5)

ax.set_yticklabels(numeric_variables.columns)

ax.set_xlabel('Value')
ax.set_title('Box Plot of Numeric Variables')

# Show the plot
plt.tight_layout()
plt.show()

#### - Visualization (individual Box-Plot of all variables)

In [None]:
numeric_variables = df.select_dtypes(include='number')

column_names = numeric_variables.columns

# Subplots per row for better visualization
num_cols = 4
num_rows = len(column_names) // num_cols + 1

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 12))

if num_rows == 1:
    axes = axes.reshape(1, -1)

# Create box plots for each numeric variable
for i, column in enumerate(column_names):
    row_idx = i // num_cols
    col_idx = i % num_cols
    ax = axes[row_idx, col_idx]

    ax.boxplot(df[column], vert=False)
    ax.set_title(column)
    ax.set_xlabel('Value')

for i in range(len(column_names), num_cols * num_rows):
    fig.delaxes(axes.flatten()[i])

plt.tight_layout()
plt.show()

#### - Visualization (Histogram of all variables)

In [None]:
# Get numeric variables
numeric_variables = df.select_dtypes(include='number')

# Subplots per row for better visualization
subplots_per_row = 4
num_variables = len(numeric_variables.columns)
num_rows = (num_variables + subplots_per_row - 1) // subplots_per_row

fig, axes = plt.subplots(nrows=num_rows, ncols=subplots_per_row, figsize=(16, 4 * num_rows), sharex=False)

axes = axes.flatten()

# Plot histograms for each numeric variable
for i, col in enumerate(numeric_variables.columns):
    ax = axes[i]
    ax.hist(df[col], bins=20)
    ax.set_title(col)
    ax.grid(True)

for i in range(num_variables, num_rows * subplots_per_row):
    fig.delaxes(axes[i])

plt.tight_layout(pad=2.0)
plt.show()

### - Visualization (Correlation matrix)
- Find the relationships among variables, specially when there are huge number of observations
-  Helpful for understanding redundant variables

In [None]:
# Select numeric variables (excluding any non-numeric columns)
#numeric_variables = df.select_dtypes(include='number')

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap
plt.figure(figsize=(18, 12))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()

### Need to provde description:
- Do we need resample or preprocess the data to make it synchronize? If yes then what is the plan?
- Is there are any irrelevant or redundant variables that don't contribute to the analysis or prediction.
- Identification of pretreatment steps, and a plan on how to do data pretreatment
- Identify which variables are time series data and which are not. Time series data will have timestamps associated with them.
- I dentify different measurement unit or scale of the variables
- How do we deal with the different measurements/scale

### Identification of Pretreatment Steps:

- Handling Missing Values: If missing values are detected, decide on a strategy to handle them. Options include imputation (e.g., mean, median, forward-fill, or interpolation) or removal of rows/columns with missing values.
- Resampling: If time series data is not synchronized, we may need to resample it to a consistent time interval.
- Feature Selection: Assess the relevance of each variable for our analysis or modeling task. Remove any irrelevant or redundant variables.
- Data Scaling/Normalization: Depending on the modeling techniques that we plan to use, may need to scale or normalize the data to ensure all variables have similar ranges.
- Outlier Detection and Handling: Identify and handle outliers if they exist in the dataset. Outliers can significantly impact modeling results.
- Data Splitting: FOr our predictive models, need to decide how to split the data into training, validation, and test sets.
