<a href="https://colab.research.google.com/github/codewithselva/industrial-copper-modelling/blob/main/Capstone_Industrial_Copper_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement:

The copper industry deals with less complex data related to sales and pricing. However, this data may suffer from issues such as skewness and noisy data, which can affect the accuracy of manual predictions. Dealing with these challenges manually can be time-consuming and may not result in optimal pricing decisions. A machine learning regression model can address these issues by utilizing advanced techniques such as data normalization, feature scaling, and outlier detection, and leveraging algorithms that are robust to skewed and noisy data. 

Another area where the copper industry faces challenges is in capturing the leads. A lead classification model is a system for evaluating and classifying leads based on how likely they are to become a customer . You can use the STATUS variable with WON being considered as Success and LOST being considered as Failure and remove data points other than WON, LOST STATUS values.


The solution must include the following steps:
1. Exploring skewness and outliers in the dataset.
2. Transform the data into a suitable format and perform any necessary cleaning and pre-processing steps.
3. ML Regression model which predicts continuous variable ‘Selling_Price’.
4. ML Classification model which predicts Status: WON or LOST.
5. Creating a streamlit page where you can insert each column value and you will get the Selling_Price predicted value or Status(Won/Lost)


# About the Data:
1. `id`: This column likely serves as a unique identifier for each transaction or item, which can be useful for tracking and record-keeping.
2. `item_date`: This column represents the date when each transaction or item was recorded or occurred. It's important for tracking the timing of business activities.
3. `quantity tons`: This column indicates the quantity of the item in tons, which is essential for inventory management and understanding the volume of products sold or produced.
4. `customer`: The "customer" column refers to the name or identifier of the customer who either purchased or ordered the items. It's crucial for maintaining customer relationships and tracking sales.
5. `country`: The "country" column specifies the country associated with each customer. This information can be useful for understanding the geographic distribution of customers and may have implications for logistics and international sales.
6. `status`: The "status" column likely describes the current status of the transaction or item. This information can be used to track the progress of orders or transactions, such as "Draft" or "Won."
7. `item type`: This column categorizes the type or category of the items being sold or produced. Understanding item types is essential for inventory categorization and business reporting.
8. `application`: The "application" column defines the specific use or application of the items. This information can help tailor marketing and product development efforts.
9. `thickness`: The "thickness" column provides details about the thickness of the items. It's critical when dealing with materials where thickness is a significant factor, such as metals or construction materials.
10. `width`: The "width" column specifies the width of the items. It's important for understanding the size and dimensions of the products.
11. `material_ref`: This column appears to be a reference or identifier for the material used in the items. It's essential for tracking the source or composition of the products.
12. `product_ref`: The "product_ref" column seems to be a reference or identifier for the specific product. This information is useful for identifying and cataloging products in a standardized way.
13. `delivery date`: This column records the expected or actual delivery date for each item or transaction. It's crucial for managing logistics and ensuring timely delivery to customers.
14. `selling_price`: The "selling_price" column represents the price at which the items are sold. This is a critical factor for revenue generation and profitability analysis.

# Approach: 
1. Data Understanding: Identify the types of variables (continuous, categorical) and their distributions. Some rubbish values are present in ‘Material_Reference’ which starts with ‘00000’ value which should be converted into null. Treat reference columns as categorical variables. INDEX may not be useful.
2. Data Preprocessing:
Handle missing values with mean/median/mode.
Treat Outliers using IQR or Isolation Forest from sklearn library.
Identify Skewness in the dataset and treat skewness with appropriate data transformations, such as log transformation(which is best suited to transform target variable-train, predict and then reverse transform it back to original scale eg:dollars), boxcox transformation, or other techniques, to handle high skewness in continuous variables.
Encode categorical variables using suitable techniques, such as one-hot encoding, label encoding, or ordinal encoding, based on their nature and relationship with the target variable.
3. EDA: Try visualizing outliers and skewness(before and after treating skewness) using Seaborn’s boxplot, distplot, violinplot.
4. Feature Engineering: Engineer new features if applicable, such as aggregating or transforming existing features to create more informative representations of the data. And drop highly correlated columns using SNS HEATMAP.
5. Model Building and Evaluation:
Split the dataset into training and testing/validation sets.
Train and evaluate different classification models, such as ExtraTreesClassifier, XGBClassifier, or Logistic Regression, using appropriate evaluation metrics such as accuracy, precision, recall, F1 score, and AUC curve.
Optimize model hyperparameters using techniques such as cross-validation and grid search to find the best-performing model.
Interpret the model results and assess its performance based on the defined problem statement.
Same steps for Regression modelling.(note: dataset contains more noise and linearity between independent variables so itll perform well only with tree based models)
6. Model GUI: Using streamlit module, create interactive page with
   (1) task input( Regression or Classification) and
   (2) create an input field where you can enter each column value except ‘Selling_Price’ for regression model and  except ‘Status’ for classification model.
   (3) perform the same feature engineering, scaling factors, log/any transformation steps which you used for training ml model and predict this new data from streamlit and display the output.
7. Tips: Use pickle module to dump and load models such as encoder(onehot/ label/ str.cat.codes /etc), scaling models(standard scaler), ML models. First fit and then transform in separate line and use transform only for unseen data
Eg: scaler = StandardScaler()
scaler.fit(X_train)
scaler.transform(X_train)
scaler.transform(X_test_new) #unseen data


In [1]:
!pip install pandas
!pip install numpy
!pip install openpyxl
!pip install seaborn
!pip install scikit-learn 
!pip install scipy 
!pip install matplotlib




[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
#run this command to mount the google drive while using colab

# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
import streamlit as st

In [None]:
# Read the CSV file and load it into a Pandas DataFrame
excel_file_path = 'copper_data_set.csv'

df = pd.read_csv(excel_file_path)


In [None]:
# Display the first few rows of the DataFrame

df.head()


In [None]:
# Total number of records in the data set

len(df)

In [None]:
# Display the info of the DataFrame

df.info()

**Inference:**
1. Total number of records: 181673
2. item_date field is in float64 Dtype which needs to be converted into date Dtype
3.


In [None]:
df.describe()

In [None]:
# Create a copy to avoid modifying the original DataFrame
cleaned_df = df.copy()

In [None]:
cleaned_df.head()

In [None]:
status_unique = cleaned_df['status'].unique()
status_unique

In [None]:
cleaned_df = cleaned_df[(cleaned_df['status'] == 'Won') | (cleaned_df['status'] == 'Lost')]

In [None]:
cleaned_df

In [None]:
cleaned_df.info()

In [None]:
# Assuming 'your_column' is the column you're working with
cleaned_df['quantity tons'] = pd.to_numeric(cleaned_df['quantity tons'], errors='coerce')
cleaned_df.sort_values(by='quantity tons', ascending=False, inplace=True)
cleaned_df.tail()

In [None]:
cleaned_df['item_date'] = pd.to_datetime(cleaned_df['item_date'], format='%Y%m%d', errors='coerce')
cleaned_df['delivery date'] = pd.to_datetime(cleaned_df['delivery date'], format='%Y%m%d', errors='coerce')


In [None]:

cleaned_df.sort_values(by='item_date', ascending=False, inplace=True)
cleaned_df.tail()

In [None]:
# Some rubbish values are present in ‘Material_Reference’ which starts with ‘00000’ value which should be converted into null
cleaned_df['material_ref'] = cleaned_df['material_ref'].apply(lambda x: None if str(x).startswith('00000') else x)
# Replace commas with '@' in the material_ref column
cleaned_df['material_ref'] = cleaned_df['material_ref'].str.replace(',', '@')

In [None]:
cleaned_df['customer'] = cleaned_df['customer'].astype(pd.Int64Dtype())

In [None]:
# Checking for consistency in categorization
cleaned_df['country'] = cleaned_df['country'].astype('category')
cleaned_df['status'] = cleaned_df['status'].astype('category')
cleaned_df['item type'] = cleaned_df['item type'].astype('category')
cleaned_df['application'] = cleaned_df['application'].astype('category')
cleaned_df['product_ref'] = cleaned_df['product_ref'].astype('category')
cleaned_df['material_ref'] = cleaned_df['material_ref'].astype('category')
cleaned_df['customer'] = cleaned_df['customer'].astype('category')

In [None]:
cleaned_df.info()

In [None]:
# Display basic statistics

cleaned_df.describe()

In [None]:
# Handle missing values (replace with a default value or fill using a specific strategy)
cleaned_df['quantity tons'].fillna(0, inplace=True) 
cleaned_df['thickness'].fillna(0, inplace=True) 
cleaned_df['width'].fillna(0, inplace=True) 
cleaned_df['selling_price'].fillna(0, inplace=True) 

In [None]:
cleaned_df.info()

In [None]:
skewness = cleaned_df[cleaned_df.select_dtypes(include=['number']).columns].skew()

In [None]:
# Display skewness for each numerical column
print("Skewness for each numerical column:")
print(skewness)

In [None]:
outliers = (cleaned_df[cleaned_df.select_dtypes(include=['number']).columns] - cleaned_df[cleaned_df.select_dtypes(include=['number']).columns].mean()).abs() > 3 * cleaned_df[cleaned_df.select_dtypes(include=['number']).columns].std()  # Define your outlier detection method

In [None]:
outliers

In [None]:
# Data preprocessing
# Assume 'Selling_Price' is the target variable for regression
# Assume 'STATUS' is the target variable for classification

In [None]:
# Regression
regression_features = cleaned_df.drop(['selling_price', 'status','id','item_date','delivery date','material_ref', 'customer','item type','product_ref'], axis=1)
regression_target = cleaned_df['selling_price']

In [None]:
regression_features

In [None]:
# Classification
classification_features = cleaned_df.drop(['selling_price','id','item_date','delivery date','material_ref', 'customer','item type','product_ref'], axis=1)
classification_target = cleaned_df['status']

In [None]:


# Train-test split
regression_X_train, regression_X_test, regression_y_train, regression_y_test = train_test_split(
    regression_features, regression_target, test_size=0.2, random_state=42
)

classification_X_train, classification_X_test, classification_y_train, classification_y_test = train_test_split(
    classification_features, classification_target, test_size=0.2, random_state=42
)


In [None]:
regression_X_test.describe()

In [None]:


# Data normalization and feature scaling
scaler = StandardScaler()
regression_X_train_scaled = scaler.fit_transform(regression_X_train)
regression_X_test_scaled = scaler.transform(regression_X_test)

In [None]:


# Regression model
regression_model = RandomForestRegressor()
regression_model.fit(regression_X_train_scaled, regression_y_train)
regression_predictions = regression_model.predict(regression_X_test_scaled)
regression_rmse = np.sqrt(mean_squared_error(regression_y_test, regression_predictions))

# Classification model
classification_model = RandomForestClassifier()
classification_model.fit(classification_X_train, classification_y_train)
classification_predictions = classification_model.predict(classification_X_test)
classification_accuracy = accuracy_score(classification_y_test, classification_predictions)

# Streamlit App
st.title("Copper Industry ML Application")

# Sidebar for user input
st.sidebar.title("Insert Column Values")
user_input = {}
for column in df.columns:
    user_input[column] = st.sidebar.text_input(f"Enter {column}", df[column].iloc[0])

# Predictions
regression_input = pd.DataFrame([user_input])
classification_input = pd.DataFrame([user_input.drop(['Selling_Price'])])

# Scaling and prediction for regression
regression_input_scaled = scaler.transform(regression_input)
predicted_selling_price = regression_model.predict(regression_input_scaled)

# Prediction for classification
predicted_status = classification_model.predict(classification_input)[0]

# Display predictions
st.header("Regression Prediction (Selling Price)")
st.write(f"The predicted Selling Price is: {predicted_selling_price[0]}")

st.header("Classification Prediction (Status)")
st.write(f"The predicted Status is: {predicted_status}")

# Display model evaluation metrics
st.header("Model Evaluation Metrics")
st.write(f"Regression RMSE: {regression_rmse}")
st.write(f"Classification Accuracy: {classification_accuracy}")