<a href="https://colab.research.google.com/github/codewithselva/industrial-copper-modelling/blob/main/Capstone_Industrial_Copper_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **About the Data:**
1. `id`: This column likely serves as a unique identifier for each transaction or item, which can be useful for tracking and record-keeping.
2. `item_date`: This column represents the date when each transaction or item was recorded or occurred. It's important for tracking the timing of business activities.
3. `quantity tons`: This column indicates the quantity of the item in tons, which is essential for inventory management and understanding the volume of products sold or produced.
4. `customer`: The "customer" column refers to the name or identifier of the customer who either purchased or ordered the items. It's crucial for maintaining customer relationships and tracking sales.
5. `country`: The "country" column specifies the country associated with each customer. This information can be useful for understanding the geographic distribution of customers and may have implications for logistics and international sales.
6. `status`: The "status" column likely describes the current status of the transaction or item. This information can be used to track the progress of orders or transactions, such as "Draft" or "Won."
7. `item type`: This column categorizes the type or category of the items being sold or produced. Understanding item types is essential for inventory categorization and business reporting.
8. `application`: The "application" column defines the specific use or application of the items. This information can help tailor marketing and product development efforts.
9. `thickness`: The "thickness" column provides details about the thickness of the items. It's critical when dealing with materials where thickness is a significant factor, such as metals or construction materials.
10. `width`: The "width" column specifies the width of the items. It's important for understanding the size and dimensions of the products.
11. `material_ref`: This column appears to be a reference or identifier for the material used in the items. It's essential for tracking the source or composition of the products.
12. `product_ref`: The "product_ref" column seems to be a reference or identifier for the specific product. This information is useful for identifying and cataloging products in a standardized way.
13. `delivery date`: This column records the expected or actual delivery date for each item or transaction. It's crucial for managing logistics and ensuring timely delivery to customers.
14. `selling_price`: The "selling_price" column represents the price at which the items are sold. This is a critical factor for revenue generation and profitability analysis.

**Approach: **
1. Data Understanding: Identify the types of variables (continuous, categorical) and their distributions. Some rubbish values are present in ‘Material_Reference’ which starts with ‘00000’ value which should be converted into null. Treat reference columns as categorical variables. INDEX may not be useful.
2. Data Preprocessing:
Handle missing values with mean/median/mode.
Treat Outliers using IQR or Isolation Forest from sklearn library.
Identify Skewness in the dataset and treat skewness with appropriate data transformations, such as log transformation(which is best suited to transform target variable-train, predict and then reverse transform it back to original scale eg:dollars), boxcox transformation, or other techniques, to handle high skewness in continuous variables.
Encode categorical variables using suitable techniques, such as one-hot encoding, label encoding, or ordinal encoding, based on their nature and relationship with the target variable.
3. EDA: Try visualizing outliers and skewness(before and after treating skewness) using Seaborn’s boxplot, distplot, violinplot.
4. Feature Engineering: Engineer new features if applicable, such as aggregating or transforming existing features to create more informative representations of the data. And drop highly correlated columns using SNS HEATMAP.
5. Model Building and Evaluation:
Split the dataset into training and testing/validation sets.
Train and evaluate different classification models, such as ExtraTreesClassifier, XGBClassifier, or Logistic Regression, using appropriate evaluation metrics such as accuracy, precision, recall, F1 score, and AUC curve.
Optimize model hyperparameters using techniques such as cross-validation and grid search to find the best-performing model.
Interpret the model results and assess its performance based on the defined problem statement.
Same steps for Regression modelling.(note: dataset contains more noise and linearity between independent variables so itll perform well only with tree based models)
6. Model GUI: Using streamlit module, create interactive page with
   (1) task input( Regression or Classification) and
   (2) create an input field where you can enter each column value except ‘Selling_Price’ for regression model and  except ‘Status’ for classification model.
   (3) perform the same feature engineering, scaling factors, log/any transformation steps which you used for training ml model and predict this new data from streamlit and display the output.
7. Tips: Use pickle module to dump and load models such as encoder(onehot/ label/ str.cat.codes /etc), scaling models(standard scaler), ML models. First fit and then transform in separate line and use transform only for unseen data
Eg: scaler = StandardScaler()
scaler.fit(X_train)
scaler.transform(X_train)
scaler.transform(X_test_new) #unseen data


In [5]:
!pip install pandas
!pip install numpy
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2


In [2]:
#import gdown
import pandas as pd
#import chardet
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [None]:
# Google Drive file ID
#file_id = "18eR6DBe5TMWU9FnIewaGtsepDbV4BOyr"

# URL of the file on Google Drive
#url = f'https://drive.google.com/uc?id={file_id}'

# Destination file path to save the downloaded file
#output_path = '/content/data.csv'

# Download the file
#gdown.download(url, output_path, quiet=False)

In [3]:
excel_file_path = 'copper_data_set.xlsx'

# Specify the sheet name or index
sheet_name_or_index = 'Result 1'  # or use 0 for the first sheet

# Read the Excel file with a specific sheet into a Pandas DataFrame
df = pd.read_excel(excel_file_path, sheet_name=sheet_name_or_index)


In [4]:
# Display the first few rows of the DataFrame

df.head()


Unnamed: 0,id,item_date,quantity tons,customer,country,status,item type,application,thickness,width,material_ref,product_ref,delivery date,selling_price
0,EC06F063-9DF0-440C-8764-0B0C05A4F6AE,20210401.0,54.151139,30156308.0,28.0,Won,W,10.0,2.0,1500.0,DEQ1 S460MC,1670798778,20210701.0,854.0
1,4E5F4B3D-DDDF-499D-AFDE-A3227EC49425,20210401.0,768.024839,30202938.0,25.0,Won,W,41.0,0.8,1210.0,0000000000000000000000000000000000104991,1668701718,20210401.0,1047.0
2,E140FF1B-2407-4C02-A0DD-780A093B1158,20210401.0,386.127949,30153963.0,30.0,Won,WI,28.0,0.38,952.0,S0380700,628377,20210101.0,644.33
3,F8D507A0-9C62-4EFE-831E-33E1DA53BB50,20210401.0,202.411065,30349574.0,32.0,Won,S,59.0,2.3,1317.0,DX51D+ZM310MAO 2.3X1317,1668701718,20210101.0,768.0
4,4E1C4E78-152B-430A-8094-ADD889C9D0AD,20210401.0,785.526262,30211560.0,28.0,Won,W,10.0,4.0,2000.0,2_S275JR+AR-CL1,640665,20210301.0,577.0


In [5]:
# Total number of records in the data set

len(df)

181673

In [6]:
# Display the info of the DataFrame

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181673 entries, 0 to 181672
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   id             181671 non-null  object 
 1   item_date      181672 non-null  float64
 2   quantity tons  181673 non-null  object 
 3   customer       181672 non-null  float64
 4   country        181645 non-null  float64
 5   status         181671 non-null  object 
 6   item type      181673 non-null  object 
 7   application    181649 non-null  float64
 8   thickness      181672 non-null  float64
 9   width          181673 non-null  float64
 10  material_ref   103754 non-null  object 
 11  product_ref    181673 non-null  int64  
 12  delivery date  181672 non-null  float64
 13  selling_price  181672 non-null  float64
dtypes: float64(8), int64(1), object(5)
memory usage: 19.4+ MB


**Inference:**
1. Total number of records: 181673
2. item_date field is in float64 Dtype which needs to be converted into date Dtype
3.


In [7]:
# Create a copy to avoid modifying the original DataFrame
cleaned_df = df.copy()

In [8]:
cleaned_df.head()

Unnamed: 0,id,item_date,quantity tons,customer,country,status,item type,application,thickness,width,material_ref,product_ref,delivery date,selling_price
0,EC06F063-9DF0-440C-8764-0B0C05A4F6AE,20210401.0,54.151139,30156308.0,28.0,Won,W,10.0,2.0,1500.0,DEQ1 S460MC,1670798778,20210701.0,854.0
1,4E5F4B3D-DDDF-499D-AFDE-A3227EC49425,20210401.0,768.024839,30202938.0,25.0,Won,W,41.0,0.8,1210.0,0000000000000000000000000000000000104991,1668701718,20210401.0,1047.0
2,E140FF1B-2407-4C02-A0DD-780A093B1158,20210401.0,386.127949,30153963.0,30.0,Won,WI,28.0,0.38,952.0,S0380700,628377,20210101.0,644.33
3,F8D507A0-9C62-4EFE-831E-33E1DA53BB50,20210401.0,202.411065,30349574.0,32.0,Won,S,59.0,2.3,1317.0,DX51D+ZM310MAO 2.3X1317,1668701718,20210101.0,768.0
4,4E1C4E78-152B-430A-8094-ADD889C9D0AD,20210401.0,785.526262,30211560.0,28.0,Won,W,10.0,4.0,2000.0,2_S275JR+AR-CL1,640665,20210301.0,577.0


In [14]:
cleaned_df['item_date'] = pd.to_datetime(cleaned_df['item_date'], format='%Y%m%d', errors='coerce')


In [17]:

cleaned_df.sort_values(by='item_date', ascending=False, inplace=True)
cleaned_df.tail()

Unnamed: 0,id,item_date,quantity tons,customer,country,status,item type,application,thickness,width,material_ref,product_ref,delivery date,selling_price
181671,7AFFD323-01D9-4E15-B80D-7D1B03498FC8,2020-07-02,-2000.0,30200854.0,25.0,Won,W,41.0,0.85,1250.0,0000000000000000000000000000000001001149,164141591,20200701.0,601.0
181672,AD0CA853-AE3C-4B2F-9FBB-8B0B965F84BC,2020-07-02,406.686538,30200854.0,25.0,Won,W,41.0,0.71,1240.0,0000000000000000000000000000000001005439,164141591,20200701.0,607.0
52,175B56C3-CDF1-4BD4-BC83-C1BF1FEAD8B8,NaT,27.743221,30162161.0,77.0,Won,S,4.0,1.1,1300.0,DX51D+Z100 MA,164141591,20210601.0,1046.0
104640,1BA92915-36FC-437A-811C-9DC7BF958EA6,NaT,51.785585,30230331.0,80.0,Lost,S,10.0,0.9,1435.0,,628377,20210101.0,654.0
105485,40203729-1A96-481E-9B71-3FF672C27F0B,NaT,101.742899,30210087.0,26.0,Lost,S,42.0,3.0,1494.0,,1668701718,20210201.0,795.0


In [19]:
# Uniform date format

cleaned_df['delivery date'] = pd.to_datetime(cleaned_df['delivery date'], format='%Y%m%d', errors='coerce')

In [22]:
cleaned_df.sort_values(by='delivery date', ascending=False, inplace=True)
cleaned_df.head()

Unnamed: 0,id,item_date,quantity tons,customer,country,status,item type,application,thickness,width,material_ref,product_ref,delivery date,selling_price
2519,86A827B7-C2FD-47D8-8D64-342ABE448939,2021-03-29,1075.116415,30344971.0,84.0,Lost,W,38.0,12.0,1710.0,,640665,2022-01-01,1001.0
3233,230684A7-6716-4B08-9AD7-1DD13D0066A5,2021-03-29,1079.945988,30196886.0,84.0,Lost,W,10.0,16.0,975.0,,640665,2022-01-01,1015.0
2515,78D13D53-A217-4FC9-A54D-D6433B2262E0,2021-03-29,1076.477597,30344971.0,84.0,Lost,W,38.0,16.0,1255.0,,640665,2022-01-01,1004.0
2523,9B251DFD-3C16-42CB-8A1A-7276373A59BD,2021-03-29,1084.605598,30344971.0,84.0,Lost,W,38.0,16.0,975.0,,640665,2022-01-01,1015.0
3229,94C21F40-F46E-418E-80AA-C85818D6C0CB,2021-03-29,1080.035109,30196886.0,84.0,Lost,W,10.0,16.0,1255.0,,640665,2022-01-01,1003.0


In [None]:
# Handling missing values
cleaned_df.dropna(inplace=True)

In [23]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 181673 entries, 2519 to 104761
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   id             181671 non-null  object        
 1   item_date      181670 non-null  datetime64[ns]
 2   quantity tons  181673 non-null  object        
 3   customer       181672 non-null  float64       
 4   country        181645 non-null  float64       
 5   status         181671 non-null  object        
 6   item type      181673 non-null  object        
 7   application    181649 non-null  float64       
 8   thickness      181672 non-null  float64       
 9   width          181673 non-null  float64       
 10  material_ref   103754 non-null  object        
 11  product_ref    181673 non-null  int64         
 12  delivery date  181670 non-null  datetime64[ns]
 13  selling_price  181672 non-null  float64       
dtypes: datetime64[ns](2), float64(6), int64(1), object(5)


In [24]:
# Handling duplicate rows
cleaned_df.drop_duplicates(inplace=True)

In [25]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 181673 entries, 2519 to 104761
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   id             181671 non-null  object        
 1   item_date      181670 non-null  datetime64[ns]
 2   quantity tons  181673 non-null  object        
 3   customer       181672 non-null  float64       
 4   country        181645 non-null  float64       
 5   status         181671 non-null  object        
 6   item type      181673 non-null  object        
 7   application    181649 non-null  float64       
 8   thickness      181672 non-null  float64       
 9   width          181673 non-null  float64       
 10  material_ref   103754 non-null  object        
 11  product_ref    181673 non-null  int64         
 12  delivery date  181670 non-null  datetime64[ns]
 13  selling_price  181672 non-null  float64       
dtypes: datetime64[ns](2), float64(6), int64(1), object(5)


In [26]:
# Dealing with negative selling prices
cleaned_df1 = cleaned_df[cleaned_df['selling_price'] < 0]

In [28]:
cleaned_df1

Unnamed: 0,id,item_date,quantity tons,customer,country,status,item type,application,thickness,width,material_ref,product_ref,delivery date,selling_price
44865,87F20C79-CE1E-4325-BBA8-1C6DE4657084,2021-02-03,28.368563,30217604.0,27.0,Not lost for AM,PL,10.0,1.5,1270.0,BOB,164141591,2021-05-01,-25.0
28,BEC18863-E965-478B-9861-A49A77F26655,2021-04-01,99.059199,30153510.0,30.0,Won,W,41.0,0.595,1207.0,GOO1208X595SP,611993,2021-04-01,-1160.0
44761,947C725B-85ED-4817-B4F8-27720314F9E6,2021-02-04,101.397995,30198657.0,32.0,Won,W,41.0,1.25,1100.0,,1721130331,2021-04-01,-730.0
44810,35C64267-229F-438E-9A3B-91A6A41DACE2,2021-02-03,12.225889,30157111.0,78.0,Won,W,41.0,0.75,1250.0,,164141591,2021-04-01,-445.0
105189,8CA4D51F-DF96-4B88-805D-3937CCFDA810,2020-11-12,5.280274,30209814.0,25.0,Won,W,15.0,6.0,1250.0,,1671863738,2021-02-01,-336.0


In [None]:
# Handling outliers (consider replacing 3 with the appropriate threshold)
cleaned_df = cleaned_df[(cleaned_df['quantity tons'].between(cleaned_df['quantity tons'].quantile(0.01), cleaned_df['quantity tons'].quantile(0.99))) &
                        (cleaned_df['thickness'].between(cleaned_df['thickness'].quantile(0.01), cleaned_df['thickness'].quantile(0.99)))]

In [None]:
cleaned_df.head()

In [None]:
cleaned_df.info()

In [None]:
# Checking for consistency in categorization
# Assuming 'status', 'item_type', 'application' are categorical columns
cleaned_df['status'] = cleaned_df['status'].astype('category')
cleaned_df['item type'] = cleaned_df['item type'].astype('category')
cleaned_df['application'] = cleaned_df['application'].astype('category')

In [None]:
cleaned_df.info()

In [None]:
cleaned_df.head()

In [None]:
# Some rubbish values are present in ‘Material_Reference’ which starts with ‘00000’ value which should be converted into null
cleaned_df['material_ref'] = cleaned_df['material_ref'].apply(lambda x: None if str(x).startswith('0000') else x)


In [None]:
cleaned_df.head()

In [None]:
# material_ref - is a category
cleaned_df['material_ref'] = cleaned_df['material_ref'].astype('category')

In [None]:
cleaned_df.info()