# Linear Discriminant Analysis

### Objective

Training a Linear Discriminant Analysis(LDA) model to check if the product has been shipped or canceled.

### Problem Statement

XYZ.com is an e-commerce company based in Argentina. Due to the covid crisis and lockdown XYZ.com is facing lots of issues from the dealer and the shipment team.  XYZ.com has lots of product data where various shipping and sales details of each product have been mentioned. XYZ.com wants to find out which of the products has been shipped and which of the products has been canceled to reduce customer escalation. As a data-scientist, We have to train an LDA(Linear Discriminant Analysis) model to predict which of the product has been shipped and which of the product has been canceled.

### 1. Import necessary libraries.

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,accuracy_score,auc,roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

### 2. Display a sample of five rows of the data frame

In [7]:
df = pd.read_csv("C:/Users/ASUS-NB/Downloads/LDA/sales_data_sample.csv",encoding='unicode_escape')
print(df.sample(n=5))

      ORDERNUMBER  QUANTITYORDERED  PRICEEACH  ORDERLINENUMBER    SALES  \
1408        10311               25      66.99                2  1674.75   
1324        10207               47     100.00               16  6658.02   
1957        10273               48      83.02                3  3984.96   
1752        10198               48      67.82                5  3255.36   
2780        10222               31      45.69                7  1416.39   

            ORDERDATE   STATUS  QTR_ID  MONTH_ID  YEAR_ID  ...  \
1408  10/16/2004 0:00  Shipped       4        10     2004  ...   
1324   12/9/2003 0:00  Shipped       4        12     2003  ...   
1957   7/21/2004 0:00  Shipped       3         7     2004  ...   
1752  11/27/2003 0:00  Shipped       4        11     2003  ...   
2780   2/19/2004 0:00  Shipped       1         2     2004  ...   

                                    ADDRESSLINE1  ADDRESSLINE2         CITY  \
1408                          C/ Moralzarzal, 86           NaN       Madr

### 3. Check the shape of the data (number of rows and column). Check the general information about the dataframe using .info() method.

In [9]:
print(df.shape)
print(df.info())

(2823, 25)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ORDERNUMBER       2823 non-null   int64  
 1   QUANTITYORDERED   2823 non-null   int64  
 2   PRICEEACH         2823 non-null   float64
 3   ORDERLINENUMBER   2823 non-null   int64  
 4   SALES             2823 non-null   float64
 5   ORDERDATE         2823 non-null   object 
 6   STATUS            2823 non-null   object 
 7   QTR_ID            2823 non-null   int64  
 8   MONTH_ID          2823 non-null   int64  
 9   YEAR_ID           2823 non-null   int64  
 10  PRODUCTLINE       2823 non-null   object 
 11  MSRP              2823 non-null   int64  
 12  PRODUCTCODE       2823 non-null   object 
 13  CUSTOMERNAME      2823 non-null   object 
 14  PHONE             2823 non-null   object 
 15  ADDRESSLINE1      2823 non-null   object 
 16  ADDRESSLINE2      302 non-null 

### 4.Check the percentage of missing values in each column of the data frame.

In [10]:
missing_percentage = df.isnull().mean() * 100

# Display the result
print("Percentage of missing values in each column:")
print(missing_percentage)

Percentage of missing values in each column:
ORDERNUMBER          0.000000
QUANTITYORDERED      0.000000
PRICEEACH            0.000000
ORDERLINENUMBER      0.000000
SALES                0.000000
ORDERDATE            0.000000
STATUS               0.000000
QTR_ID               0.000000
MONTH_ID             0.000000
YEAR_ID              0.000000
PRODUCTLINE          0.000000
MSRP                 0.000000
PRODUCTCODE          0.000000
CUSTOMERNAME         0.000000
PHONE                0.000000
ADDRESSLINE1         0.000000
ADDRESSLINE2        89.302161
CITY                 0.000000
STATE               52.639036
POSTALCODE           2.692171
COUNTRY              0.000000
TERRITORY           38.044633
CONTACTLASTNAME      0.000000
CONTACTFIRSTNAME     0.000000
DEALSIZE             0.000000
dtype: float64


### 5. Check if there are any duplicate rows.

In [11]:
# Check for duplicates
duplicates = df[df.duplicated()]

# Display the duplicates
print("Duplicate rows:")
print(duplicates)

Duplicate rows:
Empty DataFrame
Columns: [ORDERNUMBER, QUANTITYORDERED, PRICEEACH, ORDERLINENUMBER, SALES, ORDERDATE, STATUS, QTR_ID, MONTH_ID, YEAR_ID, PRODUCTLINE, MSRP, PRODUCTCODE, CUSTOMERNAME, PHONE, ADDRESSLINE1, ADDRESSLINE2, CITY, STATE, POSTALCODE, COUNTRY, TERRITORY, CONTACTLASTNAME, CONTACTFIRSTNAME, DEALSIZE]
Index: []

[0 rows x 25 columns]


### 6. Write a function that will impute missing values of the columns “STATE”, “POSTALCODE”,“TERRITORY” with its most occurring label.  

In [12]:
def impute_most_occuring_label(df, columns):
    """
    Impute missing values in specified columns with the most occurring label.

    Parameters:
    - df: DataFrame
    - columns: List of column names to impute

    Returns:
    - df: Updated DataFrame with imputed values
    """

    for column in columns:
        most_occuring_label = df[column].mode().iloc[0]  # Get the most occurring label
        df[column].fillna(most_occuring_label, inplace=True)  # Fill missing values

    return df
columns_to_impute = ["STATE", "POSTALCODE", "TERRITORY"]

# Call the function to impute missing values
df = impute_most_occuring_label(df, columns_to_impute)

### 7. Drop “ADDRESSLINE2”,”ORDERDATE”,”PHONE” column.

In [13]:
columns_to_drop = ["ADDRESSLINE2", "ORDERDATE", "PHONE"]

# Drop the specified columns
df.drop(columns=columns_to_drop, inplace=True)

### 8. Convert the labels of the STATUS column to 0 and 1. For Shipped assign value 1 and for all other labels (i.e. ‘Cancelled’,’ Resolved’,’ On Hold’,’ In Process’, 'Disputed') assign 0. Note we will consider everything apart from Shipped as cancel (i.e. 0).


In [15]:
status_mapping = {'Shipped': 1, 'Cancelled': 0, 'Resolved': 0, 'On Hold': 0, 'In Process': 0, 'Disputed': 0}

# Convert the labels of the STATUS column
df['STATUS'] = df['STATUS'].replace(status_mapping)

# Display the updated DataFrame
print(df)

      ORDERNUMBER  QUANTITYORDERED  PRICEEACH  ORDERLINENUMBER    SALES  \
0           10107               30      95.70                2  2871.00   
1           10121               34      81.35                5  2765.90   
2           10134               41      94.74                2  3884.34   
3           10145               45      83.26                6  3746.70   
4           10159               49     100.00               14  5205.27   
...           ...              ...        ...              ...      ...   
2818        10350               20     100.00               15  2244.40   
2819        10373               29     100.00                1  3978.51   
2820        10386               43     100.00                4  5417.57   
2821        10397               34      62.24                1  2116.16   
2822        10414               47      65.52                9  3079.44   

      STATUS  QTR_ID  MONTH_ID  YEAR_ID  PRODUCTLINE  ...  \
0          1       1         2     200

### 9. Encode the categorical features using dummy encoding

In [16]:
df = pd.get_dummies(df,drop_first=True)

### 10. Separate the target and independent features

In [21]:
X = df.drop('STATUS',axis=1)
y = df['STATUS']

### 11. Split the dataset into two parts (i.e. 80% train and 20% test) using random_state=42. 

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("Training set shape:")
print(X_train.shape, y_train.shape)

print("\nTesting set shape:")
print(X_test.shape, y_test.shape)

Training set shape:
(2258, 633) (2258,)

Testing set shape:
(565, 633) (565,)


### 11. Scale the data

In [26]:
mm = MinMaxScaler()

X_train.iloc[:,:] = mm.fit_transform(X_train.iloc[:,:])
X_test.iloc[:,:] = mm.transform(X_test.iloc[:,:])

  X_train.iloc[:,:] = mm.fit_transform(X_train.iloc[:,:])
  X_test.iloc[:,:] = mm.transform(X_test.iloc[:,:])


## LDA 

### Traning a RandomForest Classfier model before applying LDA

In [27]:
rf = RandomForestClassifier(max_depth=3,n_estimators=25)
rf.fit(X_train,y_train)
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

print("Train Accuracy",accuracy_score(y_train,y_train_pred))
print("Test Accuracy",accuracy_score(y_test,y_test_pred))
print("*"*50)
print("Train confusion matrix",'\n',confusion_matrix(y_train,y_train_pred))
print("Test confusion matrix",'\n',confusion_matrix(y_test,y_test_pred))

Train Accuracy 0.9322409211691762
Test Accuracy 0.9079646017699115
**************************************************
Train confusion matrix 
 [[   1  153]
 [   0 2104]]
Test confusion matrix 
 [[  0  52]
 [  0 513]]


### Training a Linear Discriminant Analysis(LDA) model on the train data. Do fit_transform on the train data and only transform on the test data. 

In [29]:
lda = LDA(n_components=1)

X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
X_train[:5],X_test[:5]

(array([[ 0.60496511],
        [ 0.28984435],
        [ 0.6028427 ],
        [-0.21719613],
        [ 0.27988057]]),
 array([[ 1.26394182],
        [ 2.52900911],
        [ 1.33478261],
        [-2.94060547],
        [ 0.7395123 ]]))

### Training a random-forest model on the transformed data  and print the accuracy of train and test data. Take max_depth=3 and n_estimators=25 

In [30]:
rf=RandomForestClassifier(max_depth=3,n_estimators=25)
rf.fit(X_train,y_train)
y_train_pred=rf.predict(X_train)
y_test_pred=rf.predict(X_test)

print("Train Accuracy",accuracy_score(y_train,y_train_pred))
print("Test Accuracy",accuracy_score(y_test,y_test_pred))
print("*"*50)
print("Train confusion matrix",'\n',confusion_matrix(y_train,y_train_pred))
print("Test confusion matrix",'\n',confusion_matrix(y_test,y_test_pred))

Train Accuracy 0.9565987599645704
Test Accuracy 0.9274336283185841
**************************************************
Train confusion matrix 
 [[  91   63]
 [  35 2069]]
Test confusion matrix 
 [[ 21  31]
 [ 10 503]]


### Conclusion 

- As we can see we True negative and False negative points is zero before applying LDA but after applying LDA we are getting some amount of True negative and false negative points.
- Due to class imbalance, the majority class has high recall but minority class has poor recall.


### Happy Learning:)