# what is multivariate normal distribution ?

Multivariate normal distribution is a type of probability distribution that describes the joint distribution of a set of random variables. It is a generalization of the univariate normal (or Gaussian) distribution, which describes the distribution of a single random variable.

In the multivariate normal distribution, each random variable is a dimension, and the distribution is characterized by a mean vector and a covariance matrix. The mean vector represents the center of the distribution, and the covariance matrix describes the spread and correlation between the random variables.

The probability density function of the multivariate normal distribution has a bell-shaped, symmetric form, similar to the univariate normal distribution. The key difference is that the probability density function of the multivariate normal distribution depends on multiple variables, and it can be used to model complex data distributions that have dependencies between their variables.

The multivariate normal distribution is widely used in various fields, including statistics, machine learning, finance, and engineering, as it provides a flexible and powerful tool to model multivariate data.

# This code generates a synthetic dataset of 40 samples with three features and a binary target variable. Here is a step-by-step explanation of the code:

np.random.seed(23): This sets the random seed for the NumPy random number generator to 23. Setting the seed ensures that the random numbers generated are reproducible, meaning that running the code multiple times with the same seed will produce the same results.

mu_vec1 = np.array([0,0,0]): This creates a NumPy array with three elements, each initialized to zero. This array represents the mean vector of the first class.

cov_mat1 = np.array([[1,0,0] ,[0,1,0],[0,0,1]]): This creates a 3x3 NumPy array representing the covariance matrix of the first class. The diagonal elements of the covariance matrix are set to 1, indicating that each feature has a variance of 1. The off-diagonal elements are set to 0, indicating that the features are uncorrelated.

class1_sample = np.random.multivariate_normal(mu_vec1 , cov_mat1 , 20): This generates a random sample of 20 points from a multivariate normal distribution with mean mu_vec1 and covariance cov_mat1. The resulting class1_sample is a 20x3 NumPy array, where each row represents a point with three features.

df = pd.DataFrame(class1_sample , columns = ['feature1' , 'feature2' , 'feature3']): This creates a Pandas DataFrame df with the class1_sample data and column labels 'feature1', 'feature2', and 'feature3'.

df['target'] = 1: This adds a new column to the DataFrame with the label 'target' and initializes all values to 1, indicating that these points belong to the first class.

mu_vec2 = np.array([1,1,1]): This creates a NumPy array with three elements, each initialized to 1. This array represents the mean vector of the second class.

cov_mat2 = np.array([[1,0,0], [0,1,0],[0,0,1]]): This creates a 3x3 NumPy array representing the covariance matrix of the second class. The diagonal elements of the covariance matrix are set to 1, indicating that each feature has a variance of 1. The off-diagonal elements are set to 0, indicating that the features are uncorrelated.

class2_sample = np.random.multivariate_normal(mu_vec2 , cov_mat2 , 20): This generates a random sample of 20 points from a multivariate normal distribution with mean mu_vec2 and covariance cov_mat2. The resulting class2_sample is a 20x3 NumPy array, where each row represents a point with three features.

df1 = pd.DataFrame(class2_sample , columns = ['feature1' , 'feature2' , 'feature3']): This creates a Pandas DataFrame df1 with the class2_sample data and column labels 'feature1', 'feature2', and 'feature3'.

df1['target'] = 0: This adds a new column to the DataFrame with the label 'target' and initializes all values to 0, indicating that these points belong to the second class.

df = df.append(df1 , ignore_index = True): This appends the rows of df1 to the end of df, effectively merging the two

In [1]:
import numpy as np 
import pandas as pd 

In [4]:
np.random.seed(23) 

mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0] ,[0,1,0],[0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1 , cov_mat1 , 20)

df = pd.DataFrame(class1_sample , columns = ['feature1' , 'feature2' , 'feature3'])
df['target'] = 1

mu_vec2 = np.array([1,1,1]) 
cov_mat2 = np.array([[1,0,0], [0,1,0],[0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2 , cov_mat2 , 20) 

df1 = pd.DataFrame(class2_sample , columns = ['feature1' , 'feature2' , 'feature3']) 
df1['target'] = 0 

df = df.append(df1 , ignore_index = True)
df = df.sample(40) 

  df = df.append(df1 , ignore_index = True)


In [5]:
df.head()

Unnamed: 0,feature1,feature2,feature3,target
2,-0.367548,-1.13746,-1.322148,1
34,0.177061,-0.598109,1.226512,0
14,0.420623,0.41162,-0.071324,1
11,1.968435,-0.547788,-0.679418,1
12,-2.50623,0.14696,0.606195,1


In [17]:
import plotly.express as px 
fig = px.scatter_3d(df , x = df['feature1'] , y = df['feature2'] , z = df['feature3'] , color = df['target'].astype('str'))
fig.update_traces(marker = dict(size = 12 , line = dict(width = 2 , color = 'DarkSlateGrey')) ,
                           selector=dict(mode = 'markers'))
fig.show()

In [7]:
# step -1 Apply stanardScaling 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 
df.iloc[: , 0:3] = scaler.fit_transform(df.iloc[:,0:3])

In [8]:
# step -2 find Covariance matrix 

covariance_matrix = np.cov([df.iloc[:,0] , df.iloc[:,1] , df.iloc[: , 2]])
print('covariance matrix: \n' , covariance_matrix)

covariance matrix: 
 [[1.02564103 0.20478114 0.080118  ]
 [0.20478114 1.02564103 0.19838882]
 [0.080118   0.19838882 1.02564103]]


In [9]:
# Step 3 - Finding EV and EVs
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

In [10]:
pc = eigen_vectors[0:2]
pc

array([[-0.53875915, -0.69363291,  0.47813384],
       [-0.65608325, -0.01057596, -0.75461442]])

In [11]:
transformed_df = np.dot(df.iloc[:,0:3],pc.T)
# 40,3 - 3,2
new_df = pd.DataFrame(transformed_df,columns=['PC1','PC2'])
new_df['target'] = df['target'].values
new_df.head()

Unnamed: 0,PC1,PC2,target
0,0.599433,1.795862,1
1,1.056919,-0.212737,0
2,-0.271876,0.498222,1
3,-0.621586,0.02311,1
4,1.567286,1.730967,1
