# Dermatology Database

 * This database contains 34 attributes, 33 of which are linear
   valued and one of them is nominal. 


 * The differential diagnosis of erythemato-squamous diseases is a real
   problem in dermatology. They all share the clinical features of
   erythema and scaling, with very little differences. The diseases in
   this group are psoriasis, seboreic dermatitis, lichen planus, 
   pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris.
   Usually a biopsy is necessary for the diagnosis but unfortunately
   these diseases share many histopathological features as
   well. Another difficulty for the differential diagnosis is that a
   disease may show the features of another disease at the beginning
   stage and may have the characteristic features at the following stages. 
   Patients were first evaluated clinically with 12 features.
   Afterwards, skin samples were taken for the evaluation of 22
   histopathological features. The values of the histopathological features
   are determined by an analysis of the samples under a microscope. 


 * In the dataset constructed for this domain, the family history feature
   has the value 1 if any of these diseases has been observed in the
   family, and 0 otherwise. The age feature simply represents the age of
   the patient. Every other feature (clinical and histopathological) was
   given a degree in the range of 0 to 3. Here, 0 indicates that the
   feature was not present, 3 indicates the largest amount possible,
   and 1, 2 indicate the relative intermediate values.

 * The Dataset is taken from here - [Dermatology Dataset](https://archive.ics.uci.edu/ml/datasets/dermatology)

# Steps Involved:

 1. Importing the library
 2. Loading the Dataset
 3. Structure of the Dataset
 4. Exploration of Dataset
  * Statistics
  * Data Cleaning
  * Heat Map
 5. PCA Implementation
 6. Machine learning Model
  * Splitting the Dataset
  * Training the Model
  * Prediction
  * Model Score

# 1. Importing the library

In [1]:
#Import all the required Packages like numpy ,pandas etc.
#Import the library for the plots also

# 2. Loading the Dataset

In [2]:
'''Load the dataset of Dermatology Dataset''';
#Use the Pandas or numpy to read the file

In [3]:
'''If the data is taken from the numpy,try to convert it into dataframe ''';

In [4]:
'''The Last Column, i.e ClassCode of the plant is your target value''';
#Print the target value

In [5]:
#Print the columns names in the dataset

In [6]:
#Save the target value in some other variable for further analysis
'''Use columns name for this indexing''';

# 3. Structure of the Dataset

In [7]:
#Print the shape of the dataset

In [8]:
#Print the describe of the dataset

## Missing or Null Points

In [9]:
#Check for null value in the features using isnull function

In [10]:
#Check for nan value in the features using isnan function

In [11]:
'''Check whether you get any null value or nan value''';
#If found try to avoid that, and if not proceed to next step

# 4. Exploration of the Dataset

## Statistics

For our very first coding implementation, we will calculate descriptive statistics about the Dermatology Dataset. Since numpy has already been imported for us, using this library to perform the necessary calculations. These statistics will be extremely important later on to analyze various prediction results from the constructed model.

In the code cell below, we will need to implement the following:

 * Calculate the minimum, maximum, mean, median, and Unique of 'ClassCode'. Store each calculation in their respective variable.
 * Store each calculation in their respective variable.

In [12]:
#Mininum of the 'ClassCode'


#Maximum of the 'ClassCode'


#Mean of the 'ClassCode'


#Median of the 'ClassCode'


#Unique of the 'ClassCode'

# Show the calculated statistics

After statistics analysis, go for the graphical representation

In [13]:
#You'll use displot from seaborn on the target value

In [14]:
'''Get some observation from the above graph and perform the filteration at this level, if you found.''';

In [15]:
#Check for value_counts for target

## Data Cleaning

In [16]:
#Check for any feature that need to resolve
#Hint: Check for "?" in feature, search for it 

In [17]:
#Replace the value with nan

In [18]:
#Replace nan with other statistics, like mean, mode etc.

## Heat Map

In [19]:
'''Check for Correlation in the Dataset''';

In [20]:
#Use heat map in the seaborn library to get the correlation graph

A heat map uses a warm-to-cool color spectrum to show dataset analytics, namely which parts of data receive the most attention.

The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation.

**Is there any relations among the features?**

In [21]:
'''Try to get some correlation from the above graph, and try to use the features 
   only with the more positive correlation and more correlation''';

# 5. PCA Implementation

In [22]:
#The Class code column is the value that needs to be predicted from the analysis. 
#Hence you will have to split X and y(Features and labels) based on this information

In [23]:
#Call Standard Scaler fit transform

In [24]:
#print the scaled_data

## The Algebra for PCA

 * Calculating the covarience matrix
 * Calculating the eigen values and eigen vectors
 * Forming Principal Components
 * Projection into the new features space

### a). Calculating the covarience matrix

 * Covarience matrix is a matrix of variances and covariances(or correlations) among every pair of the m variable.
 * It is square, symmetric matrix.
 * Covarience matrix (S) = X.T* X, we can find it by using numpy matmul() function in python.

In [25]:
#Find the covarience matrix which is : X.T*X

In [26]:
#Matrix multiplication using numpy

In [27]:
#Print the shape of variance matrix

### b). Calculating the eigen values and eigen vectors

 * ƛ is an eigenvalue for a matrix X if it is a solution of the characteristic equation: det( ƛ*I - A ) = 0 Where, I is the identity matrix of the same dimension as X
 * The sum of all m eigenvalues equals the trace of S (the sum of the variances of the original variables).
 * For each eigenvalue ƛ, a corresponding eigen-vector v, can be found by solving : ( ƛ*I - A )v = 0
 * The eigenvalues, ƛ1, ƛ2, ... ƛm are the variances of the coordinates on each principal component axis.

In [28]:
#Find top two eigen value and corresponding eigen vectors
#for projecting onto a 2-Dimension space.


#The parameter 'eigvals' is defined(low value to high value)
#eigh function will return the eigen value in ascending order
#this code generates only top two eigen values 

#convert the eigen vectors into (2,d) shape for easyness of further computations

#the vectors[1] represent the eigen vector corresponding 1st principal eigen vector
#the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector

### c). Forming Principal Components

In [29]:
#project the original data sample on the plane 
#formed by two principal eigen vectors by vector-vector multiplication

In [30]:
#Print the new data point shape

### d). Projection into the new features space

In [31]:
#Create the Dataframe having 1st principal & 2nd principal

#create new_dataframe for plotting labeled points

In [32]:
# plot the 2d data points with seaborn

# 6. Machine learning Model

In [33]:
'''Try to get any Classification Model''';

## Splitting the Data

In [34]:
#Use train_test_split
'''split the data in 70:30 ratio''';

In [35]:
#print the shape of train data

In [36]:
#print the shape of test_data

## Import the Classifier Model

In [37]:
#use inbuilt classifier model from scikit

In [38]:
#Call the classifier

## Training the Model

In [39]:
#Train the model on training Data using inbuilt .fit

## Testing the Model

In [40]:
#Predict the y_predict from .predict function on X_test

## Model Score

In [41]:
#Print the Model Score
#Print the Graph related