## Critical Analysis Report
name: Y Z

ID: 00000000

### Issue 1. "We note that the "not.fully.paid" feature has missed a record. We can repair it by using the average value of this feature to replace the missing value.'"
When dealing with missing data, we have two major options: imputation or remove data.
- In these case, there is only one data of one feature missing, the portion of missing data is low.
- This feature is a categorical variable
In this case, a common way is to drop the row that has missing record. The way to replace missing value by a
median or average is normally for continuous variables.

#### Solution：drop the raw(s) with missing record to remove outliers
data = Data.dropna()

data.info()

### Issue 2. "If 'int.rate' is negative it will be replaced by the median value of 'int.rate'. If 'days.with.cr.line' is greater than 36,500 days (or 100 years), it will be replaced by the median value of 'days.with.cr.line'"
Some outliers that we can use common sense to find indicate a mistake in data collection.
If we know that it's wrong and we have a lot of data, it is safe to drop values that are ourliers.

#### Solution: drop the raw(s) with outliers
data_clean = data.drop(Data.index[(Data['int.rate']<=0) & (data['days.with.cr.line'] > 36500)])                    
data_clean.info    

### Issue 3. "From the heatmaps, there are different correlations between each feature and 'credit.policy'.only reserve features that have positive correlations by removing all features that are negatively correlated with labeled feature."
- Negative correlation is a relationship between two variables in which one variable increases as the other decreases, and vice versa. 
- A perfect negative correlation is represented by the value -1.0, while a 0 indicates no correlation.
- It is not right to decide whether a correlation is useful or useless by observing correlation positive or negative. 
- The further away correlation is from zero, the stronger the relationship between the two variables.

#### Solution: do not drop the features that have negative correlation with 'credit.policy'

### Issue 4. "Sort all records by an ascending order of 'credit.policy'. "
After observing the dataset, we found that all records are sorted in descending order.
Because numbers of credit.policy with value=1 is much higher than that with value=0,
it is a good idea to contain all data with credit.policy value = 0 in train dataset.
Sorting by ascending order is the solution.

The problem is the following piece of coding can not successfully change the order
from sklearn.model_selection import train_test_split

New_Data.sort_values(by=['credit.policy'])

#### solution
from sklearn.model_selection import train_test_split

New_Data.sort_values(by=['credit.policy'], inplace=True)



### Issue 5. Data normalisation is following Dataset train_test_split
The goal of data normalisation is to transform features in a dataset to be on a similar scale.

According to sklearn documentation:
- fit(X[, y, sample_weight]): Compute the mean and std to be used for later scaling.
- fit_transform(X[, y]): Fit to data, then transform it.
- transform(X[, copy]): Perform standardization by centering and scaling.
    
Here we should fit with train dataset for later scaling, then transform both train dataset and test dataset with the same scaler.

#### solution
from sklearn.preprocessing import StandardScaler

obje_ss=StandardScaler()

x_ex1_train=obje_ss.fit_transform(x_ex1_train)

x_ex1_test=obje_ss.transform(x_ex1_test)

________________________________________________________________________________________________________________________________

Please check [critical_analysis_revised](http://localhost:8888/notebooks/DSWorkshop/data-science-portfolio-annyz7516/critical_analysis_revised.ipynb) for complete solution with 

Train success rate: % 99.86078886310905

Test success rate: % 99.79101358411702

for decision tree model.