### Linearity in linear regression 
  Linearity in linear regression refers to the relationship between the independent variable and the dependent variable being linear, meaning that the change in the dependent variable is directly proportional to the change in the independent variable.
  
  
##### Effect of outliers
 * biased model predictions For example, in a regression model, outliers can result in a significant deviation from the true      line of best fit
 * Outliers can cause the model's center (mean or median) to shift towards their location. This can lead to incorrect              generalizations and predictions, especially when the majority of data points follow a different trend.
 * Outliers can lead to larger residuals (differences between observed and predicted values), causing increased error rates and    reduced model precision.
 * Altered Decision Boundaries: In classification models, outliers might cause decision boundaries to be skewed or incorrectly    positioned. This could lead to misclassification of data points, reducing the model's ability to correctly classify new        instances.
 
 
### Feature scaling
 Feature scaling techniques are important preprocessing steps in machine learning that involve adjusting the values of your features (variables) to a common scale. This ensures that the features contribute fairly to the learning process and that the model can better understand the data.
##### Normalization:

* What is Normalization: <br> 
Normalization is a feature scaling technique that adjusts the values of features to be within a                                specific range, usually between 0 and 1. <br>

* Why Use Normalization: <br>
Imagine you have features with different scales, like age (0-100) and income (0-100,000).                                      Normalization ensures that regardless of the original scale, all features are now in a common range,                            making comparisons fair.When the outliers are more.<br>

* How Does Normalization Work:<br>
For each data point in a feature, normalization subtracts the minimum value and divides by the range (max-min). This rescales the feature values to lie between 0 and 1.<br>

* Benefits of Normalization: <br>
Normalization is helpful when you don't want any feature to dominate others based on its original scale.<br>

##### Standardization:

* What is Standardization:<br> Standardization is another feature scaling technique that transforms feature values so they have a mean of 0 and a standard deviation of 1.<br>

* Why Use Standardization:<br> When features have different scales and follow different distributions, standardization helps to bring them to a common ground for analysis and modeling.<br>

* How Does Standardization Work:<br> For each data point in a feature, standardization subtracts the mean of the feature and divides by its standard deviation. This centers the data around 0 and scales it by the spread of the data.<br>

* Benefits of Standardization:<br> Standardization is useful when features have varying units and different ranges. It's commonly used in algorithms that assume a normal distribution of data.<br>

### What are the common problems with decision tree and how to overcome it
<br>
 * Decision trees are powerful and intuitive machine learning algorithms, but they are not without their limitations. Here are some common problems associated with decision trees and strategies to overcome them:
 * Overfitting:
    *Problem: Decision trees can easily become too complex and fit the training data too closely, resulting in overfitting. This  leads to poor generalization on new, unseen data.
    * Solution: Use techniques like pruning to limit the depth and complexity of the tree. Pre-pruning involves setting constraints  on tree growth, such as maximum depth, minimum samples per leaf, or minimum samples per split. Post-pruning involves removing  or collapsing nodes after the tree is fully grown and then simplifying it. ensemble learning or bagging techniques.
 
* Not Suitable for Linear Relationships
 
* Handling Missing Data





##### feature scaling on knn dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("KNN_Project_Data.unknown")

In [4]:
df.head()

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC,TARGET CLASS
0,1636.670614,817.988525,2565.995189,358.347163,550.417491,1618.870897,2147.641254,330.727893,1494.878631,845.136088,0
1,1013.40276,577.587332,2644.141273,280.428203,1161.873391,2084.107872,853.404981,447.157619,1193.032521,861.081809,1
2,1300.035501,820.518697,2025.854469,525.562292,922.206261,2552.355407,818.676686,845.491492,1968.367513,1647.186291,1
3,1059.347542,1066.866418,612.000041,480.827789,419.467495,685.666983,852.86781,341.664784,1154.391368,1450.935357,0
4,1018.340526,1313.679056,950.622661,724.742174,843.065903,1370.554164,905.469453,658.118202,539.45935,1899.850792,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   XVPM          1000 non-null   float64
 1   GWYH          1000 non-null   float64
 2   TRAT          1000 non-null   float64
 3   TLLZ          1000 non-null   float64
 4   IGGA          1000 non-null   float64
 5   HYKR          1000 non-null   float64
 6   EDFS          1000 non-null   float64
 7   GUUB          1000 non-null   float64
 8   MGJM          1000 non-null   float64
 9   JHZC          1000 non-null   float64
 10  TARGET CLASS  1000 non-null   int64  
dtypes: float64(10), int64(1)
memory usage: 86.1 KB


In [7]:
df.shape

(1000, 11)

In [8]:
df.describe()

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC,TARGET CLASS
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,1055.071157,991.851567,1529.373525,495.107156,940.590072,1550.637455,1561.003252,561.346117,1089.067338,1452.521629,0.5
std,370.980193,392.27889,640.286092,142.789188,345.923136,493.491988,598.608517,247.357552,402.666953,568.132005,0.50025
min,21.17,21.72,31.8,8.45,17.93,27.93,31.96,13.52,23.21,30.89,0.0
25%,767.413366,694.859326,1062.600806,401.788135,700.763295,1219.267077,1132.097865,381.704293,801.849802,1059.499689,0.0
50%,1045.904805,978.355081,1522.507269,500.197421,939.348662,1564.996551,1565.882879,540.420379,1099.087954,1441.554053,0.5
75%,1326.065178,1275.52877,1991.128626,600.525709,1182.578166,1891.93704,1981.739411,725.762027,1369.923665,1864.405512,1.0
max,2117.0,2172.0,3180.0,845.0,1793.0,2793.0,3196.0,1352.0,2321.0,3089.0,1.0


In [9]:
df.duplicated().sum()

0

In [10]:
#1. Specify dependent and independent variable

In [14]:
x=df.drop("TARGET CLASS",axis=1)

In [15]:
x

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC
0,1636.670614,817.988525,2565.995189,358.347163,550.417491,1618.870897,2147.641254,330.727893,1494.878631,845.136088
1,1013.402760,577.587332,2644.141273,280.428203,1161.873391,2084.107872,853.404981,447.157619,1193.032521,861.081809
2,1300.035501,820.518697,2025.854469,525.562292,922.206261,2552.355407,818.676686,845.491492,1968.367513,1647.186291
3,1059.347542,1066.866418,612.000041,480.827789,419.467495,685.666983,852.867810,341.664784,1154.391368,1450.935357
4,1018.340526,1313.679056,950.622661,724.742174,843.065903,1370.554164,905.469453,658.118202,539.459350,1899.850792
...,...,...,...,...,...,...,...,...,...,...
995,1343.060600,1289.142057,407.307449,567.564764,1000.953905,919.602401,485.269059,668.007397,1124.772996,2127.628290
996,938.847057,1142.884331,2096.064295,483.242220,522.755771,1703.169782,2007.548635,533.514816,379.264597,567.200545
997,921.994822,607.996901,2065.482529,497.107790,457.430427,1577.506205,1659.197738,186.854577,978.340107,1943.304912
998,1157.069348,602.749160,1548.809995,646.809528,1335.737820,1455.504390,2788.366441,552.388107,1264.818079,1331.879020


In [16]:
y=df["TARGET CLASS"]

In [17]:
y

0      0
1      1
2      1
3      0
4      0
      ..
995    0
996    1
997    1
998    1
999    1
Name: TARGET CLASS, Length: 1000, dtype: int64

In [18]:
from sklearn.preprocessing import StandardScaler

In [19]:
scaler=StandardScaler()
scaler.fit(x) # The fit method is used to compute the mean and standard deviation of each feature in your data (x).
scaled=scaler.transform(x)

In [20]:
scaled

array([[ 1.56852168, -0.44343461,  1.61980773, ..., -0.93279392,
         1.00831307, -1.06962723],
       [-0.11237594, -1.05657361,  1.7419175 , ..., -0.46186435,
         0.25832069, -1.04154625],
       [ 0.66064691, -0.43698145,  0.77579285, ...,  1.14929806,
         2.1847836 ,  0.34281129],
       ...,
       [-0.35889496, -0.97901454,  0.83771499, ..., -1.51472604,
        -0.27512225,  0.86428656],
       [ 0.27507999, -0.99239881,  0.0303711 , ..., -0.03623294,
         0.43668516, -0.21245586],
       [ 0.62589594,  0.79510909,  1.12180047, ..., -1.25156478,
        -0.60352946, -0.87985868]])

In [21]:
scaleddf=pd.DataFrame(scaled,columns=df.columns[:-1])

###### what is stdscaler doing
* The StandardScaler subtracts the mean of each feature from every data point. This centers the data around 0.

In [22]:
scaleddf.head()

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC
0,1.568522,-0.443435,1.619808,-0.958255,-1.128481,0.138336,0.980493,-0.932794,1.008313,-1.069627
1,-0.112376,-1.056574,1.741918,-1.50422,0.640009,1.081552,-1.182663,-0.461864,0.258321,-1.041546
2,0.660647,-0.436981,0.775793,0.213394,-0.053171,2.030872,-1.240707,1.149298,2.184784,0.342811
3,0.011533,0.191324,-1.433473,-0.100053,-1.507223,-1.753632,-1.183561,-0.888557,0.16231,-0.002793
4,-0.099059,0.820815,-0.904346,1.609015,-0.282065,-0.365099,-1.095644,0.391419,-1.365603,0.787762


###### Process
Not all algorithms require standardization. For example, decision trees and random forests are not as sensitive to feature scales. However, algorithms like support k-nearest neighbors, linear regression and neural networks often benefit from standardized features.

 standardization is usually performed on the independent variables to improve the performance of certain algorithms. The dependent variable (target) is typically left in its original scale to ensure meaningful predictions and evaluations

#### Gridsearch cv on Random forest dataset

* the Random Forest algorithm creates a bunch of decision trees, each trained on a slightly different version of the data through bootstrapping(bagging). Then, it combines the opinions of all these trees to give you a reliable and accurate prediction.Weak learners i.e desicion trees.
*  Grid Search Cross-Validation (CV). It helps you find the best combination of settings for your model without you having to guess. It tries different combinations of parameters and tells you which one works the best based on how well your model performs.

In [23]:
df=pd.read_csv("heart.csv")

In [24]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [26]:
df.shape

(303, 14)

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [29]:
df.duplicated().sum()

1

In [32]:
df.drop_duplicates(inplace=True)

In [33]:
df.duplicated().sum()

0

In [35]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

### Model building

In [36]:
#1. Specify dependent and independent variable

In [38]:
x=df.drop("target",axis=1)

In [39]:
y=df["target"]

In [41]:
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [42]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 302, dtype: int64

In [43]:
#2. splitting

In [44]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=100)

In [45]:
# Num of trees 
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 80, num = 10)]
# Num of features at every split
max_features = ['auto', 'sqrt']
# Max num of levels 
max_depth = [2,4]
# Min num of samples required to split a node
min_samples_split = [2, 5]
# Min num of samples required at each leaf node
min_samples_leaf = [1, 2]
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [46]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(param_grid)

{'n_estimators': [10, 17, 25, 33, 41, 48, 56, 64, 72, 80], 'max_features': ['auto', 'sqrt'], 'max_depth': [2, 4], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2], 'bootstrap': [True, False]}


In [47]:
from sklearn.ensemble import RandomForestClassifier

In [48]:
rf= RandomForestClassifier()

In [50]:
from sklearn.model_selection import GridSearchCV
rfg = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, verbose=2, n_jobs = 4)

In [51]:
rfg.fit(x_train, y_train)

Fitting 3 folds for each of 320 candidates, totalling 960 fits


480 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
480 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\hp\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\hp\anaconda3\Lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\hp\anaconda3\Lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\hp\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sk

In [53]:
rfg.best_params_

{'bootstrap': False,
 'max_depth': 4,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 33}

#### Label encoding 

In [67]:
df=pd.read_csv("car_evaluation.csv")

In [68]:
df.head()

Unnamed: 0,vhigh,vhigh.1,2,2.1,small,low,unacc
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


In [69]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df.columns=col_names
col_names

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

In [70]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


Contatins categorical data needs to be converted to numeric

In [71]:
from sklearn import preprocessing

In [72]:
l_enc=preprocessing.LabelEncoder()

In [73]:
df['lug_boot']=l_enc.fit_transform(df['lug_boot'])
df['lug_boot'].unique()

array([2, 1, 0])

In [74]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,2,med,unacc
1,vhigh,vhigh,2,2,2,high,unacc
2,vhigh,vhigh,2,2,1,low,unacc
3,vhigh,vhigh,2,2,1,med,unacc
4,vhigh,vhigh,2,2,1,high,unacc
