# What does ‚ÄúChoosing an ML Model‚Äù actually mean?

It means selecting the right algorithm for your problem type + data.

‚ùå Not ‚Äúwhich model is best overall‚Äù
‚úÖ ‚Äúwhich model fits this problem‚Äù

First question you must ask ‚ùì

What kind of problem is this?

1Ô∏è‚É£ Regression (predict a number)

Examples:

House price

Temperature

Salary

Models to choose from:

Linear Regression

Ridge / Lasso

Random Forest Regressor

from sklearn.linear_model import LinearRegression
model = LinearRegression()

2Ô∏è‚É£ Classification (predict a category / label)

Examples:

Spam / Not spam

Disease Yes / No

Pass / Fail

Models:

Logistic Regression

Decision Tree

SVM

Random Forest

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

3Ô∏è‚É£ Clustering (no labels)

Examples:

Customer grouping

Pattern discovery

Models:

KMeans

DBSCAN

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)

Second question ‚ùì

How much data do you have?

Data Size	Good Choices
Small	Linear / Logistic Regression
Medium	Decision Tree, KNN
Large	Random Forest, Gradient Boosting
Third question ‚ùì

Is data simple or complex?

Data	Model
Linear relationship	Linear Regression
Non-linear	Decision Tree, Random Forest
Many features	Regularized models
Beginner-friendly rule (VERY useful) ‚≠ê

Start simple ‚Üí then go complex:

Logistic / Linear Regression

Decision Tree

Random Forest

Boosting models

If a simple model works well, don‚Äôt overcomplicate.

How sklearn helps here

Sklearn gives:

Consistent API (fit, predict)

Easy model switching

Built-in defaults that work well

You can literally change one line and test another model:

model = RandomForestClassifier()

Real-world thinking (job mindset üíº)

ML engineers try multiple models, compare scores, and choose the best.

That‚Äôs normal and expected.

One-line takeaway ‚ú®

Choosing an ML model = matching the problem + data + complexity.

In [90]:
import pandas as pd
import matplotlib.pyplot as plt

In [91]:
heart_disease = pd.read_csv('heart_disease_300.csv')

In [92]:
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,67,1,2,122,139,1,1,168,0,2.2,1,1,0,0
1,57,0,0,154,232,1,2,108,0,4.1,1,2,0,0
2,43,0,3,107,259,0,0,84,0,4.9,1,3,1,1
3,71,1,2,185,166,1,2,188,1,5.7,2,3,0,1
4,36,0,0,138,120,1,1,98,1,1.5,2,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,46,1,1,115,376,0,1,171,1,0.4,0,3,1,0
299,74,0,1,157,197,0,1,88,0,3.6,0,0,2,1
300,46,0,2,108,490,1,1,151,0,3.5,0,3,2,0
301,30,0,0,173,294,0,0,92,1,3.5,1,3,1,0


In [93]:
# create x
X = heart_disease.drop('target',axis=1)
X
# create y
Y = heart_disease['target']
Y

0      0
1      0
2      1
3      1
4      0
      ..
298    0
299    1
300    0
301    0
302    1
Name: target, Length: 303, dtype: int64

In [94]:
#from 140### 2 choosing  Machine Learning Model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [95]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3)

In [96]:
clf.fit(X_train,Y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [97]:
#end of leson 143

In [98]:
################ Evaluate Model ############
predicted_y=clf.predict(X_test)
predicted_y


array([0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1])

In [99]:
# hamara  ab kam h  check kry ky prediction is how much level is accurate in percentae

In [100]:
clf.score(X_train,Y_train)

1.0

In [101]:
clf.score(X_test,Y_test)

0.43956043956043955

In [102]:
#49 % accuracy oops now one moere question it is bad accuracy how can i raise my accuracy of predictions score

In [103]:
################ 145. Step 5 Improve Model

In [104]:
# 5 improve Model
for i in range(10,200,10):
    print(f'Running Model with {i} estimators ')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train,Y_train)
    print(f'Accuracy is :{clf.score(X_test,Y_test)}')

Running Model with 10 estimators 
Accuracy is :0.4835164835164835
Running Model with 20 estimators 
Accuracy is :0.4835164835164835
Running Model with 30 estimators 
Accuracy is :0.5054945054945055
Running Model with 40 estimators 
Accuracy is :0.4065934065934066
Running Model with 50 estimators 
Accuracy is :0.42857142857142855
Running Model with 60 estimators 
Accuracy is :0.46153846153846156
Running Model with 70 estimators 
Accuracy is :0.43956043956043955
Running Model with 80 estimators 
Accuracy is :0.42857142857142855
Running Model with 90 estimators 
Accuracy is :0.45054945054945056
Running Model with 100 estimators 
Accuracy is :0.4945054945054945
Running Model with 110 estimators 
Accuracy is :0.4725274725274725
Running Model with 120 estimators 
Accuracy is :0.4945054945054945
Running Model with 130 estimators 
Accuracy is :0.4835164835164835
Running Model with 140 estimators 
Accuracy is :0.5604395604395604
Running Model with 150 estimators 
Accuracy is :0.4395604395604395

In [105]:
##### 146. Step 6 Save Model ######

In [106]:
# save
import pickle
pickle.dump(clf,open('Heart Disease Predictor','wb')) #wb mean write binary

In [107]:
load_model = pickle.load(open('Heart Disease Predictor','rb')) # rb means read binary
load_model.score(X_test,Y_test)

0.4835164835164835

In [108]:
#####################   this is Data Science

#####  #Section 18 ::: 147. What we are going to Do #####################################

In [109]:
sk_learn_steps =[
    "1: Getting the Data Ready"
    "2: Choosing Machine Learning Model"
    "3: Fit Model"
    "4: Evaluate Model"
    "5: Improve Model"
    "6: Saving the Model"
    "7: Summary"
]

In [110]:
# # Steps
# print("""
# Machine Learning Workflow Completed:
# 1. Data Prepared
# 2. Model Selected
# 3. Model Trained
# 4. Model Evaluated
# 5. Model Improved
# 6. Model Saved
# 7. Process Summarized
# """)


# 1 : Getting the Data Ready
##### 1.1 Split data into features and label,(independent vs dependent variable),X,y
##### 1.2 Filling missing values
##### 1.3 Converting data Types

In [111]:
#standard Imports jo har haal m hum ny import krn he hoty h
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [112]:
hrt_dis =pd.read_csv('heart_disease_300.csv')

In [113]:
hrt_dis.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,67,1,2,122,139,1,1,168,0,2.2,1,1,0,0
1,57,0,0,154,232,1,2,108,0,4.1,1,2,0,0
2,43,0,3,107,259,0,0,84,0,4.9,1,3,1,1
3,71,1,2,185,166,1,2,188,1,5.7,2,3,0,1
4,36,0,0,138,120,1,1,98,1,1.5,2,0,1,0


In [114]:
X = hrt_dis.drop('target', axis=1) #drop vanishes the target column

In [115]:
y = hrt_dis['target'] #now we are giving all values of target column or label

In [116]:
X.head() #now  we call head trestbps is vanished 

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,67,1,2,122,139,1,1,168,0,2.2,1,1,0
1,57,0,0,154,232,1,2,108,0,4.1,1,2,0
2,43,0,3,107,259,0,0,84,0,4.9,1,3,1
3,71,1,2,185,166,1,2,188,1,5.7,2,3,0
4,36,0,0,138,120,1,1,98,1,1.5,2,0,1


In [117]:
y.head() #and when we call y head then it only show data and and we can show name of label and dtype below output

0    0
1    0
2    1
3    1
4    0
Name: target, dtype: int64

In [118]:
#end of this lecture

### 1.1 Split data into features and label,(independent vs dependent variable),X,y

In [119]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)

In [120]:
X_train.shape

(227, 13)

In [121]:
len(hrt_dis)

303

In [122]:
X_train.shape, X_test.shape,y_train.shape,y_test.shape

((227, 13), (76, 13), (227,), (76,))

In [123]:
227+76

303

In [124]:
303*0.25 #25 % separate krlo

75.75

## 149 Step 1 Getting Data Ready Part 1
 # now
 ### 1.2 Converting data types


In [135]:
phone_data = pd.read_csv("phone_data_larg.csv")
phone_data

Unnamed: 0,Make,Colour,Memory(kb),Sim Cards,Price
0,Nokia,White,150043,4,$400.00
1,Samsung,Red,87899,4,$500.00
2,Nokia,Blue,32549,3,$700.00
3,Iphone,Black,11179,5,$220.00
4,Motorolla,White,213095,4,$350.00
...,...,...,...,...,...
995,Nokia,Green,99213,4,$450.00
996,Samsung,Blue,45698,4,$750.00
997,Samsung,Blue,54738,4,$700.00
998,Nokia,White,60000,4,$625.00


In [134]:
#df_extended = pd.concat([df] * (1000 // len(df)), ignore_index=True) #this step was to increase data values rows upto 1000
#df_extended.to_csv("phone_data_larg.csv", index=False)
# print(df_extended.head(15)) 
# print("Total rows:", len(df_extended))

In [136]:
phone_data.dtypes

Make            str
Colour          str
Memory(kb)    int64
Sim Cards     int64
Price           str
dtype: object

In [138]:
phone_data['Price'] = phone_data['Price'].str[:-2]
phone_data['Price']

0      $40
1      $50
2      $70
3      $22
4      $35
      ... 
995    $45
996    $75
997    $70
998    $62
999    $97
Name: Price, Length: 1000, dtype: str

In [143]:
#regex= True is added it is in newer versions of pandas in in older 
phone_data['Price'] = phone_data['Price'].str.replace(r"[\$,\. ]", "", regex=True)  # remove $, commas, dots, spaces.astype(int)
phone_data['Price']

0      40
1      50
2      70
3      22
4      35
       ..
995    45
996    75
997    70
998    62
999    97
Name: Price, Length: 1000, dtype: str

In [144]:
phone_data

Unnamed: 0,Make,Colour,Memory(kb),Sim Cards,Price
0,Nokia,White,150043,4,40
1,Samsung,Red,87899,4,50
2,Nokia,Blue,32549,3,70
3,Iphone,Black,11179,5,22
4,Motorolla,White,213095,4,35
...,...,...,...,...,...
995,Nokia,Green,99213,4,45
996,Samsung,Blue,45698,4,75
997,Samsung,Blue,54738,4,70
998,Nokia,White,60000,4,62
